What Have We Learned About Measuring Health System Performance?

Quality, cost, efficiency and speed graph on blackboard, photo by Warchi/Getty Images

Assessments of performance are widely used to inform value-based payment systems, to help consumers choose health systems that best meet their needs, and to compare performance across health systems for quality improvement. The absence of a consistent definition of what constitutes high performance, and consensus about how to measure it, hinders our ability to compare, motivate, and reward health systems based on performance.

In a series of recent studies, researchers at the RAND Center of Excellence on Health System Performance have explored some of the challenges inherent in measuring health system performance and suggested ways to address them.

In brief, we learned:

We lack consensus on what makes a health system high performing.

Person standing at the beginning of a road, uncertain which direction to go, photo by Bulat Silvia/Getty Images

Photo by Tyler Olson/Fotolia

Core to the mission of the RAND Center of Excellence is describing what makes a health system high performing. Developing that description requires a shared understanding of what high performing means. Although there is substantial interest in improving the performance of health systems by a wide range of stakeholders, surprisingly, such a consensus does not exist.

Searching for a generally accepted definition, a RAND team conducted a systematic review of the peer reviewed literature from 2005 to 2015, and the grey literature (materials produced by organizations outside traditional publishing channels) from 1999 to 2016. We included studies of any design as well as a wide range of reports, editorials, commentaries, and other materials that used and defined the term high performance when describing a health care delivery system.

We found ample literature defining high performance with respect to a specific clinical area (for example, cardiac surgery or glycemic control) but only about 60 articles applied the term ‘high performance’ to health systems or health care organizations.

After reviewing these articles, the team concluded that there is no consistently used definition of high performing applied either to an entire health care delivery system or to one of its components. Across the studies, more than a dozen performance dimensions were used, and most studies used multiple dimensions. Clinical quality was almost always included, often paired with cost, and sometimes with access. The most commonly used dimensions – quality, cost, access, equity, patient experience, and patient safety – generally aligned with the aims for health care improvement proposed by the Institute of Medicine in its 2001 report, Crossing the Quality Chasm. However, the studies also identified other dimensions such as organizational responsiveness, coordination, physician work-life satisfaction, governance, and innovation, suggesting that additional dimensions may be important in defining high performance.

People who build and operate health systems bring real world experience to the challenge of defining and measuring performance.

Woman standing in a conference room, holding a marker and folder, photo by JohnnyGreig/Getty Images

Photo by JohnnyGreig/Getty Images

To further explore dimensions of high performance, a RAND team turned to executives who have real world experience building, operating, and managing health systems. The RAND team assembled a technical expert panel comprising c-suite-level leaders of health systems and large multispecialty physician organizations; the panel also included researchers with expertise in economics, business administration, public health, and medicine. The team implemented a three-round modified Delphi process. First, panel members used a web-based survey to rank dimensions of performance drawn from a targeted literature review. Then, the panel met in person to determine which performance domains and attributes were essential elements in health system performance. In a final rating round, the panel prioritized which health system attributes were the most important to include in data collection.

The literature identified attributes associated with performance across eight performance-related domains:

  1. Market
  2. Culture
  3. Care Delivery
  4. Health IT
  5. Organizational structure
  6. Leadership
  7. Quality improvement
  8. Human resources

Panelists added a ninth domain (Business execution), revised the list, and rated the importance of attributes. They viewed some attributes used in past studies (size, ownership, profit status) as only somewhat or not important. They ranked aspects of culture, leadership, and business execution as highly important.

Information about some attributes (market context, size, ownership, and EHR functions) is available in secondary data. But information about highly important attributes such as business execution, culture, and leadership requires primary data collection and new data collection tools to systematically examine how these factors affect performance.

The panelists constructed a logic model to illustrate how the various domains interact to affect performance.

How Do Domains Affect Health System Outcomes?

How Do Domains Affect Health System Outcomes?

Note: The nine performance-related domains are denoted with blue boxes.

The logic model shows the interrelationships among the various domains that affect health system performance.

The model starts with the market. The market follows two paths.

  1. For the first path:
    1. The market affects organizational structure, culture, and leadership.
    2. Organizational structure, culture, and leadership drives care delivery and quality improvement.
    3. Care delivery and quality improvement produces outcomes.
      1. However, care delivery and quality improvement also require health IT and human resources infrastructure to produce these outcomes.
    4. Additionally, patient characteristics affect the risk adjustment for outcomes.
  2. For the second path:
    1. The market affects business execution and strategic intent.
    2. Likewise, business execution and strategic intent affects the market.

This kind of experiential evidence can help policymakers craft better policies to incentivize high performance and help health leaders build better health systems.

Panelists viewed the market as affecting a health system’s business model and strategic approach, but also influencing a health system’s culture and leadership. They agreed that culture and leadership drive delivery of care and quality improvement efforts. However, they also viewed health IT and human resources as necessary to achieving desired patient outcomes, which are ultimately influenced by the characteristics of patients themselves.

The RAND team also examined the subjective performance assessments of executives in 24 health systems in four states, comparing them to a set of objective measures of clinical performance based on nationally endorsed standards of care. Subjective assessments were higher than objective assessments and captured more factors than are typically considered in performance assessment and value-based performance initiatives. Executives whose views were consistent with the objective measures were those who cited clinical quality measures as the basis for their subjective assessment. Those executives whose assessments were inconsistent with the objective measures focused instead on customer satisfaction, market competition, and financial performance. In line with what we learned from the technical expert panel, executives identified organizational culture, organizational governance, and staff engagement as key levers for achieving high performance. Future research should explore the benefits and drawbacks of including a wider range of metrics in performance assessment.

These assessments are especially valuable because they take into account the broader context in which health systems function, and reflect the experience and real world perspective of individuals who are responsible for driving health system performance. This kind of experiential evidence can help policymakers craft better policies to incentivize high performance and help health leaders build better health systems.

How you measure health system performance affects the answer you get.

Close up of a man working on data analysis, writing on a notebook, photo by Natee Meepian/Getty Images

Payers and policymakers are using multiple tools to motivate better performance from health care providers. Pay-for-performance and value-based payment programs link provider reimbursement to achievement of high performance on cost and quality, based on widely accepted measures. It is reasonable for consumers, who are encouraged to choose providers based on publicly available score cards, to expect that the providers they choose rank high on multiple dimensions of care.

A RAND team examined how different performance domains and different performance thresholds affect medical group performance rankings, using publicly available performance data from the Minnesota Community Measurement Health Care Quality Report. The team examined performance data on quality, total costs of care, access, and patient experience, and chose a subset of measures in each domain reported by the largest number of medical groups.

Two common approaches were used to establish thresholds to classify providers: relative value thresholds (where groups are ranked by performance compared with each other—e.g., top 25 percent, top 50 percent); and absolute value thresholds (where groups are ranked according to pre-set standards—e.g., scores above 75 percent, scores above 90 percent).

When relative value thresholds were used, no medical groups were identified as high performing at the top 10 percent, 25 percent, or 35 percent thresholds. One medical group was identified using a 40 percent threshold, and another using a 50 percent threshold. However, there was little agreement about which medical groups were ranked as high performing when using different combinations of performance domains. The medical groups that performed in the top 35 percent for quality, access, and patient experience were not the same groups that performed in the top 35 percent for quality, access, and cost.

One domain
Quality: 9 medical groups
Access: 21 medical groups
Patient experience: 11 medical groups
Cost: 21 medical groups
Two domains
Quality and Access: 7 medical groups
Quality and Patient experience: 3 medical groups
Quality and Cost: 3 medical groups
Three domains
Quality, Access, and Patient experience: 3 medical groups
Quality, Cost, and Patient experience: 0 medical groups
Quality, Access, and Cost: 3 medical groups
Four domains
Quality, Access, Cost, and Patient experience: 0 medical groups

When absolute value thresholds were used to measure performance, no groups were designated as high performing with thresholds of 90 percent, 80 percent, or 70 percent. Three groups were identified as high performing when the threshold was 60 percent. Sixteen groups qualified as high performing at a threshold of 50 percent.

As with the relative value approach, more medical groups qualified as high performing when only two performance domains were included, compared with three or four. There was moderate agreement across performance domains regarding which groups were high performing—for example, seven medical groups were high performing when groups were classified based on clinical quality and patient experience, but only three of the seven groups were high performing when groups were classified based on clinical quality and cost.

Consensus about how to define and measure high performance is urgently needed.

Overall, very few Minnesota medical groups performed in the top 50 percent when assessed across all measures, and the number of medical groups designated as high performing decreased when more domains were added to the assessment.

The absence of a consistent way to measure and classify high performance has important implications for both consumers and providers. For example, the Centers for Medicare & Medicaid Services (CMS) Star Ratings Program uses a performance rating algorithm based on relative thresholds. In contrast, California’s Integrated Healthcare Association (which operates the largest pay-for-performance program in the United States) uses an absolute threshold of 50 percent. As a result, medical groups designated as high performing by one payer might not be so designated by the other. Given the absence of agreed upon standards for high performance, signals to stakeholders can be confusing rather than enlightening. Consumers, providers, payers, and policymakers can’t respond appropriately when the very same providers are classified as high performing in one approach to measurement but not in another. Consensus about how to define and measure high performance is urgently needed.

A composite measure based on publicly reported data can produce valid, reliable, and stable quality rankings for ambulatory care.

Summary measures of provider performance are already widely used—for example, in the Medicare Advantage and Medicare Part D Star Ratings Program by which CMS determines payment for health and prescription drug plans based on a composite summary score. Methods for ranking health care providers, hospitals, and health plans on a single measure are well established, but the measurement science is less advanced when it comes to summarizing performance across multiple dimensions of care. Multi-level models using multiple measures have been used to rank hospitals and providers. The advantage of a multilevel model is that not all providers collect and report the same performance measures, and there may be wide discrepancies across providers in the number of patients. Multilevel models automatically account for these discrepancies.

A RAND team sought to build a single composite measure of ambulatory quality of care using clinical quality measures that are already widely used and accepted. The goal was to see if a single composite measure would produce quality rankings that accurately summarize the information in multiple measures. The team used data from 27 health systems in California and 28 in Minnesota that publicly reported performance data from 2014 to 2016. We sought to evaluate three critical properties of a good performance measure: the composite measure’s validity (does it measure what it is intended to measure); reliability (are differences in performance larger than random statistical noise); and stability (are rankings reasonably stable from year to year).

In detailed analyses, the team determined that their composite measure was valid: the composite was associated with many of the component measures, but no one component dominated. Health systems that performed better on the composite measure tended to rank higher on the individual components of the composite measure. The composite measure was reliable, especially for designating high ranking health systems and differentiating them from low performing systems. Most health system rankings were stable from year to year.

The advantage of the RAND composite measure is that it does not require all health systems to report on all measures. The composite reduces the complexity and maximizes the interpretability of rankings – features that are critical to end users such as payers who want to recognize and reward high performing health systems and consumers who want to select high performing providers for their own care.

Health systems could use a similar approach to rank and classify medical groups (or units within their hospitals) that report common quality measures. Systems could also use such a composite to benchmark themselves against other health systems if they have access to publicly-reported measures. Finally, stakeholders in other regions could use the methodology described by the RAND team to develop composites using available performance measures. They can use this analysis as a guide for how to subject proposed composite measures to statistical scrutiny—to identify whether the proposed composite has the desirable properties they would want it to have.

Learn More