Consumer Tolerance for Inaccuracy in Physician Performance Ratings
Originally published by the Center for Studying Health System Change
Published: March 2007
Updated: April 8, 2026
How Much Error Will Consumers Accept in Physician Ratings?
Originally published as Issue Brief No. 110 in March 2007 by Matthew M. Davis, Judith H. Hibbard, and Arnold Milstein, this study tackled a question that had been largely absent from an increasingly heated policy debate: How much measurement error are consumers willing to accept when health plans rate and rank their physicians?
By the mid-2000s, health plans were expanding their use of physician performance ratings as a tool to improve clinical practice and steer patients toward higher-performing clinicians. But the underlying data sources and performance measures available at the time were far from perfect. Every rating system carried some degree of imprecision, meaning that some physicians would inevitably be misclassified -- labeled as higher performing when their actual performance was lower, or vice versa.
Physicians had been vocal in their opposition to ratings they considered unreliable, warning that flawed performance scores could damage reputations and incomes unfairly. Health plans, purchasers, and consumer advocates countered that even imperfect ratings could drive more improvement than operating in a performance-blind environment with no measurement at all. What neither side had examined closely was where consumers themselves stood on the question of acceptable error.
Study Design and How Inaccuracy Was Explained to Participants
The Center for Studying Health System Change commissioned a nationally representative household survey through Knowledge Networks, Inc., in December 2006. The sample included 1,057 adults aged 18 and older, drawn from a standing panel assembled through random-digit-dialing methods. Knowledge Networks provided free Internet access to households that lacked it, ensuring the panel closely resembled the broader U.S. population. The sample was stratified so that half of respondents reported chronic, doctor-diagnosed conditions such as heart disease, asthma, or diabetes. Results were weighted to be nationally representative, and the response rate among panel members contacted was 64 percent.
The survey measured consumer acceptance of measurement error in physician performance ratings across four distinct applications: releasing ratings to the general public, using ratings to select one's own primary care physician, using ratings to adjust physicians' payment rates through pay-for-performance programs, and using ratings to encourage patients to seek care from higher-rated physicians through tiered-benefit insurance plans.
Inaccuracy was described in concrete terms. Respondents were told that a rating system described as 80 percent accurate and 20 percent inaccurate would incorrectly classify 20 out of every 100 physicians -- placing some higher-performing doctors into the lower-performing category or vice versa. The survey offered choices of acceptable inaccuracy ranging from 1 percent or less (at least 99 percent accurate) up to 50 percent (at least 50 percent accurate). Importantly, the researchers found that whether the question was framed as misclassifying higher-performing or lower-performing physicians made no difference to respondents' answers.
Wide Variation in Tolerance Across All Four Rating Uses
Regardless of the specific application of physician performance ratings, the survey revealed a remarkably wide range of consumer tolerance for measurement error. No single standard of acceptable accuracy commanded a consensus.
Across all four uses, the most common response was low tolerance for inaccuracy -- defined as 5 percent or less. For three of the four applications, at least 40 percent of consumers fell into this low-tolerance category. These respondents demanded near-perfect accuracy before they were comfortable with ratings being used in any capacity.
At the other end of the spectrum, more than 20 percent of consumers indicated they would be comfortable with ratings that were 20 to 50 percent inaccurate -- what the researchers classified as high tolerance. These consumers appeared to view imperfect information as preferable to having no performance data at all.
Consumers demonstrated relatively more tolerance for inaccuracy when ratings were used for public reporting and for tiered-network insurance products. Their tolerance dropped significantly in two circumstances: when the ratings would be used to choose their own personal physician and when health plans would use the ratings to adjust physician payment through pay-for-performance initiatives.
The lower tolerance for pay-for-performance applications was particularly noteworthy. Consumers appeared to have broader concerns about using financial incentives to reward or penalize physicians based on performance scores, suggesting unease with the concept of tying physician income to metrics that might be flawed.
One striking finding was the consistency of individual consumers' tolerance levels across the different rating uses. Among respondents who expressed the lowest tolerance for inaccuracy when choosing their own physician, 76 percent also held the lowest tolerance for inaccuracy in pay-for-performance applications. This suggested that tolerance for error was more of a personal trait than a context-dependent judgment.
Public Information and Choosing a Personal Physician
People who demanded the highest accuracy for public reporting and for selecting their own physician shared certain characteristics. They were disproportionately middle-aged, between 45 and 64 years old, and more likely to have an established relationship with a regular doctor.
The survey asked respondents about their beliefs regarding variation in physician adherence to quality-of-care guidelines and about the importance of specific performance characteristics such as timeliness of care and prevention of treatment complications. For the public reporting use, none of these attitudes were associated with tolerance for inaccuracy.
When it came to choosing one's own physician, however, the pattern was different. Tolerance for inaccuracy was lowest among people who believed physicians generally do not differ in their adherence to quality guidelines. This seemingly counterintuitive finding suggested that consumers who perceive little natural variation among physicians are especially reliant on ratings to help them make distinctions they cannot make on their own.
Low tolerance for error was also concentrated among those who rated timeliness of care and prevention of treatment complications as very important factors in physician performance. These respondents viewed punctuality and patient safety as critical metrics and wanted the measurement systems to reflect those priorities with high fidelity.
Notably, tolerance for rating inaccuracy when choosing a physician did not differ by education, income, race or ethnicity, gender, insurance type, presence of a chronic health condition, prior use of consumer ratings, or how involved consumers were in managing their own health care.
Pay-for-Performance and Financial Incentives
The pattern for pay-for-performance applications closely mirrored what the researchers observed for personal physician selection. People who insisted on the lowest levels of inaccuracy for pay-for-performance programs were more likely to believe that physicians generally do not differ in following quality-of-care guidelines. They also placed greater emphasis on a physician's ability to prevent complications as a critical factor in judging performance.
Beyond those two associations, low tolerance for inaccuracy in pay-for-performance contexts was not connected to any of the health or sociodemographic characteristics measured in the survey. It was not related to prior use of consumer ratings, attitudes about other dimensions of physician performance, or beliefs about the importance of bedside manner, cost-effectiveness, or other quality metrics.
Tiered Networks and Steering Patients Toward High Performers
Consumers who demanded the greatest accuracy for tiered-network applications -- where health plans encourage patients to use higher-rated physicians by offering lower cost sharing -- exhibited a distinct profile compared to the other rating uses. In this context, low tolerance for inaccuracy was significantly associated with having private insurance as opposed to other coverage types and with reporting excellent self-rated health status.
These respondents were also more likely to say that bedside manner should not be an important factor in evaluating physician performance. Otherwise, tolerance for inaccuracy in the tiered-network context was not connected to other health characteristics, sociodemographic factors, prior ratings use, or attitudes about physician performance measurement.
Consumer Experience with Ratings of All Kinds
The survey also explored whether consumers' broader experience with ratings of goods and services -- such as those published by Consumer Reports -- or with physician-specific ratings sources like Healthgrades.com or local magazine rankings, might shape their tolerance for inaccuracy in physician performance measurement.
Overall, 45 percent of respondents reported having used consumer ratings of some kind in the past. Usage was significantly higher among people with more education and higher incomes. Non-Hispanic Black respondents were significantly less likely to have used ratings compared with non-Hispanic whites, Hispanics, and other non-Hispanic groups. People with chronic conditions had used consumer ratings less frequently than those without chronic conditions. Usage did not otherwise vary by age, gender, or whether someone had a regular doctor.
Among those who had used ratings, 93 percent had consulted general consumer ratings sources, while 50 percent had looked at ratings specifically about physicians. The perceived value differed sharply between the two categories: 44 percent of users found general consumer ratings helpful, but only 13 percent said physician-specific rating sources had been helpful. This gap suggested that physician ratings, as they existed in the mid-2000s, were not meeting consumer needs or expectations.
Implications for Health Plans and Policymakers
The findings carried several practical implications for health plans working to implement physician performance ratings with the measurement tools available at the time.
Consumer tolerance for inaccuracy was almost certainly higher than physician tolerance. Consumers may have viewed imperfect information as better than no information at all, while physicians -- who understood the technical challenges of performance measurement and had their reputations and livelihoods at stake -- demanded greater precision. This divergence helped explain why plans and physicians clashed so frequently over rating programs.
Given the limitations in available clinical data, the actual error rate in most individual physician ratings likely exceeded 5 percent -- the threshold at which at least one-third of consumers would find the ratings unacceptable. This raised a critical question: should health plans disclose the level of inaccuracy in their rating systems? Transparency would allow consumers to decide for themselves whether the ratings were reliable enough for their purposes, but it would also add complexity to an already complicated landscape. Research had already shown that complexity was a major barrier to consumers using public performance reports, and adding another layer of information about accuracy could further discourage engagement.
Consumers who chose not to use ratings they considered inaccurate would face real consequences. They might select physicians whose performance was genuinely worse than alternatives, or they might forgo financial rewards available through tiered insurance plans that incentivized visits to higher-rated providers. If a substantial share of consumers declined to act on performance data, the entire value proposition of physician rating systems would weaken, potentially forcing plans and provider organizations to find other approaches to quality improvement.
The Accuracy Challenge and Paths Forward
In the near term, health plans faced significant obstacles in improving rating accuracy. Most ratings at the time relied on analyses of a single insurer's claims and enrollment data, which had well-known limitations in measurement reliability. Small sample sizes for individual physicians, incomplete diagnostic coding, and the inability to capture all dimensions of clinical quality meant that substantial measurement error was baked into the process.
Several potential solutions existed, though each carried its own limitations. Broader adoption of electronic medical records designed to generate robust performance measures would eventually improve accuracy, but widespread implementation was not imminent in 2007. Health plans could encourage physicians to self-report supplementary clinical data to fill gaps in claims-based measures, but this approach required physician incentives that added cost to insurance, and the self-reported data was rarely audited.
Plans could also incorporate patient-reported assessments of physician performance, including evaluations posted on patient-driven websites. These might cover experience-based metrics like bedside manner and observable quality events such as whether a physician checked the feet of diabetic patients during routine visits. Another option was pooling claims data across multiple private insurers or, if legally permitted, combining private claims with Medicare data to achieve the sample sizes needed for more reliable measurement of individual physician performance.
Some physicians advocated limiting performance measurement and rating to multi-physician groups rather than individual clinicians. Group-level measurement would reduce some reliability problems associated with small sample sizes, but it would introduce validity concerns because performance can vary significantly among physicians within the same group practice.
Communicating Error: No Easy Answers
Given the broad variation in consumer tolerance, the researchers suggested that a workable short-term compromise might involve payers disclosing the level of inaccuracy associated with their rating methodology. This would at least allow consumers to make informed decisions about how much weight to place on the scores.
But significant questions remained about how to convey inaccuracy in a way that was both meaningful and understandable. The study found that framing inaccuracy as a problem of misclassifying higher-performing versus lower-performing doctors did not affect consumer responses, which simplified one aspect of the communication challenge. However, little was known about consumers' baseline expectations for accuracy in ratings outside health care, or about the most effective methods for explaining measurement error to a general audience.
The implementation of physician performance ratings would likely compel further research into these questions, as health plans sought to accelerate quality improvement, physicians pushed back against imperfect measurement systems, and consumers and plan sponsors tried to figure out where they could find the best value for their health care spending.
Sources and Further Reading
AHRQ -- Health Care Quality -- Federal agency research on quality measurement and improvement.
CMS -- Quality of Care -- Medicare quality initiatives and physician performance data.
NCQA -- National Committee for Quality Assurance -- Health plan accreditation and physician recognition programs.
Health Affairs -- Peer-reviewed health policy research, including studies on public reporting and quality measurement.
Commonwealth Fund -- Research on health system quality, performance measurement, and patient experience.