Physician Performance Measurement: A Key to Higher Quality and Lower Cost Growth
Originally published by the Center for Studying Health System Change
Published: June 2009
Updated: April 6, 2026
Measuring Physician Quality and Cost Was Gaining Momentum, but Flawed Methods Threatened to Undermine It
The United States was spending more than $2 trillion annually on health care by 2007, yet patient outcomes lagged those of many developed countries that spent far less per person. Physicians -- directly and indirectly -- shaped the quality and cost of a substantial share of that spending, influencing hospital admissions, prescription choices, diagnostic testing, and referral patterns. This made physician performance measurement an attractive lever for improvement. A Center for Studying Health System Change (HSC) Commentary (No. 3, June 2009) by Debra A. Draper examined the growing wave of measurement programs, the methodological problems dogging them, and the risk that a promising tool could become a missed opportunity if its shortcomings were not addressed.
Why Physician Performance Mattered So Much
At a minimum, physicians influenced spending in three categories that together accounted for roughly 64 percent of national health expenditures: hospital care (32 percent), physician and clinical services (22 percent), and prescription drugs (10 percent). The gap between what the country spent and the results it got had created growing pressure to figure out which physicians were delivering high-quality, efficient care and which were not -- and to use that information to drive improvement.
Much of the push came from large national employers who wanted to give their workers better information about the doctors they were seeing. As cost sharing increased and employees shouldered more financial responsibility for health care decisions, the argument for providing physician quality and cost data grew stronger. Health plans responded by building performance measurement programs they could market to employer clients and present to enrollees.
How Plans Built Their Programs
Health plans developed physician ranking programs under names like Aexcel Specialist Network (Aetna), Blue Precision (Blue Cross Blue Shield), Care Network (CIGNA), Preferred Network (Humana), and Premium Designation Program (UnitedHealthcare). The basic approach was similar across plans: use claims and administrative data to evaluate physicians on quality and cost metrics, then make those evaluations available to enrollees. In some cases, patients received financial incentives -- lower copayments, for example -- to visit physicians rated as high performers. Plans rarely paid bonuses directly to top-rated doctors.
The theory behind these programs was straightforward: publish physician ratings, patients shift to higher-performing doctors, lower-performing doctors respond by improving their care, and overall quality and efficiency go up. In practice, the execution was considerably messier.
The Credibility Problem
Draper identified four categories of methodological weakness that plagued plan programs. First, data credibility was shaky. Plans relied on their own claims data, which were considerably less reliable than medical record review for capturing what physicians actually did and why. Claims data could miss services provided, fail to record legitimate clinical reasons for deviating from standard protocols, and misattribute care to the wrong physician. Plans generally did not validate their claims data against other sources, like electronic medical records maintained by large physician groups.
Second, sample sizes were often too small. Because any single plan's patients represented only a fraction of a physician's total panel, assessments based on just that slice could produce skewed results. If a plan's enrollees happened to be disproportionately sicker than the physician's overall patient population, the resulting evaluation might tag the doctor as a poor performer when the opposite was true. Minimum sample thresholds were typically set low -- sometimes fewer than a dozen patients -- partly because plans had no way to access data from other payers.
Third, there was no standardized set of measures. Different plans used different quality indicators, defined them differently, and applied different risk-adjustment methods. Efficiency was generally assessed using episode-of-care groupers that bundled all related costs and attributed them to a single physician, even when that doctor had limited control over what other providers did. The episode-grouper methodology was still maturing, and physicians objected to being held accountable for costs generated by other providers' decisions.
Fourth, how plans weighed and combined quality and cost measures into an overall assessment was proprietary and opaque. One plan might emphasize quality; another might tilt toward cost. The result was that the same physician could be rated high-performing by one plan and not by another. Virginia Mason Medical Center in Seattle, a well-regarded integrated system, experienced exactly this when multiple plans rolled out their programs in that market and produced conflicting ratings for the same doctors.
Legal and Political Fallout
The methodological weaknesses generated real-world consequences. In 2006, the Washington State Medical Association sued Regence Blue Shield, alleging the plan used flawed methods and outdated data to exclude physicians from its high-performance network. Regence suspended the program. In 2007, New York Attorney General Andrew Cuomo investigated physician ranking programs across the state, raising concerns that plans' profit motives were influencing the accuracy of rankings and pushing consumers toward physicians chosen primarily for low cost rather than quality. The investigation resulted in agreements requiring plans to use national quality measures, base assessments on more than just costs, and achieve 100 percent compliance on external reviews of their ranking methodologies.
The American Medical Association described plan methodologies as "black-box" approaches and called for greater transparency. When problems emerged with one plan's program, physicians tended to extend their skepticism to all plans' measurement efforts, making physician engagement in performance improvement even harder to achieve.
Measurement Without Support or Rewards Was Not Enough
Beyond methodology, Draper argued that measurement alone was insufficient to drive improvement. Plans needed to couple their assessments with two additional elements: actionable support for physicians willing to improve and meaningful rewards for those demonstrating good results. On the support side, most plans failed to present their data in a way physicians could actually use to change practice. An exception was Aetna's collaboration with Virginia Mason, where the plan provided detailed claims data broken out by individual physician, practice site, patient, and cost center. That granular information allowed the system to identify specific cost-reduction opportunities -- for example, discovering that high migraine-episode costs stemmed from patients going to the emergency department because they lacked prescribed rescue medication.
Benchmarking individual physicians against relevant peer groups could also drive improvement, tapping into professional competitiveness. But assessments needed to be timely -- plans typically ran evaluations at most once a year using data that was already a year old, meaning results might no longer reflect current practice patterns. On the rewards side, incentives were generally too small to capture physician attention. Often the only reward for high performance was a designation in the plan's provider directory. Plans were reluctant to increase aggregate payments, and they lacked confidence in their methodologies' ability to support penalizing lower-performing physicians.
The Case for Treating Measurement as a Public Good
Draper argued that physician performance measurement had implications far beyond any single plan's enrollees -- it affected the health care of the entire population and should be treated as a public good rather than a competitive differentiator. Several steps could move measurement in that direction. Standardizing measures and assessment methods would eliminate the conflicting results that plagued the current plan-by-plan approach. Combining data from all payers -- commercial plans, Medicare, and Medicaid -- would solve the sample-size problem and give a complete picture of each physician's practice. The Centers for Medicare and Medicaid Services' Generating Medicare Physician Quality Performance Measurement Results (GEM) project offered a potential framework, providing Medicare-derived performance data to regional collaboratives that could merge it with commercial data.
But achieving this vision required a convening entity with the authority and credibility to cut through competitive dynamics. CMS was perhaps the only organization positioned to play that role, potentially serving as a central data repository, standardizing the process, and eliminating the conflicting results that individual plan efforts produced. Without such steps, Draper warned, physician performance measurement risked becoming a squandered opportunity -- an idea with enormous potential to improve American health care that instead generated controversy, distrust, and negligible results.
Sources and Further Reading
CMS -- National Health Expenditure Data -- Official data on U.S. health spending trends.
Kaiser Family Foundation -- Health Costs -- Analysis of health care costs and spending.
Health Affairs -- Peer-reviewed health policy research.
Robert Wood Johnson Foundation -- Health policy research and programs. Funded HSC's research.
Commonwealth Fund -- Research on health care costs and system performance.