CMS levies billions of dollars in overpayments a year against healthcare providers, based on the use of extrapolation audits.
The use of extrapolation in Medicare and private payer audits has been around for quite some time now. And lest you be of the opinion that extrapolation is not appropriate for claims-based audits, there are many, many court cases that have supported its use, both specifically and in general. Arguing that extrapolation should not have been used in a given audit, unless that argument is supported by specific statistical challenges, is mostly a waste of time.
For background purposes, extrapolation, as it is used in statistics, is a “statistical technique aimed at inferring the unknown from the known. It attempts to predict future data by relying on historical data, such as estimating the size of a population a few years in the future on the basis of the current population size and its rate of growth,” according to a definition created by Eurostat, a component of the European Union. For our purposes, extrapolation is used to estimate what the actual overpayment amount might likely be for a population of claims, based on auditing a smaller sample of that population. For example, say a Uniform Program Integrity Contractor (UPIC) pulls 30 claims from a medical practice from a population of 10,000 claims. The audit finds that 10 of those claims had some type of coding error, resulting in an overpayment of $500. To extrapolate this to the entire population of claims, one might take the average overpayment, which is the $500 divided by the 30 claims ($16.67 per claim) and multiply this by the total number of claims in the population. In this case, we would multiply the $16.67 per claim by 10,000 for an extrapolated overpayment estimate of $166,667.
The big question that normally crops up around extrapolation is this: how accurate are the estimates? And the answer is (wait for it …), it depends. It depends on just how well the sample was created, meaning: was the sample size appropriate, were the units pulled properly from the population, was the sample truly random, and was it representative of the population? The last point is particularly important, because if the sample is not representative of the population (in other words, if the sample data does not look like the population data), then it is likely that the extrapolated estimate will be anything but accurate.
To account for this issue, referred to as “sample error,” statisticians will calculate something called a confidence interval (CI), which is a range within which there is some acceptable amount of error. The higher the confidence value, the larger the potential range of error. For example, in the hypothetical audit outlined above, maybe the real average for a 90-percent confidence interval is somewhere between $15 and $18, while, for a 95-percent confidence interval, the true average is somewhere between $14 and $19. And if we were to calculate for a 99-percent confidence interval, the range might be somewhere between $12 and $21. So, the greater the range, the more confident I feel about my average estimate. Some express the confidence interval as a sense of true confidence, like “I am 90 percent confident the real average is somewhere between $15 and $18,” and while this is not necessarily wrong, per se, it does not communicate the real value of the CI. I have found that the best way to define it would be more like “if I were to pull 100 random samples of 30 claims and audit all of them, 90 percent would have a true average of somewhere between $15 and $18,” meaning that the true average for some 1 out of 10 would fall outside of that range – either below the lower boundary or above the upper boundary. The main reason that auditors use this technique is to avoid challenges based on sample error.
To the crux of the issue, the Centers for Medicare & Medicaid Services (CMS) levies billions of dollars in overpayments a year against healthcare providers, based on the use of extrapolation audits. And while the use of extrapolation is well-established and well-accepted, its use in an audit is not an automatic, and depends upon the creation of a statistically valid and representative sample. Thousands of extrapolation audits are completed each year, and for many of these, the targeted provider or organization will appeal the use of extrapolation. In most cases, the appeal is focused on one or more flaws in the methodology used to create the sample and calculate the extrapolated overpayment estimate. For government audits, such as with UPICs, there is a specific appeal process, as outlined in their Medical Learning Network booklet, titled “Medicare Parts A & B Appeals Process.”
On Aug. 20, 2020, the U.S. Department of Health and Human Services Office of Inspector General (HHS OIG) released a report titled “Medicare Contractors Were Not Consistent in How They Reviewed Extrapolated Overpayments in the Provider Appeals Process.” This report opens with the following statement: “although MACs (Medicare Administrative Contractors) and QICs (Qualified Independent Contractors) generally reviewed appealed extrapolated overpayments in a manner that conforms with existing CMS requirements, CMS did not always provide sufficient guidance and oversight to ensure that these reviews were performed in a consistent manner.” These inconsistencies were associated with $42 million in extrapolated payments from fiscal years 2017 and 2018 that were overturned in favor of the provider. It’s important to note that at this point, we are only talking about appeal determinations at the first and second level, known as redetermination and reconsideration, respectively.
Redetermination is the first level of appeal, and is adjudicated by the MAC. And while the staff that review the appeals at this level are supposed to have not been involved in the initial claim determination, I believe that most would agree that this step is mostly a rubber stamp of approval for the extrapolation results. In fact, of the hundreds of post-audit extrapolation mitigation cases in which I have been the statistical expert, not a single one was ever overturned at redetermination.
The second level of appeal, reconsideration, is handled by a QIC. In theory, the QIC is supposed to independently review the administrative records, including the appeal results of redetermination. Continuing with the prior paragraph, I have to date had only several extrapolation appeals reversed at reconsideration; however, all were due to the fact that the auditor failed to provide the practice with the requisite data, and not due to any specific issues with the statistical methodology. In two of those cases, the QIC notified the auditor that if they were to get the required information to them, they would reconsider their decision. And in two other cases, the auditor appealed the decision, and it was reversed again. Only the fifth case held without objection and was adjudicated in favor of the provider.
Maybe this is a good place to note that the entire process for conducting extrapolations in government audits is covered under Chapter 8 of the Medicare Program Integrity Manual (PIM). Altogether, there are only 12 pages within the entire Manual that actually deal with the statistical methodology behind sampling and extrapolation; this is certainly not enough to provide the degree of guidance required to ensure consistency among the different government contractors that perform such audits. And this is what the OIG report is talking about.
Back to the $42 million that was overturned at either redetermination or reconsideration: the OIG report found that this was due to a “type of simulation testing that was performed only by a subset of contractors.” The report goes on to say that “CMS did not intend that the contractors use this procedure, (so) these extrapolations should not have been overturned. Conversely, if CMS intended that contractors use this procedure, it is possible that other extrapolations should have been overturned but were not.” This was quite confusing for me at first, because this “simulation” testing was not well-defined, and also because it seemed to say that if this procedure was appropriate to use, then more contractors should have used it, which would have resulted in more reversals in favor of the provider.
Interestingly, CMS seems to have written itself an out in Chapter 8, section 126.96.36.199 of the PIM, which states that “[f]ailure by a contractor to follow one or more of the requirements contained herein does not necessarily affect the validity of the statistical sampling that was conducted or the projection of the overpayment.” The use of the term “does not necessarily” leaves wide open the fact that the failure by a contractor to follow one or more of the requirements may affect the validity of the statistical sample, which will affect the validity of the extrapolated overpayment estimate.
Regarding the simulation testing, the report stated that “one MAC performed this type of simulation testing for all extrapolation reviews, and two MACs recently changed their policies to include simulation testing for sample designs that are not well-supported by the program integrity contractor. In contrast, both QICs and three MACs did not perform simulation testing and had no plans to start using it in the future.” And even though it was referenced some 20 times, with the exception of an example given as Figure 2 on page 10, the report never did describe in any detail the type of simulation testing that went on. From the example, it was evident to me that the MACs and QICs involved were using what is known as a Monte Carlo simulation. In statistics, simulation is used to assess the performance of a method, typically when there is a lack of theoretical background. With simulations, the statistician knows and controls the truth. Simulation is used advantageously in a number of situations, including providing the empirical estimation of sampling distributions. Footnote 10 in the report stated that ”reviewers used the specific simulation test referenced here to provide information about whether the lower limit for a given sampling design was likely to achieve the target confidence level.” If you are really interested in learning more about it, there is a great paper called
“The design of simulation studies in medical statistics” by Burton et al. (2006).
Its application in these types of audits is to “simulate” the audit many thousands of times to see if the mean audit results fall within the expected confidence interval range, thereby validating the audit results within what is known as the Central Limit Theorem (CLT).
Often, the sample sizes used in recoupment-type audits are too small, and this is usually due to a conflict between the sample size calculations and the distributions of the data. For example, in RAT-STATS, the statistical program maintained by the OIG, and a favorite of government auditors, sample size estimates are based on an assumption that the data are normally (or near normally) distributed. A normal distribution is defined by the mean and the standard deviation, and includes a bunch of characteristics that make sample size calculations relatively straightforward. But the truth is, because most auditors use the paid amount as the variable of interest, population data are rarely, if ever, normally distributed. Unfortunately, there is simply not enough room or time to get into the details of distributions, but suffice it to say that, because paid data are bounded on the left with zero (meaning that payments are never less than zero), paid data sets are almost always right-skewed. This means that the distribution tail continues on to the right for a very long distance.
In these types of skewed situations, sample size normally has to be much larger in order to meet the CLT requirements. So, what one can do is simulate the random sample over and over again to see whether the sampling results ever end up reporting a normal distribution – and if not, it means that the results of that sample should not be used for extrapolation. And this seems to be what the OIG was talking about in this report. Basically, they said that some but not all of the appeals entities (MACs and QICs) did this type of simulation testing, and others did not. But for those that did perform the tests, the report stated that $41.5 million of the $42 million involved in the reversals of the extrapolations were due to the use of this simulation testing. The OIG seems to be saying this: if this was an unintended consequence, meaning that there wasn’t any guidance in place authorizing this type of testing, then it should not have been done, and those extrapolations should not have been overturned. But if it should have been done, meaning that there should have been some written guidance to authorize that type of testing, then it means that there are likely many other extrapolations that should have been reversed in favor of the provider. A sticky wicket, at best.
Under the heading “Opportunity To Improve Contractor Understanding of Policy Updates,” the report also stated that “the MACs and QICs have interpreted these requirements differently. The MAC that previously used simulation testing to identify the coverage of the lower limit stated that it planned to continue to use that approach. Two MACs that previously did not perform simulation testing indicated that they would start using such testing if they had concerns about a program integrity contractor’s sample design. Two other MACs, which did not use simulation testing, did not plan to change their review procedures.” One QIC indicated that it would defer to the administrative QIC (AdQIC, the central manager for all Medicare fee-for-service claim case files appealed to the QIC) regarding any changes. But it ended this paragraph by stating that “AdQIC did not plan to change the QIC Manual in response to the updated PIM.”
With respect to this issue and this issue alone, the OIG submitted two specific recommendations, as follows:
- Provide additional guidance to MACs and QICs to ensure reasonable consistency in procedures used to review extrapolated overpayments during the first two levels of the Medicare Parts A and B appeals process; and
- Take steps to identify and resolve discrepancies in the procedures that MACs and QICs use to review extrapolations during the appeals process.
In the end, I am not encouraged that we will see any degree of consistency between and within the QIC and MAC appeals in the near future.
Basically, it would appear that the OIG, while having some oversight in the area of recommendations, doesn’t really have any teeth when it comes to enforcing change. I expect that while some reviewers may respond appropriately to the use of simulation testing, most will not, if it means a reversal of the extrapolated findings. In these cases, it is incumbent upon the provider to ensure that these issues are brought up during the Administrative Law Judge (ALJ) appeal.
Programming Note: Listen to Frank Cohen report this story live during the next edition of Monitor Mondays, 10 a.m. Eastern.