Over the Hill: The accuracy of COVID-19 statistics

Nobody really knows how many people have died in the UK due to COVID-19. There are currently 3 numbers that might give a bit of a clue:

The daily statistic of those who've died after a positive test
The number who have had COVID-19 mentioned on the death certificate
The excess deaths reported compared to a normal year

The last one is more amenable to statistical analysis, but is also subject to a variety of errors: the baseline varies from year to year, and lockdown may increase some types of deaths, while decreasing others (for the latter, consider the reduction in pollution). Still, it's a reasonably well defined number.

The second one is highly tricky, because - absent lots of tests and post-mortems, it's sometimes going to be tricky apportioning the cause of death.

The first one has attracted a lot of attention, and it's the headline number you see in the news. It has the advantage that it's quite well defined (you know, definitively, who has had a test, the outcome, and whether they died). The disadvantage is that it doesn't give you any clue as to whether COVID-19 was actually the cause of death or not - perhaps they got run over by a bus.

Early on, this didn't make much difference. But, over time, the probability that someone would die from another cause obviously increases. So, early in August, the statistic was changed to add a 28-day cut off.

The idea of a cut off is that, very roughly, the number of people who die from COVID-19 after the cut off are offset by those who die of other causes before the cut off. At a nigh level, it makes sense, because otherwise it's going to be wrong, and the error will increase over time.

The question really is whether the correction applied is correct. After all, that 28 days is basically a guess - it's commonly used, so is reasonable for comparison purposes, but it's still a guess.

There are a couple of ways to see if the value for the cut off is reasonable. And for that, we need data. It turns out that we can download the time series from the portal, and have a look at the numbers.

One fortunate thing we have is that the dataset actually includes 3 numbers - raw numbers, with the 28-day cut off, and with a 60-day cut off. As of today, the 28-day cut off removes 6,634 deaths from the starting 44,115, and the 60-day cut off less that half that at 2,695. These are what would come in as deaths from other causes.

First, given the number of positive tests, does that correction look sensible. We know, roughly, that there were ~300,000 positive tests in the first peak. So that's mostly 4-5 months ago. For the 28 day cutoff, that gives a regular remaining life expectancy of 15-19 years to account for the observed deaths. For 60 days, the range is 37-46 years.

The problem with this is that we don't know the demographics of those tested. However (you can look these things up in actuarial tables), for the 28-day number to be accurate, those tested have to be quite old - over 65, whereas the 60 day number would be right for a population of working age - say typically in their 40s. Given that we know that there was little testing in nursing homes, the 28-day life expectancy looks a bit wrong, whereas the 60-day version looks reasonable if you're testing a lot of health workers. It's not definitive, but there's a hint that 28 days is overcorrecting.

Another thing to do is look at the rate of corrected deaths over time. This is what it looks like for 28 days:

and this is for the 60 day data:

There's quite a difference there. Note that our expectation is that the probablity of death from other causes is constant (approximately) over time, so that the overall rate will increase over time (it's constant if there's just a single input at a fixed point in time, but we're testing more people as time goes on so the population is growing - the relatively recent big spike in positive tests hasn't worked its way into the data yet).

From this point of view, the 28-day graph looks a little suspect - it starts to rise 28 days after significant testing, but after day 90 there's a decline. That's plain wrong, indicating that there's a correlation with the time of the test (there are a lot of positive tests in days 30-90 which is when the big first peak was). If they're correlated with the positive test, there's going to be some element of correlation with the cause of the test - namely COVID-19 itself.

By contrast, the 60-day chart has the right shape. The problem is that any large cut off will have the right shape, so it's not telling us anything about the correct number to cut the data at, just that 60 days is beyond it.

The thing is, if you had all the data (inclduing demographics) you could do this properly, and work out what the optimal cut off to minimize errors should be. I just haven't seen that yet. But I'm fairly sure that, while the original quoted numbers overestimated the number of deaths, the new numbers with a 28-day cut off are underestimating the true impact, and they might even be more in error the other way than the original figures were.

Over the Hill

Friday, September 18, 2020

The accuracy of COVID-19 statistics

No comments: