John Appleyard, updated 16th April 2020
Disclaimer: I trained as a physicist to PhD level many years ago, and my career required good numeracy. However, I have no special expertise in epidemiology or statistics.
I’ve done some analysis on data downloaded from worldometers.info, with a view to understanding why different countries have such different case fatality rates.
The data comprises 3 variables for each of 110 countries with more than 100 cases, and at least one death. China is excluded because testing data is not available. The variables are the number of confirmed cases (c), deaths (d), and tests (t) per million of population. The graph below shows that a good fit to the observed apparent fatality rate, d/c, can be obtained using the equation:
d/c=0.6*sqrt(d/t)
Only a dozen countries diverge from the prediction by more than a factor of 2 (indicated by the tramlines). It is possible to obtain slightly better fit metrics by multiple regression, but the difference is small, and not obvious on a scatter plot.
There are all sorts of reasons to expect different countries to have different mortality rates, including:
- Different health systems – some primitive, and others advanced
- Overloading makes even advanced health systems less effective
- Different demographics – Covid-19 is worse for older populations
- Time lags – e.g. confirmed cases may not survive, and reports may be delayed
- Different criteria for counting deaths
- Political massaging of published data
The conclusion is that all of these factors combine to a factor of up to about 2.
Figure 2 shows a prediction of the number of reported cases per million, using the formula
c = sqrt(d*t)/0.6
This is not a prediction of the true number of cases, but of the number that are confirmed, given the current deaths and tests per million population. The true number of cases is generally many times larger.
In general terms, countries move from bottom left to top right as the pandemic progresses. The prediction spans almost 4 orders of magnitude, but even so, almost all countries are within the tramlines.
Figure 3 shows how the model fits the time series of UK confirmed case numbers:
This line is also superimposed, as an orange trace, on Figures 1 and 2.
If we assume c = t = 1 million (i.e. the entire population is infected), the model predicts mortality as 36%. This is clearly wrong, and suggests that as the proportion infected increases, the curve turns up so that the mortality at saturation is about 1%. This point is indicated by a red spot on Figures 2 and 3.
If testing were random, we would expect the number of cases found to be proportional to the number of tests, and the proportion of the population that is infected. The second factor is unknown, but the number of deaths is probably a reasonable proxy, so we have:
c ∝ d.t
In the early stages of the pandemic, testing is not random, but is designed to find as many cases as possible. It focuses on a small part of the population comprising the contacts of existing cases, medical staff, and those with symptoms. If the size of that reduced population is proportional to c, then the number of cases found will be
c ∝ d.t/c
c ∝ sqrt(d.t)
which is in the form of the model equation.
As the number of cases and tests increases to include a significant proportion of the population, it’s likely that testing has to be more random. If the true mortality rate is 1%, that suggests that the actual course of the pandemic veers away from the extrapolation of the model and on to the dashed line with c ∝ d.t passing through the red spot in Figures 2 and 3.
The model predicts that the number of reported cases varies with the square root of the number of tests. We can therefore get an estimate of the true number of infections per million (c0), using
c0 = c . sqrt(1E6 / t)
For the UK, on 14th April, that is about 1.8% of the population. However, that is probably an underestimate, as we suspect that the model breaks down as the number of tests is increased towards saturation. In the case where c = t = 1E6, a factor of 6 increase in the extrapolated value of c was required to bring the predicted mortality down from 36% to 1%. If the same were true for lower infection rates, we’d have to increase the computed value of c0 by a factor of 6, meaning that around 10% of the population has been infected. The true figure is most likely between 1 and 6. Figure 4 shows how the transition from the model to different regime might work.
You can download an Excel spreadsheet containing the data, workings and charts here.