Some basic information about the data set used and the idea of mortality rate. Skip this section to go directly to RLS approach.
As of today we have 4689 observations across unique 55 states, with 9 columns for each observation, ranging in date from 2020-01-21 to 2020-05-26.
A few sample observations below.
date | state | fips | cases | deaths | incr.cases | incr.deaths | smooth.incr.cases | smooth.incr.deaths |
---|---|---|---|---|---|---|---|---|
2020-04-03 | New Hampshire | 33 | 540 | 7 | 61 | 2 | 54 | 1.33 |
2020-03-09 | Vermont | 50 | 1 | 0 | 0 | 0 | NA | NA |
2020-05-02 | Montana | 30 | 454 | 16 | 2 | 0 | 1 | 0.00 |
2020-05-02 | Hawaii | 15 | 611 | 16 | 1 | 0 | 2 | 0.00 |
2020-03-18 | Illinois | 17 | 286 | 1 | 127 | 0 | 42 | 0.33 |
2020-05-24 | Connecticut | 9 | 40468 | 3693 | 446 | 18 | 392 | 37.00 |
2020-03-24 | Minnesota | 27 | 262 | 1 | 27 | 0 | 31 | 0.00 |
2020-03-24 | Texas | 48 | 857 | 11 | 129 | 4 | 115 | 2.00 |
2020-01-30 | California | 6 | 2 | 0 | 0 | 0 | NA | 0.00 |
2020-04-15 | Virginia | 51 | 6499 | 195 | 329 | 41 | 410 | 18.00 |
2020-05-06 | Texas | 48 | 35441 | 985 | 1155 | 30 | 1062 | 31.67 |
2020-02-19 | Massachusetts | 25 | 1 | 0 | 0 | 0 | 0 | 0.00 |
2020-03-19 | Georgia | 13 | 282 | 10 | 89 | 7 | 40 | 3.00 |
2020-04-21 | Georgia | 13 | 19189 | 810 | 742 | 43 | 768 | 46.67 |
2020-04-13 | Guam | 66 | 719 | 6 | 3 | 1 | 74 | 0.67 |
In theory, mortality rate in any time period = (Cumulative Deaths/Cases) during that time, assuming that we’re measuring the metrics (cases and deaths) correctly during each time period, and we’re attributing the metric to the correct time period.
A naive approach for computing mortality rate is simply, on any given day, to compute (# deaths)/(# cases). Of course, these numbers will change every day – and in particular the ratio might change every day – so the answer will be a vector of numbers rather than a particular rate.
Notice that the mortality rate (visualized by state above) varies by state but is rather unstable over time. True, during a 2-month period one should expect some variation in mortality rate - based on levels of congestion in the health system, innovations in care, and other factors. However, such factors should cause a few discrete jumps - and a definition of mortality rate in terms of cumulative deaths and cases is simply not consistent with this expectation. Moreover, once an innovation or regime change occurs, its effect should be to change mortality rate (perhaps over a period of few days rather than just one) to a new stable level.
Therefore, it makes sense to compute mortality rate in terms of daily data, and then deal with the challenge of temporal variations in this rate. Our goal is to identify a definition and computational method that will minimize the noise or variation in mortality rate - that is, yield a stable metric as much as possible.
We’ll use quantreg and RLS packages for testing. Analysis with the smoothed incremental (i.e., daily) cases and deaths. Restricted to a few states for now. We’ll eliminate data for (date, state) where cases = 0 or NA.
date | state | cases | deaths | smooth.incr.cases | smooth.incr.deaths | lead1.incr.deaths | lead5.incr.deaths | lead6.incr.deaths | lead9.incr.deaths |
---|---|---|---|---|---|---|---|---|---|
2020-03-27 | Arizona | 665 | 15 | 94 | 3.0 | 3.0 | 3.7 | 5.0 | 9.7 |
2020-03-28 | Arizona | 773 | 15 | 103 | 3.0 | 3.3 | 5.0 | 5.7 | 8.7 |
2020-03-29 | Arizona | 929 | 18 | 116 | 3.3 | 1.7 | 5.7 | 8.0 | 8.0 |
2020-03-30 | Arizona | 1169 | 20 | 140 | 1.7 | 3.0 | 8.0 | 9.7 | 5.3 |
2020-03-31 | Arizona | 1298 | 24 | 149 | 3.0 | 3.7 | 9.7 | 8.7 | 7.3 |
2020-04-01 | Arizona | 1413 | 29 | 151 | 3.7 | 5.0 | 8.7 | 8.0 | 6.7 |
2020-04-02 | Arizona | 1600 | 35 | 156 | 5.0 | 5.7 | 8.0 | 5.3 | 10.7 |
2020-04-03 | Arizona | 1769 | 41 | 166 | 5.7 | 8.0 | 5.3 | 7.3 | 9.3 |
2020-04-04 | Arizona | 2019 | 53 | 182 | 8.0 | 9.7 | 7.3 | 6.7 | 8.3 |
2020-04-05 | Arizona | 2269 | 64 | 183 | 9.7 | 8.7 | 6.7 | 10.7 | 6.3 |
2020-04-06 | Arizona | 2465 | 67 | 194 | 8.7 | 8.0 | 10.7 | 9.3 | 8.3 |
2020-04-07 | Arizona | 2575 | 77 | 194 | 8.0 | 5.3 | 9.3 | 8.3 | 10.0 |
2020-04-08 | Arizona | 2726 | 80 | 188 | 5.3 | 7.3 | 8.3 | 6.3 | 14.3 |
2020-04-09 | Arizona | 3018 | 89 | 208 | 7.3 | 6.7 | 6.3 | 8.3 | 13.0 |
2020-04-10 | Arizona | 3112 | 97 | 182 | 6.7 | 10.7 | 8.3 | 10.0 | 12.0 |
The data set df.RLS has 9 states (Arizona, California, Florida, Maryland, Michigan, Minnesota, New York, Texas, Washington). For deaths, we recognize that deaths on date t should really be attributed to the cases on date t-k, i.e., cases are a leading indicator of deaths.
Let’s start by computing a naive mortality rate, defined simply as 100*deaths/cases for each state.
The notable thing here is how unstable the mortality rate is over time – in other words if you were to measure it any particular day, the picture you get is quite far from the truth or what you might see another day.
Now let’s run a log-log recursive least squares regression. Let y represent “daily change in daily smoothed-lagged deaths”, and x represent “daily change in smoothed cases”. We believe the relationship between them is y = b*x where b = mortality rate. Taking log on both sides, we have log(y) = log(b) + log(x)
We’ll run the RLS on
log(y) ~ int + slope * log(x)
so that the “int” = log(b), hence b = exp(x). Here’s what we get.