As of today we have 4524 observations across unique 55 states, with 9 columns for each observation, ranging in date from 2020-01-21 to 2020-05-23.
A few sample observations below.
date | state | fips | cases | deaths | incr.cases | incr.deaths | smooth.incr.cases | smooth.incr.deaths |
---|---|---|---|---|---|---|---|---|
2020-04-10 | Colorado | 8 | 6510 | 253 | 308 | 26 | 312.0 | 22.6 |
2020-04-28 | Georgia | 13 | 23607 | 1022 | 378 | 41 | 540.4 | 30.2 |
2020-04-15 | Indiana | 18 | 8955 | 436 | 428 | 49 | 409.6 | 27.2 |
2020-04-16 | Nebraska | 31 | 1094 | 24 | 100 | 2 | 75.2 | 1.2 |
2020-04-30 | Nebraska | 31 | 4332 | 70 | 456 | 3 | 286.8 | 3.6 |
2020-04-04 | Nebraska | 31 | 339 | 8 | 43 | 2 | 30.8 | 1.0 |
2020-04-20 | Georgia | 13 | 18447 | 767 | 828 | 85 | 772.8 | 39.6 |
2020-04-21 | Iowa | 19 | 3641 | 83 | 466 | 4 | 300.0 | 4.6 |
2020-05-14 | Iowa | 19 | 13675 | 318 | 386 | 12 | 400.8 | 13.2 |
2020-04-18 | Ohio | 39 | 10222 | 451 | 1115 | 33 | 649.4 | 35.4 |
2020-05-05 | Massachusetts | 25 | 70271 | 4212 | 1184 | 122 | 1613.2 | 130.0 |
2020-05-14 | California | 6 | 74947 | 3039 | 1729 | 25 | 1624.6 | 61.4 |
2020-04-16 | Georgia | 13 | 15644 | 611 | 1061 | 42 | 676.6 | 35.8 |
2020-04-02 | Alaska | 2 | 146 | 2 | 3 | 0 | 8.8 | 0.2 |
2020-04-19 | Maryland | 24 | 12892 | 548 | 513 | 14 | 684.0 | 49.2 |
Let’s see if there is a lot of noise in daily data. We can check for this by looking at the daily cases (or deaths) to see if it oscillates - or look at cumulative and see if there are flats and spikes.
Let’s start by looking at daily data (on log scale)
We’ll want to look at both cases and deaths together. However, there is a scale difference between the two, and specifically we know that mortality rate = 100*(#deaths)/(#cases). Mortality rate can vary across time, and certainly can also vary by geo unit (such as state or county). For now, in order to plot both cases and deaths together, let’s define a state-specific scale parameter scl.new = max(cases)/max(deaths)- or, the number of cases that lead to one death.
Separating out by state, and looking at daily cases (black) and deaths (red),
There are a lot of up-and-down spikes in daily data - likely due to reporting problems. This suggests that multi-day smoothing would be useful. Here’s the same picture with smoothed cases (5 day rolling average).
Still, a few states behave badly, because the numbers are extremely low (cases, and generally few deaths). These are: Alaska, Hawaii, Montana, North Dakota, South Dakota, Vermont, Wyoming, Guam, Northern Mariana Islands, Virgin Islands. Let’s drop them from further analysis. These states collectively account for 366499 cases, out of a total 52359280 cases in the US (just 0.7 % of cases).
After eliminating these states that have minimal covid infections, here’s the picture for the remaining states.
Now let’s count daily deaths differently. Since, among the people who have such severe outcome of covid19 that they die, death usually occurs a few days after the infection, let’s introduce a time lag between cases and deaths. News reports paint a varied picture – some people are brought nearly-dead to a hospital, others spend multiple days in the hospital or at home. Generally, however, there is a multiple-day gap between infection (specifically, discovery of infection) and death. We’ll let the data identify the optimal time lag. Based on some other analysis (with covidtracking.com data at US-level) let’s for the moment pick a 6-day lag – common across all states – and see how that works.
We can see here that a 6-day lag makes the “cases” and “deaths” lines (with the latter on an adjusted scale) pretty much co-incident.
Research Goal: if we can formalize the ideas conveyed in the graphs – basically we write an “algorithm” to identify optimal cases-to-death lag for each state and then establish the mortality rate by state - and show a) how much it is for each state, and b) it can vary by state - and then make a compelling argument that it is important to correctly establish mortality rate - then we have something.
Another way to minimize the impact of daily-level noise (misallocation of counts) is to work with cumulative data. Again we’ll do this on a log scale so we can reasonably compare states with widely disparate levels of cases.
Let’s look at this with a 6-day lag from cases to deaths.
What we see is that it is not sensible to look at mortality rate based on cumulative data of cases and deaths. First, after sufficient time has passed a lag doesn’t really work well on cumulatives, especially when the daily cases and deaths are not monotonic. It makes sense to do this computation using the incremental data for each day.
Second, in the early days – when value of quick detection is highest – cumulative based analysis is way off and unstable – as the above pictures show.
As of today we have 170803 observations across unique 1744 counties, with 6 columns for each observation, ranging in date from 2020-01-21 to 2020-05-23. Note: fips is a Federal Information Processing Standard that assigns a numeric code to each county.
A few sample observations below.
date | county | state | fips | cases | deaths |
---|---|---|---|---|---|
2020-05-18 | Switzerland | Indiana | 18155 | 18 | 0 |
2020-05-23 | Brookings | South Dakota | 46011 | 15 | 0 |
2020-03-25 | Waldo | Maine | 23027 | 1 | 0 |
2020-05-07 | Webster | Georgia | 13307 | 10 | 2 |
2020-03-17 | Wilson | North Carolina | 37195 | 1 | 0 |
2020-05-12 | Lincoln | Mississippi | 28085 | 189 | 14 |
2020-04-09 | Hampshire | Massachusetts | 25015 | 177 | 3 |
2020-05-18 | Montrose | Colorado | 8085 | 136 | 11 |
2020-05-08 | Warren | North Carolina | 37185 | 22 | 0 |
2020-03-29 | Pickens | South Carolina | 45077 | 11 | 0 |
2020-05-15 | Tipton | Indiana | 18159 | 21 | 1 |
2020-04-19 | Kershaw | South Carolina | 45055 | 197 | 8 |
2020-04-14 | Crawford | Kansas | 20037 | 6 | 1 |
2020-05-09 | Oakland | Michigan | 26125 | 7692 | 841 |
2020-04-11 | Danville city | Virginia | 51590 | 20 | 0 |