State-Level Data

As of today we have 4524 observations across unique 55 states, with 9 columns for each observation, ranging in date from 2020-01-21 to 2020-05-23.

A few sample observations below.

date state fips cases deaths incr.cases incr.deaths smooth.incr.cases smooth.incr.deaths
2020-04-10 Colorado 8 6510 253 308 26 312.0 22.6
2020-04-28 Georgia 13 23607 1022 378 41 540.4 30.2
2020-04-15 Indiana 18 8955 436 428 49 409.6 27.2
2020-04-16 Nebraska 31 1094 24 100 2 75.2 1.2
2020-04-30 Nebraska 31 4332 70 456 3 286.8 3.6
2020-04-04 Nebraska 31 339 8 43 2 30.8 1.0
2020-04-20 Georgia 13 18447 767 828 85 772.8 39.6
2020-04-21 Iowa 19 3641 83 466 4 300.0 4.6
2020-05-14 Iowa 19 13675 318 386 12 400.8 13.2
2020-04-18 Ohio 39 10222 451 1115 33 649.4 35.4
2020-05-05 Massachusetts 25 70271 4212 1184 122 1613.2 130.0
2020-05-14 California 6 74947 3039 1729 25 1624.6 61.4
2020-04-16 Georgia 13 15644 611 1061 42 676.6 35.8
2020-04-02 Alaska 2 146 2 3 0 8.8 0.2
2020-04-19 Maryland 24 12892 548 513 14 684.0 49.2

Data Quality (Noise in Daily Data?)

Let’s see if there is a lot of noise in daily data. We can check for this by looking at the daily cases (or deaths) to see if it oscillates - or look at cumulative and see if there are flats and spikes.

Daily Cases

Let’s start by looking at daily data (on log scale)

We’ll want to look at both cases and deaths together. However, there is a scale difference between the two, and specifically we know that mortality rate = 100*(#deaths)/(#cases). Mortality rate can vary across time, and certainly can also vary by geo unit (such as state or county). For now, in order to plot both cases and deaths together, let’s define a state-specific scale parameter scl.new = max(cases)/max(deaths)- or, the number of cases that lead to one death.

Separating out by state, and looking at daily cases (black) and deaths (red),

There are a lot of up-and-down spikes in daily data - likely due to reporting problems. This suggests that multi-day smoothing would be useful. Here’s the same picture with smoothed cases (5 day rolling average).

Still, a few states behave badly, because the numbers are extremely low (cases, and generally few deaths). These are: Alaska, Hawaii, Montana, North Dakota, South Dakota, Vermont, Wyoming, Guam, Northern Mariana Islands, Virgin Islands. Let’s drop them from further analysis. These states collectively account for 366499 cases, out of a total 52359280 cases in the US (just 0.7 % of cases).

After eliminating these states that have minimal covid infections, here’s the picture for the remaining states.

Lagged Deaths and Mortality Rate

Now let’s count daily deaths differently. Since, among the people who have such severe outcome of covid19 that they die, death usually occurs a few days after the infection, let’s introduce a time lag between cases and deaths. News reports paint a varied picture – some people are brought nearly-dead to a hospital, others spend multiple days in the hospital or at home. Generally, however, there is a multiple-day gap between infection (specifically, discovery of infection) and death. We’ll let the data identify the optimal time lag. Based on some other analysis (with covidtracking.com data at US-level) let’s for the moment pick a 6-day lag – common across all states – and see how that works.

We can see here that a 6-day lag makes the “cases” and “deaths” lines (with the latter on an adjusted scale) pretty much co-incident.

Research Goal: if we can formalize the ideas conveyed in the graphs – basically we write an “algorithm” to identify optimal cases-to-death lag for each state and then establish the mortality rate by state - and show a) how much it is for each state, and b) it can vary by state - and then make a compelling argument that it is important to correctly establish mortality rate - then we have something.

Cumulative Cases

Another way to minimize the impact of daily-level noise (misallocation of counts) is to work with cumulative data. Again we’ll do this on a log scale so we can reasonably compare states with widely disparate levels of cases.

Let’s look at this with a 6-day lag from cases to deaths.

What we see is that it is not sensible to look at mortality rate based on cumulative data of cases and deaths. First, after sufficient time has passed a lag doesn’t really work well on cumulatives, especially when the daily cases and deaths are not monotonic. It makes sense to do this computation using the incremental data for each day.

Second, in the early days – when value of quick detection is highest – cumulative based analysis is way off and unstable – as the above pictures show.

County Level Data

As of today we have 170803 observations across unique 1744 counties, with 6 columns for each observation, ranging in date from 2020-01-21 to 2020-05-23. Note: fips is a Federal Information Processing Standard that assigns a numeric code to each county.

A few sample observations below.

date county state fips cases deaths
2020-05-18 Switzerland Indiana 18155 18 0
2020-05-23 Brookings South Dakota 46011 15 0
2020-03-25 Waldo Maine 23027 1 0
2020-05-07 Webster Georgia 13307 10 2
2020-03-17 Wilson North Carolina 37195 1 0
2020-05-12 Lincoln Mississippi 28085 189 14
2020-04-09 Hampshire Massachusetts 25015 177 3
2020-05-18 Montrose Colorado 8085 136 11
2020-05-08 Warren North Carolina 37185 22 0
2020-03-29 Pickens South Carolina 45077 11 0
2020-05-15 Tipton Indiana 18159 21 1
2020-04-19 Kershaw South Carolina 45055 197 8
2020-04-14 Crawford Kansas 20037 6 1
2020-05-09 Oakland Michigan 26125 7692 841
2020-04-11 Danville city Virginia 51590 20 0