State-Level Data

As of today we have 4524 observations across unique 55 states, with 9 columns for each observation, ranging in date from 2020-01-21 to 2020-05-23.

A few sample observations below.

date	state	fips	cases	deaths	incr.cases	incr.deaths	smooth.incr.cases	smooth.incr.deaths
2020-04-10	Colorado	8	6510	253	308	26	312.0	22.6
2020-04-28	Georgia	13	23607	1022	378	41	540.4	30.2
2020-04-15	Indiana	18	8955	436	428	49	409.6	27.2
2020-04-16	Nebraska	31	1094	24	100	2	75.2	1.2
2020-04-30	Nebraska	31	4332	70	456	3	286.8	3.6
2020-04-04	Nebraska	31	339	8	43	2	30.8	1.0
2020-04-20	Georgia	13	18447	767	828	85	772.8	39.6
2020-04-21	Iowa	19	3641	83	466	4	300.0	4.6
2020-05-14	Iowa	19	13675	318	386	12	400.8	13.2
2020-04-18	Ohio	39	10222	451	1115	33	649.4	35.4
2020-05-05	Massachusetts	25	70271	4212	1184	122	1613.2	130.0
2020-05-14	California	6	74947	3039	1729	25	1624.6	61.4
2020-04-16	Georgia	13	15644	611	1061	42	676.6	35.8
2020-04-02	Alaska	2	146	2	3	0	8.8	0.2
2020-04-19	Maryland	24	12892	548	513	14	684.0	49.2

Data Quality (Noise in Daily Data?)

Let’s see if there is a lot of noise in daily data. We can check for this by looking at the daily cases (or deaths) to see if it oscillates - or look at cumulative and see if there are flats and spikes.

Daily Cases

Let’s start by looking at daily data (on log scale)

We’ll want to look at both cases and deaths together. However, there is a scale difference between the two, and specifically we know that mortality rate = 100*(#deaths)/(#cases). Mortality rate can vary across time, and certainly can also vary by geo unit (such as state or county). For now, in order to plot both cases and deaths together, let’s define a state-specific scale parameter scl.new = max(cases)/max(deaths)- or, the number of cases that lead to one death.

Separating out by state, and looking at daily cases (black) and deaths (red),

There are a lot of up-and-down spikes in daily data - likely due to reporting problems. This suggests that multi-day smoothing would be useful. Here’s the same picture with smoothed cases (5 day rolling average).

Still, a few states behave badly, because the numbers are extremely low (cases, and generally few deaths). These are: Alaska, Hawaii, Montana, North Dakota, South Dakota, Vermont, Wyoming, Guam, Northern Mariana Islands, Virgin Islands. Let’s drop them from further analysis. These states collectively account for 366499 cases, out of a total 52359280 cases in the US (just 0.7 % of cases).

After eliminating these states that have minimal covid infections, here’s the picture for the remaining states.

Lagged Deaths and Mortality Rate

Now let’s count daily deaths differently. Since, among the people who have such severe outcome of covid19 that they die, death usually occurs a few days after the infection, let’s introduce a time lag between cases and deaths. News reports paint a varied picture – some people are brought nearly-dead to a hospital, others spend multiple days in the hospital or at home. Generally, however, there is a multiple-day gap between infection (specifically, discovery of infection) and death. We’ll let the data identify the optimal time lag. Based on some other analysis (with covidtracking.com data at US-level) let’s for the moment pick a 6-day lag – common across all states – and see how that works.

We can see here that a 6-day lag makes the “cases” and “deaths” lines (with the latter on an adjusted scale) pretty much co-incident.

Research Goal: if we can formalize the ideas conveyed in the graphs – basically we write an “algorithm” to identify optimal cases-to-death lag for each state and then establish the mortality rate by state - and show a) how much it is for each state, and b) it can vary by state - and then make a compelling argument that it is important to correctly establish mortality rate - then we have something.

Cumulative Cases

Another way to minimize the impact of daily-level noise (misallocation of counts) is to work with cumulative data. Again we’ll do this on a log scale so we can reasonably compare states with widely disparate levels of cases.

Let’s look at this with a 6-day lag from cases to deaths.

What we see is that it is not sensible to look at mortality rate based on cumulative data of cases and deaths. First, after sufficient time has passed a lag doesn’t really work well on cumulatives, especially when the daily cases and deaths are not monotonic. It makes sense to do this computation using the incremental data for each day.

Second, in the early days – when value of quick detection is highest – cumulative based analysis is way off and unstable – as the above pictures show.

date	county	state	fips	cases	deaths
2020-05-18	Switzerland	Indiana	18155	18	0
2020-05-23	Brookings	South Dakota	46011	15	0
2020-03-25	Waldo	Maine	23027	1	0
2020-05-07	Webster	Georgia	13307	10	2
2020-03-17	Wilson	North Carolina	37195	1	0
2020-05-12	Lincoln	Mississippi	28085	189	14
2020-04-09	Hampshire	Massachusetts	25015	177	3
2020-05-18	Montrose	Colorado	8085	136	11
2020-05-08	Warren	North Carolina	37185	22	0
2020-03-29	Pickens	South Carolina	45077	11	0
2020-05-15	Tipton	Indiana	18159	21	1
2020-04-19	Kershaw	South Carolina	45055	197	8
2020-04-14	Crawford	Kansas	20037	6	1
2020-05-09	Oakland	Michigan	26125	7692	841
2020-04-11	Danville city	Virginia	51590	20	0

Covid19 Analysis using New York Times Data

HKB

5/24/2020

State-Level Data

Data Quality (Noise in Daily Data?)

Daily Cases

Lagged Deaths and Mortality Rate

Cumulative Cases

County Level Data