Let’s read the data from https://covidtracking.com/. There are two data sets, one for the US and the second by state.

The df.US data set looks like this,

	1	2
date	2020-07-19	2020-07-18
states	56	56
positive	3755968	3692061
negative	41978359	41273443
pending	3052	3032
hospitalizedCurrently	57792	57562
hospitalizedCumulative	277378	276439
inIcuCurrently	6391	6396
inIcuCumulative	12391	12342
onVentilatorCurrently	2362	2343
onVentilatorCumulative	1216	1211
recovered	1131121	1122720
dateChecked	2020-07-19T00:00:00Z	2020-07-18T00:00:00Z
death	132918	132395
hospitalized	277378	276439
lastModified	2020-07-19T00:00:00Z	2020-07-18T00:00:00Z
total	45737379	44968536
totalTestResults	45734327	44965504

while the df.states data set has this,

	1	2
date	2020-07-19	2020-07-19
state	AK	AL
positive	2277	67011
negative	170733	518522
pending	NA	NA
hospitalizedCurrently	27	1465
hospitalizedCumulative	NA	7782
inIcuCurrently	NA	NA
inIcuCumulative	NA	988
onVentilatorCurrently	1	NA
onVentilatorCumulative	NA	526
recovered	712	29736
dataQualityGrade	A	B
lastUpdateEt	7/19/2020 00:00	7/18/2020 11:00
dateModified	2020-07-19T00:00:00Z	2020-07-18T11:00:00Z
checkTimeEt	07/18 20:00	07/18 07:00
death	18	1287
hospitalized	NA	7782

We’ll work with the raw data (and ignore the derived columns, like total results = positive + negative + pending results), clean up some names, and then create some useful derived columns (daily or “new” x, where x = cases, deaths etc.).

	1	2	3	4	5	6
date	2020-07-19	2020-07-18	2020-07-17	2020-07-16	2020-07-15	2020-07-14
n.states	56	56	56	56	56	56
n.cases	3755968	3692061	3626881	3549648	3478695	3413313
n.neg	41978359	41273443	40576852	39802297	39048662	38351244
n.pend	3052	3032	3002	2929	2947	960
in.hosp	57792	57562	57705	57369	56143	55533
n.hosp	277378	276439	274436	271758	269543	267127
new.ICU	6391	6396	6453	6359	6317	6235
in.ICU	12391	12342	12243	12091	12002	11857
new.vent	2362	2343	2353	2317	2317	2263
n.vent	1216	1211	1200	1175	1166	1154
n.recov	1131121	1122720	1106844	1090645	1075882	1049098
n.death	277378	276439	274436	271758	269543	267127
n.tests	45734327	44965504	44203733	43351945	42527357	41764557
new.tests	768823	761771	851788	824588	762800	760282
new.cases	63907	65180	77233	70953	65382	62879
new.hosp	939	2003	2678	2215	2416	2262
new.death	939	2003	2678	2215	2416	2262
pct.inf	8.3	8.6	9.1	8.6	8.6	8.3

Cleaning and Smoothing the Data

Daily data measurements have lots of errors. Metrics reflecting date t may only be captured at time t+k, and the mismatch k varies widely across time and across sub-geos. So, we’ll smooth these out by computing rolling means. We’ll use different means for different metrics (e.g., 7 days for test results (cases), small numbers for hospitalizations and deaths).

n.days = dim(df.US)[1]

# smooth the data (name starts with s.)

df.US <- df.US %>% mutate( 
  s.daily.tests = rollmean(new.tests, k = 5, fill = NA), 
  s.daily.cases = rollmean(new.cases, k = 5, fill = NA),
  s.daily.death = rollmean(new.death, k = 3, fill = NA), 
  s.new.ICU = rollmean(new.ICU, k = 3, fill = NA), 
  s.new.vent = rollmean(new.vent, k = 3, fill = NA), 
  s.daily.hosp = rollmean(in.hosp, k = 3, fill = NA)
  ) %>% ungroup() 

# now do a similar smoothing as in df.US but one state at a time
df.states <- df.states %>% group_by(state) %>% add_tally() %>% filter(n > 35) %>% mutate( 
s.daily.tests = rollmean(new.tests, k = 5, fill = NA), 
  s.daily.cases = rollmean(new.cases, k = 5, fill = NA),
  s.daily.death = rollmean(new.death, k = 3, fill = NA), 
  s.in.ICU = rollmean(in.ICU, k = 3, fill = NA), 
  s.daily.hosp = rollmean(in.hosp, k = 3, fill = NA)
  ) %>% ungroup() 

df.ST <- function(ST) {
  df.states %>% filter(state==ST) 
}

US-level data with the added smoothed metrics.

	1	2	3	4	5	6
date	2020-07-19	2020-07-18	2020-07-17	2020-07-16	2020-07-15	2020-07-14
n.states	56	56	56	56	56	56
n.cases	3755968	3692061	3626881	3549648	3478695	3413313
n.neg	41978359	41273443	40576852	39802297	39048662	38351244
n.pend	3052	3032	3002	2929	2947	960
in.hosp	57792	57562	57705	57369	56143	55533
n.hosp	277378	276439	274436	271758	269543	267127
new.ICU	6391	6396	6453	6359	6317	6235
in.ICU	12391	12342	12243	12091	12002	11857
new.vent	2362	2343	2353	2317	2317	2263
n.vent	1216	1211	1200	1175	1166	1154
n.recov	1131121	1122720	1106844	1090645	1075882	1049098
n.death	277378	276439	274436	271758	269543	267127
n.tests	45734327	44965504	44203733	43351945	42527357	41764557
new.tests	768823	761771	851788	824588	762800	760282
new.cases	63907	65180	77233	70953	65382	62879
new.hosp	939	2003	2678	2215	2416	2262
new.death	939	2003	2678	2215	2416	2262
pct.inf	8.3	8.6	9.1	8.6	8.6	8.3
s.daily.tests	NA	NA	793954	792246	784311	759417
s.daily.cases	NA	NA	68531	68325	66982	63731
s.daily.death	NA	1873	2299	2436	2298	1964
s.new.ICU	NA	6413	6403	6376	6304	6209
s.new.vent	NA	2353	2338	2329	2299	2280
s.daily.hosp	NA	57686	57545	57072	56348	55221

A sample of the states data with smoothed metrics.

date	2020-07-19	2020-07-19	2020-07-19	2020-07-19	2020-07-19	2020-07-19
state	AK	AL	AR	AZ	CA	CO
n.cases	2277	67011	32533	143624	384692	39788
n.neg	170733	518522	386638	642355	5902160	396150
n.pend	NA	NA	NA	NA	NA	NA
in.hosp	27	1465	453	3136	8398	401
n.hosp	NA	7782	2177	6632	NA	6019
in.ICU	NA	NA	NA	894	2107	NA
n.ICU	NA	988	NA	NA	NA	NA
on.vent	1	NA	97	622	NA	NA
n.vent	NA	526	307	NA	NA	NA
n.recov	712	29736	25292	18149	NA	4948
n.death	18	1287	357	2761	7685	1615
n.tests	173010	585533	419171	785979	6286852	435938
new.tests	4647	11038	6166	13988	119634	5155
new.cases	118	1777	771	2359	9329	444
new.hosp	NA	0	42	65	NA	25
new.death	0	1	4	31	90	0
pct.inf	2.5	16.1	12.5	16.9	7.8	8.6
n	127	130	124	134	135	133
s.daily.tests	NA	NA	NA	NA	NA	NA
s.daily.cases	NA	NA	NA	NA	NA	NA
s.daily.death	NA	NA	NA	NA	NA	NA
s.in.ICU	NA	NA	NA	NA	NA	NA
s.daily.hosp	NA	NA	NA	NA	NA	NA

Exploring Cases and Death Rates

Looking at cases and deaths visually. Let’s start with test results vs. cases (positive test results) to get a sense of the positivity rate and variation across time.

Now look at cases vs deaths, i.e., the mortality rate.

The first panel simply lays out daily cases and daily deaths. It looks like deaths are suddenly up the last few days. The second panel pushes daily deaths data back 5 days - and how you see how the cases and deaths line up. The third panel presents an important metric - change in mortality rate over time and the importance of recognizing the lag between cases and deaths. If you ignore the lag it looks like the mortality rate is increasing, but if you adjust for a delay then you can see that- while health systems were initially overwhelmed and experienced high death rates - this rate has declined over time, perhaps because health systems have learnt and become better at dealing with critical cases.

Looking Deeper into Mortality Rate

Another way to see that the right “death lag” is needed to understand mortality rate is to plot cases against deaths, with different levels of lag.

Based on this, it “looks” like we should employ a 6-day gap between cases (i.e., someone tested, diagnosed and counted as being infected) vs. deaths. The slope of the regression line then gives us the mortality rate.

## 
## Call:
## lm(formula = lag(s.daily.death, 3) ~ -1 + new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1936   -675   -204    876   4816 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## new.ICU   0.2441     0.0148    16.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1460 on 111 degrees of freedom
##   (56 observations deleted due to missingness)
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.708 
## F-statistic:  273 on 1 and 111 DF,  p-value: <2e-16

Deaths vs ICU

Let’s look at how deaths relate to ICU admissions.

## 
## Call:
## lm(formula = lag(s.daily.death, 0) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1868   -643   -156    705   4746 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2517     0.0136    18.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1350 on 113 degrees of freedom
##   (54 observations deleted due to missingness)
## Multiple R-squared:  0.752,  Adjusted R-squared:  0.75 
## F-statistic:  343 on 1 and 113 DF,  p-value: <2e-16

## 
## Call:
## lm(formula = lag(s.daily.death, 1) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1890   -659   -195    750   4760 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2493     0.0139    17.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1380 on 112 degrees of freedom
##   (55 observations deleted due to missingness)
## Multiple R-squared:  0.74,   Adjusted R-squared:  0.738 
## F-statistic:  320 on 1 and 112 DF,  p-value: <2e-16

## 
## Call:
## lm(formula = lag(s.daily.death, 2) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1912   -687   -190    794   4787 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2467     0.0143    17.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1420 on 111 degrees of freedom
##   (56 observations deleted due to missingness)
## Multiple R-squared:  0.728,  Adjusted R-squared:  0.726 
## F-statistic:  297 on 1 and 111 DF,  p-value: <2e-16

## 
## Call:
## lm(formula = lag(s.daily.death, 3) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1936   -700   -209    810   4797 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2441     0.0147    16.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1450 on 110 degrees of freedom
##   (57 observations deleted due to missingness)
## Multiple R-squared:  0.715,  Adjusted R-squared:  0.713 
## F-statistic:  277 on 1 and 110 DF,  p-value: <2e-16

## 
## Call:
## lm(formula = lag(s.daily.death, 4) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1949   -690   -225    869   4774 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2415     0.0149    16.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1470 on 109 degrees of freedom
##   (58 observations deleted due to missingness)
## Multiple R-squared:  0.705,  Adjusted R-squared:  0.703 
## F-statistic:  261 on 1 and 109 DF,  p-value: <2e-16

## 
## Call:
## lm(formula = lag(s.daily.death, 5) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1972   -697   -216    966   4749 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2388     0.0151    15.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1490 on 108 degrees of freedom
##   (59 observations deleted due to missingness)
## Multiple R-squared:  0.698,  Adjusted R-squared:  0.695 
## F-statistic:  250 on 1 and 108 DF,  p-value: <2e-16

## 
## Call:
## lm(formula = lag(s.daily.death, 6) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1998   -669   -220    918   4711 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2361     0.0151    15.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1480 on 107 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.695,  Adjusted R-squared:  0.692 
## F-statistic:  244 on 1 and 107 DF,  p-value: <2e-16

Multiple Regression

We do a regression with deaths against cases and ICU, and find that adding ICU doesn’t add anything to the regression vs. doing just cases.

## 
## Call:
## lm(formula = s.daily.death ~ -1 + lead(s.daily.cases, 6) + lead(s.new.ICU, 
##     0), data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1915   -736   -339    474   4768 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## lead(s.daily.cases, 6)   0.0175     0.0080    2.19    0.031 *  
## lead(s.new.ICU, 0)       0.2019     0.0264    7.64  7.9e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1330 on 112 degrees of freedom
##   (54 observations deleted due to missingness)
## Multiple R-squared:  0.762,  Adjusted R-squared:  0.758 
## F-statistic:  180 on 2 and 112 DF,  p-value: <2e-16

## 
## Call:
## lm(formula = s.daily.death ~ -1 + lead(s.daily.cases, 6), data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2525   -509     -8   1265   5269 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## lead(s.daily.cases, 6)  0.07049    0.00464    15.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1520 on 134 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.633,  Adjusted R-squared:  0.63 
## F-statistic:  231 on 1 and 134 DF,  p-value: <2e-16

State-level

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

	Estimate.(Intercept)	Std. Error.(Intercept)	t value.(Intercept)	Pr(>\|t\|).(Intercept)	Estimate.s.daily.cases	Std. Error.s.daily.cases	t value.s.daily.cases	Pr(>\|t\|).s.daily.cases
AK	0.11	2.7	0.04	0.97	0.00	0.15	0.02	0.98
AL	5.73	3.0	1.95	0.05	0.01	0.01	2.17	0.03
AR	1.58	3.0	0.54	0.59	0.01	0.01	0.76	0.45
AZ	8.29	2.4	3.41	0.00	0.01	0.00	8.66	0.00
CA	42.19	2.9	14.76	0.00	0.01	0.00	8.35	0.00
CO	3.47	4.6	0.76	0.45	0.03	0.01	2.34	0.02
CT	-0.91	3.2	-0.28	0.78	0.09	0.01	16.10	0.00
DC	-0.63	3.8	-0.17	0.87	0.06	0.04	1.67	0.10
DE	-1.23	3.7	-0.33	0.74	0.05	0.03	1.71	0.09
FL	25.72	2.3	11.10	0.00	0.01	0.00	9.76	0.00
GA	30.44	3.3	9.25	0.00	0.00	0.00	-1.21	0.23
GU	-0.01	3.1	0.00	1.00	0.02	0.80	0.02	0.98
HI	0.01	3.0	0.00	1.00	0.02	0.22	0.09	0.93
IA	2.39	4.5	0.54	0.59	0.01	0.01	1.13	0.26
ID	0.83	2.6	0.32	0.75	0.00	0.02	0.15	0.88
IL	6.17	3.8	1.62	0.10	0.04	0.00	16.79	0.00
IN	-2.64	5.2	-0.51	0.61	0.06	0.01	5.37	0.00
KS	2.58	3.5	0.75	0.45	0.00	0.01	0.34	0.73
KY	4.34	4.0	1.10	0.27	0.01	0.02	0.48	0.63
LA	24.82	3.2	7.80	0.00	0.01	0.00	1.76	0.08
MA	10.31	3.3	3.14	0.00	0.06	0.00	23.26	0.00
MD	2.37	4.1	0.57	0.57	0.04	0.01	6.94	0.00
ME	0.71	6.9	0.10	0.92	0.01	0.22	0.05	0.96
MI	-25.94	4.0	-6.47	0.00	0.13	0.01	22.13	0.00
MN	5.53	3.7	1.49	0.14	0.02	0.01	2.52	0.01
MO	9.42	3.8	2.47	0.01	0.00	0.01	0.06	0.95
MP	0.00	NaN	NaN	NaN	0.00	NaN	NaN	NaN
MS	7.49	3.4	2.18	0.03	0.01	0.01	1.56	0.12
MT	0.13	2.5	0.05	0.96	0.01	0.10	0.13	0.90
NC	10.09	3.2	3.15	0.00	0.00	0.00	1.34	0.18
ND	0.43	3.7	0.12	0.91	0.01	0.09	0.07	0.94
NE	1.22	3.4	0.36	0.72	0.01	0.02	0.45	0.65
NH	0.17	4.9	0.04	0.97	0.06	0.08	0.80	0.43
NJ	44.22	2.9	15.28	0.00	0.06	0.00	38.26	0.00
NM	2.94	5.0	0.59	0.56	0.02	0.03	0.47	0.64
NV	4.56	2.7	1.68	0.09	0.00	0.01	0.38	0.70
NY	-35.16	2.8	-12.78	0.00	0.07	0.00	118.85	0.00
OH	22.46	4.2	5.31	0.00	0.01	0.01	1.04	0.30
OK	3.87	2.8	1.38	0.17	0.00	0.01	-0.07	0.95
OR	1.78	3.2	0.55	0.58	0.00	0.02	0.16	0.87
PA	-1.69	4.2	-0.40	0.69	0.08	0.00	16.31	0.00
PR	1.76	3.6	0.48	0.63	0.00	0.03	-0.09	0.93
RI	4.58	3.7	1.23	0.22	0.03	0.02	1.54	0.12
SC	5.45	2.8	1.93	0.05	0.01	0.00	2.54	0.01
SD	0.48	4.0	0.12	0.91	0.01	0.06	0.14	0.89
TN	3.85	3.1	1.22	0.22	0.01	0.00	1.37	0.17
TX	13.96	2.7	5.21	0.00	0.01	0.00	10.71	0.00
UT	0.98	3.2	0.30	0.76	0.00	0.01	0.40	0.69
VA	6.17	4.0	1.55	0.12	0.02	0.01	2.93	0.00
VI	0.07	2.7	0.03	0.98	-0.01	0.78	-0.01	0.99
VT	0.13	2.8	0.05	0.96	0.03	0.18	0.17	0.87
WA	8.57	3.8	2.27	0.02	0.01	0.01	0.47	0.64
WI	7.75	4.3	1.81	0.07	0.00	0.01	-0.17	0.86
WV	0.84	3.0	0.28	0.78	0.00	0.07	-0.01	0.99
WY	0.26	4.1	0.06	0.95	0.00	0.20	-0.01	0.99

covid19