Let’s read the data from https://covidtracking.com/. There are two data sets, one for the US and the second by state.

The df.US data set looks like this,

1 2
date 2020-07-19 2020-07-18
states 56 56
positive 3755968 3692061
negative 41978359 41273443
pending 3052 3032
hospitalizedCurrently 57792 57562
hospitalizedCumulative 277378 276439
inIcuCurrently 6391 6396
inIcuCumulative 12391 12342
onVentilatorCurrently 2362 2343
onVentilatorCumulative 1216 1211
recovered 1131121 1122720
dateChecked 2020-07-19T00:00:00Z 2020-07-18T00:00:00Z
death 132918 132395
hospitalized 277378 276439
lastModified 2020-07-19T00:00:00Z 2020-07-18T00:00:00Z
total 45737379 44968536
totalTestResults 45734327 44965504

while the df.states data set has this,

1 2
date 2020-07-19 2020-07-19
state AK AL
positive 2277 67011
negative 170733 518522
pending NA NA
hospitalizedCurrently 27 1465
hospitalizedCumulative NA 7782
inIcuCurrently NA NA
inIcuCumulative NA 988
onVentilatorCurrently 1 NA
onVentilatorCumulative NA 526
recovered 712 29736
dataQualityGrade A B
lastUpdateEt 7/19/2020 00:00 7/18/2020 11:00
dateModified 2020-07-19T00:00:00Z 2020-07-18T11:00:00Z
checkTimeEt 07/18 20:00 07/18 07:00
death 18 1287
hospitalized NA 7782

We’ll work with the raw data (and ignore the derived columns, like total results = positive + negative + pending results), clean up some names, and then create some useful derived columns (daily or “new” x, where x = cases, deaths etc.).

1 2 3 4 5 6
date 2020-07-19 2020-07-18 2020-07-17 2020-07-16 2020-07-15 2020-07-14
n.states 56 56 56 56 56 56
n.cases 3755968 3692061 3626881 3549648 3478695 3413313
n.neg 41978359 41273443 40576852 39802297 39048662 38351244
n.pend 3052 3032 3002 2929 2947 960
in.hosp 57792 57562 57705 57369 56143 55533
n.hosp 277378 276439 274436 271758 269543 267127
new.ICU 6391 6396 6453 6359 6317 6235
in.ICU 12391 12342 12243 12091 12002 11857
new.vent 2362 2343 2353 2317 2317 2263
n.vent 1216 1211 1200 1175 1166 1154
n.recov 1131121 1122720 1106844 1090645 1075882 1049098
n.death 277378 276439 274436 271758 269543 267127
n.tests 45734327 44965504 44203733 43351945 42527357 41764557
new.tests 768823 761771 851788 824588 762800 760282
new.cases 63907 65180 77233 70953 65382 62879
new.hosp 939 2003 2678 2215 2416 2262
new.death 939 2003 2678 2215 2416 2262
pct.inf 8.3 8.6 9.1 8.6 8.6 8.3

Cleaning and Smoothing the Data

Daily data measurements have lots of errors. Metrics reflecting date t may only be captured at time t+k, and the mismatch k varies widely across time and across sub-geos. So, we’ll smooth these out by computing rolling means. We’ll use different means for different metrics (e.g., 7 days for test results (cases), small numbers for hospitalizations and deaths).

n.days = dim(df.US)[1]

# smooth the data (name starts with s.)

df.US <- df.US %>% mutate( 
  s.daily.tests = rollmean(new.tests, k = 5, fill = NA), 
  s.daily.cases = rollmean(new.cases, k = 5, fill = NA),
  s.daily.death = rollmean(new.death, k = 3, fill = NA), 
  s.new.ICU = rollmean(new.ICU, k = 3, fill = NA), 
  s.new.vent = rollmean(new.vent, k = 3, fill = NA), 
  s.daily.hosp = rollmean(in.hosp, k = 3, fill = NA)
  ) %>% ungroup() 

# now do a similar smoothing as in df.US but one state at a time
df.states <- df.states %>% group_by(state) %>% add_tally() %>% filter(n > 35) %>% mutate( 
s.daily.tests = rollmean(new.tests, k = 5, fill = NA), 
  s.daily.cases = rollmean(new.cases, k = 5, fill = NA),
  s.daily.death = rollmean(new.death, k = 3, fill = NA), 
  s.in.ICU = rollmean(in.ICU, k = 3, fill = NA), 
  s.daily.hosp = rollmean(in.hosp, k = 3, fill = NA)
  ) %>% ungroup() 

df.ST <- function(ST) {
  df.states %>% filter(state==ST) 
}

US-level data with the added smoothed metrics.

1 2 3 4 5 6
date 2020-07-19 2020-07-18 2020-07-17 2020-07-16 2020-07-15 2020-07-14
n.states 56 56 56 56 56 56
n.cases 3755968 3692061 3626881 3549648 3478695 3413313
n.neg 41978359 41273443 40576852 39802297 39048662 38351244
n.pend 3052 3032 3002 2929 2947 960
in.hosp 57792 57562 57705 57369 56143 55533
n.hosp 277378 276439 274436 271758 269543 267127
new.ICU 6391 6396 6453 6359 6317 6235
in.ICU 12391 12342 12243 12091 12002 11857
new.vent 2362 2343 2353 2317 2317 2263
n.vent 1216 1211 1200 1175 1166 1154
n.recov 1131121 1122720 1106844 1090645 1075882 1049098
n.death 277378 276439 274436 271758 269543 267127
n.tests 45734327 44965504 44203733 43351945 42527357 41764557
new.tests 768823 761771 851788 824588 762800 760282
new.cases 63907 65180 77233 70953 65382 62879
new.hosp 939 2003 2678 2215 2416 2262
new.death 939 2003 2678 2215 2416 2262
pct.inf 8.3 8.6 9.1 8.6 8.6 8.3
s.daily.tests NA NA 793954 792246 784311 759417
s.daily.cases NA NA 68531 68325 66982 63731
s.daily.death NA 1873 2299 2436 2298 1964
s.new.ICU NA 6413 6403 6376 6304 6209
s.new.vent NA 2353 2338 2329 2299 2280
s.daily.hosp NA 57686 57545 57072 56348 55221

A sample of the states data with smoothed metrics.

date 2020-07-19 2020-07-19 2020-07-19 2020-07-19 2020-07-19 2020-07-19
state AK AL AR AZ CA CO
n.cases 2277 67011 32533 143624 384692 39788
n.neg 170733 518522 386638 642355 5902160 396150
n.pend NA NA NA NA NA NA
in.hosp 27 1465 453 3136 8398 401
n.hosp NA 7782 2177 6632 NA 6019
in.ICU NA NA NA 894 2107 NA
n.ICU NA 988 NA NA NA NA
on.vent 1 NA 97 622 NA NA
n.vent NA 526 307 NA NA NA
n.recov 712 29736 25292 18149 NA 4948
n.death 18 1287 357 2761 7685 1615
n.tests 173010 585533 419171 785979 6286852 435938
new.tests 4647 11038 6166 13988 119634 5155
new.cases 118 1777 771 2359 9329 444
new.hosp NA 0 42 65 NA 25
new.death 0 1 4 31 90 0
pct.inf 2.5 16.1 12.5 16.9 7.8 8.6
n 127 130 124 134 135 133
s.daily.tests NA NA NA NA NA NA
s.daily.cases NA NA NA NA NA NA
s.daily.death NA NA NA NA NA NA
s.in.ICU NA NA NA NA NA NA
s.daily.hosp NA NA NA NA NA NA

Exploring Cases and Death Rates

Looking at cases and deaths visually. Let’s start with test results vs. cases (positive test results) to get a sense of the positivity rate and variation across time.

Now look at cases vs deaths, i.e., the mortality rate.

The first panel simply lays out daily cases and daily deaths. It looks like deaths are suddenly up the last few days. The second panel pushes daily deaths data back 5 days - and how you see how the cases and deaths line up. The third panel presents an important metric - change in mortality rate over time and the importance of recognizing the lag between cases and deaths. If you ignore the lag it looks like the mortality rate is increasing, but if you adjust for a delay then you can see that- while health systems were initially overwhelmed and experienced high death rates - this rate has declined over time, perhaps because health systems have learnt and become better at dealing with critical cases.

Looking Deeper into Mortality Rate

Another way to see that the right “death lag” is needed to understand mortality rate is to plot cases against deaths, with different levels of lag.

Based on this, it “looks” like we should employ a 6-day gap between cases (i.e., someone tested, diagnosed and counted as being infected) vs. deaths. The slope of the regression line then gives us the mortality rate.

## 
## Call:
## lm(formula = lag(s.daily.death, 3) ~ -1 + new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1936   -675   -204    876   4816 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## new.ICU   0.2441     0.0148    16.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1460 on 111 degrees of freedom
##   (56 observations deleted due to missingness)
## Multiple R-squared:  0.711,  Adjusted R-squared:  0.708 
## F-statistic:  273 on 1 and 111 DF,  p-value: <2e-16

Deaths vs ICU

Let’s look at how deaths relate to ICU admissions.

## 
## Call:
## lm(formula = lag(s.daily.death, 0) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1868   -643   -156    705   4746 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2517     0.0136    18.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1350 on 113 degrees of freedom
##   (54 observations deleted due to missingness)
## Multiple R-squared:  0.752,  Adjusted R-squared:  0.75 
## F-statistic:  343 on 1 and 113 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = lag(s.daily.death, 1) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1890   -659   -195    750   4760 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2493     0.0139    17.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1380 on 112 degrees of freedom
##   (55 observations deleted due to missingness)
## Multiple R-squared:  0.74,   Adjusted R-squared:  0.738 
## F-statistic:  320 on 1 and 112 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = lag(s.daily.death, 2) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1912   -687   -190    794   4787 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2467     0.0143    17.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1420 on 111 degrees of freedom
##   (56 observations deleted due to missingness)
## Multiple R-squared:  0.728,  Adjusted R-squared:  0.726 
## F-statistic:  297 on 1 and 111 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = lag(s.daily.death, 3) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1936   -700   -209    810   4797 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2441     0.0147    16.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1450 on 110 degrees of freedom
##   (57 observations deleted due to missingness)
## Multiple R-squared:  0.715,  Adjusted R-squared:  0.713 
## F-statistic:  277 on 1 and 110 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = lag(s.daily.death, 4) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1949   -690   -225    869   4774 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2415     0.0149    16.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1470 on 109 degrees of freedom
##   (58 observations deleted due to missingness)
## Multiple R-squared:  0.705,  Adjusted R-squared:  0.703 
## F-statistic:  261 on 1 and 109 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = lag(s.daily.death, 5) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1972   -697   -216    966   4749 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2388     0.0151    15.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1490 on 108 degrees of freedom
##   (59 observations deleted due to missingness)
## Multiple R-squared:  0.698,  Adjusted R-squared:  0.695 
## F-statistic:  250 on 1 and 108 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = lag(s.daily.death, 6) ~ -1 + s.new.ICU, data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1998   -669   -220    918   4711 
## 
## Coefficients:
##           Estimate Std. Error t value Pr(>|t|)    
## s.new.ICU   0.2361     0.0151    15.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1480 on 107 degrees of freedom
##   (60 observations deleted due to missingness)
## Multiple R-squared:  0.695,  Adjusted R-squared:  0.692 
## F-statistic:  244 on 1 and 107 DF,  p-value: <2e-16

Multiple Regression

We do a regression with deaths against cases and ICU, and find that adding ICU doesn’t add anything to the regression vs. doing just cases.

## 
## Call:
## lm(formula = s.daily.death ~ -1 + lead(s.daily.cases, 6) + lead(s.new.ICU, 
##     0), data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1915   -736   -339    474   4768 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## lead(s.daily.cases, 6)   0.0175     0.0080    2.19    0.031 *  
## lead(s.new.ICU, 0)       0.2019     0.0264    7.64  7.9e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1330 on 112 degrees of freedom
##   (54 observations deleted due to missingness)
## Multiple R-squared:  0.762,  Adjusted R-squared:  0.758 
## F-statistic:  180 on 2 and 112 DF,  p-value: <2e-16
## 
## Call:
## lm(formula = s.daily.death ~ -1 + lead(s.daily.cases, 6), data = df.US)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -2525   -509     -8   1265   5269 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## lead(s.daily.cases, 6)  0.07049    0.00464    15.2   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1520 on 134 degrees of freedom
##   (33 observations deleted due to missingness)
## Multiple R-squared:  0.633,  Adjusted R-squared:  0.63 
## F-statistic:  231 on 1 and 134 DF,  p-value: <2e-16

State-level

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
Estimate.(Intercept) Std. Error.(Intercept) t value.(Intercept) Pr(>|t|).(Intercept) Estimate.s.daily.cases Std. Error.s.daily.cases t value.s.daily.cases Pr(>|t|).s.daily.cases
AK 0.11 2.7 0.04 0.97 0.00 0.15 0.02 0.98
AL 5.73 3.0 1.95 0.05 0.01 0.01 2.17 0.03
AR 1.58 3.0 0.54 0.59 0.01 0.01 0.76 0.45
AZ 8.29 2.4 3.41 0.00 0.01 0.00 8.66 0.00
CA 42.19 2.9 14.76 0.00 0.01 0.00 8.35 0.00
CO 3.47 4.6 0.76 0.45 0.03 0.01 2.34 0.02
CT -0.91 3.2 -0.28 0.78 0.09 0.01 16.10 0.00
DC -0.63 3.8 -0.17 0.87 0.06 0.04 1.67 0.10
DE -1.23 3.7 -0.33 0.74 0.05 0.03 1.71 0.09
FL 25.72 2.3 11.10 0.00 0.01 0.00 9.76 0.00
GA 30.44 3.3 9.25 0.00 0.00 0.00 -1.21 0.23
GU -0.01 3.1 0.00 1.00 0.02 0.80 0.02 0.98
HI 0.01 3.0 0.00 1.00 0.02 0.22 0.09 0.93
IA 2.39 4.5 0.54 0.59 0.01 0.01 1.13 0.26
ID 0.83 2.6 0.32 0.75 0.00 0.02 0.15 0.88
IL 6.17 3.8 1.62 0.10 0.04 0.00 16.79 0.00
IN -2.64 5.2 -0.51 0.61 0.06 0.01 5.37 0.00
KS 2.58 3.5 0.75 0.45 0.00 0.01 0.34 0.73
KY 4.34 4.0 1.10 0.27 0.01 0.02 0.48 0.63
LA 24.82 3.2 7.80 0.00 0.01 0.00 1.76 0.08
MA 10.31 3.3 3.14 0.00 0.06 0.00 23.26 0.00
MD 2.37 4.1 0.57 0.57 0.04 0.01 6.94 0.00
ME 0.71 6.9 0.10 0.92 0.01 0.22 0.05 0.96
MI -25.94 4.0 -6.47 0.00 0.13 0.01 22.13 0.00
MN 5.53 3.7 1.49 0.14 0.02 0.01 2.52 0.01
MO 9.42 3.8 2.47 0.01 0.00 0.01 0.06 0.95
MP 0.00 NaN NaN NaN 0.00 NaN NaN NaN
MS 7.49 3.4 2.18 0.03 0.01 0.01 1.56 0.12
MT 0.13 2.5 0.05 0.96 0.01 0.10 0.13 0.90
NC 10.09 3.2 3.15 0.00 0.00 0.00 1.34 0.18
ND 0.43 3.7 0.12 0.91 0.01 0.09 0.07 0.94
NE 1.22 3.4 0.36 0.72 0.01 0.02 0.45 0.65
NH 0.17 4.9 0.04 0.97 0.06 0.08 0.80 0.43
NJ 44.22 2.9 15.28 0.00 0.06 0.00 38.26 0.00
NM 2.94 5.0 0.59 0.56 0.02 0.03 0.47 0.64
NV 4.56 2.7 1.68 0.09 0.00 0.01 0.38 0.70
NY -35.16 2.8 -12.78 0.00 0.07 0.00 118.85 0.00
OH 22.46 4.2 5.31 0.00 0.01 0.01 1.04 0.30
OK 3.87 2.8 1.38 0.17 0.00 0.01 -0.07 0.95
OR 1.78 3.2 0.55 0.58 0.00 0.02 0.16 0.87
PA -1.69 4.2 -0.40 0.69 0.08 0.00 16.31 0.00
PR 1.76 3.6 0.48 0.63 0.00 0.03 -0.09 0.93
RI 4.58 3.7 1.23 0.22 0.03 0.02 1.54 0.12
SC 5.45 2.8 1.93 0.05 0.01 0.00 2.54 0.01
SD 0.48 4.0 0.12 0.91 0.01 0.06 0.14 0.89
TN 3.85 3.1 1.22 0.22 0.01 0.00 1.37 0.17
TX 13.96 2.7 5.21 0.00 0.01 0.00 10.71 0.00
UT 0.98 3.2 0.30 0.76 0.00 0.01 0.40 0.69
VA 6.17 4.0 1.55 0.12 0.02 0.01 2.93 0.00
VI 0.07 2.7 0.03 0.98 -0.01 0.78 -0.01 0.99
VT 0.13 2.8 0.05 0.96 0.03 0.18 0.17 0.87
WA 8.57 3.8 2.27 0.02 0.01 0.01 0.47 0.64
WI 7.75 4.3 1.81 0.07 0.00 0.01 -0.17 0.86
WV 0.84 3.0 0.28 0.78 0.00 0.07 -0.01 0.99
WY 0.26 4.1 0.06 0.95 0.00 0.20 -0.01 0.99

Using Autoregressive Models