Daily US COVID deaths as an Autoregression

Discussion Week 3 DATA621

Bonnie Cooper

A quick and very simple application of autoregression to model the US Daily COVID Deaths

start with taking a look at the dataset us_covid19_daily.csv which is freely available for download from kaggle.

library( dplyr )
library( ggplot2 )
df <- read.csv( '/home/bonzilla/Documents/MSDS/DATA621/us_covid19_daily.csv')
df <- transform(df, date = as.Date(as.character(date), "%Y%m%d"))
glimpse( df )
## Rows: 320
## Columns: 25
## $ date                     <date> 2020-12-06, 2020-12-05, 2020-12-04, 2020-12…
## $ states                   <int> 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, …
## $ positive                 <int> 14534035, 14357264, 14146191, 13921360, 1371…
## $ negative                 <int> 161986294, 160813704, 159286709, 158026052, …
## $ pending                  <int> 13592, 13433, 12714, 15106, 14368, 8764, 148…
## $ hospitalizedCurrently    <int> 101487, 101190, 101276, 100755, 100322, 9877…
## $ hospitalizedCumulative   <int> 585676, 583420, 580104, 575452, 570121, 5650…
## $ inIcuCurrently           <int> 20145, 19950, 19858, 19723, 19680, 19295, 18…
## $ inIcuCumulative          <int> 31946, 31831, 31608, 31276, 31038, 30749, 30…
## $ onVentilatorCurrently    <int> 7094, 7005, 6999, 6867, 6855, 6649, 6520, 62…
## $ onVentilatorCumulative   <int> 3322, 3321, 3305, 3280, 3252, 3223, 3205, 31…
## $ recovered                <int> 5624444, 5576026, 5470389, 5404018, 5322128,…
## $ dateChecked              <chr> "2020-12-06T24:00:00Z", "2020-12-05T24:00:00…
## $ death                    <int> 273374, 272236, 269791, 267228, 264522, 2617…
## $ hospitalized             <int> 585676, 583420, 580104, 575452, 570121, 5650…
## $ totalTestResults         <int> 204063869, 202429337, 200259581, 198404712, …
## $ lastModified             <chr> "2020-12-06T24:00:00Z", "2020-12-05T24:00:00…
## $ total                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ posNeg                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ deathIncrease            <int> 1138, 2445, 2563, 2706, 2733, 2473, 1136, 80…
## $ hospitalizedIncrease     <int> 2256, 3316, 4652, 5331, 5028, 5222, 3394, 24…
## $ negativeIncrease         <int> 1172590, 1526995, 1260657, 1238465, 982032, …
## $ positiveIncrease         <int> 176771, 211073, 224831, 210204, 195796, 1767…
## $ totalTestResultsIncrease <int> 1634532, 2169756, 1854869, 1828230, 1459202,…
## $ hash                     <chr> "9cf16504f91958e803a2197daf8c2528a4eddc18", …

When we visualize the daily increase in deaths, the weekly cyclical behavior of the data is obvious:

COVID_ddp <- ggplot( df, aes( x = date, y = deathIncrease ) ) +
  geom_line() +
  theme_classic() +
  ggtitle( 'Daily US COVID deaths' ) +
  ylab( 'daily deaths' )
COVID_ddp

To try to capture this behavior I apply the same autoregression method from LMR section 4.3.

lagdeaths <- embed( df$deathIncrease, 9 )
colnames( lagdeaths ) <- c( 'death', paste0('lag',1:8) )
lagdeaths <- data.frame( lagdeaths )
glimpse( lagdeaths )
## Rows: 312
## Columns: 9
## $ death <int> 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1967, 1864,…
## $ lag1  <int> 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1967, …
## $ lag2  <int> 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, …
## $ lag3  <int> 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, …
## $ lag4  <int> 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, …
## $ lag5  <int> 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955,…
## $ lag6  <int> 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066…
## $ lag7  <int> 2445, 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289…
## $ lag8  <int> 1138, 2445, 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336…

Fit a linear model that incorporates the lagged values as an autoregressive process

armod <- lm( death ~ lag1 + lag7 + lag8, lagdeaths )
summary( armod )
## 
## Call:
## lm(formula = death ~ lag1 + lag7 + lag8, data = lagdeaths)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -694.82  -74.66  -23.83  106.92 1020.08 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.50411   20.85922   1.271    0.205    
## lag1         0.68327    0.04218  16.200  < 2e-16 ***
## lag7         0.68466    0.03791  18.060  < 2e-16 ***
## lag8        -0.41587    0.04849  -8.577 4.83e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 212.6 on 308 degrees of freedom
## Multiple R-squared:  0.8852, Adjusted R-squared:  0.8841 
## F-statistic: 791.4 on 3 and 308 DF,  p-value: < 2.2e-16

All three of the lagged variables are strongly significant and the \(R^2\) value suggests that the model accounts for for quite a bit of the variance in the data.

Now to visualize the model fit with the data:

death_lag <- df[9:320,]
death_lag[ 'predict' ] <- predict( armod ) 
COVID_ddp_ar <- COVID_ddp +
  geom_line( aes( x = date, y = predict, color = 'red' ), data = death_lag, show.legend = FALSE )
COVID_ddp_ar

The the autoregressive linear model prediction follows the envelope of the data reasonably well.