A quick and very simple application of autoregression to model the US Daily COVID Deaths
start with taking a look at the dataset us_covid19_daily.csv which is freely available for download from kaggle.
library( dplyr )
library( ggplot2 )
df <- read.csv( '/home/bonzilla/Documents/MSDS/DATA621/us_covid19_daily.csv')
df <- transform(df, date = as.Date(as.character(date), "%Y%m%d"))
glimpse( df )## Rows: 320
## Columns: 25
## $ date <date> 2020-12-06, 2020-12-05, 2020-12-04, 2020-12…
## $ states <int> 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, …
## $ positive <int> 14534035, 14357264, 14146191, 13921360, 1371…
## $ negative <int> 161986294, 160813704, 159286709, 158026052, …
## $ pending <int> 13592, 13433, 12714, 15106, 14368, 8764, 148…
## $ hospitalizedCurrently <int> 101487, 101190, 101276, 100755, 100322, 9877…
## $ hospitalizedCumulative <int> 585676, 583420, 580104, 575452, 570121, 5650…
## $ inIcuCurrently <int> 20145, 19950, 19858, 19723, 19680, 19295, 18…
## $ inIcuCumulative <int> 31946, 31831, 31608, 31276, 31038, 30749, 30…
## $ onVentilatorCurrently <int> 7094, 7005, 6999, 6867, 6855, 6649, 6520, 62…
## $ onVentilatorCumulative <int> 3322, 3321, 3305, 3280, 3252, 3223, 3205, 31…
## $ recovered <int> 5624444, 5576026, 5470389, 5404018, 5322128,…
## $ dateChecked <chr> "2020-12-06T24:00:00Z", "2020-12-05T24:00:00…
## $ death <int> 273374, 272236, 269791, 267228, 264522, 2617…
## $ hospitalized <int> 585676, 583420, 580104, 575452, 570121, 5650…
## $ totalTestResults <int> 204063869, 202429337, 200259581, 198404712, …
## $ lastModified <chr> "2020-12-06T24:00:00Z", "2020-12-05T24:00:00…
## $ total <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ posNeg <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ deathIncrease <int> 1138, 2445, 2563, 2706, 2733, 2473, 1136, 80…
## $ hospitalizedIncrease <int> 2256, 3316, 4652, 5331, 5028, 5222, 3394, 24…
## $ negativeIncrease <int> 1172590, 1526995, 1260657, 1238465, 982032, …
## $ positiveIncrease <int> 176771, 211073, 224831, 210204, 195796, 1767…
## $ totalTestResultsIncrease <int> 1634532, 2169756, 1854869, 1828230, 1459202,…
## $ hash <chr> "9cf16504f91958e803a2197daf8c2528a4eddc18", …
When we visualize the daily increase in deaths, the weekly cyclical behavior of the data is obvious:
COVID_ddp <- ggplot( df, aes( x = date, y = deathIncrease ) ) +
geom_line() +
theme_classic() +
ggtitle( 'Daily US COVID deaths' ) +
ylab( 'daily deaths' )
COVID_ddpTo try to capture this behavior I apply the same autoregression method from LMR section 4.3.
lagdeaths <- embed( df$deathIncrease, 9 )
colnames( lagdeaths ) <- c( 'death', paste0('lag',1:8) )
lagdeaths <- data.frame( lagdeaths )
glimpse( lagdeaths )## Rows: 312
## Columns: 9
## $ death <int> 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1967, 1864,…
## $ lag1 <int> 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1967, …
## $ lag2 <int> 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, …
## $ lag3 <int> 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, …
## $ lag4 <int> 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, …
## $ lag5 <int> 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955,…
## $ lag6 <int> 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066…
## $ lag7 <int> 2445, 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289…
## $ lag8 <int> 1138, 2445, 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336…
Fit a linear model that incorporates the lagged values as an autoregressive process
armod <- lm( death ~ lag1 + lag7 + lag8, lagdeaths )
summary( armod )##
## Call:
## lm(formula = death ~ lag1 + lag7 + lag8, data = lagdeaths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -694.82 -74.66 -23.83 106.92 1020.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.50411 20.85922 1.271 0.205
## lag1 0.68327 0.04218 16.200 < 2e-16 ***
## lag7 0.68466 0.03791 18.060 < 2e-16 ***
## lag8 -0.41587 0.04849 -8.577 4.83e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 212.6 on 308 degrees of freedom
## Multiple R-squared: 0.8852, Adjusted R-squared: 0.8841
## F-statistic: 791.4 on 3 and 308 DF, p-value: < 2.2e-16
All three of the lagged variables are strongly significant and the \(R^2\) value suggests that the model accounts for for quite a bit of the variance in the data.
Now to visualize the model fit with the data:
death_lag <- df[9:320,]
death_lag[ 'predict' ] <- predict( armod )
COVID_ddp_ar <- COVID_ddp +
geom_line( aes( x = date, y = predict, color = 'red' ), data = death_lag, show.legend = FALSE )
COVID_ddp_arThe the autoregressive linear model prediction follows the envelope of the data reasonably well.