Attempting autoregression to model the US Daily COVID Deaths.
Will demonstrate an Ordinary Least Squares (OLS) approach and a Generalized Least Squares (GLS) models.
How can I improve my approach here? ….
Importing the data
start with taking a look at the dataset us_covid19_daily.csv which is freely available for download from kaggle.
library( dplyr )
library( ggplot2 )
library( nlme )
df <- read.csv( '/home/bonzilla/Documents/MSDS/DATA621/us_covid19_daily.csv')
df <- transform(df, date = as.Date(as.character(date), "%Y%m%d"))
glimpse( df )## Rows: 320
## Columns: 25
## $ date <date> 2020-12-06, 2020-12-05, 2020-12-04, 2020-12-…
## $ states <int> 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 5…
## $ positive <int> 14534035, 14357264, 14146191, 13921360, 13711…
## $ negative <int> 161986294, 160813704, 159286709, 158026052, 1…
## $ pending <int> 13592, 13433, 12714, 15106, 14368, 8764, 1488…
## $ hospitalizedCurrently <int> 101487, 101190, 101276, 100755, 100322, 98777…
## $ hospitalizedCumulative <int> 585676, 583420, 580104, 575452, 570121, 56509…
## $ inIcuCurrently <int> 20145, 19950, 19858, 19723, 19680, 19295, 188…
## $ inIcuCumulative <int> 31946, 31831, 31608, 31276, 31038, 30749, 304…
## $ onVentilatorCurrently <int> 7094, 7005, 6999, 6867, 6855, 6649, 6520, 624…
## $ onVentilatorCumulative <int> 3322, 3321, 3305, 3280, 3252, 3223, 3205, 318…
## $ recovered <int> 5624444, 5576026, 5470389, 5404018, 5322128, …
## $ dateChecked <chr> "2020-12-06T24:00:00Z", "2020-12-05T24:00:00Z…
## $ death <int> 273374, 272236, 269791, 267228, 264522, 26178…
## $ hospitalized <int> 585676, 583420, 580104, 575452, 570121, 56509…
## $ totalTestResults <int> 204063869, 202429337, 200259581, 198404712, 1…
## $ lastModified <chr> "2020-12-06T24:00:00Z", "2020-12-05T24:00:00Z…
## $ total <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ posNeg <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deathIncrease <int> 1138, 2445, 2563, 2706, 2733, 2473, 1136, 803…
## $ hospitalizedIncrease <int> 2256, 3316, 4652, 5331, 5028, 5222, 3394, 242…
## $ negativeIncrease <int> 1172590, 1526995, 1260657, 1238465, 982032, 1…
## $ positiveIncrease <int> 176771, 211073, 224831, 210204, 195796, 17675…
## $ totalTestResultsIncrease <int> 1634532, 2169756, 1854869, 1828230, 1459202, …
## $ hash <chr> "9cf16504f91958e803a2197daf8c2528a4eddc18", "…
When we visualize the daily increase in deaths, the weekly cyclical behavior of the data is obvious:
COVID_ddp <- ggplot( df, aes( x = date, y = deathIncrease ) ) +
geom_line() +
theme_classic() +
ggtitle( 'Daily US COVID deaths' ) +
ylab( 'daily COVID deaths' )
COVID_ddpDaily COVID deaths in the US
To try to capture this behavior I apply the same autoregression method from LMR section 4.3.
lagdeaths <- embed( df$deathIncrease, 9 )
colnames( lagdeaths ) <- c( 'death', paste0('lag',1:8) )
lagdeaths <- data.frame( lagdeaths )
glimpse( lagdeaths )## Rows: 312
## Columns: 9
## $ death <int> 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1967, 1864, …
## $ lag1 <int> 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1967, 1…
## $ lag2 <int> 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1…
## $ lag3 <int> 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1…
## $ lag4 <int> 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1…
## $ lag5 <int> 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, …
## $ lag6 <int> 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066,…
## $ lag7 <int> 2445, 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289,…
## $ lag8 <int> 1138, 2445, 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336,…
Autoreggressive Model Fit
Fit a linear model that incorporates the lagged values as an autoregressive process
armod <- lm( death ~ lag1 + lag7 + lag8, lagdeaths )
summary( armod )##
## Call:
## lm(formula = death ~ lag1 + lag7 + lag8, data = lagdeaths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -694.82 -74.66 -23.83 106.92 1020.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.50411 20.85922 1.271 0.205
## lag1 0.68327 0.04218 16.200 < 2e-16 ***
## lag7 0.68466 0.03791 18.060 < 2e-16 ***
## lag8 -0.41587 0.04849 -8.577 4.83e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 212.6 on 308 degrees of freedom
## Multiple R-squared: 0.8852, Adjusted R-squared: 0.8841
## F-statistic: 791.4 on 3 and 308 DF, p-value: < 2.2e-16
All three of the lagged variables are strongly significant and the \(R^2\) value suggests that the model accounts for for quite a bit of the variance in the data.
Now to visualize the model fit with the data:
death_lag <- df[9:320,]
death_lag[ 'predict' ] <- predict( armod )
COVID_ddp_ar <- COVID_ddp +
geom_line( aes( x = date, y = predict, color = 'red' ), data = death_lag, show.legend = FALSE )
COVID_ddp_arAutoregression of Daily Death that considers several past time points. OLS = red, data = black
The the autoregressive linear model prediction follows the envelope of the data reasonably well.
OLS Model Fit
armod_ols <- lm( deathIncrease ~ hospitalizedCurrently +
inIcuCurrently + onVentilatorCurrently +
hospitalizedIncrease + negativeIncrease + positiveIncrease +
totalTestResultsIncrease + positive + negative + recovered +
death + hospitalized + totalTestResults,
data = df )
summary( armod_ols )##
## Call:
## lm(formula = deathIncrease ~ hospitalizedCurrently + inIcuCurrently +
## onVentilatorCurrently + hospitalizedIncrease + negativeIncrease +
## positiveIncrease + totalTestResultsIncrease + positive +
## negative + recovered + death + hospitalized + totalTestResults,
## data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -830.77 -189.71 3.03 167.68 1042.21
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.017e+02 1.182e+02 2.553 0.01129 *
## hospitalizedCurrently -7.275e-03 5.157e-03 -1.411 0.15967
## inIcuCurrently 1.506e-01 2.335e-02 6.451 6.00e-10 ***
## onVentilatorCurrently 1.173e-01 4.320e-02 2.716 0.00708 **
## hospitalizedIncrease 4.972e-02 1.145e-02 4.342 2.08e-05 ***
## negativeIncrease -1.454e-03 5.385e-04 -2.701 0.00741 **
## positiveIncrease 1.008e-02 2.339e-03 4.310 2.38e-05 ***
## totalTestResultsIncrease 1.239e-03 5.108e-04 2.426 0.01599 *
## positive -7.594e-04 1.745e-04 -4.352 1.99e-05 ***
## negative 2.971e-04 3.828e-05 7.762 2.35e-13 ***
## recovered 3.443e-03 5.340e-04 6.447 6.12e-10 ***
## death 1.047e-02 7.222e-03 1.449 0.14849
## hospitalized -9.881e-03 4.118e-03 -2.399 0.01718 *
## totalTestResults -2.808e-04 3.288e-05 -8.538 1.53e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 287.9 on 242 degrees of freedom
## (64 observations deleted due to missingness)
## Multiple R-squared: 0.7504, Adjusted R-squared: 0.737
## F-statistic: 55.98 on 13 and 242 DF, p-value: < 2.2e-16
df2 <- df[-armod_ols$na.action, ]
death_lag2 <- df2
death_lag2[ 'predict_ols' ] <- predict( armod_ols )
COVID_ddp_ols <- COVID_ddp +
geom_line( aes( x = date, y = predict ), data = death_lag, show.legend = FALSE, color = 'red', size = 0.2 ) +
geom_line( aes( x = date, y = predict_ols ), data = death_lag2, show.legend = FALSE, color = "blue", size = 1 )
COVID_ddp_olsStandard OLS linear regression fit for daily US COVID deaths. OLS = blue, autoreg = red, data = black
GLS Model Fit
armod_gls <- gls( deathIncrease ~ hospitalizedCurrently +
inIcuCurrently + onVentilatorCurrently +
hospitalizedIncrease + negativeIncrease + positiveIncrease +
totalTestResultsIncrease + positive + negative + recovered +
death + hospitalized + totalTestResults,
correlation = corAR1( form = ~ 1 ),
data = na.omit( df ) )
summary( armod_gls )## Generalized least squares fit by REML
## Model: deathIncrease ~ hospitalizedCurrently + inIcuCurrently + onVentilatorCurrently + hospitalizedIncrease + negativeIncrease + positiveIncrease + totalTestResultsIncrease + positive + negative + recovered + death + hospitalized + totalTestResults
## Data: na.omit(df)
## AIC BIC logLik
## 3678.884 3734.305 -1823.442
##
## Correlation Structure: AR(1)
## Formula: ~1
## Parameter estimate(s):
## Phi
## 0.4698509
##
## Coefficients:
## Value Std.Error t-value p-value
## (Intercept) 503.0936 280.27397 1.795007 0.0739
## hospitalizedCurrently -0.0036 0.00794 -0.452267 0.6515
## inIcuCurrently 0.1404 0.03443 4.078588 0.0001
## onVentilatorCurrently 0.1336 0.06160 2.168366 0.0311
## hospitalizedIncrease 0.0265 0.01028 2.579824 0.0105
## negativeIncrease -0.0009 0.00047 -1.873626 0.0622
## positiveIncrease 0.0128 0.00263 4.862226 0.0000
## totalTestResultsIncrease 0.0008 0.00046 1.622397 0.1061
## positive -0.0009 0.00030 -3.022331 0.0028
## negative 0.0003 0.00006 6.201739 0.0000
## recovered 0.0039 0.00084 4.608347 0.0000
## death 0.0183 0.01160 1.578837 0.1157
## hospitalized -0.0158 0.00686 -2.302848 0.0222
## totalTestResults -0.0003 0.00005 -6.630688 0.0000
##
## Correlation:
## (Intr) hsptlC inIcCr onVntC hsptlI ngtvIn pstvIn
## hospitalizedCurrently -0.339
## inIcuCurrently -0.209 -0.479
## onVentilatorCurrently 0.045 -0.053 -0.626
## hospitalizedIncrease 0.018 0.022 0.026 -0.071
## negativeIncrease 0.034 -0.153 0.104 0.023 -0.043
## positiveIncrease 0.105 -0.453 0.234 0.134 -0.101 0.405
## totalTestResultsIncrease -0.045 0.180 -0.108 -0.031 0.034 -0.934 -0.567
## positive 0.572 -0.568 -0.029 0.081 0.120 -0.035 -0.028
## negative -0.062 0.027 -0.104 0.555 -0.010 0.064 0.359
## recovered -0.350 0.530 0.013 -0.079 -0.050 0.029 0.040
## death 0.182 0.008 0.280 -0.243 0.316 0.055 0.175
## hospitalized -0.391 0.058 -0.159 0.085 -0.305 -0.066 -0.174
## totalTestResults 0.082 -0.070 0.069 -0.470 0.004 -0.047 -0.340
## ttlTRI positv negatv recvrd death hsptlz
## hospitalizedCurrently
## inIcuCurrently
## onVentilatorCurrently
## hospitalizedIncrease
## negativeIncrease
## positiveIncrease
## totalTestResultsIncrease
## positive 0.033
## negative -0.104 -0.116
## recovered -0.018 -0.840 0.285
## death -0.108 0.160 0.047 -0.039
## hospitalized 0.099 -0.330 -0.204 0.135 -0.949
## totalTestResults 0.081 0.181 -0.971 -0.442 -0.046 0.182
##
## Standardized residuals:
## Min Q1 Med Q3 Max
## -2.383223874 -0.672133333 -0.002322932 0.602509752 3.422973358
##
## Residual standard error: 306.3897
## Degrees of freedom: 250 total; 236 residual
death_lag3 <- na.omit( df )
death_lag3[ 'predict_gls' ] <- predict( armod_gls )
COVID_ddp_gls <- COVID_ddp +
geom_line( aes( x = date, y = predict ), data = death_lag, show.legend = FALSE, color = 'red', size = 0.2 ) +
geom_line( aes( x = date, y = predict_ols ), data = death_lag2, show.legend = FALSE, color = "blue", size = 0.2 ) +
geom_line( aes( x = date, y = predict_gls ), data = death_lag3, show.legend = FALSE, color = 'green', size = 1 )
COVID_ddp_glsGLS fit for daily US COVID deaths. GLS = green, OLS = blue, autoreg = red, data = black
…and that is where I’ve left off
What are some next steps?
- combine the lagged features from the 1st autoregression with the original data for another OLS approach
- am I using the
gls()function correctly?…will RTFM to see if I can make improvements with that approach - ….any siggestions?….
- comparing models