Modelling Daily US COVID deaths

Discussion Week 11 DATA621

Bonnie Cooper

Attempting autoregression to model the US Daily COVID Deaths.
Will demonstrate an Ordinary Least Squares (OLS) approach and a Generalized Least Squares (GLS) models.
How can I improve my approach here? ….

Importing the data

start with taking a look at the dataset us_covid19_daily.csv which is freely available for download from kaggle.

library( dplyr )
library( ggplot2 )
library( nlme )
df <- read.csv( '/home/bonzilla/Documents/MSDS/DATA621/us_covid19_daily.csv')
df <- transform(df, date = as.Date(as.character(date), "%Y%m%d"))
glimpse( df )
## Rows: 320
## Columns: 25
## $ date                     <date> 2020-12-06, 2020-12-05, 2020-12-04, 2020-12-…
## $ states                   <int> 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 56, 5…
## $ positive                 <int> 14534035, 14357264, 14146191, 13921360, 13711…
## $ negative                 <int> 161986294, 160813704, 159286709, 158026052, 1…
## $ pending                  <int> 13592, 13433, 12714, 15106, 14368, 8764, 1488…
## $ hospitalizedCurrently    <int> 101487, 101190, 101276, 100755, 100322, 98777…
## $ hospitalizedCumulative   <int> 585676, 583420, 580104, 575452, 570121, 56509…
## $ inIcuCurrently           <int> 20145, 19950, 19858, 19723, 19680, 19295, 188…
## $ inIcuCumulative          <int> 31946, 31831, 31608, 31276, 31038, 30749, 304…
## $ onVentilatorCurrently    <int> 7094, 7005, 6999, 6867, 6855, 6649, 6520, 624…
## $ onVentilatorCumulative   <int> 3322, 3321, 3305, 3280, 3252, 3223, 3205, 318…
## $ recovered                <int> 5624444, 5576026, 5470389, 5404018, 5322128, …
## $ dateChecked              <chr> "2020-12-06T24:00:00Z", "2020-12-05T24:00:00Z…
## $ death                    <int> 273374, 272236, 269791, 267228, 264522, 26178…
## $ hospitalized             <int> 585676, 583420, 580104, 575452, 570121, 56509…
## $ totalTestResults         <int> 204063869, 202429337, 200259581, 198404712, 1…
## $ lastModified             <chr> "2020-12-06T24:00:00Z", "2020-12-05T24:00:00Z…
## $ total                    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ posNeg                   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deathIncrease            <int> 1138, 2445, 2563, 2706, 2733, 2473, 1136, 803…
## $ hospitalizedIncrease     <int> 2256, 3316, 4652, 5331, 5028, 5222, 3394, 242…
## $ negativeIncrease         <int> 1172590, 1526995, 1260657, 1238465, 982032, 1…
## $ positiveIncrease         <int> 176771, 211073, 224831, 210204, 195796, 17675…
## $ totalTestResultsIncrease <int> 1634532, 2169756, 1854869, 1828230, 1459202, …
## $ hash                     <chr> "9cf16504f91958e803a2197daf8c2528a4eddc18", "…

When we visualize the daily increase in deaths, the weekly cyclical behavior of the data is obvious:

COVID_ddp <- ggplot( df, aes( x = date, y = deathIncrease ) ) +
  geom_line() +
  theme_classic() +
  ggtitle( 'Daily US COVID deaths' ) +
  ylab( 'daily COVID deaths' )
COVID_ddp
Daily COVID deaths in the US

Daily COVID deaths in the US

To try to capture this behavior I apply the same autoregression method from LMR section 4.3.

lagdeaths <- embed( df$deathIncrease, 9 )
colnames( lagdeaths ) <- c( 'death', paste0('lag',1:8) )
lagdeaths <- data.frame( lagdeaths )
glimpse( lagdeaths )
## Rows: 312
## Columns: 9
## $ death <int> 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1967, 1864, …
## $ lag1  <int> 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1967, 1…
## $ lag2  <int> 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1884, 1…
## $ lag3  <int> 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1512, 1…
## $ lag4  <int> 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, 898, 1…
## $ lag5  <int> 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066, 955, …
## $ lag6  <int> 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289, 2066,…
## $ lag7  <int> 2445, 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336, 2289,…
## $ lag8  <int> 1138, 2445, 2563, 2706, 2733, 2473, 1136, 803, 1245, 1372, 1336,…

Autoreggressive Model Fit

Fit a linear model that incorporates the lagged values as an autoregressive process

armod <- lm( death ~ lag1 + lag7 + lag8, lagdeaths )
summary( armod )
## 
## Call:
## lm(formula = death ~ lag1 + lag7 + lag8, data = lagdeaths)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -694.82  -74.66  -23.83  106.92 1020.08 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.50411   20.85922   1.271    0.205    
## lag1         0.68327    0.04218  16.200  < 2e-16 ***
## lag7         0.68466    0.03791  18.060  < 2e-16 ***
## lag8        -0.41587    0.04849  -8.577 4.83e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 212.6 on 308 degrees of freedom
## Multiple R-squared:  0.8852, Adjusted R-squared:  0.8841 
## F-statistic: 791.4 on 3 and 308 DF,  p-value: < 2.2e-16

All three of the lagged variables are strongly significant and the \(R^2\) value suggests that the model accounts for for quite a bit of the variance in the data.

Now to visualize the model fit with the data:

death_lag <- df[9:320,]
death_lag[ 'predict' ] <- predict( armod ) 
COVID_ddp_ar <- COVID_ddp +
  geom_line( aes( x = date, y = predict, color = 'red' ), data = death_lag, show.legend = FALSE )
COVID_ddp_ar
Autoregression of Daily Death that considers several past time points. OLS = red, data = black

Autoregression of Daily Death that considers several past time points. OLS = red, data = black

The the autoregressive linear model prediction follows the envelope of the data reasonably well.

OLS Model Fit

armod_ols <- lm( deathIncrease ~ hospitalizedCurrently + 
                    inIcuCurrently + onVentilatorCurrently +
                    hospitalizedIncrease + negativeIncrease + positiveIncrease +
                    totalTestResultsIncrease + positive + negative + recovered + 
                   death + hospitalized + totalTestResults, 
                  data = df )
summary( armod_ols )
## 
## Call:
## lm(formula = deathIncrease ~ hospitalizedCurrently + inIcuCurrently + 
##     onVentilatorCurrently + hospitalizedIncrease + negativeIncrease + 
##     positiveIncrease + totalTestResultsIncrease + positive + 
##     negative + recovered + death + hospitalized + totalTestResults, 
##     data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -830.77 -189.71    3.03  167.68 1042.21 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               3.017e+02  1.182e+02   2.553  0.01129 *  
## hospitalizedCurrently    -7.275e-03  5.157e-03  -1.411  0.15967    
## inIcuCurrently            1.506e-01  2.335e-02   6.451 6.00e-10 ***
## onVentilatorCurrently     1.173e-01  4.320e-02   2.716  0.00708 ** 
## hospitalizedIncrease      4.972e-02  1.145e-02   4.342 2.08e-05 ***
## negativeIncrease         -1.454e-03  5.385e-04  -2.701  0.00741 ** 
## positiveIncrease          1.008e-02  2.339e-03   4.310 2.38e-05 ***
## totalTestResultsIncrease  1.239e-03  5.108e-04   2.426  0.01599 *  
## positive                 -7.594e-04  1.745e-04  -4.352 1.99e-05 ***
## negative                  2.971e-04  3.828e-05   7.762 2.35e-13 ***
## recovered                 3.443e-03  5.340e-04   6.447 6.12e-10 ***
## death                     1.047e-02  7.222e-03   1.449  0.14849    
## hospitalized             -9.881e-03  4.118e-03  -2.399  0.01718 *  
## totalTestResults         -2.808e-04  3.288e-05  -8.538 1.53e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 287.9 on 242 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.7504, Adjusted R-squared:  0.737 
## F-statistic: 55.98 on 13 and 242 DF,  p-value: < 2.2e-16
df2 <- df[-armod_ols$na.action, ]
death_lag2 <- df2
death_lag2[ 'predict_ols' ] <- predict( armod_ols ) 
COVID_ddp_ols <- COVID_ddp +
  geom_line( aes( x = date, y = predict ), data = death_lag, show.legend = FALSE, color = 'red', size = 0.2 ) +
  geom_line( aes( x = date, y = predict_ols ), data = death_lag2, show.legend = FALSE, color = "blue", size = 1 ) 
COVID_ddp_ols
Standard OLS linear regression fit for daily US COVID deaths. OLS = blue, autoreg = red, data = black

Standard OLS linear regression fit for daily US COVID deaths. OLS = blue, autoreg = red, data = black

GLS Model Fit

armod_gls <- gls( deathIncrease ~ hospitalizedCurrently + 
                    inIcuCurrently + onVentilatorCurrently +
                    hospitalizedIncrease + negativeIncrease + positiveIncrease +
                    totalTestResultsIncrease + positive + negative + recovered + 
                    death + hospitalized + totalTestResults, 
                  correlation = corAR1( form = ~ 1 ),
                  data = na.omit( df ) )
summary( armod_gls )
## Generalized least squares fit by REML
##   Model: deathIncrease ~ hospitalizedCurrently + inIcuCurrently + onVentilatorCurrently +      hospitalizedIncrease + negativeIncrease + positiveIncrease +      totalTestResultsIncrease + positive + negative + recovered +      death + hospitalized + totalTestResults 
##   Data: na.omit(df) 
##        AIC      BIC    logLik
##   3678.884 3734.305 -1823.442
## 
## Correlation Structure: AR(1)
##  Formula: ~1 
##  Parameter estimate(s):
##       Phi 
## 0.4698509 
## 
## Coefficients:
##                             Value Std.Error   t-value p-value
## (Intercept)              503.0936 280.27397  1.795007  0.0739
## hospitalizedCurrently     -0.0036   0.00794 -0.452267  0.6515
## inIcuCurrently             0.1404   0.03443  4.078588  0.0001
## onVentilatorCurrently      0.1336   0.06160  2.168366  0.0311
## hospitalizedIncrease       0.0265   0.01028  2.579824  0.0105
## negativeIncrease          -0.0009   0.00047 -1.873626  0.0622
## positiveIncrease           0.0128   0.00263  4.862226  0.0000
## totalTestResultsIncrease   0.0008   0.00046  1.622397  0.1061
## positive                  -0.0009   0.00030 -3.022331  0.0028
## negative                   0.0003   0.00006  6.201739  0.0000
## recovered                  0.0039   0.00084  4.608347  0.0000
## death                      0.0183   0.01160  1.578837  0.1157
## hospitalized              -0.0158   0.00686 -2.302848  0.0222
## totalTestResults          -0.0003   0.00005 -6.630688  0.0000
## 
##  Correlation: 
##                          (Intr) hsptlC inIcCr onVntC hsptlI ngtvIn pstvIn
## hospitalizedCurrently    -0.339                                          
## inIcuCurrently           -0.209 -0.479                                   
## onVentilatorCurrently     0.045 -0.053 -0.626                            
## hospitalizedIncrease      0.018  0.022  0.026 -0.071                     
## negativeIncrease          0.034 -0.153  0.104  0.023 -0.043              
## positiveIncrease          0.105 -0.453  0.234  0.134 -0.101  0.405       
## totalTestResultsIncrease -0.045  0.180 -0.108 -0.031  0.034 -0.934 -0.567
## positive                  0.572 -0.568 -0.029  0.081  0.120 -0.035 -0.028
## negative                 -0.062  0.027 -0.104  0.555 -0.010  0.064  0.359
## recovered                -0.350  0.530  0.013 -0.079 -0.050  0.029  0.040
## death                     0.182  0.008  0.280 -0.243  0.316  0.055  0.175
## hospitalized             -0.391  0.058 -0.159  0.085 -0.305 -0.066 -0.174
## totalTestResults          0.082 -0.070  0.069 -0.470  0.004 -0.047 -0.340
##                          ttlTRI positv negatv recvrd death  hsptlz
## hospitalizedCurrently                                             
## inIcuCurrently                                                    
## onVentilatorCurrently                                             
## hospitalizedIncrease                                              
## negativeIncrease                                                  
## positiveIncrease                                                  
## totalTestResultsIncrease                                          
## positive                  0.033                                   
## negative                 -0.104 -0.116                            
## recovered                -0.018 -0.840  0.285                     
## death                    -0.108  0.160  0.047 -0.039              
## hospitalized              0.099 -0.330 -0.204  0.135 -0.949       
## totalTestResults          0.081  0.181 -0.971 -0.442 -0.046  0.182
## 
## Standardized residuals:
##          Min           Q1          Med           Q3          Max 
## -2.383223874 -0.672133333 -0.002322932  0.602509752  3.422973358 
## 
## Residual standard error: 306.3897 
## Degrees of freedom: 250 total; 236 residual
death_lag3 <- na.omit( df )
death_lag3[ 'predict_gls' ] <- predict( armod_gls ) 
COVID_ddp_gls <- COVID_ddp +
  geom_line( aes( x = date, y = predict ), data = death_lag, show.legend = FALSE, color = 'red', size = 0.2 ) +
  geom_line( aes( x = date, y = predict_ols ), data = death_lag2, show.legend = FALSE, color = "blue", size = 0.2 ) +  
  geom_line( aes( x = date, y = predict_gls ), data = death_lag3, show.legend = FALSE, color = 'green', size = 1 ) 
COVID_ddp_gls
GLS fit for daily US COVID deaths. GLS = green, OLS = blue, autoreg = red, data = black

GLS fit for daily US COVID deaths. GLS = green, OLS = blue, autoreg = red, data = black

…and that is where I’ve left off

What are some next steps?

  1. combine the lagged features from the 1st autoregression with the original data for another OLS approach
  2. am I using the gls() function correctly?…will RTFM to see if I can make improvements with that approach
  3. ….any siggestions?….
  4. comparing models