Finding the best predictor for the number of deceases in the COVID-19 pandemic in Spain

Introduction

Predicting the worst possible outcome in this crisis can be done in a number of ways. The first is using the dynamics of the deceases time series, but that has got its own uncertainties and it’s never possible to find out too precise values. It’s convenient to use different predictors for that. Given that the decease usually follows a course, starting with detection, followed by hospitalization, then by moving to an ICU, those three quantities might predict, several days in advance, the number of deaths that are going to happen today. For several days starting in March 30th, the “peak deceases” was announced; hoever, April 2nd saw the highest number so far, so it’s impossible or very difficult to predict what’s going to happen from now on. This is the main reason why we try to use time series correlations to find out the value on April 3rd.

Methodology

We will mainly use the cross-correlation function ccf of the language R. The Spanish ministry of Health publishes daily accumulated values of new hospital and ICU check-ins, as well as new cases. We will discard this one initially, since testing is done patchily, if at all, so it is the quantity with the biggest uncertainty of them all.

Looking at the cross-correlation in those time series, we will try to find the day before decease that best predicts them, and them fit a linear model to it. A good fit will indicate a good predictor, the greater the lag, the better. After today’s values are published, we will also try and find out which one has been the closer.

Cross-correlation, hospital and ICU cases.

We will plot here the cross-correlation between new daily cases and new deceases. Below the cross-correlation between new cases in hospital and daily deceases.

ccf(data$Hospitalizaciones.nuevas,data$Fallecimientos.nuevos,na.action = na.pass,lag.max = 20)

Main problem with this series is that there are so few cases published. In fact, until today there were not a sufficient amount of them; they go back only to March 14th. It, however, exhibits a peak five days before decease. We will use that date to create the model.

ccf(data$Uci.nuevos,data$Fallecimientos.nuevos,na.action = na.pass,lag.max = 28)

There’s a negative cross-correlation for 11 days before, probably indicating that early check-in into the ICU ward will probably avoid premature deaths; there’s also a positive cross-correlation which is stronger 6 days before. We will try and use both for the models.

Fitting linear models

We will first try to compute correlations betweeh hospital check ins 5 days in advance.

data <- data %>% mutate(H.plus.6 = lag(Hospitalizaciones.nuevas,6))
ggplot(data,aes(x=H.plus.6,y=Fallecimientos.nuevos))+ geom_point()+geom_smooth(method='lm', formula= y~x)+theme_tufte()

Although the fit seems to be adequate, there are so few data points that it will probably be useless. Let’s try to fit a linear model anyway.

H.lm <- lm(Fallecimientos.nuevos  ~ H.plus.6 , data=data)
summary(H.lm)

## 
## Call:
## lm(formula = Fallecimientos.nuevos ~ H.plus.6, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -221.95  -51.09   15.65   76.66  112.87 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 553.00647  143.27282   3.860  0.00621 **
## H.plus.6      0.07312    0.03825   1.912  0.09751 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 110 on 7 degrees of freedom
##   (30 observations deleted due to missingness)
## Multiple R-squared:  0.343,  Adjusted R-squared:  0.2492 
## F-statistic: 3.655 on 1 and 7 DF,  p-value: 0.09751

The slope has a bad p-value; the intercept, however, is rather good.

data <- data %>% mutate(H.plus.7 = lag(Hospitalizaciones.nuevas,7))
H.lm.7 <- lm(  Fallecimientos.nuevos ~ H.plus.7, data=data)
summary(H.lm.7)

## 
## Call:
## lm(formula = Fallecimientos.nuevos ~ H.plus.7, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.447 -27.975   8.397  28.569  50.264 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 672.04781   57.57622  11.672 2.38e-05 ***
## H.plus.7      0.05092    0.01571   3.242   0.0177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.58 on 6 degrees of freedom
##   (31 observations deleted due to missingness)
## Multiple R-squared:  0.6365, Adjusted R-squared:  0.5759 
## F-statistic: 10.51 on 1 and 6 DF,  p-value: 0.01765

The intercept is quite good, which might imply a very good prediction capability; however, the slope is at a 5% so it is not so valid. The problem is still that there’s very little data to work with, so this will vary every single day.

Let’s try next with ICU check-ins

data <- data %>% mutate(ICU.plus.17 = lag(Uci.nuevos,17))
ggplot(data,aes(x=ICU.plus.17,y=Fallecimientos.nuevos))+ geom_point()+geom_smooth(method='lm', formula= y~x)+theme_tufte()

Again, looks like a relatively good fit, although it seems to go in the opposite direction of what the cross-correlation would indicate. Also, not a lot of data.

ICU.lm.17 <- lm(  Fallecimientos.nuevos ~ ICU.plus.17 , data=data)
summary(ICU.lm.17)

## 
## Call:
## lm(formula = Fallecimientos.nuevos ~ ICU.plus.17, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -222.19  -89.58   12.22   83.88  179.48 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  609.755     74.301   8.207 3.63e-05 ***
## ICU.plus.17    3.215      1.177   2.732   0.0258 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 143.9 on 8 degrees of freedom
##   (29 observations deleted due to missingness)
## Multiple R-squared:  0.4826, Adjusted R-squared:  0.4179 
## F-statistic: 7.462 on 1 and 8 DF,  p-value: 0.02578

The fit is quite good at the intercept, not so good at the slope level. Anyway, let’s check for the next good fit, which should be at t -2.

data <- data %>% mutate(ICU.plus.2 = lag(Uci.nuevos,2))
ggplot(data,aes(x=ICU.plus.2,y=Fallecimientos.nuevos))+ geom_point()+geom_smooth(method='lm', formula= y~x)+theme_tufte()

The fit looks a bit better than before, although most data points fall outside the confidence interval. Let’s try the fit.

ICU.lm.2 <- lm( Fallecimientos.nuevos ~ ICU.plus.2, data=data)
summary(ICU.lm.2)

## 
## Call:
## lm(formula = Fallecimientos.nuevos ~ ICU.plus.2, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -363.91  -89.09  -38.44   85.74  442.61 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  56.4830    59.3002   0.952    0.351    
## ICU.plus.2    1.6336     0.2007   8.141 3.18e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 183.2 on 23 degrees of freedom
##   (14 observations deleted due to missingness)
## Multiple R-squared:  0.7424, Adjusted R-squared:  0.7312 
## F-statistic: 66.27 on 1 and 23 DF,  p-value: 3.176e-08

The fit is now not so good.

Conclusions

In “Past the peak”, the logistic growth model of deseases, using data published April 2nd, predicted 802 deceases today April 3rd. Published data has been above that. Once data for today has been published, and all models re-evaluated, there’s no model that gives a good fith with a p-value below 0.001 for both slope and intercept. It’s not possible, then, to give good estimations of new values for deceases for the time being.

Acknowledgement

This report uses data provided by Datadista and has a free license; it can be regenerated with data and scripts from this repository.