The following document evaluates the relationship between the age of deceased COVID-19 patients and the mean time from start of symptoms until death. It is hypothesized that people of older age (60+ cutoff value given by the UN) tend to have shorter periods from the onset of the disease to death. This has direct implications on the average time spent on intensive care, which translates on a bigger burden for the medical facilities.

The analysis uses data from 3.007 reported deaths due to COVID-19 in Colombia as of June 29th, 2020.

library(ggplot2)
library(dplyr)
library(lubridate)
library(car)

Exploratory Analysis

covid.deaths <- covid.deaths %>%
  mutate(death_days=
           as.numeric(
             difftime(death_date, symptoms_date, units = "days")))
covid.deaths <- covid.deaths %>% 
  mutate(age_group = cut(
    covid.deaths$age, 
    breaks=c(seq(0, 90, 10), c(110)), 
    include.lowest = T))
summary(covid.deaths)
   city_code                    city                   department 
 8001   :635   Barranquilla       :635   Barranquilla D.E.  :635  
 11001  :610   Bogotá D.C.        :610   Bogotá D.C.        :610  
 13001  :300   Cartagena de Indias:300   Atlántico          :426  
 8758   :259   Soledad            :259   Valle del Cauca    :308  
 76001  :252   Cali               :252   Cartagena D.T. y C.:300  
 91001  : 86   Leticia            : 86   Nariño             :104  
 (Other):932   (Other)            :932   (Other)            :691  
        attention    gender       infection_type          status    
 Casa        :   0   F:1191   En estudio :2823   Asintomático:   0  
 Fallecido   :3074   M:1883   Importado  :  20   Fallecido   :3074  
 Hospital    :   0            Relacionado: 231   Grave       :   0  
 Hospital UCI:   0                               Leve        :   0  
 Recuperado  :   0                               Moderado    :   0  
                                                                    
                                                                    
           country     recuperation_type department_code  country_code 
 ESTADOS UNIDOS:   5   PCR   :   0       8      :1061    840    :   5  
 BRASIL        :   4   Tiempo:   0       11     : 610    76     :   4  
 ESPAÑA        :   4   NA's  :3074       76     : 393    724    :   4  
 VENEZUELA     :   2                     13     : 340    862    :   2  
 ALEMANIA      :   1                     47     : 109    218    :   1  
 (Other)       :   6                     52     : 104    (Other):   4  
 NA's          :3052                     (Other): 457    NA's   :3054  
          race              tribe      notification_date   
 Otro       :2667   Por definir:  24   Min.   :2020-03-13  
 Negro      : 204   Tikuna     :  19   1st Qu.:2020-05-12  
 Por definir: 101   Cocama     :   3   Median :2020-05-31  
 Indígena   :  30   Paez       :   2   Mean   :2020-05-24  
 Rom        :   5   Pasto      :   2   3rd Qu.:2020-06-11  
 (Other)    :   0   (Other)    :   6   Max.   :2020-06-27  
 NA's       :  67   NA's       :3018                       
 symptoms_date          death_date         diagnosis_date      
 Min.   :2020-03-01   Min.   :2020-03-16   Min.   :2020-03-16  
 1st Qu.:2020-05-07   1st Qu.:2020-05-21   1st Qu.:2020-05-18  
 Median :2020-05-25   Median :2020-06-07   Median :2020-06-06  
 Mean   :2020-05-18   Mean   :2020-05-31   Mean   :2020-05-30  
 3rd Qu.:2020-06-06   3rd Qu.:2020-06-17   3rd Qu.:2020-06-18  
 Max.   :2020-06-23   Max.   :2020-06-28   Max.   :2020-06-28  
                                           NA's   :55          
 recuperation_date  report_date               id             age        
 Min.   :NA        Min.   :2020-03-16   Min.   :  152   Min.   :  0.00  
 1st Qu.:NA        1st Qu.:2020-05-19   1st Qu.:16390   1st Qu.: 58.00  
 Median :NA        Median :2020-06-07   Median :38109   Median : 69.00  
 Mean   :NA        Mean   :2020-05-30   Mean   :38349   Mean   : 67.42  
 3rd Qu.:NA        3rd Qu.:2020-06-18   3rd Qu.:57579   3rd Qu.: 79.00  
 Max.   :NA        Max.   :2020-06-28   Max.   :91418   Max.   :103.00  
 NA's   :3074                                                           
   dead           death_days       age_group  
 Mode:logical   Min.   : 0.00   (70,80] :748  
 TRUE:3074      1st Qu.: 6.00   (60,70] :746  
                Median :11.00   (80,90] :546  
                Mean   :12.77   (50,60] :452  
                3rd Qu.:17.00   (40,50] :270  
                Max.   :98.00   (90,110]:128  
                                (Other) :184  
covid.deaths %>% 
  ggplot(aes(x=age)) + 
  geom_histogram(bins=35, color="black", fill="lightgray") +
  geom_vline(aes(xintercept=median(age)), color="red", size=1) +
  ggtitle("Distribution of Deaths by age", "Median age: 69 years") +
  xlab("Age") +
  ylab("Deaths")

covid.deaths %>% 
  ggplot(aes(x=death_days)) + 
  geom_histogram(bins=50, color="black", fill="lightgray") + 
  geom_vline(aes(xintercept=median(death_days)), color="red", size=1) +
  ggtitle("Distribution of the number of days from onset to death", "Median days: 11") +
  xlab("Days") +
  ylab("Deaths")

Number of deaths per age group

covid.deaths %>% 
  group_by(age_group) %>% 
  tally()
## # A tibble: 10 x 2
##    age_group     n
##    <fct>     <int>
##  1 [0,10]       10
##  2 (10,20]       7
##  3 (20,30]      50
##  4 (30,40]     117
##  5 (40,50]     270
##  6 (50,60]     452
##  7 (60,70]     746
##  8 (70,80]     748
##  9 (80,90]     546
## 10 (90,110]    128

Given that the lower age groups (0-30) present few cases, they will be disregarded from this analysis.

covid.deaths <- covid.deaths %>% 
  filter(age>30)
covid.deaths %>% 
  ggplot(aes(x=death_days, y=age_group, fill=age_group)) + 
  geom_violin(scale="count",  draw_quantiles = c(0.25, 0.5, 0.75)) +
  ggtitle("Distribution of the number of days from onset to death in each age group") +
  xlab("Days") +
  ylab("Age Group") +
  labs(fill='Age Group')

covid.deaths %>% 
  ggplot(aes(x=death_days, fill=age_group)) +
  geom_histogram(bins=50) + 
  facet_wrap(~age_group, scale="free_y") +
  ggtitle("Distribution of the number of days from onset to death in each age group") +
  xlab("Days") +
  ylab("Deaths") +
  labs(fill='Age Group')

We can see that all the distributions are positively skewed.

Descriptive Statistics per Age Group

covid.deaths %>% 
  group_by(age_group) %>% 
  summarise(mean=mean(death_days), 
            median=median(death_days), 
            sd=sd(death_days), 
            n=n(), 
            se=sd(death_days)/sqrt(n()))
# A tibble: 7 x 6
  age_group  mean median    sd     n    se
  <fct>     <dbl>  <dbl> <dbl> <int> <dbl>
1 (30,40]   12.7      11 10.5    117 0.973
2 (40,50]   13.3      11  8.41   270 0.512
3 (50,60]   13.6      12  9.11   452 0.429
4 (60,70]   13.8      12 10.2    746 0.372
5 (70,80]   12.8      11  9.78   748 0.358
6 (80,90]   10.9       9  8.15   546 0.349
7 (90,110]   8.91      7  7.17   128 0.634

Inferential Analysis

Levene’s test tells us that the variances are heterogeneous but these differences are not very large i.e. the maximum variance is 2.15 times the minimum variance.

leveneTest(death_days ~ age_group, covid.deaths)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    6  4.1048 0.0004145 ***
##       3000                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because the normality assumptions of the classical F test are not met, we will use the non-parametric Kruskal-Wallis test to check if there is at least one group with a different mean.

kruskal.test(death_days ~ age_group, covid.deaths)
## 
##  Kruskal-Wallis rank sum test
## 
## data:  death_days by age_group
## Kruskal-Wallis chi-squared = 69.706, df = 6, p-value = 4.698e-13

Since \(p<0.05\) we reject the hypothesis that the means are equal.

Now we perform a pairwise Wilcox test with Bonferroni correction to test which are the group means that differ.

pairwise.wilcox.test(covid.deaths$death_days, covid.deaths$age_group, p.adjust.method = "bonferroni")
## 
##  Pairwise comparisons using Wilcoxon rank sum test with continuity correction 
## 
## data:  covid.deaths$death_days and covid.deaths$age_group 
## 
##          (30,40] (40,50] (50,60] (60,70] (70,80] (80,90]
## (40,50]  1.00000 -       -       -       -       -      
## (50,60]  1.00000 1.00000 -       -       -       -      
## (60,70]  1.00000 1.00000 1.00000 -       -       -      
## (70,80]  1.00000 1.00000 1.00000 0.93533 -       -      
## (80,90]  1.00000 0.00014 7.4e-06 1.2e-06 0.00687 -      
## (90,110] 0.03103 2.3e-07 1.0e-07 8.4e-08 2.0e-05 0.06710
## 
## P value adjustment method: bonferroni

We can conclude with 95% confidence that the group (80, 90] has lower periods from start of symptoms to death than the groups (40, 80] and that the group (90, 110] has lower periods than the groups (30, 80].

In other words, this concludes that although one could imagine that older people (60+) tend to die in a shorter period than younger adults (30-60) the difference in time from start of symptoms to death is only significantly shorter for people older than 80.