The following document evaluates the relationship between the age of deceased COVID-19 patients and the mean time from start of symptoms until death. It is hypothesized that people of older age (60+ cutoff value given by the UN) tend to have shorter periods from the onset of the disease to death. This has direct implications on the average time spent on intensive care, which translates on a bigger burden for the medical facilities.
The analysis uses data from 3.007 reported deaths due to COVID-19 in Colombia as of June 29th, 2020.
library(ggplot2)
library(dplyr)
library(lubridate)
library(car)
covid.deaths <- covid.deaths %>%
mutate(death_days=
as.numeric(
difftime(death_date, symptoms_date, units = "days")))
covid.deaths <- covid.deaths %>%
mutate(age_group = cut(
covid.deaths$age,
breaks=c(seq(0, 90, 10), c(110)),
include.lowest = T))
summary(covid.deaths)
city_code city department
8001 :635 Barranquilla :635 Barranquilla D.E. :635
11001 :610 Bogotá D.C. :610 Bogotá D.C. :610
13001 :300 Cartagena de Indias:300 Atlántico :426
8758 :259 Soledad :259 Valle del Cauca :308
76001 :252 Cali :252 Cartagena D.T. y C.:300
91001 : 86 Leticia : 86 Nariño :104
(Other):932 (Other) :932 (Other) :691
attention gender infection_type status
Casa : 0 F:1191 En estudio :2823 Asintomático: 0
Fallecido :3074 M:1883 Importado : 20 Fallecido :3074
Hospital : 0 Relacionado: 231 Grave : 0
Hospital UCI: 0 Leve : 0
Recuperado : 0 Moderado : 0
country recuperation_type department_code country_code
ESTADOS UNIDOS: 5 PCR : 0 8 :1061 840 : 5
BRASIL : 4 Tiempo: 0 11 : 610 76 : 4
ESPAÑA : 4 NA's :3074 76 : 393 724 : 4
VENEZUELA : 2 13 : 340 862 : 2
ALEMANIA : 1 47 : 109 218 : 1
(Other) : 6 52 : 104 (Other): 4
NA's :3052 (Other): 457 NA's :3054
race tribe notification_date
Otro :2667 Por definir: 24 Min. :2020-03-13
Negro : 204 Tikuna : 19 1st Qu.:2020-05-12
Por definir: 101 Cocama : 3 Median :2020-05-31
Indígena : 30 Paez : 2 Mean :2020-05-24
Rom : 5 Pasto : 2 3rd Qu.:2020-06-11
(Other) : 0 (Other) : 6 Max. :2020-06-27
NA's : 67 NA's :3018
symptoms_date death_date diagnosis_date
Min. :2020-03-01 Min. :2020-03-16 Min. :2020-03-16
1st Qu.:2020-05-07 1st Qu.:2020-05-21 1st Qu.:2020-05-18
Median :2020-05-25 Median :2020-06-07 Median :2020-06-06
Mean :2020-05-18 Mean :2020-05-31 Mean :2020-05-30
3rd Qu.:2020-06-06 3rd Qu.:2020-06-17 3rd Qu.:2020-06-18
Max. :2020-06-23 Max. :2020-06-28 Max. :2020-06-28
NA's :55
recuperation_date report_date id age
Min. :NA Min. :2020-03-16 Min. : 152 Min. : 0.00
1st Qu.:NA 1st Qu.:2020-05-19 1st Qu.:16390 1st Qu.: 58.00
Median :NA Median :2020-06-07 Median :38109 Median : 69.00
Mean :NA Mean :2020-05-30 Mean :38349 Mean : 67.42
3rd Qu.:NA 3rd Qu.:2020-06-18 3rd Qu.:57579 3rd Qu.: 79.00
Max. :NA Max. :2020-06-28 Max. :91418 Max. :103.00
NA's :3074
dead death_days age_group
Mode:logical Min. : 0.00 (70,80] :748
TRUE:3074 1st Qu.: 6.00 (60,70] :746
Median :11.00 (80,90] :546
Mean :12.77 (50,60] :452
3rd Qu.:17.00 (40,50] :270
Max. :98.00 (90,110]:128
(Other) :184
covid.deaths %>%
ggplot(aes(x=age)) +
geom_histogram(bins=35, color="black", fill="lightgray") +
geom_vline(aes(xintercept=median(age)), color="red", size=1) +
ggtitle("Distribution of Deaths by age", "Median age: 69 years") +
xlab("Age") +
ylab("Deaths")
covid.deaths %>%
ggplot(aes(x=death_days)) +
geom_histogram(bins=50, color="black", fill="lightgray") +
geom_vline(aes(xintercept=median(death_days)), color="red", size=1) +
ggtitle("Distribution of the number of days from onset to death", "Median days: 11") +
xlab("Days") +
ylab("Deaths")
covid.deaths %>%
group_by(age_group) %>%
tally()
## # A tibble: 10 x 2
## age_group n
## <fct> <int>
## 1 [0,10] 10
## 2 (10,20] 7
## 3 (20,30] 50
## 4 (30,40] 117
## 5 (40,50] 270
## 6 (50,60] 452
## 7 (60,70] 746
## 8 (70,80] 748
## 9 (80,90] 546
## 10 (90,110] 128
Given that the lower age groups (0-30) present few cases, they will be disregarded from this analysis.
covid.deaths <- covid.deaths %>%
filter(age>30)
covid.deaths %>%
ggplot(aes(x=death_days, y=age_group, fill=age_group)) +
geom_violin(scale="count", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of the number of days from onset to death in each age group") +
xlab("Days") +
ylab("Age Group") +
labs(fill='Age Group')
covid.deaths %>%
ggplot(aes(x=death_days, fill=age_group)) +
geom_histogram(bins=50) +
facet_wrap(~age_group, scale="free_y") +
ggtitle("Distribution of the number of days from onset to death in each age group") +
xlab("Days") +
ylab("Deaths") +
labs(fill='Age Group')
We can see that all the distributions are positively skewed.
covid.deaths %>%
group_by(age_group) %>%
summarise(mean=mean(death_days),
median=median(death_days),
sd=sd(death_days),
n=n(),
se=sd(death_days)/sqrt(n()))
# A tibble: 7 x 6
age_group mean median sd n se
<fct> <dbl> <dbl> <dbl> <int> <dbl>
1 (30,40] 12.7 11 10.5 117 0.973
2 (40,50] 13.3 11 8.41 270 0.512
3 (50,60] 13.6 12 9.11 452 0.429
4 (60,70] 13.8 12 10.2 746 0.372
5 (70,80] 12.8 11 9.78 748 0.358
6 (80,90] 10.9 9 8.15 546 0.349
7 (90,110] 8.91 7 7.17 128 0.634
Levene’s test tells us that the variances are heterogeneous but these differences are not very large i.e. the maximum variance is 2.15 times the minimum variance.
leveneTest(death_days ~ age_group, covid.deaths)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 6 4.1048 0.0004145 ***
## 3000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because the normality assumptions of the classical F test are not met, we will use the non-parametric Kruskal-Wallis test to check if there is at least one group with a different mean.
kruskal.test(death_days ~ age_group, covid.deaths)
##
## Kruskal-Wallis rank sum test
##
## data: death_days by age_group
## Kruskal-Wallis chi-squared = 69.706, df = 6, p-value = 4.698e-13
Since \(p<0.05\) we reject the hypothesis that the means are equal.
Now we perform a pairwise Wilcox test with Bonferroni correction to test which are the group means that differ.
pairwise.wilcox.test(covid.deaths$death_days, covid.deaths$age_group, p.adjust.method = "bonferroni")
##
## Pairwise comparisons using Wilcoxon rank sum test with continuity correction
##
## data: covid.deaths$death_days and covid.deaths$age_group
##
## (30,40] (40,50] (50,60] (60,70] (70,80] (80,90]
## (40,50] 1.00000 - - - - -
## (50,60] 1.00000 1.00000 - - - -
## (60,70] 1.00000 1.00000 1.00000 - - -
## (70,80] 1.00000 1.00000 1.00000 0.93533 - -
## (80,90] 1.00000 0.00014 7.4e-06 1.2e-06 0.00687 -
## (90,110] 0.03103 2.3e-07 1.0e-07 8.4e-08 2.0e-05 0.06710
##
## P value adjustment method: bonferroni
We can conclude with 95% confidence that the group (80, 90] has lower periods from start of symptoms to death than the groups (40, 80] and that the group (90, 110] has lower periods than the groups (30, 80].
In other words, this concludes that although one could imagine that older people (60+) tend to die in a shorter period than younger adults (30-60) the difference in time from start of symptoms to death is only significantly shorter for people older than 80.