Assignment 2

COVID-19 infection age between gender

Ka Sai Chan S3871162, Carsten Litton-Strain S3871589, Marco Lo S3870278

Last updated: 25 October, 2020

Introduction

This report is an analysis of COVID-19 infections across gender to find any indication if there is a greater trend of infection towards males or females. With COVID-19 as a sudden and pressing issue there are so many possible impacts that it can be difficult to track who has been impacted the most and how to best use the available resources to save lives (Sarkar, Debnath, Reang 2020). Gender as a frequent a hot topic in discussions is important to highlight any places where inequality may be present. Investigating this we have found discussions in early 2020 of greater death rates in males than females (Huckins, 2020). Which can be visually seen in the death rates among Australian’s.

Deaths Australia as of 19 April 2020 (Graves, 2020).

This difference in death rate has been confirmed in peer reviewed articles such as Bwire’s (2020) paper on “Why Men are More Vulnerable to Covid-19 Than Women?”, this investigation will help to evidence if this difference may be caused by a greater weakening of men’s immunity with age.

Problem Statement

This presentation looks to if these discussions may have a root in unequal infection rate. By drawing from samples cumulative 1085 global infections collated in early 2020 (Pratik) and comparing the age of each infected person across gender. Within the sample data we can clearly visualise a greater rate of death in men than women.

C9L <- COVID19_line_list_data
C9L <- na.omit(C9L[,c(8,9)])
C9L$gender<- as.factor(C9L$gender)

AgenDeath <- na.omit(COVID19_line_list_data[,c(8,9,17)])
AgenDeath<- arrange(AgenDeath, death)
AgenDeath<-AgenDeath[c(768:825),]

ggplot(AgenDeath,aes(x=age,group=gender,fill=gender))+
  geom_histogram(position="dodge",binwidth=10)+theme_bw()+
  labs(title="Deaths in sample data grouped by age and gender")

The data was compared to determine if, within this sample, there is any statistical significance to a differing rate of infection between men and women across age by using a two-sided t-test.

Data

The data we collected included 1085 cases of infection and was downloaded from https://www.kaggle.com/pratik1235/covid19-csea (COVID-19 [CSEA], 2020), the data collated here from numerous open source data references, most commonly notational reported numbers, with the latest possible update in March 27 2020. As the data has been mixed from numerous sources there was some inconsistent formatting and missing values. The data (‘COVID19_line_list_data.csv’) was imported into Rstudio for pre-processing and analysis. The imported dataframe was reduced to the variables of interest (Gender, Age) to ignore inconsistencies in other variables. The variable gender was factorised to 2 levels, male and female. It is assumed that all NA values were missing. The variable Age was adjusted from a character variable to numeric. Any cases which included missing values within these variables were removed, reducing the considered cases to 825.

Descriptive Statistics and Visualisation

There had been a total of 825 cases of which 476 were male and 349 were female. There was a larger range of ages for female patients when compared against male. The median age for females was higher than males at 52 compared to 50.5 as indicated on the boxplot. The mean age for females were slightly higher than males at 49.63 compared to 49.85. This data was inserted into a boxplot to determine if any outliers were present, none were found.

C9L %>% group_by(gender) %>% summarise(Min = min(age,na.rm = TRUE),
                                             Q1 = quantile(age,probs = .25,na.rm = TRUE),
                                             Median = median(age, na.rm = TRUE),
                                             Q3 = quantile(age,probs = .75,na.rm = TRUE),
                                             Max = max(age,na.rm = TRUE),
                                             Mean = round(mean(age, na.rm = TRUE), 2),
                                             SD = round(sd(age, na.rm = TRUE), 2),
                                             n = n(),
                                             Missing = sum(is.na(age)))
boxplot(age ~ gender,data = C9L, main="Age by gender when contracting COVID",ylab="Age", xlab="Gender", col = "skyblue")

Hypothesis Testing

Before we perform the Hypothesis testing, we need to confirm the homogeneity of variance and if the data is normally distributed.

Homogeneity of Variance

The Levene’s test is used to assess the assumption of equal variance. Therefore in this scenario the p-value for age by gender when contracting COVID-19 was p = 0.5467, we find p > .05 therefore we will assume equal variance.

\[H_0: \sigma_1 = \sigma_2 \]

\[H_A: \sigma_1 \ne \sigma_2\]

leveneTest(age ~ gender, data = C9L)

Hypothesis Testing Cont.

Testing the Assumption of Normality

Our sample size of each gender is greater than 30; although we can assume the data are drawn from normal population distribution based on Central Limit Theorem, we would plot a normal Q-Q plot to confirm this. Below is the age distributions for females and males using qqplot() function. The data points in the plot forms a rough diagonal line, there confirming our assumption that each gender has an approximately normal distribution for our hypothesis testing.

C9L_m<- C9L %>% filter(gender == "male")

qqPlot(C9L_m$age, dist="norm", main="Q-Q plot, Male", ylab="Age")

## [1] 270 304
C9L_f<- C9L %>% filter(gender == "female")

qqPlot(C9L_f$age, dist="norm", main="Q-Q plot, Female", ylab="Age")

## [1]  36 175

Hypothesis Testing Cont.

Two-sample t-test

A two-sample t-test was used to test for a significant difference between the mean age of COVID-19 contraction for males and females.

A two-tailed test with significance level of 0.05 is used to compare the difference between the 2 population means. We assume the population of female and male are independent to each other, and that both population have equal variances and both data are normally distributed.

\[H_0: \mu_m = \mu_f \]

\[H_A: \mu_m \ne \mu_f\]

The results of the two sample t-test confirms that there is no statistical difference between the mean age of female and male patients contracted by COVID-19 as p-value>0.05.

t.test(age ~ gender, data = C9L,
       var.equal = TRUE, 
       alternative = "two.sided")
## 
##  Two Sample t-test
## 
## data:  age by gender
## t = -0.1713, df = 823, p-value = 0.864
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.707495  2.272862
## sample estimates:
## mean in group female   mean in group male 
##             49.63037             49.84769

Discussion

Our data contains 825 COVID-19 cases from early 2019 with 58% as male and 42% as female. The infected age range lies between 0-96 with no outliers presented in the dataset. As female has a greater age range (2-96 years old) than male (0.5-89 years old), this could possibly lead to the median age of female being higher than male; nevertheless, the mean age of the gender is similar, suggesting the virus does not discriminate gender by age in its spread.

the strengths of this data include: a large sample size, wide number of countries (based on recorded infections at the time), the curve of distribution is approximately normal and contained no outliers. The limitations on the data set include the duration and a disproportionate presence of Chinese infections in early stages of the virus outbreak. As the first case of COVID-19 was recorded in December 2019 (Cucinotta & Vanelli, 2020). It was declared a global pandemic on 11th of March 2020. Therefore it would be interesting to observe the impacts with a more recent dataset covering a longer period of time and wider population/demographic to analyse the differences between gender and age over more cases.

Given this data shows a greater number of male infections than female, it would be interesting to further investigate whether males have a higher chance of picking up the virus regardless of age, and whether the average age of death with COVID-19 is found to be similar between the genders.

The average age of COVID-19 victims between female and male was tested by a t-test, assuming equal variances and data normally distributed. The result shows no statistical difference between the mean age of female and male patients contracted by COVID-19, suggesting the chances of being infected between female and male of the same age are equal. Because there is not a significant difference in infections between age and gender we should not assume that more men are dying due to contraction of COVID-19 at an older age, however this should be confirmed through further analysis on genetic and immunological differences between gender and age such as presented in Bwire’s (2020) article on gender based vulnerability.

References

Bwire G. M. (2020). Coronavirus: Why Men are More Vulnerable to Covid-19 Than Women?. SN comprehensive clinical medicine, 1–3. Advance online publication. https://doi.org/10.1007/s42399-020-00341-w

Coronavirus Disease 2019 (COVID-19). (2020). Retrieved 24 October 2020, from https://www.cdc.gov/coronavirus/2019-ncov/need-extra-precautions/older-adults.html

COVID-19 [CSEA] (2020, March). Kaggle. https://www.kaggle.com/pratik1235/covid19-csea [Accessed 06 October 2020].

Cucinotta D, Vanelli M. (2020) Mar WHO Declares COVID-19 a Pandemic. Acta Biomed. 19;91(1):157-160. doi: 10.23750/abm.v91i1.9397. PMID: 32191675; PMCID: PMC7569573.

Graves, J. (2020, April 20). Why do more men die from coronavirus than women?. The Conversation. https://theconversation.com/why-do-more-men-die-from-coronavirus-than-women-136038

Huckins, G., 2020. Covid Kills More Men Than Women. Experts Still Can’T Explain Why. [online] Wired. Available at: https://www.wired.com/story/covid-kills-more-men-than-women-experts-still-cant-explain-why/ [Accessed 7 September 2020].

Sarkar, P., Debnath, N., & Reang, D. (2020). Coupled human-environment system amid COVID-19 crisis: A conceptual model to understand the nexus. Science of the Total Environment, 753 doi:10.1016/j.scitotenv.2020.141757