Introduction

This is a mini study conducted to find how Attack Rate directly affects the Death Rate of Cholera. The idea stemmed from a larger study conducted in collaboration with the University of Alberta and Youreka Canada to find the Efficacy of the Cholera Vaccine as Tested through HDI, Death Rate, and Attack Rate. Concepts of data visualization and statistical analysis in R along with R Markdown were used for this document.

Environment Setup

The ggplot2 package within the tidyverse package was used for the visualization aspect of this document. The rest of our statistical tests are available within tidyverse.

#install.packages("tidyverse")
#install.packages("ggplot2")

library(tidyverse)
library(ggplot2)

We then needed to import the data used. We used Cholera data on the Yemeni population from data.world as it is quite a reputable website with an abundance of free datasets for all kinds of use.

Here is a quick summary of the data to get a good look of what it contains.

##      date               cases             deaths          cfr        
##  Length:147         Min.   :  35471   Min.   : 361   Min.   :0.2100  
##  Class :character   1st Qu.: 353818   1st Qu.:1796   1st Qu.:0.2600  
##  Mode  :character   Median : 602526   Median :2043   Median :0.3400  
##                     Mean   : 599694   Mean   :1899   Mean   :0.3971  
##                     3rd Qu.: 828370   3rd Qu.:2162   3rd Qu.:0.5000  
##                     Max.   :1115378   Max.   :2310   Max.   :1.0000  
##                                                                      
##  attack_rate_per_1000    X             death_rate       attack_rate_fraction
##  Min.   : 1.19        Mode:logical   Min.   :0.002071   Min.   :0.00000     
##  1st Qu.:11.85        NA's:147       1st Qu.:0.002611   1st Qu.:0.01170     
##  Median :21.25                       Median :0.003391   Median :0.02117     
##  Mean   :21.14                       Mean   :0.003981   Mean   :0.02100     
##  3rd Qu.:29.21                       3rd Qu.:0.005076   3rd Qu.:0.02911     
##  Max.   :39.74                       Max.   :0.010177   Max.   :0.03974     
##  NA's   :1

We can also display a short data frame highlighting key parts of the factors being analyzed.

##                           Attack_Rate          Death_Rate
## Minimum                             0      0.002071046766
## Maximum                      0.039744       0.01017732796
## Mean                0.020999612244898 0.00398074302667347
## Median                 0.003390725048      0.003390725048
## Mode                          numeric             numeric
## Standard Deviation 0.0110837997907158 0.00174185653454457

Comparison: Death Rate V.S. Attack Rate

The factor tested was the Attack Rate and the impact it had on Death Rate. Attack Rate, expressed as a percentage, is defined as the amount of people that get ill after contracting the disease. Death Rate, expressed as a percentage, is defined as the amount of the population that died after contracting Cholera. For the purpose of this study, Death Rate has been converted into a percentage from an integer value.

Two-Sample T-Test

A two-sample t-test was conducted to check if there was a significant difference between the means of death rate and attack rate.

 t.test(yemen_data$attack_rate_fraction, yemen_data$death_rate, alternative = "two.sided", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  yemen_data$attack_rate_fraction and yemen_data$death_rate
## t = 18.391, df = 292, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01519758 0.01884016
## sample estimates:
##   mean of x   mean of y 
## 0.020999612 0.003980743

As we are able to see, we had a very small p-value. Therefore, there was a significant difference in the means of the Attack Rate (M = 0.02, SD = 0.011) and the Death Rate (M 0.003, SD = 0.001). The other takeaway we were able to have was that there was the presence of a relationship, although it not possible to confirm with just a t-test.

Pearson’s Correlation Test

After finding a significant difference in the means and realizing that there was a relationship present, the next step was to find out what that relationship would have been. The Pearson’s Correlation was chosen as it would have told us of the relationship when measured linearly.

cor.test(yemen_data$attack_rate_fraction, yemen_data$death_rate, method="pearson")
## 
##  Pearson's product-moment correlation
## 
## data:  yemen_data$attack_rate_fraction and yemen_data$death_rate
## t = -28.68, df = 145, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.9431377 -0.8935068
## sample estimates:
##        cor 
## -0.9220264

From what we can see, there seems to be a very strong negative correlation, taken from the quite large negative r-value. However, we did not want to take this as a final result as we could further confirm this idea with another similar test.

Spearman’s Correlation Test

Once we knew that a relationship existed, we applied a Spearman’s Correlation Test to find out what that relationship could have possibly been. This test is non-parametric, meaning it does not assume a linear relationship, therefore giving us the raw relationship.

cor.test(yemen_data$attack_rate_fraction, yemen_data$death_rate, method = "spearman")
## 
##  Spearman's rank correlation rho
## 
## data:  yemen_data$attack_rate_fraction and yemen_data$death_rate
## S = 1055036, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.9929051

We were able to achieve an even negative r-value. This test allowed us to fully confirm our findings and accept that there was a strong negative correlation between the two factors.

Linear Regression

Finally, after we confirmed the relationship that existed, it was time to present our findings. In this case, in the form of a linear regression. The data is easy to read and presentable while being functional.

ggplot(data = yemen_data, mapping=aes(x=attack_rate_fraction, y=death_rate, color=attack_rate_fraction)) +
  geom_point(position="jitter") +
  geom_smooth(method=lm, formula = y ~ x, color='black') +
  labs(x="Attack Rate", y="Death Rate", color="") +
  ggtitle("Relationship between Death Rate and Attack Rate of Cholera") +
  theme_classic() +
  theme(legend.position = "none") +
  theme(plot.title = element_text(size = 15, face = "bold"))

Conclusion

The relationship between Death Rate and Attack Rate is displayed as the following:

Finding this out, some may think this is counter-intuitive, thinking that the relationship should be more closely related to the following:

The Death Rate of any disease is most likely to be higher in its beginnings due to the limited exposure, which could influence the Death Rate because of factors like age, improper treatment, and severe cases. However, as the disease progresses, it is able to affect a much greater population, giving us a much more accurate Death Rate as it is able to impact a variety of people.