This is a mini study conducted to find how Attack Rate directly affects the Death Rate of Cholera. The idea stemmed from a larger study conducted in collaboration with the University of Alberta and Youreka Canada to find the Efficacy of the Cholera Vaccine as Tested through HDI, Death Rate, and Attack Rate. Concepts of data visualization and statistical analysis in R along with R Markdown were used for this document.
The ggplot2 package within the tidyverse package was used for the visualization aspect of this document. The rest of our statistical tests are available within tidyverse.
#install.packages("tidyverse")
#install.packages("ggplot2")
library(tidyverse)
library(ggplot2)
We then needed to import the data used. We used Cholera data on the Yemeni population from data.world as it is quite a reputable website with an abundance of free datasets for all kinds of use.
Here is a quick summary of the data to get a good look of what it contains.
## date cases deaths cfr
## Length:147 Min. : 35471 Min. : 361 Min. :0.2100
## Class :character 1st Qu.: 353818 1st Qu.:1796 1st Qu.:0.2600
## Mode :character Median : 602526 Median :2043 Median :0.3400
## Mean : 599694 Mean :1899 Mean :0.3971
## 3rd Qu.: 828370 3rd Qu.:2162 3rd Qu.:0.5000
## Max. :1115378 Max. :2310 Max. :1.0000
##
## attack_rate_per_1000 X death_rate attack_rate_fraction
## Min. : 1.19 Mode:logical Min. :0.002071 Min. :0.00000
## 1st Qu.:11.85 NA's:147 1st Qu.:0.002611 1st Qu.:0.01170
## Median :21.25 Median :0.003391 Median :0.02117
## Mean :21.14 Mean :0.003981 Mean :0.02100
## 3rd Qu.:29.21 3rd Qu.:0.005076 3rd Qu.:0.02911
## Max. :39.74 Max. :0.010177 Max. :0.03974
## NA's :1
We can also display a short data frame highlighting key parts of the factors being analyzed.
## Attack_Rate Death_Rate
## Minimum 0 0.002071046766
## Maximum 0.039744 0.01017732796
## Mean 0.020999612244898 0.00398074302667347
## Median 0.003390725048 0.003390725048
## Mode numeric numeric
## Standard Deviation 0.0110837997907158 0.00174185653454457
The factor tested was the Attack Rate and the impact it had on Death Rate. Attack Rate, expressed as a percentage, is defined as the amount of people that get ill after contracting the disease. Death Rate, expressed as a percentage, is defined as the amount of the population that died after contracting Cholera. For the purpose of this study, Death Rate has been converted into a percentage from an integer value.
A two-sample t-test was conducted to check if there was a significant difference between the means of death rate and attack rate.
t.test(yemen_data$attack_rate_fraction, yemen_data$death_rate, alternative = "two.sided", var.equal = TRUE)
##
## Two Sample t-test
##
## data: yemen_data$attack_rate_fraction and yemen_data$death_rate
## t = 18.391, df = 292, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01519758 0.01884016
## sample estimates:
## mean of x mean of y
## 0.020999612 0.003980743
As we are able to see, we had a very small p-value. Therefore, there was a significant difference in the means of the Attack Rate (M = 0.02, SD = 0.011) and the Death Rate (M 0.003, SD = 0.001). The other takeaway we were able to have was that there was the presence of a relationship, although it not possible to confirm with just a t-test.
After finding a significant difference in the means and realizing that there was a relationship present, the next step was to find out what that relationship would have been. The Pearson’s Correlation was chosen as it would have told us of the relationship when measured linearly.
cor.test(yemen_data$attack_rate_fraction, yemen_data$death_rate, method="pearson")
##
## Pearson's product-moment correlation
##
## data: yemen_data$attack_rate_fraction and yemen_data$death_rate
## t = -28.68, df = 145, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.9431377 -0.8935068
## sample estimates:
## cor
## -0.9220264
From what we can see, there seems to be a very strong negative correlation, taken from the quite large negative r-value. However, we did not want to take this as a final result as we could further confirm this idea with another similar test.
Once we knew that a relationship existed, we applied a Spearman’s Correlation Test to find out what that relationship could have possibly been. This test is non-parametric, meaning it does not assume a linear relationship, therefore giving us the raw relationship.
cor.test(yemen_data$attack_rate_fraction, yemen_data$death_rate, method = "spearman")
##
## Spearman's rank correlation rho
##
## data: yemen_data$attack_rate_fraction and yemen_data$death_rate
## S = 1055036, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.9929051
We were able to achieve an even negative r-value. This test allowed us to fully confirm our findings and accept that there was a strong negative correlation between the two factors.
Finally, after we confirmed the relationship that existed, it was time to present our findings. In this case, in the form of a linear regression. The data is easy to read and presentable while being functional.
ggplot(data = yemen_data, mapping=aes(x=attack_rate_fraction, y=death_rate, color=attack_rate_fraction)) +
geom_point(position="jitter") +
geom_smooth(method=lm, formula = y ~ x, color='black') +
labs(x="Attack Rate", y="Death Rate", color="") +
ggtitle("Relationship between Death Rate and Attack Rate of Cholera") +
theme_classic() +
theme(legend.position = "none") +
theme(plot.title = element_text(size = 15, face = "bold"))
The relationship between Death Rate and Attack Rate is displayed as the following:
Finding this out, some may think this is counter-intuitive, thinking that the relationship should be more closely related to the following:
The Death Rate of any disease is most likely to be higher in its beginnings due to the limited exposure, which could influence the Death Rate because of factors like age, improper treatment, and severe cases. However, as the disease progresses, it is able to affect a much greater population, giving us a much more accurate Death Rate as it is able to impact a variety of people.