smallpox <- read.csv("smallpox.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 4.0.0 ✔ stringr 1.5.2
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Research Question: Is there a significant association between inoculation status and mortality?
Introduction
The data set that I’ll be using for this project is named smallpox which is a data set of a sample of 6224 individuals from the year 1721 . There are two variables in this set which are inoculated (yes/no) and result (died/lived). Inoculation is essentially the practice of deliberately infecting someone with mild smallpox material and was the early form of immunization, which is a precursor to vaccination.
The reason why I chose this topic is because of how vaccination has come such a long way throughout history and there are still continuous debates to this day about its effectiveness. A lot of misinformation are spread about vaccines so maybe looking at actual data will show us how the effectiveness of early methods of immunization and vaccination can translate to their integrity in our modern era.
Data Set
“Smallpox Vaccine Results.” Data Sets, www.openintro.org/data/index.php?data=smallpox. Accessed 13 Dec. 2025.
Data Analysis
For my data analysis, I’ll first start out with cleaning my data set, making sure there are no N/As in them. Next step will summarize my data set. I’ll also create variables to get the mortality rate of the inoculated group versus the not inoculated group.
smallpox_clean <- smallpox |>
filter(!is.na(result)) %>%
filter(!is.na(inoculated))
summary(smallpox_clean)
## result inoculated
## Length:6224 Length:6224
## Class :character Class :character
## Mode :character Mode :character
inoculated <- smallpox_clean |>
filter(inoculated == "yes")
total_inoc <- sum(count(inoculated))
inoculated_died <- inoculated |>
filter(result == "died")
died_inoc <- sum(count(inoculated_died))
not_inoculated <- smallpox_clean |>
filter(inoculated == "no")
total_no<- sum(count(not_inoculated))
not_inoculated_died <- not_inoculated |>
filter(result == "died")
died_no<- sum(count(not_inoculated_died))
mortality_inoculated <- died_inoc / total_inoc
mortality_inoculated
## [1] 0.02459016
mortality_not_inoculated <- died_no / total_no
mortality_not_inoculated
## [1] 0.1411371
Statistical Analysis (Hypothesis Testing)
For this stage in the project, I decided to do a Chi-Squared Test of Independence since the data I’m working with has both variables being categorical. Also since the sample size is pretty big and it would answer my research question which is about association, not really about prediction.
observed_dataset<- table(smallpox_clean$inoculated, smallpox_clean$result)
observed_dataset
##
## died lived
## no 844 5136
## yes 6 238
Hypothesis
\(H_0\) : Inoculation status and mortality rate are not associated with each other \(H_a\) : Inoculation status and mortality rate are associated with each other
chisq.test(observed_dataset)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: observed_dataset
## X-squared = 26.026, df = 1, p-value = 3.369e-07
Results
The degree of freedom is just 1 which also helps in producing our very small p-value of 3.369e-07. Because of our very low p-value, we can reject the null hypothesis, meaning that there is a significant evidence to prove that inoculation status is associated with mortality rate.
ggplot(smallpox_clean, aes(x = inoculated, fill = result)) +
geom_bar(position = "fill") +
labs(
title = "Proportion of Mortality by Inoculation",
x = "Inoculated",
y = "Proportion"
) +
scale_fill_manual(
values = c("died" = "#EE9572", "lived" = "#E0EEE0")
) +
theme_minimal()
Results
The bar graph shows us the proportion of people who died versus who lived in the two groups that was either not inoculated and was inoculated. On the bar to the left, we see that the not inoculated group has a bigger proportion that died compared to the bar on the right. From our Chi-Squared test, we know that there is a significant evidence to prove the association in mortality because of our low p-value. In this graph it might look insignificant, but because of our big sample size, that difference makes it a lot.
Conclusion
In conclusion, there is a significant evidence to show that there is an association between inoculation and mortality rate. Our resulting p-value from the Chi-Squared Test of Independence is our significant evidence to reject our null. From the data analysis stage in this project as well, we saw that the mortality rate was higher in the not inoculated group. This answers our research question and shows that newer forms of inoculation like immunization and vaccination can be helpful in saving lives.
Some potential avenues for this research would be getting newer data that covers immunization and vaccination or maybe having data sets that covers inoculation throughout the years considering our data set was from the 1700s.
Sources
“Smallpox Vaccine Results.” Data Sets, www.openintro.org/data/index.php?data=smallpox. Accessed 13 Dec. 2025.
“A Brief History of Vaccination.” World Health Organization, World Health Organization, www.who.int/news-room/spotlight/history-of-vaccination/a-brief-history-of-vaccination. Accessed 13 Dec. 2025.