As a graduate student 👩🏻🎓 exploring data fundamentals, I hypothesize
that vaccination rates (represented as totalPerHundred)
show significant changes over time, as seen through daily vaccination
rates (dailyPerMillion). This project allows me to explore
societal influences on vaccination trends.
This analysis investigates vaccination trends in the United States using data on total and daily vaccination rates. By exploring this dataset, I hope to understand patterns in public health initiatives over time and develop my skills in data wrangling, visualization, and hypothesis testing.
The following steps were performed to prepare the data for analysis:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
vaccination_data <- read.csv("us_vaccination_data.csv")
# Wrangle data: remove missing, zero rows, and ensure proper date formatting
data_clean <- vaccination_data %>%
filter(total > 0 & daily > 0) %>%
mutate(
date_formatted = as.Date(date.1, format = "%m/%d/%y"),
cumulative_percentage = totalPerHundred,
daily_rate = dailyPerMillion
) %>%
arrange(date_formatted)
The bar chart below illustrates cumulative vaccination rates over time. The x-axis now shows properly formatted dates spaced evenly, and the bars are wider for better visibility.
library(ggplot2)
ggplot(data_clean, aes(x = date_formatted, y = cumulative_percentage)) +
geom_col(fill = "skyblue", width = .4
) + # Reduced bar width for spacing
scale_x_date(date_breaks = "4 weeks", date_labels = "%b %d") +
labs(title = "Cumulative Vaccination Rates Over Time",
x = "Date",
y = "Total Vaccinations Per Hundred") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5))
Interpretation: The bar chart shows a steady increase in cumulative vaccination rates over time, with clearer intervals and better visibility.
The line chart displays daily vaccination rates over time (8 week itnerval, double the interval of the bar chart), offering a clear view of fluctuations and trends.
ggplot(data_clean, aes(x = date_formatted, y = daily_rate)) +
geom_line(color = "gold", size = 1) +
scale_x_date(date_breaks = "8 weeks", date_labels = "%b %d") +
labs(title = "Daily Vaccination Rates Over Time",
x = "Date",
y = "Daily Vaccinations Per Million") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
plot.title = element_text(hjust = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Interpretation: The line chart reveals fluctuations in daily vaccination rates, highlighting peak periods that may correspond to vaccination campaigns or other interventions.
The table below provides summary statistics for cumulative and daily vaccination rates:
summary_stats <- data_clean %>%
summarize(
mean_total = mean(totalPerHundred, na.rm = TRUE),
sd_total = sd(totalPerHundred, na.rm = TRUE),
mean_daily = mean(dailyPerMillion, na.rm = TRUE),
sd_daily = sd(dailyPerMillion, na.rm = TRUE)
)
print(summary_stats)
## mean_total sd_total mean_daily sd_daily
## 1 141.6196 60.45156 2321.591 2295.694
Interpretation: On average, total vaccination rates reached [mean_total] per hundred individuals, while daily rates varied around [mean_daily] per million people. The variability, reflected by standard deviations, indicates uneven progress across different days.
To investigate whether daily vaccination rates are influenced by cumulative progress:
model <- lm(daily ~ totalPerHundred, data = data_clean)
summary(model)
##
## Call:
## lm(formula = daily ~ totalPerHundred, data = data_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1922168 -231628 -136215 213265 2087275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1927016.2 49977.6 38.56 <2e-16 ***
## totalPerHundred -8164.0 324.6 -25.15 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 581100 on 876 degrees of freedom
## Multiple R-squared: 0.4193, Adjusted R-squared: 0.4187
## F-statistic: 632.6 on 1 and 876 DF, p-value: < 2.2e-16
Interpretation: The linear model suggests a [positive/negative] relationship between cumulative vaccinations and daily vaccination rates. The p-value indicates [statistical significance or lack thereof], suggesting the strength of this association.
We test whether there is a significant difference in vaccination rates before and after a midpoint date.
data_clean <- data_clean %>%
mutate(period = ifelse(date_formatted < as.Date("2021-06-01"), "Early", "Late"))
t_test <- t.test(daily_rate ~ period, data = data_clean)
print(t_test)
##
## Welch Two Sample t-test
##
## data: daily_rate by period
## t = 17.937, df = 185.13, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Early and group Late is not equal to 0
## 95 percent confidence interval:
## 3540.447 4415.498
## sample estimates:
## mean in group Early mean in group Late
## 5529.341 1551.369
Interpretation: The t-test reveals that daily vaccination rates are significantly [higher/lower] in the later period compared to the early period (p-value: [insert value]). This indicates [plain-language explanation].
Through this project, I explored vaccination trends using cumulative and daily rates. My findings suggest clear increases over time, with fluctuations in daily rates reflecting possible societal and policy influences. This project allowed me to apply data wrangling, visualization, and statistical methods to better understand how public health trends evolve.
As a graduate student, this exercise deepened my curiosity about societal factors influencing health outcomes and reinforced the value of data-driven research.
What I can and want to improve on is including more APIs to delve into how the elections and demographic of political gtroups effects the vaccinationmm rate!!!