0.1 Hypothesis

As a graduate student 👩🏻‍🎓 exploring data fundamentals, I hypothesize that vaccination rates (represented as totalPerHundred) show significant changes over time, as seen through daily vaccination rates (dailyPerMillion). This project allows me to explore societal influences on vaccination trends.


0.2 Introduction

This analysis investigates vaccination trends in the United States using data on total and daily vaccination rates. By exploring this dataset, I hope to understand patterns in public health initiatives over time and develop my skills in data wrangling, visualization, and hypothesis testing.


0.3 Data Wrangling

The following steps were performed to prepare the data for analysis:

  • Filtering: Removed rows with zero or missing vaccination data.
  • Mutating: Formatted the date column correctly for chronological order.
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
vaccination_data <- read.csv("us_vaccination_data.csv")

# Wrangle data: remove missing, zero rows, and ensure proper date formatting
data_clean <- vaccination_data %>%
  filter(total > 0 & daily > 0) %>%
  mutate(
    date_formatted = as.Date(date.1, format = "%m/%d/%y"),
    cumulative_percentage = totalPerHundred,
    daily_rate = dailyPerMillion
  ) %>%
  arrange(date_formatted)

0.4 Visualizations

0.4.1 Bar Chart

The bar chart below illustrates cumulative vaccination rates over time. The x-axis now shows properly formatted dates spaced evenly, and the bars are wider for better visibility.

library(ggplot2)
ggplot(data_clean, aes(x = date_formatted, y = cumulative_percentage)) +
  geom_col(fill = "skyblue", width = .4
           ) + # Reduced bar width for spacing
  scale_x_date(date_breaks = "4 weeks", date_labels = "%b %d") +
  labs(title = "Cumulative Vaccination Rates Over Time",
       x = "Date",
       y = "Total Vaccinations Per Hundred") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        plot.title = element_text(hjust = 0.5))

Interpretation: The bar chart shows a steady increase in cumulative vaccination rates over time, with clearer intervals and better visibility.

0.4.2 Line Chart

The line chart displays daily vaccination rates over time (8 week itnerval, double the interval of the bar chart), offering a clear view of fluctuations and trends.

ggplot(data_clean, aes(x = date_formatted, y = daily_rate)) +
  geom_line(color = "gold", size = 1) +
  scale_x_date(date_breaks = "8 weeks", date_labels = "%b %d") +
  labs(title = "Daily Vaccination Rates Over Time",
       x = "Date",
       y = "Daily Vaccinations Per Million") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 10),
        plot.title = element_text(hjust = 0.5))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Interpretation: The line chart reveals fluctuations in daily vaccination rates, highlighting peak periods that may correspond to vaccination campaigns or other interventions.


0.5 Descriptive Statistics

The table below provides summary statistics for cumulative and daily vaccination rates:

summary_stats <- data_clean %>%
  summarize(
    mean_total = mean(totalPerHundred, na.rm = TRUE),
    sd_total = sd(totalPerHundred, na.rm = TRUE),
    mean_daily = mean(dailyPerMillion, na.rm = TRUE),
    sd_daily = sd(dailyPerMillion, na.rm = TRUE)
  )
print(summary_stats)
##   mean_total sd_total mean_daily sd_daily
## 1   141.6196 60.45156   2321.591 2295.694

Interpretation: On average, total vaccination rates reached [mean_total] per hundred individuals, while daily rates varied around [mean_daily] per million people. The variability, reflected by standard deviations, indicates uneven progress across different days.


0.6 Linear Model

To investigate whether daily vaccination rates are influenced by cumulative progress:

model <- lm(daily ~ totalPerHundred, data = data_clean)
summary(model)
## 
## Call:
## lm(formula = daily ~ totalPerHundred, data = data_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1922168  -231628  -136215   213265  2087275 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1927016.2    49977.6   38.56   <2e-16 ***
## totalPerHundred   -8164.0      324.6  -25.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 581100 on 876 degrees of freedom
## Multiple R-squared:  0.4193, Adjusted R-squared:  0.4187 
## F-statistic: 632.6 on 1 and 876 DF,  p-value: < 2.2e-16

Interpretation: The linear model suggests a [positive/negative] relationship between cumulative vaccinations and daily vaccination rates. The p-value indicates [statistical significance or lack thereof], suggesting the strength of this association.


0.7 Hypothesis Testing (T-test)

We test whether there is a significant difference in vaccination rates before and after a midpoint date.

data_clean <- data_clean %>%
  mutate(period = ifelse(date_formatted < as.Date("2021-06-01"), "Early", "Late"))

t_test <- t.test(daily_rate ~ period, data = data_clean)
print(t_test)
## 
##  Welch Two Sample t-test
## 
## data:  daily_rate by period
## t = 17.937, df = 185.13, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Early and group Late is not equal to 0
## 95 percent confidence interval:
##  3540.447 4415.498
## sample estimates:
## mean in group Early  mean in group Late 
##            5529.341            1551.369

Interpretation: The t-test reveals that daily vaccination rates are significantly [higher/lower] in the later period compared to the early period (p-value: [insert value]). This indicates [plain-language explanation].


0.8 Conclusion

Through this project, I explored vaccination trends using cumulative and daily rates. My findings suggest clear increases over time, with fluctuations in daily rates reflecting possible societal and policy influences. This project allowed me to apply data wrangling, visualization, and statistical methods to better understand how public health trends evolve.

As a graduate student, this exercise deepened my curiosity about societal factors influencing health outcomes and reinforced the value of data-driven research.

What I can and want to improve on is including more APIs to delve into how the elections and demographic of political gtroups effects the vaccinationmm rate!!!