Final Project

Author

Latifah Traore

Exploring the Spread of Contagious Diseases in the United States

A visual representation of hand-to-hand Germ Transmission

INTRODUCTION

Contagious diseases remain a significant public health concern, shaping policy and influencing individual behaviors. Understanding how these diseases spread, the states most affected, and the patterns of reporting over time can provide valuable insights for both public health professionals and policymakers. This project explores data on contagious diseases in the United States to gain a deeper understanding of their prevalence and distribution. The dataset for this project is from the Centers for Disease Control and Prevention (CDC) via their public data portal (data.cdc.gov). It includes the following key variables: Disease (the specific illness reported), State (the reporting U.S. state), Year (the year data was recorded), Weeks Reporting (number of weeks data was reported), Count (reported cases), and Population (state population for the given year). These variables offer an opportunity to explore patterns of disease occurrence, identify areas most affected, and investigate the relationship between population size and case counts.

I chose this topic and dataset because I wanted to better understand how contagious diseases spread within the U.S. and the factors influencing reporting trends. As someone interested in public health, this dataset offers an opportunity to explore patterns of disease occurrence and assess how public health data can inform responses to health crises.

In order to understand the broader context of the data, I researched the history and patterns of contagious diseases in the United States. Historical records show that the introduction of vaccines in the mid-20th century led to a significant decline in diseases such as measles and polio. The CDC and other public health organizations have played a crucial role in tracking outbreaks and guiding prevention measures. This dataset reflects the ongoing efforts to monitor and report diseases, contributing to public health strategies.

Unfortunately, the dataset does not include explicit details on the methodology used to collect the data.

This project will delve into questions such as:

Which diseases have the highest reported case counts?
How do trends in disease prevalence vary across states and over time?
Is there a correlation between population size and the number of reported cases?

Through this analysis, I aim to enhance awareness of the dynamics of contagious diseases and contribute to a broader understanding of their impact on public health.

Dataset Source: data.cdc.gov

Background Research Source: History of Vaccines. (n.d.). Vaccines and immunization: History of vaccines. The College of Physicians of Philadelphia. Retrieved December 16, 2024, from https://www.historyofvaccines.org/

# Load necessary libraries
library(tidyverse) # For data manipulation and visualization

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes) # Adds additional themes
library(plotly)   # Enables the creation of interactive plots


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(RColorBrewer) #Supplies color palettes for visualizations

These libraries are loaded to handle data manipulation, create visualizations, add interactivity to plots, and use custom color schemes.

# Load the dataset using read_csv
setwd("C:/Users/akais/OneDrive/Documents/Dataset for final project") 
disease_data <- read_csv("us_contagious_diseases.csv", show_col_types = FALSE)

The dataset is imported from a local directory. The setwd() function specifies the file path, while read_csv() reads the CSV file into a dataframe for analysis. The show_col_types = FALSE argument suppresses column type messages.

# Display the first few rows to understand the structure of the data
head(disease_data)

# A tibble: 6 × 6
  disease     state    year weeks_reporting count population
  <chr>       <chr>   <dbl>           <dbl> <dbl>      <dbl>
1 Hepatitis A Alabama  1966              50   321    3345787
2 Hepatitis A Alabama  1967              49   291    3364130
3 Hepatitis A Alabama  1968              52   314    3386068
4 Hepatitis A Alabama  1969              49   380    3412450
5 Hepatitis A Alabama  1970              51   413    3444165
6 Hepatitis A Alabama  1971              51   378    3481798

The head() function is used to preview the first few rows of the dataset. This step helps ensure the data is loaded correctly and allows you to familiarize yourself with its structure.

Cleaning and Wrangling the Data

Step 1: Renaming Columns for Clarity

cleaned_disease_data <- disease_data %>%
  rename_with(~ str_to_title(.))

This code standardizes column names to title case for better readability. The rename_with() function applies str_to_title() to transform all column names.

# Display the first few rows of the cleaned data with uppercase column names
head(cleaned_disease_data)

# A tibble: 6 × 6
  Disease     State    Year Weeks_reporting Count Population
  <chr>       <chr>   <dbl>           <dbl> <dbl>      <dbl>
1 Hepatitis A Alabama  1966              50   321    3345787
2 Hepatitis A Alabama  1967              49   291    3364130
3 Hepatitis A Alabama  1968              52   314    3386068
4 Hepatitis A Alabama  1969              49   380    3412450
5 Hepatitis A Alabama  1970              51   413    3444165
6 Hepatitis A Alabama  1971              51   378    3481798

# Changes column names to start with uppercase letters for better readability

The cleaned dataset is previewed to confirm that column names were successfully modified.

Step 2: Handling Missing Values

# Identify missing values in each column
colSums(is.na(cleaned_disease_data))

        Disease           State            Year Weeks_reporting           Count 
              0               0               0               0               0 
     Population 
            204

This code identifies the number of missing values in each column. The colSums() function, combined with is.na(), calculates the total missing entries per column.In our case, we notice that there are 204 missing values in population.

# Calculate median for each group (State and Year)
group_medians <- tapply(cleaned_disease_data$Population, 
                        list(cleaned_disease_data$State, cleaned_disease_data$Year), 
                        function(x) median(x, na.rm = TRUE))

The tapply() function calculates the median population for each state-year group. This step prepares a reference to fill in missing population data with appropriate values.

# Replace NA values in Population with corresponding group median
cleaned_disease_data$Population <- ifelse(
  is.na(cleaned_disease_data$Population),
  ave(cleaned_disease_data$Population, cleaned_disease_data$State, cleaned_disease_data$Year, FUN = function(x) median(x, na.rm = TRUE)),
  cleaned_disease_data$Population
)

Missing values in the population column are replaced with the median population of their respective state and year. This approach ensures that missing data is imputed logically.

Step 3: Filtering Relevant Data

# Filter data for diseases with the most reported cases
top_diseases <- cleaned_disease_data %>%
  group_by(Disease) %>%
  summarise(Total_Cases = sum(Count, na.rm = TRUE)) %>%
  arrange(desc(Total_Cases)) %>%
  slice_head(n = 5)

This code identifies the top five diseases with the highest total reported cases by grouping the data by disease, summing the case counts, and sorting them in descending order.

# Filter dataset for these top diseases
filtered_data <- cleaned_disease_data %>%
  filter(Disease %in% top_diseases$Disease)

This filters the dataset to retain only rows corresponding to the top five diseases identified in the previous step.

Step 4: Statistical Analysis: Linear Regression

Relationship Between Population and Case Counts

A linear regression model is fitted to explore the relationship between population size and the number of reported cases. The summary() function provides detailed statistical results, including the strength and significance of the relationship.

# Linear regression model
lm_model <- lm(Count ~ Population, data = filtered_data)
summary(lm_model)


Call:
lm(formula = Count ~ Population, data = filtered_data)

Residuals:
   Min     1Q Median     3Q    Max 
 -6268  -1562  -1155   -828 129916 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.000e+03  6.452e+01   15.51   <2e-16 ***
Population  1.447e-04  9.919e-06   14.59   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6006 on 15452 degrees of freedom
  (152 observations deleted due to missingness)
Multiple R-squared:  0.01359,   Adjusted R-squared:  0.01353 
F-statistic:   213 on 1 and 15452 DF,  p-value: < 2.2e-16

For this analysis, I performed a simple linear regression to explore the relationship between population size and reported disease counts, using the equation: Count=1000+0.0001447⋅Population. Both the intercept and the population variable are statistically significant (p<0.001), meaning population size does influence disease counts. However, the adjusted R^2 =0.0135 shows that the model explains only 1.35% of the variation in disease counts, which means it does not do a good job of predicting the data. Diagnostic plots show some problems, such as uneven spread of the residuals and non-normal patterns, suggesting the model could be improved.

# Diagnostic plots
par(mfrow = c(2, 2))
plot(lm_model)

Diagnostic plots for the regression model are generated to check assumptions like linearity, normality of residuals, and homoscedasticity.

Step5: Visualizations

Visualization 1: Disease Trends Over Time

# Aggregate cases by disease and year across all states
disease_trends <- filtered_data %>% 
  group_by(Disease, Year) %>% 
  summarise(
    Total_Cases = sum(Count),
    Total_Population = sum(Population),
    Cases_Per_100k = (Total_Cases / Total_Population) * 100000,
    .groups = 'drop'
  )

It Calculates total cases, total population, and cases per 100,000 people for each disease-year combination.

# Create an interactive plot of disease trends over time
interactive_plot <-  ggplot(disease_trends, aes(x = Year, y = Total_Cases, color = Disease)) +
  geom_line(linewidth = 1) +
  geom_point(size = 2, alpha = 0.7) +
  labs(
    title = "Total Disease Cases Across All States",
    subtitle = "Yearly Trends of Different Diseases",
    x = "Year",
    y = "Total Number of Cases",
    color = "Disease"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(labels = scales::comma)

# Convert the ggplot to an interactive plot
ggplotly(interactive_plot)

This graph shows the total number of reported cases for the top five diseases over time in the United States, which are Measles, Polio, Mumps, Pertussis, and Hepatitis A.Each disease is represented by a color. Measles had the highest number of cases, peaking in 1938 with 820,087 cases, but cases dropped sharply after the 1960s due to vaccination. Polio also saw a significant decline after the mid-1950s when its vaccine was introduced. The other diseases, including Mumps, Pertussis, and Hepatitis A, show lower case numbers overall and a steady decline starting in the mid-20th century.

# Calculate percentage change
disease_trends_pct <- disease_trends %>% 
  group_by(Disease) %>% 
  mutate(
    Pct_Change = (Total_Cases - lag(Total_Cases)) / lag(Total_Cases) * 100
  ) %>% 
  filter(!is.na(Pct_Change))

# Create the plot
ggplot(disease_trends_pct, aes(x = Year, y = Pct_Change, color = Disease)) +
  geom_line(size = 1) +
  geom_point(size = 2, alpha = 0.7) +
  labs(
    title = "Percentage Change in Disease Cases",
    subtitle = "Year-over-Year Variation",
    x = "Year",
    y = "Percentage Change in Cases",
    color = "Disease"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5),
    legend.position = "bottom",
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_color_brewer(palette = "Set1") +
  scale_y_continuous(labels = scales::percent_format(scale = 1))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

This plot visualizes the percentage change in disease cases over time for each disease. It helps to observe year-over-year variations and trends.

Visualization 2: Highest Reported Cases by Top 5 States

# Calculate total cases by state
top_states <- cleaned_disease_data %>%
  group_by(State) %>%
  summarise(Total_Cases = sum(Count, na.rm = TRUE)) %>%
  arrange(desc(Total_Cases)) %>%
  slice_head(n = 5)

# Create the bar plot for top 5 states with highest reported case counts
ggplot(top_states, aes(x = reorder(State, -Total_Cases), y = Total_Cases, fill = State)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  coord_flip() +  # Flip the axes for better readability
  labs(
    title = "Top 5 States with Highest Reported Case Counts",
    x = "State",
    y = "Total Case Count"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.minor = element_blank()
  ) +
  scale_fill_brewer(palette = "Set1") +
  scale_y_continuous(labels = scales::comma)  # Format the y-axis with commas

The bar chart displays the top 5 states with the highest reported case counts, with New York having the highest total case count, followed by California, Pennsylvania, Texas, and Michigan, which has the lowest total among the five. The x-axis shows the total case count, ranging from 0 to 2 million, while the y-axis lists the states. Each colored bar represents the case count for each state, with New York having the longest bar and Michigan the shortest.

CONCLUSION

The visualization provides a detailed view of the total number of reported cases for the top five diseases over time. Diseases such as Measles, Polio, Mumps, Pertussis, and Hepatitis A show distinct trends, with measles peaking in the late 1930s and Polio’s decline after the 1950s with the introduction of its vaccine. Interestingly, the plot reveals that, while vaccination campaigns led to a reduction in these diseases, there are fluctuations in the early 20th century, likely driven by public health responses and changes in reporting practices.

The linear regression analysis shows a statistically significant relationship between population size and reported cases, although the model explains only a small portion of the variation in disease counts. This suggests that while population size has some influence, other factors such as healthcare access, vaccination rates, and social behaviors may also play significant roles.

The analysis offers useful insights, but there were some challenges. One issue was dealing with missing values in the population column. I filled in the missing data using the median population for each state-year group, but this method might not be the best for every case. Additionally, I had some difficulties with the interactive features of the visualizations and wished I could have added more details, like disease-specific heatmaps by state.

Despite these challenges, the visualization and analysis provide valuable information about the spread of contagious diseases in the U.S., helping to highlight historical trends and guide future public health planning.