Exploring the Spread of Contagious Diseases in the United States
A visual representation of hand-to-hand Germ Transmission
INTRODUCTION
Contagious diseases remain a significant public health concern, shaping policy and influencing individual behaviors. Understanding how these diseases spread, the states most affected, and the patterns of reporting over time can provide valuable insights for both public health professionals and policymakers. This project explores data on contagious diseases in the United States to gain a deeper understanding of their prevalence and distribution. The dataset for this project is from the Centers for Disease Control and Prevention (CDC) via their public data portal (data.cdc.gov). It includes the following key variables: Disease (the specific illness reported), State (the reporting U.S. state), Year (the year data was recorded), Weeks Reporting (number of weeks data was reported), Count (reported cases), and Population (state population for the given year). These variables offer an opportunity to explore patterns of disease occurrence, identify areas most affected, and investigate the relationship between population size and case counts.
I chose this topic and dataset because I wanted to better understand how contagious diseases spread within the U.S. and the factors influencing reporting trends. As someone interested in public health, this dataset offers an opportunity to explore patterns of disease occurrence and assess how public health data can inform responses to health crises.
In order to understand the broader context of the data, I researched the history and patterns of contagious diseases in the United States. Historical records show that the introduction of vaccines in the mid-20th century led to a significant decline in diseases such as measles and polio. The CDC and other public health organizations have played a crucial role in tracking outbreaks and guiding prevention measures. This dataset reflects the ongoing efforts to monitor and report diseases, contributing to public health strategies.
Unfortunately, the dataset does not include explicit details on the methodology used to collect the data.
This project will delve into questions such as:
Which diseases have the highest reported case counts?
How do trends in disease prevalence vary across states and over time?
Is there a correlation between population size and the number of reported cases?
Through this analysis, I aim to enhance awareness of the dynamics of contagious diseases and contribute to a broader understanding of their impact on public health.
Background Research Source: History of Vaccines. (n.d.). Vaccines and immunization: History of vaccines. The College of Physicians of Philadelphia. Retrieved December 16, 2024, from https://www.historyofvaccines.org/
# Load necessary librarieslibrary(tidyverse) # For data manipulation and visualization
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggthemes) # Adds additional themeslibrary(plotly) # Enables the creation of interactive plots
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(RColorBrewer) #Supplies color palettes for visualizations
These libraries are loaded to handle data manipulation, create visualizations, add interactivity to plots, and use custom color schemes.
# Load the dataset using read_csvsetwd("C:/Users/akais/OneDrive/Documents/Dataset for final project") disease_data <-read_csv("us_contagious_diseases.csv", show_col_types =FALSE)
The dataset is imported from a local directory. The setwd() function specifies the file path, while read_csv() reads the CSV file into a dataframe for analysis. The show_col_types = FALSE argument suppresses column type messages.
# Display the first few rows to understand the structure of the datahead(disease_data)
# A tibble: 6 × 6
disease state year weeks_reporting count population
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Hepatitis A Alabama 1966 50 321 3345787
2 Hepatitis A Alabama 1967 49 291 3364130
3 Hepatitis A Alabama 1968 52 314 3386068
4 Hepatitis A Alabama 1969 49 380 3412450
5 Hepatitis A Alabama 1970 51 413 3444165
6 Hepatitis A Alabama 1971 51 378 3481798
The head() function is used to preview the first few rows of the dataset. This step helps ensure the data is loaded correctly and allows you to familiarize yourself with its structure.
This code standardizes column names to title case for better readability. The rename_with() function applies str_to_title() to transform all column names.
# Display the first few rows of the cleaned data with uppercase column nameshead(cleaned_disease_data)
# A tibble: 6 × 6
Disease State Year Weeks_reporting Count Population
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Hepatitis A Alabama 1966 50 321 3345787
2 Hepatitis A Alabama 1967 49 291 3364130
3 Hepatitis A Alabama 1968 52 314 3386068
4 Hepatitis A Alabama 1969 49 380 3412450
5 Hepatitis A Alabama 1970 51 413 3444165
6 Hepatitis A Alabama 1971 51 378 3481798
# Changes column names to start with uppercase letters for better readability
The cleaned dataset is previewed to confirm that column names were successfully modified.
Step 2: Handling Missing Values
# Identify missing values in each columncolSums(is.na(cleaned_disease_data))
Disease State Year Weeks_reporting Count
0 0 0 0 0
Population
204
This code identifies the number of missing values in each column. The colSums() function, combined with is.na(), calculates the total missing entries per column.In our case, we notice that there are 204 missing values in population.
# Calculate median for each group (State and Year)group_medians <-tapply(cleaned_disease_data$Population, list(cleaned_disease_data$State, cleaned_disease_data$Year), function(x) median(x, na.rm =TRUE))
The tapply() function calculates the median population for each state-year group. This step prepares a reference to fill in missing population data with appropriate values.
# Replace NA values in Population with corresponding group mediancleaned_disease_data$Population <-ifelse(is.na(cleaned_disease_data$Population),ave(cleaned_disease_data$Population, cleaned_disease_data$State, cleaned_disease_data$Year, FUN =function(x) median(x, na.rm =TRUE)), cleaned_disease_data$Population)
Missing values in the population column are replaced with the median population of their respective state and year. This approach ensures that missing data is imputed logically.
Step 3: Filtering Relevant Data
# Filter data for diseases with the most reported casestop_diseases <- cleaned_disease_data %>%group_by(Disease) %>%summarise(Total_Cases =sum(Count, na.rm =TRUE)) %>%arrange(desc(Total_Cases)) %>%slice_head(n =5)
This code identifies the top five diseases with the highest total reported cases by grouping the data by disease, summing the case counts, and sorting them in descending order.
# Filter dataset for these top diseasesfiltered_data <- cleaned_disease_data %>%filter(Disease %in% top_diseases$Disease)
This filters the dataset to retain only rows corresponding to the top five diseases identified in the previous step.
Step 4: Statistical Analysis: Linear Regression
Relationship Between Population and Case Counts
A linear regression model is fitted to explore the relationship between population size and the number of reported cases. The summary() function provides detailed statistical results, including the strength and significance of the relationship.
# Linear regression modellm_model <-lm(Count ~ Population, data = filtered_data)summary(lm_model)
Call:
lm(formula = Count ~ Population, data = filtered_data)
Residuals:
Min 1Q Median 3Q Max
-6268 -1562 -1155 -828 129916
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.000e+03 6.452e+01 15.51 <2e-16 ***
Population 1.447e-04 9.919e-06 14.59 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6006 on 15452 degrees of freedom
(152 observations deleted due to missingness)
Multiple R-squared: 0.01359, Adjusted R-squared: 0.01353
F-statistic: 213 on 1 and 15452 DF, p-value: < 2.2e-16
For this analysis, I performed a simple linear regression to explore the relationship between population size and reported disease counts, using the equation: Count=1000+0.0001447⋅Population. Both the intercept and the population variable are statistically significant (p<0.001), meaning population size does influence disease counts. However, the adjusted R^2 =0.0135 shows that the model explains only 1.35% of the variation in disease counts, which means it does not do a good job of predicting the data. Diagnostic plots show some problems, such as uneven spread of the residuals and non-normal patterns, suggesting the model could be improved.
Diagnostic plots for the regression model are generated to check assumptions like linearity, normality of residuals, and homoscedasticity.
Step5: Visualizations
Visualization 1: Disease Trends Over Time
# Aggregate cases by disease and year across all statesdisease_trends <- filtered_data %>%group_by(Disease, Year) %>%summarise(Total_Cases =sum(Count),Total_Population =sum(Population),Cases_Per_100k = (Total_Cases / Total_Population) *100000,.groups ='drop' )
It Calculates total cases, total population, and cases per 100,000 people for each disease-year combination.
# Create an interactive plot of disease trends over timeinteractive_plot <-ggplot(disease_trends, aes(x = Year, y = Total_Cases, color = Disease)) +geom_line(linewidth =1) +geom_point(size =2, alpha =0.7) +labs(title ="Total Disease Cases Across All States",subtitle ="Yearly Trends of Different Diseases",x ="Year",y ="Total Number of Cases",color ="Disease" ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5, face ="bold"),plot.subtitle =element_text(hjust =0.5),legend.position ="bottom",axis.text.x =element_text(angle =45, hjust =1),panel.grid.minor =element_blank() ) +scale_color_brewer(palette ="Set1") +scale_y_continuous(labels = scales::comma)# Convert the ggplot to an interactive plotggplotly(interactive_plot)
This graph shows the total number of reported cases for the top five diseases over time in the United States, which are Measles, Polio, Mumps, Pertussis, and Hepatitis A.Each disease is represented by a color. Measles had the highest number of cases, peaking in 1938 with 820,087 cases, but cases dropped sharply after the 1960s due to vaccination. Polio also saw a significant decline after the mid-1950s when its vaccine was introduced. The other diseases, including Mumps, Pertussis, and Hepatitis A, show lower case numbers overall and a steady decline starting in the mid-20th century.
# Create the plotggplot(disease_trends_pct, aes(x = Year, y = Pct_Change, color = Disease)) +geom_line(size =1) +geom_point(size =2, alpha =0.7) +labs(title ="Percentage Change in Disease Cases",subtitle ="Year-over-Year Variation",x ="Year",y ="Percentage Change in Cases",color ="Disease" ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5, face ="bold"),plot.subtitle =element_text(hjust =0.5),legend.position ="bottom",axis.text.x =element_text(angle =45, hjust =1),panel.grid.minor =element_blank() ) +scale_color_brewer(palette ="Set1") +scale_y_continuous(labels = scales::percent_format(scale =1))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This plot visualizes the percentage change in disease cases over time for each disease. It helps to observe year-over-year variations and trends.
Visualization 2: Highest Reported Cases by Top 5 States
# Calculate total cases by statetop_states <- cleaned_disease_data %>%group_by(State) %>%summarise(Total_Cases =sum(Count, na.rm =TRUE)) %>%arrange(desc(Total_Cases)) %>%slice_head(n =5)
# Create the bar plot for top 5 states with highest reported case countsggplot(top_states, aes(x =reorder(State, -Total_Cases), y = Total_Cases, fill = State)) +geom_bar(stat ="identity", show.legend =FALSE) +coord_flip() +# Flip the axes for better readabilitylabs(title ="Top 5 States with Highest Reported Case Counts",x ="State",y ="Total Case Count" ) +theme_minimal() +theme(plot.title =element_text(hjust =0.5, face ="bold"),axis.text.x =element_text(angle =45, hjust =1),panel.grid.minor =element_blank() ) +scale_fill_brewer(palette ="Set1") +scale_y_continuous(labels = scales::comma) # Format the y-axis with commas
The bar chart displays the top 5 states with the highest reported case counts, with New York having the highest total case count, followed by California, Pennsylvania, Texas, and Michigan, which has the lowest total among the five. The x-axis shows the total case count, ranging from 0 to 2 million, while the y-axis lists the states. Each colored bar represents the case count for each state, with New York having the longest bar and Michigan the shortest.
CONCLUSION
The visualization provides a detailed view of the total number of reported cases for the top five diseases over time. Diseases such as Measles, Polio, Mumps, Pertussis, and Hepatitis A show distinct trends, with measles peaking in the late 1930s and Polio’s decline after the 1950s with the introduction of its vaccine. Interestingly, the plot reveals that, while vaccination campaigns led to a reduction in these diseases, there are fluctuations in the early 20th century, likely driven by public health responses and changes in reporting practices.
The linear regression analysis shows a statistically significant relationship between population size and reported cases, although the model explains only a small portion of the variation in disease counts. This suggests that while population size has some influence, other factors such as healthcare access, vaccination rates, and social behaviors may also play significant roles.
The analysis offers useful insights, but there were some challenges. One issue was dealing with missing values in the population column. I filled in the missing data using the median population for each state-year group, but this method might not be the best for every case. Additionally, I had some difficulties with the interactive features of the visualizations and wished I could have added more details, like disease-specific heatmaps by state.
Despite these challenges, the visualization and analysis provide valuable information about the spread of contagious diseases in the U.S., helping to highlight historical trends and guide future public health planning.