The US Contagious Diseases data set contains information about reported cases of various contagious diseases in the United States from 1928 to 2011. The variables in the data set include:
Disease: The name of the disease being reported.
State: The state in where the case of the disease was reported.
Year: The year in which the case of the disease was reported.
Weeks Reporting: The number of weeks for which cases were reported
Count: The number of reported cases.
Population: The population of the state.
In this analysis, I will explore the trends in reported cases of contagious diseases over time in Maryland and examine the relationship between the number of cases reported and what disease is being reported from 1968 and onward. We will also analyze which disease seems to be the most contagious by making a bar plot.
Data set is sourced from Tycho Project (http://www.tycho.pitt.edu/).
Loading the libraries that are going to be used and read the csv data set file
## Load the libraries ggplot2,dplyr, and gridExtralibrary(ggplot2)library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
library(gridExtra)
Warning: package 'gridExtra' was built under R version 4.3.3
Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':
combine
## Read the csv data and name it US_ConDisUS_ConDis <-read.csv("us_contagious_diseases.csv")
Check for missing data and if data is missing clean it out. (DATA CLEANING)
## Checking the data for missing data valueshead(US_ConDis)
disease state year weeks_reporting count population
1 Hepatitis A Alabama 1966 50 321 3345787
2 Hepatitis A Alabama 1967 49 291 3364130
3 Hepatitis A Alabama 1968 52 314 3386068
4 Hepatitis A Alabama 1969 49 380 3412450
5 Hepatitis A Alabama 1970 51 413 3444165
6 Hepatitis A Alabama 1971 51 378 3481798
## Removing the missing data values and rename the data setRemove_NA <-na.omit(US_ConDis)US_ConDis2 <- Remove_NA
Performing Regression Analysis
## Regression analysis using year, weeks_reporting, count, and population model <-lm(count ~ year + weeks_reporting +population, data = US_ConDis2)summary(model)
Call:
lm(formula = count ~ year + weeks_reporting + population, data = US_ConDis2)
Residuals:
Min 1Q Median 3Q Max
-6778 -1618 -546 541 126513
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.469e+05 3.596e+03 40.84 <2e-16 ***
year -7.472e+01 1.819e+00 -41.09 <2e-16 ***
weeks_reporting 3.775e+01 1.943e+00 19.43 <2e-16 ***
population 1.841e-04 8.135e-06 22.63 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5169 on 18662 degrees of freedom
Multiple R-squared: 0.1277, Adjusted R-squared: 0.1275
F-statistic: 910.5 on 3 and 18662 DF, p-value: < 2.2e-16
For each additional year, the predicted number of reported cases decreases by 74.72, holding all other variables constant.For each additional week of reporting cases, the predicted number of reported cases increases by 37.75, holding all other variables constant.For each additional person in the population, the predicted number of reported cases increases by 0.0001841, holding all other variables constant.
P-values:
The p-values for all coefficients are less than 0.001, suggesting that all variables (Year, Weeks Reporting, Population) are statistically significant in explaining the variation in the number of reported cases.
Adjusted R-squared:
The adjusted R-squared value of 0.1275 indicates that approximately 12.75% of the variation in the number of reported cases can be explained by the linear regression model with Year, Weeks Reporting, and Population as predictors.
Summary:
The model suggests that Year, Weeks Reporting, and Population are significant predictors of the number of reported cases, but the model explains only a relatively small portion of the variation in the data (12.75%).
Looking for the average cases of diseases being reported each year
## Grouping by year and then summarizing and take the mean of count and that will be the average_casesaverage_cases <- US_ConDis2 |>group_by(year) |>summarize(avg_cases =mean(count))# Making a line plot with average_cases and yearggplot(data = average_cases, aes(x = year, y = avg_cases)) +geom_line(color="#795695") +labs(title ="Average Reported Cases of Diseases Per Year",x ="Year",y ="Average Number of Reported Cases",caption ="Source:Tycho Project (http://www.tycho.pitt.edu/) ") +theme_minimal()
Looking for the average cases of diseases being reported each year by disease
## Grouping by year and disease. Summarizing and take the mean of count and that will be the average_casesaverage_cases <- US_ConDis2 |>group_by(year, disease) |>summarise(avg_cases =mean(count))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Making a line plot using average_cases and year and the color of the line will equal to the diseaseggplot(data = average_cases, aes(x = year, y = avg_cases, color = disease)) +geom_line() +labs(title ="Average Reported Cases of Diseases Per Year By Disease",x ="Year",y ="Average Number of Reported Cases",color ="Disease",caption ="Source: Tycho Project (http://www.tycho.pitt.edu/)") +scale_color_manual(values =c("purple", "pink", "orange", "blue", "green","red","cyan")) +theme_minimal()
Looking for the total cases of diseases being reported each year and looking for the total cases of diseases being reported each year by disease
# Filtering the data for the state of Marylandmaryland_data <- US_ConDis2 |>filter(state =="Maryland")# Finding the total cases per year in Marylandtotal_cases_maryland <- maryland_data |>group_by(year) |>summarise(total_cases =sum(count))# Making a line plot for total cases over time in Marylandp1 <-ggplot(data = total_cases_maryland, aes(x = year, y = total_cases)) +geom_line(color ="skyblue") +labs(title ="Total Reported Cases of Contagious Diseases in Maryland Per Year",x ="Year",y ="Total Number of Reported Cases",caption ="Source: Tycho Project (http://www.tycho.pitt.edu/) ") +theme_minimal()# Finding total cases per year in Maryland by diseasetotal_cases_maryland_DI <- maryland_data |>group_by(year, disease) |>summarise(total_cases =sum(count))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Making a scatter plot for total cases vs year in Maryland, colored by diseasep2 <-ggplot(data = total_cases_maryland_DI, aes(x = year, y = total_cases, color = disease)) +geom_point(alpha =0.5) +labs(title ="Relationship Between Total Number of Cases and Year in Maryland by Disease",x ="Year",y ="Total Number of Reported Cases",caption ="Source: Tycho Project (http://www.tycho.pitt.edu/) ") +theme_minimal()# Arranging the plots in a gridgrid.arrange(p1, p2, nrow =2)
Final Visualization
I chose to focus on the data from 1968 onwards because not all diseases were consistently recorded before that year.
The bar plot below shows the total reported cases of contagious diseases in Maryland from 1968 onwards, categorized by disease.
## Filtering the data for Maryland and the years 1968 and onwardsmaryland_data_1968_onwards <- US_ConDis |>filter(state =="Maryland", year >=1968)## Summarizing total cases by diseasetotal_cases_by_disease_1968_onwards <- maryland_data_1968_onwards |>group_by(disease) |>summarize(total_cases =sum(count)) |>arrange(desc(total_cases)) ## Arrange in descending order to show the most contagious disease first## Bar plot for total cases by disease from 1968 onwardsggplot(data = total_cases_by_disease_1968_onwards, aes(x =reorder(disease, total_cases), y = total_cases, fill = disease)) +geom_bar(stat ="identity") +labs(title ="Total Reported Cases of Contagious Diseases in Maryland from 1968 Onwards by Disease",x ="Disease",y ="Total Number of Reported Cases",caption ="Source: Tycho Project (http://www.tycho.pitt.edu/) ") +scale_fill_manual(values =c("#85aadf", "#fec8d8", "#b491c8", "#666666", "#d291bc","#9caabd","#b39eb5")) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) ## Rotating x-axis labels for a better readability
I cleaned the dataset, by first checking for missing values and removing them using the na.omit() function to ensure the accuracy of the data set for the analysis. This step was important to avoid a biased result or an incorrect intepretatiopn due to incomplete data.
The visualization represents the total reported cases of contagious diseases in Maryland from 1968 onwards, categorized by disease. The bar plot shows the total number of reported cases for each disease, sorted in descending order to highlight the most prevalent diseases. The colors differentiate between diseases, making it easier to distinguish them.
One surprising finding was that hepatitis A appeared to be more prevalent than measles from 1968 onwards. I assumed that measles seemed to be more contagious, since the first plots I did all pointed to that. This discovery challenged my initial assumption that measles would be more contagious, highlighting the importance of data analysis in uncovering unexpected insights.The visualization provides a clear representation of how the prevalence of different diseases has evolved over time in Maryland.
I wish I could have included more detailed information about the specific trends and changes in the prevalence of each disease over time. Additionally, it would have been better to explore the factors contributing to the prevalence of hepatitis A compared to other diseases in more depth. However, due to limitations in the data set and the scope of the project, I focused on presenting a general overview of the data instead.
*** CHAT GPT was used to look for errors and fix them. *** CHAT GPT was also used to make suggestions. *** I remember you said we never use read.csv(i think), but i tried to use read_csv, and it said it did not exists, so I had to opt out to use read.csv