US Contagious Diseases Data Set

Author

Dajana Ramirez

The US Contagious Diseases data set contains information about reported cases of various contagious diseases in the United States from 1928 to 2011. The variables in the data set include:

In this analysis, I will explore the trends in reported cases of contagious diseases over time in Maryland and examine the relationship between the number of cases reported and what disease is being reported from 1968 and onward. We will also analyze which disease seems to be the most contagious by making a bar plot.

Data set is sourced from Tycho Project (http://www.tycho.pitt.edu/).

Loading the libraries that are going to be used and read the csv data set file

## Load the libraries ggplot2,dplyr, and gridExtra
library(ggplot2)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(gridExtra)
Warning: package 'gridExtra' was built under R version 4.3.3

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
## Read the csv data and name it US_ConDis
US_ConDis <- read.csv("us_contagious_diseases.csv")

Check for missing data and if data is missing clean it out. (DATA CLEANING)

## Checking the data for missing data values
head(US_ConDis)
      disease   state year weeks_reporting count population
1 Hepatitis A Alabama 1966              50   321    3345787
2 Hepatitis A Alabama 1967              49   291    3364130
3 Hepatitis A Alabama 1968              52   314    3386068
4 Hepatitis A Alabama 1969              49   380    3412450
5 Hepatitis A Alabama 1970              51   413    3444165
6 Hepatitis A Alabama 1971              51   378    3481798
tail(US_ConDis)
       disease   state year weeks_reporting count population
18865 Smallpox Wyoming 1948              24     1     280803
18866 Smallpox Wyoming 1949               0     0     285544
18867 Smallpox Wyoming 1950               1     2     290529
18868 Smallpox Wyoming 1951               1     1     295744
18869 Smallpox Wyoming 1952               1     1     301083
18870 Smallpox Wyoming 1953               0     0     306410
sum(is.na(US_ConDis))
[1] 204
## Removing the missing data values and rename the data set
Remove_NA <- na.omit(US_ConDis)
US_ConDis2 <- Remove_NA

Performing Regression Analysis

## Regression analysis using year, weeks_reporting, count, and population 
model <- lm(count ~ year + weeks_reporting +population, data = US_ConDis2)
summary(model)

Call:
lm(formula = count ~ year + weeks_reporting + population, data = US_ConDis2)

Residuals:
   Min     1Q Median     3Q    Max 
 -6778  -1618   -546    541 126513 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      1.469e+05  3.596e+03   40.84   <2e-16 ***
year            -7.472e+01  1.819e+00  -41.09   <2e-16 ***
weeks_reporting  3.775e+01  1.943e+00   19.43   <2e-16 ***
population       1.841e-04  8.135e-06   22.63   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5169 on 18662 degrees of freedom
Multiple R-squared:  0.1277,    Adjusted R-squared:  0.1275 
F-statistic: 910.5 on 3 and 18662 DF,  p-value: < 2.2e-16

Regression Analysis

The model has the equation:

count = (146900−74.72)year+(37.75)weeks_reporting+(0.0001841)Population

For each additional year, the predicted number of reported cases decreases by 74.72, holding all other variables constant.For each additional week of reporting cases, the predicted number of reported cases increases by 37.75, holding all other variables constant.For each additional person in the population, the predicted number of reported cases increases by 0.0001841, holding all other variables constant.

P-values:

The p-values for all coefficients are less than 0.001, suggesting that all variables (Year, Weeks Reporting, Population) are statistically significant in explaining the variation in the number of reported cases.

Adjusted R-squared:

The adjusted R-squared value of 0.1275 indicates that approximately 12.75% of the variation in the number of reported cases can be explained by the linear regression model with Year, Weeks Reporting, and Population as predictors.

Summary:

The model suggests that Year, Weeks Reporting, and Population are significant predictors of the number of reported cases, but the model explains only a relatively small portion of the variation in the data (12.75%).

Looking for the average cases of diseases being reported each year

## Grouping by year and then summarizing and take the mean of count and that will be the average_cases
average_cases <- US_ConDis2 |>
  group_by(year) |>
  summarize(avg_cases = mean(count))

# Making a line plot with average_cases and year
ggplot(data = average_cases, aes(x = year, y = avg_cases)) +
  geom_line(color="#795695") +
  labs(title = "Average Reported Cases of Diseases Per Year",
       x = "Year",
       y = "Average Number of Reported Cases",
       caption = "Source:Tycho Project (http://www.tycho.pitt.edu/) ") +theme_minimal()

Looking for the average cases of diseases being reported each year by disease

## Grouping by year and disease. Summarizing and take the mean of count and that will be the average_cases
average_cases <- US_ConDis2 |>
  group_by(year, disease) |>
  summarise(avg_cases = mean(count))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Making a line plot using average_cases and year and the color of the line will equal to the disease
ggplot(data = average_cases, aes(x = year, y = avg_cases, color = disease)) +
  geom_line() +
  labs(title = "Average Reported Cases of Diseases Per Year By Disease",
       x = "Year",
       y = "Average Number of Reported Cases",color = "Disease",
       caption = "Source: Tycho Project (http://www.tycho.pitt.edu/)") +
  scale_color_manual(values = c("purple", "pink", "orange", "blue", "green","red","cyan")) +
  theme_minimal()

Looking for the total cases of diseases being reported each year and looking for the total cases of diseases being reported each year by disease

# Filtering the data for the state of Maryland
maryland_data <- US_ConDis2 |>
  filter(state == "Maryland")

# Finding the total cases per year in Maryland
total_cases_maryland <- maryland_data |>
  group_by(year) |>
  summarise(total_cases = sum(count))

# Making a line plot for total cases over time in Maryland
p1 <- ggplot(data = total_cases_maryland, aes(x = year, y = total_cases)) +
  geom_line(color = "skyblue") +
  labs(title = "Total Reported Cases of Contagious Diseases in Maryland Per Year",
       x = "Year",
       y = "Total Number of Reported Cases",
       caption = "Source: Tycho Project (http://www.tycho.pitt.edu/) ") +
  theme_minimal()

# Finding total cases per year in Maryland by disease
total_cases_maryland_DI <- maryland_data |>
  group_by(year, disease) |>
  summarise(total_cases = sum(count))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# Making a scatter plot for total cases vs year in Maryland, colored by disease
p2 <- ggplot(data = total_cases_maryland_DI, aes(x = year, y = total_cases, color = disease)) +
  geom_point(alpha = 0.5) +
  labs(title = "Relationship Between Total Number of Cases and Year in Maryland by Disease",
       x = "Year",
       y = "Total Number of Reported Cases",
       caption = "Source: Tycho Project (http://www.tycho.pitt.edu/) ") +
  theme_minimal()

# Arranging the plots in a grid
grid.arrange(p1, p2, nrow = 2)

Final Visualization

I chose to focus on the data from 1968 onwards because not all diseases were consistently recorded before that year.

The bar plot below shows the total reported cases of contagious diseases in Maryland from 1968 onwards, categorized by disease.

## Filtering the data for Maryland and the years 1968 and onwards
maryland_data_1968_onwards <- US_ConDis |>
  filter(state == "Maryland", year >= 1968)

## Summarizing total cases by disease
total_cases_by_disease_1968_onwards <- maryland_data_1968_onwards |>
  group_by(disease) |>
  summarize(total_cases = sum(count)) |>
  arrange(desc(total_cases))  ## Arrange in descending order to show the most contagious disease first

## Bar plot for total cases by disease from 1968 onwards
ggplot(data = total_cases_by_disease_1968_onwards, aes(x = reorder(disease, total_cases), y = total_cases, fill = disease)) +
  geom_bar(stat = "identity") +
  labs(title = "Total Reported Cases of Contagious Diseases in Maryland from 1968 Onwards by Disease",
       x = "Disease",
       y = "Total Number of Reported Cases",
       caption = "Source: Tycho Project (http://www.tycho.pitt.edu/) ") +
  scale_fill_manual(values = c("#85aadf", "#fec8d8", "#b491c8", "#666666", "#d291bc","#9caabd","#b39eb5")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))  ## Rotating x-axis labels for a better readability

I cleaned the dataset, by first checking for missing values and removing them using the na.omit() function to ensure the accuracy of the data set for the analysis. This step was important to avoid a biased result or an incorrect intepretatiopn due to incomplete data.

The visualization represents the total reported cases of contagious diseases in Maryland from 1968 onwards, categorized by disease. The bar plot shows the total number of reported cases for each disease, sorted in descending order to highlight the most prevalent diseases. The colors differentiate between diseases, making it easier to distinguish them.

One surprising finding was that hepatitis A appeared to be more prevalent than measles from 1968 onwards. I assumed that measles seemed to be more contagious, since the first plots I did all pointed to that. This discovery challenged my initial assumption that measles would be more contagious, highlighting the importance of data analysis in uncovering unexpected insights.The visualization provides a clear representation of how the prevalence of different diseases has evolved over time in Maryland.

I wish I could have included more detailed information about the specific trends and changes in the prevalence of each disease over time. Additionally, it would have been better to explore the factors contributing to the prevalence of hepatitis A compared to other diseases in more depth. However, due to limitations in the data set and the scope of the project, I focused on presenting a general overview of the data instead.

*** CHAT GPT was used to look for errors and fix them. *** CHAT GPT was also used to make suggestions. *** I remember you said we never use read.csv(i think), but i tried to use read_csv, and it said it did not exists, so I had to opt out to use read.csv