Project 1 - Contagious Diseases

Author

Renato Chavez

Published

March 14, 2023

Exploring contagious diseases such as Hepatitis A, Polio, and Smallpox in particular states

This dataset comes from the World Bank dataset of infectious diseases. There are six variables within this dataset; disease (categorical variable), state (categorical variable), year (quantitative variable), weeks reporting (quantitative variable), count (quantitative variable), and population (quantitative variable). Personally, I was most interested in the number of cases of certain diseases in the different states. Therefore, I performed the cleaning of the data by filtering the information by disease, states, and even period of time. I used different diseases, states, periods of time, as well as different types of graphs to represent the information in many ways.

First we will import the data and the libraries that we will need for this project

library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0     ✔ purrr   1.0.1
✔ tibble  3.1.8     ✔ dplyr   1.1.0
✔ tidyr   1.3.0     ✔ stringr 1.5.0
✔ readr   2.1.4     ✔ forcats 1.0.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

I will set the working directory to the .csv file about contagious diseases in the US.

setwd("/Users/renatochavez/Documents/Montgomery College/Spring 2023/DATA110/Datasets")
diseases <- read.csv("us_contagious_diseases.csv")

Hepatitis A in California, Maryland, New York, and Florida from 1966 to 2011.

First, I will perform the cleaning of the data to see the given states data about Hepatitis A.

hepatitis <- diseases
hepatitis1 <- filter(hepatitis, disease == "Hepatitis A")
hepatitis2 <- filter(hepatitis1, state == "California" | state == "Maryland" | state == "New York" | state == "Florida")

Then, I will create the plot with the information filtered out to get a graph that will use dots and a curved line to indicate the count of cases in the given states.

ggplot(hepatitis2, aes(year, count, color = state)) + 
  geom_point(aes(size = count), alpha = 1/2) + 
  ggtitle("Hepatitis A in California, Maryland, New York, and Florida") + 
  xlab("Year") + 
  ylab("Number of cases") + 
  geom_smooth() + 
  scale_size_area() 
`geom_smooth()` using method = 'loess' and formula = 'y ~ x'

Polio in Alabama, Arizona, Texas, and Virginia in the 1940s.

For this second graph I wanted to be more specific about the period of time. I chose different states and I also wanted to choose a specific period of time, for example the 1940s.

I will start by filtering the information for Polio in the mentioned states, but this time I will also filter out the information by time.

polio <- diseases
polio1 <- filter(polio, disease == "Polio")
polio2 <- filter(polio1, state == "Alabama" | state == "Arizona" | state == "Texas" | state == "Virginia")
polio3 <- filter(polio2, year == "1940" | year == "1941" | year == "1942" | year == "1943" | year == "1944" | year == "1945" | year == "1946" | year == "1947" | year == "1948" | year == "1949")

Now, I will create my plot that will create this bargraph representing the data.

plot1 <- polio3 %>%
  ggplot() + 
  geom_bar(aes(x=year, y=count, fill = state), 
           position = "dodge", stat = "identity") + 
  ggtitle("Polio cases in Alabama, Arizona, Texas, and Virginia") + 
  scale_x_continuous(breaks = c(1940, 1942, 1944, 1946, 1948)) +
  ylab("Number of cases") +
  labs(fill = "States")
plot1

Smallpox in Colorado, Connecticut, and Georgia during the 1930 decade.

For my final graph I will analyze the smallpox cases in the 1930 decade for the states of Colorado, Connecticut, and Georgia. I will start by performing the necessary filtering for the disease, then the states, and the period of time (in that order).

smallpox <- diseases
smallpox1 <- filter(smallpox, disease == "Smallpox")
smallpox2 <- filter(smallpox1, state == "Colorado" | state == "Connecticut" | state == "Georgia")
smallpox3 <- filter(smallpox2, year == "1930" | year == "1931" | year == "1932" | year == "1933" | year == "1934" | year == "1935" | year == "1936" | year == "1937" | year == "1938" | year == "1939")

I will be using a line graph this time to track the cases of the given states in the given period of time.

ggplot(smallpox3, aes(x = year, y = count, color = state)) +
  ggtitle("Smallpox in Colorado, Connecticut, and Georgia in the 1930s") +
  xlab("year") + 
  ylab("Number of cases") + 
  theme_minimal(base_size = 14) +
  scale_x_continuous(breaks = c(1930, 1932, 1934, 1936, 1938)) +
  geom_point() + 
  geom_line() + 
  scale_color_brewer(palette = 'Set2')

What do these visualizations represent ?

I see a very positive trend amongst most diseases in the dataset because the number of cases has been decreasing over time. In the first visualization, one can tell that because California and Florida have a large population, they would also have more Hepatitis A cases than states like Maryland or even New York. However, I was not expecting the gap that California would create by having much more Hepatitis A cases in 1966. What is also surprising is how fast California has recovered by reducing the number of cases, while the trend has been more steady in states like Maryland or New York. I understand that these last two states have a smaller population, but I was still very surprised by the graph results. In the second visualization, when studying Polio cases in Alabama, Arizona, Texas, and Virginia in the 1940s, there were interesting patterns I noticed. Texas was clearly the state with more Polio cases due to its large population, but the four states had a very inconsistent behavior throughout the decade. I came to this conclusion because every two years they would increase the number of cases and then after another two years it would decrease. Perhaps, with the exception of Arizona because the number of cases was very steady throughout the decade. Definitely very interesting to see the number of Polio cases in the other states go up and down in that decade. Finally, in my third visualization where I used data of smallpox in Colorado, Connecticut, and Georgia during the 1930s. The number of smallpox cases in these states was not very high, but Colorado had a difficult start of the decade with almost 600 cases, then it decreased to less than 100 to finally end up with almost 400 cases by the end of the decade. This was a very surprising behavior considering that Connecticut and Georgia also had their up and downs, but proportionally to their population it was a very steady trend.

What could have been included ?

I am glad with the progress I have made with this data class, it is surprising that I could do this project in my own after just weeks of starting the class. I know that I will continue to make progress, so there are some things I wish I could have included in this project to make it even more sophisticated and attractive. For instance, I would have liked to compare the number of cases of different states with different sizes of population by percentages. That way it would not matter if I compared California to Hawaii despite of the population because I could use number of cases per one hundred people or something similar. In other words, my goal for the end of the semester will be to use more variables of this dataset and have a rich comparison of the contagious diseases between states.