## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
## ── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)

Data Description

This dataset records information on different types of contagious diseases found in the US between the years 1928-2011. The dataset includes information from each state in the US, including District of Columbia as a state. It also includes recorded census data from each year, taking in account that there are unrecorded years from certain states. For example, Hawaii was not founded until 1959 so there is no census data for that state until after 1959. The numbers included are the counts of reported cases as well as number of weeks counts were reported per year.

Data Source Description

From the description in the help files, this data is courtesy of Tycho Project, linked here: The Tycho Project. The Tycho Project includes many other open datasets of different conditions or diseases spanning acorss the country over different years. This particular dataset focuses on the following diseases: Hepatitis A, Rubella, Mumps, Pertussis, Measles, Polio, and Smallpox for US states in the years 1928-2011.

Summary table:

Variable Description
disease Names of types of diseases
state Names of states in the US (51 including DC)
year Years from 1928-2011
weeks_reporting Number of weeks reported per year
count Total count of reported cases
population State population, interpolated for non-census years

Data visualizations

install.packages("ggplot2")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
library(ggplot2)

Plot One

#Scatterplot from years 1928-1938
#Diseases before 1938
diseases_before_1987 <- us_contagious_diseases%>%
  filter(year>=1967)%>%
  filter(year<=1987)%>%
  filter(count<=2000)
diseases_before_1987%>%
  ggplot(aes(x=population, y=count))+
  geom_point(aes(color=as.factor(disease))) + theme(plot.subtitle = element_text(vjust = 1), 
                                                    plot.caption = element_text(vjust = 1))+
  labs(
    x = "Population",
    y = "Count",
    color = "Type of Disease",
    title = "Reported Counts of Each Disease Per Population In The Years 1967-1987"
  )

Description of Plot One

As seen the plot is pretty crowded, this is due to there being variables from census population data from 51 different states in the US (1 being DC). This graph shows the reported counts of diseases per population. I decided to use population in the x-axis rather than year because there wouldn’t be as much of a gradient if year was used, it would look like straight lines per year. In attempt to make the data more concentrated, I chose a 20 year time span of 1967-1987 and limited the counts to below 2000 in order to give the graph a resonable proportion. I chose this time frame so the readers could see that there are still no counts of Smallpox yet as it is not a variable in this time frame. As seen by the depleting points, the general trend is as the population increases, the reported counts of diseases in total decreases. This could be due to the fact that poulation increase is directly correlated with year. Medical advancements over the years help decrease illness and increase the population. I also colorcoded the points by disease name in case someone wanted to look at the trend of one disease at a time. For example, rubella is highly concentrated in the lower populations and the counts decrease drastically as the population increases past 1e+07. ### Plot Two

us_contagious_diseases%>%
  group_by(disease)%>%
  summarise(meanCount = mean(count))%>%
  ggplot(aes(x=as.factor(disease), y=meanCount, fill=as.factor(disease)))+
  geom_bar(stat='identity')+
  labs(x = "Type of Disease",
       y = "Mean Count",
       title = "Plot of the Mean Count of Each Disease Over The Years",
       fill = "Coloring Per Disease") + 
  theme(plot.subtitle = element_text(vjust = 1), 
    plot.caption = element_text(vjust = 1)) + theme(axis.ticks = element_line(colour = "indianred1"), 
    axis.text = element_text(colour = "orangered3")) + theme(axis.ticks = element_line(colour = "gray2"), 
    axis.text = element_text(colour = "gray6"))

Description of Plot Two

This graph includes the mean count of each type of disease. I calculated the mean count in order to look at the data over the years more clearly. Rather than seeing each year (82 years recorded in the dataset), the bar graph shows the mean count. Each disease is colorcoded as directed in the Coloring Per Disease legend. As seen on the graph, on average the measles are clearly reported more over the years than any other type of disease. I hope the reader will be able to clearly see the stark difference between the avaerage reported counts of measles over a span of 82 years of data compared to 6 other diseases.