For my second project, I chose to work on the us_contagious_diseases.csv data set, including data six variables, disease, state, year, weeks_reporting, count, and population, a mix of character and integer data types. I found this data set in our class-shared Google drive and thought the data was extremely relevant and valuable to know as someone who lives in the United States. It contains information dating back to the early 20th century and could be used to create a fantastic visual. Before creating any visual, I needed to clean the data set and make it usable to create a map of the United States showing the change in “count” the total number of cases of a specific disease and eventually compare all. I started by filtering my original data set cd_us for any NA or/and missing values. Following that, I created another data set, hp_us, for just hepatitis A cases, so I could look deeper into that disease.
Next, the us_states data set, which includes data for mapping, needed to be adjusted/cleaned to match the rest of my data sets, which listed states with the first letter capitalized. I used str_to_title to fix this; for example, “alabama” became “Alabama.” Furthermore, to merge/join my data sets, I need to assign codes to each state. I create a data frame called codes assigning each state a number 1-50, then merge this data frame with us_states, hp_us, and cd_us, which gives each state an identical number in all data sets. Subsequently, I used the left_join function to merge the cd_us and us_states, hp_us, and us_states so we can create maps that show the change in disease count over time. Lastly, to ensure that the data did not create any NA values and remove the unnecessary column “sub-region” from the data sets. I filtered cd_us and hp_us using !is.Na and select(-sub-region).
My topic was US contagious diseases, including data on Hepatitis A, Measles, Mumps, Pertussis, Polio, Rubella, and Smallpox. What is a contagious disease? According to Merriam-Webster, a contagious disease is “an infectious disease transmitted by contact with an infected individual or infected bodily discharges or fluids. Throughout the history of the United States, numerous contagious diseases have swept the nation. Smallpox was one of the first to appear in the United States, coming”to North America in the 1600s” (Robinson, 2020) from European settlers. Polio, another contagious disease that “affects the nervous system, causing paralysis,” came later with major outbreaks “in 1916 and 1952.” However, with advancements in medicine and vaccines, we have been able to slow or eliminate the spread of some contiguous diseases. For example, smallpox is “gone from the United States after a large vaccination initiative in 1972.” (Robinson, 2020) You can even notice this in the visual because no more data is available for different diseases after a specific date.
“Contagious Disease Definition & Meaning.” Merriam-Webster, Merriam-Webster, https://www.merriam-webster.com/dictionary/contagious%20disease.
Robinson, Dana. “The Worst Outbreaks in U.S. History.” Healthline, Healthline Media, 24 Mar. 2020, https://www.healthline.com/health/worst-disease-outbreaks-history.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.0 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.1 ✔ tibble 3.1.8
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tmap)
## Warning: package 'tmap' was built under R version 4.2.3
library(tmaptools)
## Warning: package 'tmaptools' was built under R version 4.2.3
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.2.3
library(sf)
## Warning: package 'sf' was built under R version 4.2.3
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(leaflet.extras)
## Warning: package 'leaflet.extras' was built under R version 4.2.3
library(dplyr)
library(rio)
## Warning: package 'rio' was built under R version 4.2.3
library(sp)
## Warning: package 'sp' was built under R version 4.2.3
library(urbnmapr)
library(gganimate)
## Warning: package 'gganimate' was built under R version 4.2.3
library(gifski)
## Warning: package 'gifski' was built under R version 4.2.3
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:rio':
##
## export
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(stringr)
library(DataExplorer)
library(ggplot2)
library(ggfortify)
setwd("C:/Users/jakea/OneDrive/Desktop/MC 2022/DATA-110")
cd_us <- read.csv("us_contagious_diseases.csv")
head(cd_us)
## disease state year weeks_reporting count population
## 1 Hepatitis A Alabama 1966 50 321 3345787
## 2 Hepatitis A Alabama 1967 49 291 3364130
## 3 Hepatitis A Alabama 1968 52 314 3386068
## 4 Hepatitis A Alabama 1969 49 380 3412450
## 5 Hepatitis A Alabama 1970 51 413 3444165
## 6 Hepatitis A Alabama 1971 51 378 3481798
str(cd_us)
## 'data.frame': 18870 obs. of 6 variables:
## $ disease : chr "Hepatitis A" "Hepatitis A" "Hepatitis A" "Hepatitis A" ...
## $ state : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ year : int 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
## $ weeks_reporting: int 50 49 52 49 51 51 45 45 45 46 ...
## $ count : int 321 291 314 380 413 378 342 467 244 286 ...
## $ population : int 3345787 3364130 3386068 3412450 3444165 3481798 3524543 3571209 3620548 3671246 ...
options(scipen = 999)
cd_us <- cd_us %>%
filter(!is.na(disease) & !is.na(year) & !is.na(weeks_reporting) & !is.na(population))
hp_us <- cd_us %>%
filter(!is.na(disease) & !is.na(year) & !is.na(weeks_reporting) & !is.na(population)) %>%
filter(disease %in% c("Hepatitis A"))
us_states <- map_data("state")
head(us_states)
## long lat group order region subregion
## 1 -87.46201 30.38968 1 1 alabama <NA>
## 2 -87.48493 30.37249 1 2 alabama <NA>
## 3 -87.52503 30.37249 1 3 alabama <NA>
## 4 -87.53076 30.33239 1 4 alabama <NA>
## 5 -87.57087 30.32665 1 5 alabama <NA>
## 6 -87.58806 30.32665 1 6 alabama <NA>
us_states$region <- str_to_title(tolower(us_states$region))
head(us_states)
## long lat group order region subregion
## 1 -87.46201 30.38968 1 1 Alabama <NA>
## 2 -87.48493 30.37249 1 2 Alabama <NA>
## 3 -87.52503 30.37249 1 3 Alabama <NA>
## 4 -87.53076 30.33239 1 4 Alabama <NA>
## 5 -87.57087 30.32665 1 5 Alabama <NA>
## 6 -87.58806 30.32665 1 6 Alabama <NA>
codes <- list(state = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"), code = c(1:50)) # creates a list of state names and assigns them a value 1-50
codes <- data.frame(codes) # convert the list to a data frame
head(codes) # preview that data set
## state code
## 1 Alabama 1
## 2 Alaska 2
## 3 Arizona 3
## 4 Arkansas 4
## 5 California 5
## 6 Colorado 6
us_states <- left_join(us_states, codes, by = c("region"="state"))
head(us_states)
## long lat group order region subregion code
## 1 -87.46201 30.38968 1 1 Alabama <NA> 1
## 2 -87.48493 30.37249 1 2 Alabama <NA> 1
## 3 -87.52503 30.37249 1 3 Alabama <NA> 1
## 4 -87.53076 30.33239 1 4 Alabama <NA> 1
## 5 -87.57087 30.32665 1 5 Alabama <NA> 1
## 6 -87.58806 30.32665 1 6 Alabama <NA> 1
hp_us <- left_join(hp_us, codes, by = "state")
head(hp_us)
## disease state year weeks_reporting count population code
## 1 Hepatitis A Alabama 1966 50 321 3345787 1
## 2 Hepatitis A Alabama 1967 49 291 3364130 1
## 3 Hepatitis A Alabama 1968 52 314 3386068 1
## 4 Hepatitis A Alabama 1969 49 380 3412450 1
## 5 Hepatitis A Alabama 1970 51 413 3444165 1
## 6 Hepatitis A Alabama 1971 51 378 3481798 1
cd_us <- left_join(cd_us, codes, by = "state")
head(cd_us)
## disease state year weeks_reporting count population code
## 1 Hepatitis A Alabama 1966 50 321 3345787 1
## 2 Hepatitis A Alabama 1967 49 291 3364130 1
## 3 Hepatitis A Alabama 1968 52 314 3386068 1
## 4 Hepatitis A Alabama 1969 49 380 3412450 1
## 5 Hepatitis A Alabama 1970 51 413 3444165 1
## 6 Hepatitis A Alabama 1971 51 378 3481798 1
cd_us_merged <- left_join(cd_us, us_states, by = c("code"="code"))
## Warning in left_join(cd_us, us_states, by = c(code = "code")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 1 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
cd_us_merged <- cd_us_merged %>%
filter(!is.na(disease) & !is.na(year) & !is.na(weeks_reporting) & !is.na(count) & !is.na(population) & !is.na(code) & !is.na(long) & !is.na(lat) & !is.na(order) & !is.na(region))
cd_us_merged <- cd_us_merged %>%
select(-subregion)
hp_us_merged <- left_join(hp_us, us_states, by=c("code"="code"))
## Warning in left_join(hp_us, us_states, by = c(code = "code")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 1 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
## warning.
head(hp_us_merged)
## disease state year weeks_reporting count population code long
## 1 Hepatitis A Alabama 1966 50 321 3345787 1 -87.46201
## 2 Hepatitis A Alabama 1966 50 321 3345787 1 -87.48493
## 3 Hepatitis A Alabama 1966 50 321 3345787 1 -87.52503
## 4 Hepatitis A Alabama 1966 50 321 3345787 1 -87.53076
## 5 Hepatitis A Alabama 1966 50 321 3345787 1 -87.57087
## 6 Hepatitis A Alabama 1966 50 321 3345787 1 -87.58806
## lat group order region subregion
## 1 30.38968 1 1 Alabama <NA>
## 2 30.37249 1 2 Alabama <NA>
## 3 30.37249 1 3 Alabama <NA>
## 4 30.33239 1 4 Alabama <NA>
## 5 30.32665 1 5 Alabama <NA>
## 6 30.32665 1 6 Alabama <NA>
hp_us_merged <- hp_us_merged %>%
select(-subregion)
hp_us_merged <- hp_us_merged %>%
filter(!is.na(disease) & !is.na(state)& !is.na(year) & !is.na(weeks_reporting) & !is.na(count) & !is.na(population) & !is.na(code) & !is.na(long) & !is.na(lat) & !is.na(order) & !is.na(region))
hp_us_plot <- ggplot(hp_us, aes(count, weeks_reporting)) +
geom_point(aes(size = count, color = state, alpha = .5)) +
guides(color = FALSE) # Legend has to be removed too cluttered
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
hp_us_plot <- ggplotly(hp_us_plot)
hp_us_plot
year <- hp_us$year # Need to define year variable
hp_us_plot <- ggplot(hp_us, aes(count, weeks_reporting)) +
geom_point(aes(size = count, color = state, alpha = .5)) +
guides(color = FALSE) + # Removes legend
labs(title = 'Year: {frame_time}', x = '#Of Infections', y = 'Weeks Reported') +
transition_time(year) +
ease_aes('linear')
hp_us_plot
ggplot(hp_us_merged, aes(x = state, y = count)) +
geom_boxplot()
ggplot(hp_us_merged, aes(x = state, y = count, color = state)) +
geom_boxplot() +
guides(color = FALSE) +
labs(title = "Distribution of Hepatitis A Case Counts by State", y = "Count", x = "States") +
coord_flip() +
theme(axis.text.y = element_text(size = 7))
Looking at the distribution of Hepatitis A count over time, we can view what countries experience the highest and/or lowest number of cases. California has the highest number of cases, and Wyoming has the lowest. Additionally, you can make more detailed observations; for example, in Massachusetts, the majority of the period 1966-2011 had few cases. However, multiple outliners represent certain years with much higher case counts. The outliners can tell a story that Massachusetts possibly dealt with a high number of cases at the beginning but was able to contain the spread for the majority after that or that there were periods of large outbreaks. Michigan was almost the opposite story as it experienced a prolonged period of higher cases because of the more significant distribution.
ggplot(hp_us_merged, aes(x = state, y = count, color = state)) +
geom_boxplot() +
guides(color = FALSE) +
labs(title = "Distribution of Hepatitis A Case Counts by State (Year: {frame_time})", y = "Count", x = "States") +
coord_flip() +
theme(axis.text.y = element_text(size = 7)) +
transition_time(year) +
ease_aes('linear')
Furthermore, viewing the box plots with animation allows us to observe overall trends in cases over time rather than a static view. Seeing the transition through time is more visually pleasing and easier to understand as well.
cl_plot <- ggplot(cd_us, aes(x = year, y = count, color = state))+
guides(color = FALSE, fill = FALSE) +
geom_point() +
geom_smooth(method='lm',formula=y~x, color = "red") +
labs(title = "Count of Diseases in Each Year") +
xlab("Year") +
ylab ("Count") +
theme_minimal()
cl_plot <- ggplotly(cl_plot)
cl_plot
Viewing the correlation between Count and Year, an apparent declining trend represents the decreasing number of contagious diseases in the United States. We know from my background research this is primarily due to the advances in medicine and vaccines around 1960-1980, which can be observed in the plot above.
cl_plot <- ggplot(cd_us, aes(x = year, y = count, color = state))+
guides(color = FALSE, fill = FALSE) +
geom_point() +
geom_smooth(method='lm',formula=y~x, color = "red") +
labs(title = "Count of Diseases in Each Year") +
xlab("Year") +
ylab ("Count") +
theme_minimal() +
facet_wrap(~ state)
cl_plot
Using facet_wrap we can view the same correlation based on count and year but for each state individually. We see what states had the most cases once again but have a better view of how count changes over time.
us_map <- ggplot(data = us_states, mapping = aes(x = long, y = lat, group = group, fill = region)) +
geom_polygon() +
guides(fill = FALSE)
us_map <- ggplotly(us_map)
us_map
ggplot(data = hp_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
geom_polygon()
ggplot(data = hp_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
geom_polygon() +
guides(color = FALSE) +
labs(title = "Count of Hepatitis A in the United States from 1966-2011") +
theme_classic()
ggplot(data = hp_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
geom_polygon() +
guides(color = FALSE) +
labs(title = "Count of Hepatitis A in the United States from 1966-2011 (Year: {frame_time})") +
theme_classic() +
transition_time(year) +
ease_aes('linear')
ggplot(data = hp_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
geom_polygon() +
scale_fill_gradient(name = "Disease Count", low = "white", high = "red") +
guides(color = FALSE) +
labs(title = "Count of Hepatitis A in the United States from 1966-2011 (Year: {frame_time})") +
theme_classic() +
transition_time(year) +
ease_aes('linear')
ggplot(data = cd_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
geom_polygon() +
scale_fill_gradient(name = "Disease Count", low = "white", high = "red") +
guides(color = FALSE) +
labs(title = "Count of Different Contagious Diseases in the United States (Year: {frame_time})") +
theme_classic() +
transition_time(year) +
ease_aes('linear') +
facet_wrap(~ disease, scales = "free")
My various visualizations throughout the markdown file represent the count of different contagious diseases in the United States. I use transition_time function to show the relationship over a span of years. There were a few patterns I noticed in the visualizations. I first investigated Hepatitis A cases specially, starting with a box plot of the distribution of the count in each state. You see, California with the biggest distribution of cases from 1966 to 2011(most cases). Notable Texas, Arizona, and Massachusetts had a large count distribution over time. Next, to see the trend in disease count over time, I created a scatter plot showing the distribution, with a linear regression being shown using geo_smooth. We see a clear declining trend in cases of all of us contiguous diseases, and after 1940-1960 a more significant decline, proven by my research that advancements in medicine and vaccines slowed or eliminated the spread of contiguous diseases. Furthermore, using facet_wrap, I viewed all 50 states individually and saw different hubs of contagious disease. These were often states with large cities, such as California, New York, and Texas. Surprisingly there was no strong correlation between any variables. The high was between years and diseases_smallpox. Using a map as a visual, I confirmed that these states experienced the highest cases throughout the years. Lastly, another interesting observation I saw viewing my map facet_wrap was that measles was by far the most widespread disease in terms of count. Something I wish I had included was the ability to stop/pause the year counter, so I could investigate certain dates. Even a scroll-er that allows you to move through time. Also, add interactivity to hover over each state to see the count as time passes. Some of this was possible using Shiny, but I was unsuccessful. Lastly, I could not figure out a way to create a range specific to each disease count for my last visualization using facet_wrap. The way how red a state is based on the values 0-12500. However, smallpox was not as widespread as measles, and we can barely see the change in count over time. While this is useful to see what disease was the most widespread, it also limits our view to investigate each disease.