a. The topic of the data, any variables included what kind of variables they are, where the data came from, and how you cleaned it up (be detailed and specific, using proper terminology where appropriate).

For my second project, I chose to work on the us_contagious_diseases.csv data set, including data six variables, disease, state, year, weeks_reporting, count, and population, a mix of character and integer data types. I found this data set in our class-shared Google drive and thought the data was extremely relevant and valuable to know as someone who lives in the United States. It contains information dating back to the early 20th century and could be used to create a fantastic visual. Before creating any visual, I needed to clean the data set and make it usable to create a map of the United States showing the change in “count” the total number of cases of a specific disease and eventually compare all. I started by filtering my original data set cd_us for any NA or/and missing values. Following that, I created another data set, hp_us, for just hepatitis A cases, so I could look deeper into that disease.

Next, the us_states data set, which includes data for mapping, needed to be adjusted/cleaned to match the rest of my data sets, which listed states with the first letter capitalized. I used str_to_title to fix this; for example, “alabama” became “Alabama.” Furthermore, to merge/join my data sets, I need to assign codes to each state. I create a data frame called codes assigning each state a number 1-50, then merge this data frame with us_states, hp_us, and cd_us, which gives each state an identical number in all data sets. Subsequently, I used the left_join function to merge the cd_us and us_states, hp_us, and us_states so we can create maps that show the change in disease count over time. Lastly, to ensure that the data did not create any NA values and remove the unnecessary column “sub-region” from the data sets. I filtered cd_us and hp_us using !is.Na and select(-sub-region).

b. Incorporate brief background research about this topic.

My topic was US contagious diseases, including data on Hepatitis A, Measles, Mumps, Pertussis, Polio, Rubella, and Smallpox. What is a contagious disease? According to Merriam-Webster, a contagious disease is “an infectious disease transmitted by contact with an infected individual or infected bodily discharges or fluids. Throughout the history of the United States, numerous contagious diseases have swept the nation. Smallpox was one of the first to appear in the United States, coming”to North America in the 1600s” (Robinson, 2020) from European settlers. Polio, another contagious disease that “affects the nervous system, causing paralysis,” came later with major outbreaks “in 1916 and 1952.” However, with advancements in medicine and vaccines, we have been able to slow or eliminate the spread of some contiguous diseases. For example, smallpox is “gone from the United States after a large vaccination initiative in 1972.” (Robinson, 2020) You can even notice this in the visual because no more data is available for different diseases after a specific date.

Bibliography

“Contagious Disease Definition & Meaning.” Merriam-Webster, Merriam-Webster, https://www.merriam-webster.com/dictionary/contagious%20disease.

Robinson, Dana. “The Worst Outbreaks in U.S. History.” Healthline, Healthline Media, 24 Mar. 2020, https://www.healthline.com/health/worst-disease-outbreaks-history.

Load libraries

library(tidyverse) 
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.0     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.1     ✔ tibble    3.1.8
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(tmap) 
## Warning: package 'tmap' was built under R version 4.2.3
library(tmaptools) 
## Warning: package 'tmaptools' was built under R version 4.2.3
library(leaflet) 
## Warning: package 'leaflet' was built under R version 4.2.3
library(sf) 
## Warning: package 'sf' was built under R version 4.2.3
## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE
library(leaflet.extras) 
## Warning: package 'leaflet.extras' was built under R version 4.2.3
library(dplyr) 
library(rio) 
## Warning: package 'rio' was built under R version 4.2.3
library(sp)
## Warning: package 'sp' was built under R version 4.2.3
library(urbnmapr)
library(gganimate)
## Warning: package 'gganimate' was built under R version 4.2.3
library(gifski)
## Warning: package 'gifski' was built under R version 4.2.3
library(plotly)
## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:rio':
## 
##     export
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.2.3
library(stringr)
library(DataExplorer)
library(ggplot2)
library(ggfortify)

Set working directory, load dataset, and preview data set

setwd("C:/Users/jakea/OneDrive/Desktop/MC 2022/DATA-110")
cd_us <- read.csv("us_contagious_diseases.csv")
head(cd_us)
##       disease   state year weeks_reporting count population
## 1 Hepatitis A Alabama 1966              50   321    3345787
## 2 Hepatitis A Alabama 1967              49   291    3364130
## 3 Hepatitis A Alabama 1968              52   314    3386068
## 4 Hepatitis A Alabama 1969              49   380    3412450
## 5 Hepatitis A Alabama 1970              51   413    3444165
## 6 Hepatitis A Alabama 1971              51   378    3481798
str(cd_us)
## 'data.frame':    18870 obs. of  6 variables:
##  $ disease        : chr  "Hepatitis A" "Hepatitis A" "Hepatitis A" "Hepatitis A" ...
##  $ state          : chr  "Alabama" "Alabama" "Alabama" "Alabama" ...
##  $ year           : int  1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 ...
##  $ weeks_reporting: int  50 49 52 49 51 51 45 45 45 46 ...
##  $ count          : int  321 291 314 380 413 378 342 467 244 286 ...
##  $ population     : int  3345787 3364130 3386068 3412450 3444165 3481798 3524543 3571209 3620548 3671246 ...

Adjust for scientific notation

options(scipen = 999)

Clean data for missing values in the original dataset for future use

cd_us <- cd_us %>%
  filter(!is.na(disease) & !is.na(year) & !is.na(weeks_reporting) &  !is.na(population))

Create data set for just hepatitis A and clean for missing values

hp_us <- cd_us %>%
  filter(!is.na(disease) & !is.na(year) & !is.na(weeks_reporting) &  !is.na(population)) %>%
  filter(disease %in% c("Hepatitis A"))

Data need for creating United States Map

us_states <- map_data("state")
head(us_states)
##        long      lat group order  region subregion
## 1 -87.46201 30.38968     1     1 alabama      <NA>
## 2 -87.48493 30.37249     1     2 alabama      <NA>
## 3 -87.52503 30.37249     1     3 alabama      <NA>
## 4 -87.53076 30.33239     1     4 alabama      <NA>
## 5 -87.57087 30.32665     1     5 alabama      <NA>
## 6 -87.58806 30.32665     1     6 alabama      <NA>

Change state names from lowercase to match the rest of the data sets in us_states

us_states$region <- str_to_title(tolower(us_states$region))
head(us_states)
##        long      lat group order  region subregion
## 1 -87.46201 30.38968     1     1 Alabama      <NA>
## 2 -87.48493 30.37249     1     2 Alabama      <NA>
## 3 -87.52503 30.37249     1     3 Alabama      <NA>
## 4 -87.53076 30.33239     1     4 Alabama      <NA>
## 5 -87.57087 30.32665     1     5 Alabama      <NA>
## 6 -87.58806 30.32665     1     6 Alabama      <NA>

Create a list of state names and codes in order to merge data sets

codes <- list(state = c("Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"), code = c(1:50)) # creates a list of state names and assigns them a value 1-50

codes <- data.frame(codes) # convert the list to a data frame
head(codes) # preview that data set
##        state code
## 1    Alabama    1
## 2     Alaska    2
## 3    Arizona    3
## 4   Arkansas    4
## 5 California    5
## 6   Colorado    6

Merge us_states and codes dataset to merge data correctly

us_states <- left_join(us_states, codes, by = c("region"="state"))
head(us_states)
##        long      lat group order  region subregion code
## 1 -87.46201 30.38968     1     1 Alabama      <NA>    1
## 2 -87.48493 30.37249     1     2 Alabama      <NA>    1
## 3 -87.52503 30.37249     1     3 Alabama      <NA>    1
## 4 -87.53076 30.33239     1     4 Alabama      <NA>    1
## 5 -87.57087 30.32665     1     5 Alabama      <NA>    1
## 6 -87.58806 30.32665     1     6 Alabama      <NA>    1

Merge to the codes data frame with hp_us to assign each state a code so we can before the next merge

hp_us <- left_join(hp_us, codes, by = "state")
head(hp_us)
##       disease   state year weeks_reporting count population code
## 1 Hepatitis A Alabama 1966              50   321    3345787    1
## 2 Hepatitis A Alabama 1967              49   291    3364130    1
## 3 Hepatitis A Alabama 1968              52   314    3386068    1
## 4 Hepatitis A Alabama 1969              49   380    3412450    1
## 5 Hepatitis A Alabama 1970              51   413    3444165    1
## 6 Hepatitis A Alabama 1971              51   378    3481798    1

Merge the orginal cd_us data set with codes data frame data for future use

cd_us <- left_join(cd_us, codes, by = "state")
head(cd_us)
##       disease   state year weeks_reporting count population code
## 1 Hepatitis A Alabama 1966              50   321    3345787    1
## 2 Hepatitis A Alabama 1967              49   291    3364130    1
## 3 Hepatitis A Alabama 1968              52   314    3386068    1
## 4 Hepatitis A Alabama 1969              49   380    3412450    1
## 5 Hepatitis A Alabama 1970              51   413    3444165    1
## 6 Hepatitis A Alabama 1971              51   378    3481798    1

Merge the orginal dataset with all diseases with us_state for future use to compare

cd_us_merged <- left_join(cd_us, us_states, by = c("code"="code"))
## Warning in left_join(cd_us, us_states, by = c(code = "code")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 1 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.

Filter for NA too

cd_us_merged <- cd_us_merged %>%
  filter(!is.na(disease) & !is.na(year) & !is.na(weeks_reporting) & !is.na(count) & !is.na(population) & !is.na(code) & !is.na(long) & !is.na(lat) & !is.na(order) & !is.na(region))

Remove the unnecessary column subregion in the data set

cd_us_merged <- cd_us_merged %>%
  select(-subregion)

Combined the data sets to create a map that shows the change of hepatitis a count

hp_us_merged <- left_join(hp_us, us_states, by=c("code"="code"))
## Warning in left_join(hp_us, us_states, by = c(code = "code")): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 1 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.
head(hp_us_merged)
##       disease   state year weeks_reporting count population code      long
## 1 Hepatitis A Alabama 1966              50   321    3345787    1 -87.46201
## 2 Hepatitis A Alabama 1966              50   321    3345787    1 -87.48493
## 3 Hepatitis A Alabama 1966              50   321    3345787    1 -87.52503
## 4 Hepatitis A Alabama 1966              50   321    3345787    1 -87.53076
## 5 Hepatitis A Alabama 1966              50   321    3345787    1 -87.57087
## 6 Hepatitis A Alabama 1966              50   321    3345787    1 -87.58806
##        lat group order  region subregion
## 1 30.38968     1     1 Alabama      <NA>
## 2 30.37249     1     2 Alabama      <NA>
## 3 30.37249     1     3 Alabama      <NA>
## 4 30.33239     1     4 Alabama      <NA>
## 5 30.32665     1     5 Alabama      <NA>
## 6 30.32665     1     6 Alabama      <NA>

Remove the unnessacary column subregion in the data set

hp_us_merged <- hp_us_merged %>%
  select(-subregion)

Check all columns for “NA”’s/missing values

hp_us_merged <- hp_us_merged %>%
  filter(!is.na(disease) & !is.na(state)& !is.na(year) & !is.na(weeks_reporting) & !is.na(count) & !is.na(population) & !is.na(code) & !is.na(long) & !is.na(lat) & !is.na(order) & !is.na(region))

Basic start to animated scatterplot looking at hepatitis a count across the United States

hp_us_plot <- ggplot(hp_us, aes(count, weeks_reporting)) +
  geom_point(aes(size = count, color = state, alpha = .5)) +
  guides(color = FALSE) # Legend has to be removed too cluttered
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
hp_us_plot <- ggplotly(hp_us_plot)
hp_us_plot

Learn to add animations using transition_time(year) and customize with animate function to view change of hepatitis a count from 1966-2011

year <- hp_us$year # Need to define year variable
hp_us_plot <- ggplot(hp_us, aes(count, weeks_reporting)) +
  geom_point(aes(size = count, color = state, alpha = .5)) +
  guides(color = FALSE) + # Removes legend
  labs(title = 'Year: {frame_time}', x = '#Of Infections', y = 'Weeks Reported') +
  transition_time(year) +
  ease_aes('linear') 
hp_us_plot 

Create a box plot to show the distribution of the data for hepatitis A cases in the United State in each state

ggplot(hp_us_merged, aes(x = state, y = count)) +
  geom_boxplot() 

Make it readable by flipping cords, add title, and make text smaller

ggplot(hp_us_merged, aes(x = state, y = count, color = state)) +
  geom_boxplot() +
  guides(color = FALSE) +
  labs(title = "Distribution of Hepatitis A Case Counts by State", y = "Count", x = "States") +
  coord_flip() +
  theme(axis.text.y = element_text(size = 7))

Looking at the distribution of Hepatitis A count over time, we can view what countries experience the highest and/or lowest number of cases. California has the highest number of cases, and Wyoming has the lowest. Additionally, you can make more detailed observations; for example, in Massachusetts, the majority of the period 1966-2011 had few cases. However, multiple outliners represent certain years with much higher case counts. The outliners can tell a story that Massachusetts possibly dealt with a high number of cases at the beginning but was able to contain the spread for the majority after that or that there were periods of large outbreaks. Michigan was almost the opposite story as it experienced a prolonged period of higher cases because of the more significant distribution.

Add animation to view data over time

ggplot(hp_us_merged, aes(x = state, y = count, color = state)) +
  geom_boxplot() +
  guides(color = FALSE) +
  labs(title = "Distribution of Hepatitis A Case Counts by State (Year: {frame_time})", y = "Count", x = "States") +
  coord_flip() +
  theme(axis.text.y = element_text(size = 7)) +
  transition_time(year) +
  ease_aes('linear') 

Furthermore, viewing the box plots with animation allows us to observe overall trends in cases over time rather than a static view. Seeing the transition through time is more visually pleasing and easier to understand as well.

Look at correlation between year and count, make it interactive to see each states name

cl_plot <- ggplot(cd_us, aes(x = year, y = count, color = state))+
  guides(color = FALSE, fill = FALSE) +
  geom_point() +
  geom_smooth(method='lm',formula=y~x, color = "red") +
  labs(title = "Count of Diseases in Each Year") +
  xlab("Year") +
  ylab ("Count") +
  theme_minimal() 
cl_plot <- ggplotly(cl_plot)
cl_plot

Viewing the correlation between Count and Year, an apparent declining trend represents the decreasing number of contagious diseases in the United States. We know from my background research this is primarily due to the advances in medicine and vaccines around 1960-1980, which can be observed in the plot above.

Use facet_wrap to view the same correlation but for each state individually

cl_plot <- ggplot(cd_us, aes(x = year, y = count, color = state))+
  guides(color = FALSE, fill = FALSE) +
  geom_point() +
  geom_smooth(method='lm',formula=y~x, color = "red") +
  labs(title = "Count of Diseases in Each Year") +
  xlab("Year") +
  ylab ("Count") +
  theme_minimal() +
  facet_wrap(~ state)
cl_plot

Using facet_wrap we can view the same correlation based on count and year but for each state individually. We see what states had the most cases once again but have a better view of how count changes over time.

Create a United States Map outline for new visual using the data = us_states with interactivity

us_map <- ggplot(data = us_states, mapping = aes(x = long, y = lat, group = group, fill = region))  + 
  geom_polygon() +
  guides(fill = FALSE)
us_map <- ggplotly(us_map)
us_map

Use the merged data set to create basic map that shows hepatitis a case in the United States with a map

ggplot(data = hp_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
  geom_polygon() 

Fix legend and add title with classic theme

ggplot(data = hp_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
  geom_polygon() +
  guides(color = FALSE) +
  labs(title = "Count of Hepatitis A in the United States from 1966-2011") +
  theme_classic()

Add animation by using transition_time(year) and ease_aes(‘linear’) to show change from 1966-2011

ggplot(data = hp_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
  geom_polygon() +
  guides(color = FALSE) +
  labs(title = "Count of Hepatitis A in the United States from 1966-2011 (Year: {frame_time})") +
  theme_classic() +
  transition_time(year) +
  ease_aes('linear') 

Different view using scale_fill_gradient that may be more efficient at see the changes in each state

ggplot(data = hp_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
  geom_polygon() +
  scale_fill_gradient(name = "Disease Count", low = "white", high = "red") +
  guides(color = FALSE) +
  labs(title = "Count of Hepatitis A in the United States from 1966-2011 (Year: {frame_time})") +
  theme_classic() +
  transition_time(year) +
  ease_aes('linear')

Use facet_wrap(~ disease) to compare all the diseases in the United States

ggplot(data = cd_us_merged, mapping = aes(x = long, y = lat, group = code, fill = count, color = region)) +
  geom_polygon() +
  scale_fill_gradient(name = "Disease Count", low = "white", high = "red") +
  guides(color = FALSE) +
  labs(title = "Count of Different Contagious Diseases in the United States (Year: {frame_time})") +
  theme_classic() +
  transition_time(year) +
  ease_aes('linear') +
  facet_wrap(~ disease, scales = "free")

c. What the visualization represents? Any interesting patterns or surprises that arise within the visualization, and anything that could have been shown that you could not get to work or that you wished you could have included.

My various visualizations throughout the markdown file represent the count of different contagious diseases in the United States. I use transition_time function to show the relationship over a span of years. There were a few patterns I noticed in the visualizations. I first investigated Hepatitis A cases specially, starting with a box plot of the distribution of the count in each state. You see, California with the biggest distribution of cases from 1966 to 2011(most cases). Notable Texas, Arizona, and Massachusetts had a large count distribution over time. Next, to see the trend in disease count over time, I created a scatter plot showing the distribution, with a linear regression being shown using geo_smooth. We see a clear declining trend in cases of all of us contiguous diseases, and after 1940-1960 a more significant decline, proven by my research that advancements in medicine and vaccines slowed or eliminated the spread of contiguous diseases. Furthermore, using facet_wrap, I viewed all 50 states individually and saw different hubs of contagious disease. These were often states with large cities, such as California, New York, and Texas. Surprisingly there was no strong correlation between any variables. The high was between years and diseases_smallpox. Using a map as a visual, I confirmed that these states experienced the highest cases throughout the years. Lastly, another interesting observation I saw viewing my map facet_wrap was that measles was by far the most widespread disease in terms of count. Something I wish I had included was the ability to stop/pause the year counter, so I could investigate certain dates. Even a scroll-er that allows you to move through time. Also, add interactivity to hover over each state to see the count as time passes. Some of this was possible using Shiny, but I was unsuccessful. Lastly, I could not figure out a way to create a range specific to each disease count for my last visualization using facet_wrap. The way how red a state is based on the values 0-12500. However, smallpox was not as widespread as measles, and we can barely see the change in count over time. While this is useful to see what disease was the most widespread, it also limits our view to investigate each disease.