Untitled

Data Cleaning

In this chunk, I loaded up the packages that I would need to obtain my data, clean my data, and plot my graph

# load the libraries
library(dslabs)

## Warning: package 'dslabs' was built under R version 4.3.3

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(highcharter)

## Warning: package 'highcharter' was built under R version 4.3.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo 
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
## 
## Attaching package: 'highcharter'
## 
## The following object is masked from 'package:dslabs':
## 
##     stars

library(RColorBrewer)

# loads the dataset
data("us_contagious_diseases")

In this chunk, I perform the actual cleanings. I wanted to create a graph that would represent the volume of cases of each disease in the United States per year. The data set provides the disease every year, but is separated by state.

To fix this, I had to group the rows by the disease and year, then add the count of cases and the population. I also used the argument “na.rm” to make sure that the sum would not result in the “case” value for a row being NA.

After performing the cleaning, I arranged the data set in order of the disease so I could see the years were in order when I viewed the data set. After I checked the data set, I confirmed that the years were indeed in order. Now lets move onto graphing the plot.

us_disease <- us_contagious_diseases %>%
  group_by(disease, year) %>% # groups by the year and disease type
  summarize(total_count = sum(count, na.rm = TRUE), # creates column for the US case for a disease
            total_population = sum(population, na.rm = TRUE)) # creates column for the US population

## `summarise()` has grouped output by 'disease'. You can override using the
## `.groups` argument.

# us_disease <- arrange(us_disease, disease) # arranges data by disease type

Plotting the graph

The first step to plotting the graph was to organize the colors I would implement into the graph. I created a vector containing the hex values of my color palette, and placed it into a variable. I would use this vector later on while we assemble our graph.

To start off the graph, I used the function “hc_add_series()” to define the x (year) and y (# of cases) variables, separate the data by disease, and set the graph type to streamgraph. I chose to use a streamgraph for my data so I could visualize the the evolution of each disease with cases and to compare their diseases to one another.

After I set up the basics to my graph, I assigned colors to each disease by using the function “hc_colors()” and placed the color vector “cols” into the argument “colors”. Next, I set up the labels for my graph including the title using the functions “hc_title()”, “hc_xAxis()”, and “hc_yAxis”. I also moved the legend to the top left area of my graph using “hc_legend()”. Then I made sure all of the data for each disease at a specific year is shown when the mouse is hovering over the year by using “hc_tooltip()”. To finish off the graphing process, I changed the font to “Avant Grande” and made it bold by using the “hc_chart()” function and using the arguments “fontFamily” and “fontweight”

cols = c("#003F5C", "#58508D", "#BC5090", "#FF6361", "#FFA600", "#FFD380", "#FFE9c0") # creates color scheme for the graph
  

highchart() |>
  hc_add_series(data = us_disease, # selects dataframe
                type = "streamgraph", # sets the graph type, streamgraph
                hcaes(x = year, y = total_count, group = disease), # sets x, y values and groups by disease
                name = c("Hepatitis A", "Measles", "Mumps", "Pertussis", "Polio", "Rubella", "Smallpox")) |>
  hc_colors(colors = cols) |> # implements color scheme into graph
  hc_title(text = "Contagious Disease Data for the US 1928-2011") |> # sets title
  hc_xAxis(type = "linear", 
           title = list(text="Year")) |> # labels the x-axis
  hc_yAxis(title = list(text = "Volume of Cases")) |> # labels the y-axis
  hc_legend(align = "left", 
            verticalAlign = "top") |> # places the legend on the top left of the graph
  hc_tooltip(shared = TRUE, # shows all seven diseases per year
             borderColor = "black",
             pointFormat = "{series.name}: {point.y:.2f}<br>") |> # shows all y-axis values at the specific x-value |>
  hc_chart(style = list(fontFamily = "AvanteGrande",
                        fontWeight = "bold")) # changes font

Graph Results

I have used the same dataset used in the Week 8 notes, but the visualization that I have made is significantly different from the graph in the Week 8 notes. The Week 8 graph depicts the evolution of measles in each state through a heatmap. My own graph depicts the evolution of all 7 diseases in the dataset for the US as a whole through a streamgraph.

The Week 8 graph was plotted by using ggplot, while my own graph was plotted using highcharts. Ggplot does not utilize any interactive tools unlike highchart. Consequently the Week 8 graph is just what is displayed, but my own streamgraph does more than the heatmap. The streamgraph can show the amount of cases each disease has every year through tooltip just by hovering the cursor over a point on the streamgraph. It is also possible to see a visualization of each individual disease by eliminating the other disease visualizations by clicking on the unwanted diseases on the legend. Furthermore, I can compare the evolution of each disease through playing with the shown disease data with the legend.

Untitled

2024-03-25

Data Cleaning

Plotting the graph

Graph Results