This dataset shows us the yearly counts for Hepatitis A, Measles, Mumps, Pertussis, Polio, Rubella, and Smallpox for all US states. For this assignment, I’m going to be illustrating the statistics between the count, the disease, the specific state, and the year.
I begin this assignment by bringing in my necessary libraries: tidyverse, highcharter, and dslabs. Highcharter is a library that allows for interactivity with a clean and simple coding process. The dslabs library holds the datasets to be used for this assignment. The list.files(system.file()) argument lists the files present in the directory containing scripts associated with the dslabs package.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(highcharter)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
library("dslabs")
##
## Attaching package: 'dslabs'
##
## The following object is masked from 'package:highcharter':
##
## stars
data(package = "dslabs")
list.files(system.file("script", package = "dslabs"))
## [1] "make-admissions.R"
## [2] "make-brca.R"
## [3] "make-brexit_polls.R"
## [4] "make-calificaciones.R"
## [5] "make-death_prob.R"
## [6] "make-divorce_margarine.R"
## [7] "make-gapminder-rdas.R"
## [8] "make-greenhouse_gases.R"
## [9] "make-historic_co2.R"
## [10] "make-mice_weights.R"
## [11] "make-mnist_127.R"
## [12] "make-mnist_27.R"
## [13] "make-movielens.R"
## [14] "make-murders-rda.R"
## [15] "make-na_example-rda.R"
## [16] "make-nyc_regents_scores.R"
## [17] "make-olive.R"
## [18] "make-outlier_example.R"
## [19] "make-polls_2008.R"
## [20] "make-polls_us_election_2016.R"
## [21] "make-pr_death_counts.R"
## [22] "make-reported_heights-rda.R"
## [23] "make-research_funding_rates.R"
## [24] "make-stars.R"
## [25] "make-temp_carbon.R"
## [26] "make-tissue-gene-expression.R"
## [27] "make-trump_tweets.R"
## [28] "make-weekly_us_contagious_diseases.R"
## [29] "save-gapminder-example-csv.R"
Let’s begin with a basic highcharter area graph. I call the specific dataset I am going to use for this assignment, US contagious diseases, and begin the coding.
I start by bringing in the highchart() function and add to it by using hc_add_series() to identify the dataset I am using, the type of visualization I’m creating, the axes, and what we’re grouping by. I use hc_plotOptions() to stack the states on top of one another, instead of an overlap. This I actually identify the x and y axes.
data("us_contagious_diseases")
highchart() |>
hc_add_series(data = us_contagious_diseases,
type = "area",
hcaes(x = year, y = count, group = state)) |>
hc_plotOptions(series = list(stacking = "normal")) |>
hc_xAxis(title = list(text = "year")) |>
hc_yAxis(title = list(text = "count"))
Oh my. That’s really ugly. Let’s identify all of the problems with this graph and work on fixing all of them.
First things first, there are just simply way too many states included here and we can’t actually grasp any information. The numbers don’t mean anything if everything looks the same. Second, the legend being under makes the chart look much smaller, and its harder to read the graph. The colors are also much too similar to one another and we have to dig through the legend and the graph to see which states have the same colors and which don’t. Fourth, it’s not even an area plot. It seems to be a bunch of line graphs. Fifth, it also looks like its in alphabetical order, so we need to manually reorder it so that it’s in order of population.
Let’s start with narrowing the list of states. I decided to choose the top 6 states with the highest population. These states happen to be California, Texas, Florida, New York, Pennsylvania, and Illinois. I create a vector with these top 6 states and use the filter() function to bring out all of the values with these 6 states.
top6 <- c("California", "Texas", "Florida", "New York", "Pennsylvania", "Illinois")
us_contagious_diseases1 <- us_contagious_diseases |>
filter(state %in% top6)
Now let’s create a color vector so we can use colors that are pleasing to the eyes. I found the hex codes from a color palette on Pinterest.
desiredcolors <- c("#F26749", "#EA9836", "#FCDED6", "#204ECF", "#83A5F2", "#F9A197")
I also want to reorder the list of states so it’s not in alphabetical, but in order of largest population to smallest (in the top 6 largest populations).
us_contagious_diseases1$state <- factor(us_contagious_diseases1$state, levels = top6)
Now that we have adjusted our external factors needed for the graph, let’s work on adjusting the code in the chunk.
We start with the same thing: identifying the data, the type of graph wanted, identifying the x and y axes, and grouping by state. Then I rename the x-axis to “Year” and rename the y-axis to “Reported Case Count”. Then I add in the title. After adding the title, I use the hc_chart() function to adjust the font of all of the text present in the chart. I’m not sure why, but I just love the font “Spectral”, so that’s what I set it to. I made all of the text bold as well so it looks nice and strong. Then I set the colors of my graph to the manual color vector created above. After the colors, I adjusted the legend to be centered next to the chart. Then I included a chunk of code that allows all of the lines to be stacked above one another. I also set the line width to be a little bit thicker.
The chunk of code following the hc_plotOptions() function, I got from ChatGPT. When I tried to introduce my fourth variable trying on my own, it would give me a different graph altogether. So in the end, I went to ChatGPT for help and received that chunk of code. I also later realized that I had four variables included in this graph. Then I just called my chart at the end.
finalplot <- highchart() |>
hc_add_series(data = us_contagious_diseases1,
type = "area",
hcaes(x = year, y = count, group = state)) |>
hc_xAxis(title = list(text = "Year")) |>
hc_yAxis(title = list(text = "Reported Case Count")) |>
hc_title(text = "Total Number of Reported Cases Each Year in the Top 6 Most Populous States") |>
hc_chart(style = list(fontFamily = "Spectral",
fontWeight = "bold")) |>
hc_colors(desiredcolors) |>
hc_legend(align = "right",
verticalAlign = "middle",
layout = "vertical") |>
hc_plotOptions(series = list(stacking = "normal",
marker = list(enabled = FALSE,
states = list(hover = list(enabled = FALSE))),
lineWidth = 3)) |>
hc_tooltip(formatter = JS("function() {
return '<b>' + this.x + '</b><br/>' +
'Count: ' + this.y + '<br/>' +
'Disease: ' + this.point.disease;}")) # Source: ChatGPT
finalplot
Originally, when I began this homework assignment, I chose to use the stars dataset. However, it gave me many difficulties because somehow, every time I ran all chunks, the variables in the original dataset started to change. I have no idea how that happened and eventually I had to start all over again. While I’m not happy about losing all of my work, I was able to create a much more interesting graph that what I had before.
For this assignment, I experimented with the highcharter package to include hyperactivity into a visualization. It was extremely entertaining to figure out the functions of each specific line of code, and although it took some time (and disappointment of losing my original assignment), I’m happy with the final product. I would like to know, however, what that last chunk of code from ChatGPT actually means. I tried interpreting it and it just wouldn’t make sense.
Overall, my visualization showcases the decline of diseases since the 1930s to 2010. Disease was at an all time high in all 6 states between 1935 - 1960 (depending on the particular disease).