Use the package DSLabs (Data Science Labs)

There are a number of datasets in this package to use to practice creating visualizations

# install.packages("dslabs")
library("dslabs")
data(package="dslabs")
list.files(system.file("script", package = "dslabs"))
##  [1] "make-admissions.R"                   
##  [2] "make-brca.R"                         
##  [3] "make-brexit_polls.R"                 
##  [4] "make-death_prob.R"                   
##  [5] "make-divorce_margarine.R"            
##  [6] "make-gapminder-rdas.R"               
##  [7] "make-greenhouse_gases.R"             
##  [8] "make-historic_co2.R"                 
##  [9] "make-mnist_27.R"                     
## [10] "make-movielens.R"                    
## [11] "make-murders-rda.R"                  
## [12] "make-na_example-rda.R"               
## [13] "make-nyc_regents_scores.R"           
## [14] "make-olive.R"                        
## [15] "make-outlier_example.R"              
## [16] "make-polls_2008.R"                   
## [17] "make-polls_us_election_2016.R"       
## [18] "make-reported_heights-rda.R"         
## [19] "make-research_funding_rates.R"       
## [20] "make-stars.R"                        
## [21] "make-temp_carbon.R"                  
## [22] "make-tissue-gene-expression.R"       
## [23] "make-trump_tweets.R"                 
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"

Note that the package dslabs also includes some of the scripts used to wrangle the data from their original source:

US murders

This dataset includes gun murder data for US states in 2010. I use this dataset to introduce the basics of R program.

data("murders")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggthemes)
library(ggrepel)
str(murders)
## 'data.frame':    51 obs. of  5 variables:
##  $ state     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ abb       : chr  "AL" "AK" "AZ" "AR" ...
##  $ region    : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
##  $ population: num  4779736 710231 6392017 2915918 37253956 ...
##  $ total     : num  135 19 232 93 1257 ...
# save the murders dataset to your folder using the write_csv command
#write_csv(murders, "murders.csv", na="")

Work with the Murders Dataset

Calculate the average murder rate for the country

Once we determine the per million rate to be r, this line is defined by the formula: y=rx, with y and x our axes: total murders and population in millions respectively.

In the log-scale this line turns into: log(y)=log(r)+log(x). So in our plot it’s a line with slope 1 and intercept log(r). To compute r, we use dplyr:

# The pull command selects a column in a data frame and transforms it into a vector
r <- murders %>% 
  summarize(rate = sum(total) /  sum(population) * 10^6) %>% 
  pull(rate)

Create a static graph for which each point is labeled

Use the data science theme. Plot using the x-axis as population for each state per million, the y-axis as the total murders for each state.

Color by region, add a linear regression line based on your calculation for r above, where we only need the intercept: geom_abline(intercept = log10(r))

Scale the x- and y-axes by a factor of log 10, add axes labels and a title.

You can use the command nudge_x argument, if you want to move the text slightly to the right or to the left:

ds_theme_set()
murders %>% 
  ggplot(aes(x = population/10^6, y = total, label = abb)) +
  geom_abline(intercept = log10(r), lty=2, col="darkgrey") +
  geom_point(aes(color=region), size = 3) +
  geom_text_repel(nudge_x = 0.005) +
  scale_x_log10("Populations in millions (log scale)") +
  scale_y_log10("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010") +
  scale_color_discrete(name="Region") +
  # Remove legend title
  theme(legend.title=element_blank())  

Gapminder Dataset

This dataset includes health and income outcomes for 184 countries from 1960 to 2016. It also includes two character vectors, OECD and OPEC, with the names of OECD and OPEC countries from 2016.

Name the regions using the code %in% : The West, East Asia, Latin America, Sub-Saharan Africa, and Others

data("gapminder")

west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")

gapminder <- gapminder %>%
  mutate(group = case_when(
    region %in% west ~ "The West",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
    TRUE ~ "Others"))
gapminder <- gapminder %>%
  mutate(group = factor(group, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))
  1. Remove all na values from “group”, “fertility”, and “life_expectancy” using !is.na (works the same as na.rm = TRUE)

  2. mutate the population to be a value per million

  3. change the theme of the plot

  4. Use the command: geom_text(aes(x=7, y=82, label=year), cex=12, color=“grey”) to label the two plots at the top right inside the plots (by their years).

  5. Shift the legend to go across the top.

gapminder %>% 
  filter(year%in%c(1962, 2013) & !is.na(group) &
         !is.na(fertility) & !is.na(life_expectancy)) %>%
  mutate(population_in_millions = population/10^6) %>%
  ggplot( aes(fertility, y=life_expectancy, col = group, size = population_in_millions)) +
  geom_point(alpha = 0.8) +
  guides(size=FALSE) +
  theme(plot.title = element_blank(), legend.title = element_blank()) +
  coord_cartesian(ylim = c(30, 85)) +
  xlab("Fertility rate (births per woman)") +
  ylab("Life Expectancy") +
  geom_text(aes(x=7, y=82, label=year), cex=12, color="grey") +
  facet_grid(. ~ year) +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(),
   legend.position = "top")

Contagious disease data for US states

The next dataset contains yearly counts for Hepatitis A, measles, mumps, pertussis, polio, rubella, and smallpox for US states. Original data courtesy of Tycho Project. Use it to show ways one can plot more than 2 dimensions.

Focus on just measles 1. filter out Alaska and Hawaii

  1. mutate the rate of measles by taking the count/(population10,00052)/weeks_reporting

  2. draw a vertical line for 1963, which is when the measles vaccination was developed

library(RColorBrewer)
data("us_contagious_diseases")
the_disease <- "Measles"
us_contagious_diseases %>%
  filter(!state%in%c("Hawaii","Alaska") & disease ==  the_disease) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
  mutate(state = reorder(state, rate)) %>%
  ggplot(aes(year, state,  fill = rate)) +
  geom_tile(color = "grey50") +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept=1963, col = "blue") +
  theme_minimal() +  theme(panel.grid = element_blank()) +
  ggtitle(the_disease) +
  ylab("") +
  xlab("")

Fivethirtyeight 2016 Poll Data

This data includes poll results from the US 2016 presidential elections aggregated from HuffPost Pollster, RealClearPolitics, polling firms and news reports. The dataset also includes election results (popular vote) and electoral college votes in results_us_election_2016. Use this dataset to explore inference.

  1. Focus on polls for Clinton and Trump after July 2016

  2. Plot a scatterplot of the enddate to the percentage in the polls

  3. Include a loess smoother regression

data(polls_us_election_2016)
polls_us_election_2016 %>%
  filter(state == "U.S." & enddate>="2016-07-01") %>%
  select(enddate, pollster, rawpoll_clinton, rawpoll_trump) %>%
  rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
  gather(candidate, percentage, -enddate, -pollster) %>% 
  mutate(candidate = factor(candidate, levels = c("Trump","Clinton")))%>%
  group_by(pollster) %>%
  filter(n()>=10) %>%
  ungroup() %>%
  ggplot(aes(enddate, percentage, color = candidate)) +  
  geom_point(show.legend = FALSE, alpha=0.4)  + 
  geom_smooth(method = "loess", span = 0.15) +
  scale_y_continuous(limits = c(30,50))
## Warning: Removed 22 rows containing non-finite values (stat_smooth).
## Warning: Removed 22 rows containing missing values (geom_point).

Working with HTML Widgets and Highcharter

Set your working directory to access your files

# load required packages
library(readr)
library(ggplot2)
library(scales)
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
library(dplyr)

##Make a range of simple charts using the highcharter package

Highcharter is a package within the htmlwidgets framework that connects R to the Highcharts and Highstock JavaScript visualization libraries. For more information, see https://github.com/jbkunst/highcharter/

Also check out this site: https://cran.r-project.org/web/packages/highcharter/vignettes/charting-data-frames.html

Install and load required packages

Now install and load highcharter, plus RColorBrewer, which will make it possible to use ColorBrewer color palettes.

Also load dplyr and readr for loading and processing data.

# install highcharter, RColorBrewer
# install.packages("highcharter","RColorBrewer")

# load required packages
library(highcharter)
## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
## Highcharts (www.highcharts.com) is a Highsoft software product which is
## not free for commercial and Governmental use
## 
## Attaching package: 'highcharter'
## The following object is masked from 'package:dslabs':
## 
##     stars
library(RColorBrewer)

Load and process nations data

Load the nations data, and add a column showing GDP in trillions of dollars.

setwd <-("C:/Users/munis/Documents/Comm in Data Science/Datasets")
nations <- read_csv("nations.csv") %>%
  mutate(gdp_tn = gdp_percap*population/1000000000000)
## Parsed with column specification:
## cols(
##   iso2c = col_character(),
##   iso3c = col_character(),
##   country = col_character(),
##   year = col_double(),
##   gdp_percap = col_double(),
##   population = col_double(),
##   birth_rate = col_double(),
##   neonat_mortal_rate = col_double(),
##   region = col_character(),
##   income = col_character()
## )

Make a version of the “China’s rise” chart from your previous assignment

First, prepare the data using dplyr:

# prepare data
big4 <- nations %>%
  filter(iso3c == "CHN" | iso3c == "DEU" | iso3c == "JPN" | iso3c == "USA") %>%
  arrange(year)

The arrange step is important, because highcharter needs the data in order when drawing a time series - otherwise any line drawn through the data will follow the path of the data order, not the correct time order.

Now draw a basic chart with default settings:

# basic symbol-and-line chart, default settings
highchart() %>%
  hc_add_series(data = big4,
                   type = "line", hcaes(x = year,
                   y = gdp_tn, 
                   group = country))
## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.

In the code above, the function highchart() creates a chart.

Clicking on the legend items allows you to remove or add series from the chart.

Highcharts works by adding data “series” to a chart, and from R you can add the variables from a data frame all in one go using the hc_add_series function.

Inside this function we define the data frame to be used, with data, the type of chart, the variables to be mapped to the x and y axes, and the variable to group the data: here this draws a separate line for each country in the data.

Go to the github site provided above for the chart types available in Highcharts.

Now let’s begin customizing the chart.

Use a ColorBrewer palette

Using RColorBrewer, we can set a palette, and then use it in highcharter.

# define color palette
cols <- brewer.pal(4, "Set1")

highchart() %>%
  hc_add_series(data = big4,
                   type = "line", hcaes(x = year,
                   y = gdp_tn, 
                   group = country)) %>%
  hc_colors(cols)

The first line of code sets a palette with four colors, using the “Set1” palette from ColorBrewer. This is then fed to the function hc_colors() to use those colors on the chart.

Add axis labels

highchart() %>%
  hc_add_series(data = big4,
                   type = "line",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = country)) %>%
  hc_colors(cols) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="GDP ($ trillion)"))

Change the legend position

For this, we use the hc_legend function.

highchart() %>%
  hc_add_series(data = big4,
                   type = "line",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = country)) %>%
  hc_colors(cols) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="GDP ($ trillion)")) %>%
  hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
  hc_legend(align = "right", 
            verticalAlign = "top")

Customize the tooltips

By default we have a tooltip for each series, or line, and the numbers run to many decimal places.

We can change to one tooltip for each year with “shared = TRUE”, and round all the numbers to two decimal places with pointFormat = "{point.country}: {point.gdp_tn:.2f}
.

# customize the tooltips

big4_chart <- highchart() %>%
  hc_add_series(data = big4,
                   type = "line",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = country)) %>%
  hc_colors(cols) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="GDP ($ trillion)")) %>%
  hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
  hc_legend(align = "right", 
            verticalAlign = "top") %>%
  hc_tooltip(shared = TRUE,
             borderColor = "black",
             pointFormat = "{point.country}: {point.gdp_tn:.2f}<br>")
big4_chart

Prepare the data

First, prepare the data using dplyr.

# prepare data
regions <- nations %>%
  group_by(year,region) %>%
  summarize(gdp_tn = sum(gdp_tn, na.rm = TRUE)) %>%
  arrange(year,region)

Make an area chart using default options

# basic area chart, default options
highchart () %>%
  hc_add_series(data = regions,
                   type = "area",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = region))

The following code customizes the chart in other ways. It uses the same ColorBrewer palette, with seven colors, that we used in unit 3.

# set color palette
cols <- brewer.pal(7, "Set2")

# stacked area chart
highchart () %>%
  hc_add_series(data = regions,
                   type = "area",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = region)) %>%
  hc_colors(cols) %>% 
  hc_chart(style = list(fontFamily = "Georgia",
                        fontWeight = "bold")) %>%
  hc_plotOptions(series = list(stacking = "normal",
                               marker = list(enabled = FALSE,
                               states = list(hover = list(enabled = FALSE))),
                               lineWidth = 0.5,
                               lineColor = "white")) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="GDP ($ trillion)")) %>%
  hc_legend(align = "right", verticalAlign = "top",
            layout = "vertical") %>%
  hc_tooltip(enabled = FALSE)

We have already encountered the main functions used here. The key changes are in the hc_plotOptions() function:

stacking = “normal” creates the stacked chart. See what happens if you use stacking = “percent”.

lineWidth and lineColor set the width and color for the lines under marker = list() the code states = list(hover = list(enabled = FALSE)) turns off the hovering effect for each marker on the chart, so that the markers no longer reappear when hovered or tapped.

In the hc_legend() function, layout = “vertical” changes the layout so that the legend items appear in a vertical column.

Food Stamps Data - Combine Two Types

cols <- c("red","black")
food_stamps<- read_csv("food_stamps.csv")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   participants = col_double(),
##   costs = col_double()
## )
highchart() %>%
  hc_yAxis_multiples(
    list(title = list(text = "Participants (millions)")),
    list(title = list(text = "Costs ($ billions)"),
         opposite = TRUE)
  ) %>%
  hc_add_series(data = food_stamps$participants,
                name = "Participants (millions)",
                type = "column",
                yAxis = 0) %>%
  hc_add_series(data = food_stamps$costs,
                name = "Costs ($ billions)",
                type = "line",
                yAxis = 1) %>%
  hc_xAxis(categories = food_stamps$year,
           tickInterval = 5) %>%
  hc_colors(cols) %>%
  hc_chart(style = list(fontFamily = "Georgia",
                        fontWeight = "bold"))
nations <- read_csv("nations.csv")
## Parsed with column specification:
## cols(
##   iso2c = col_character(),
##   iso3c = col_character(),
##   country = col_character(),
##   year = col_double(),
##   gdp_percap = col_double(),
##   population = col_double(),
##   birth_rate = col_double(),
##   neonat_mortal_rate = col_double(),
##   region = col_character(),
##   income = col_character()
## )
nations <- nations %>%
  filter(country == "United States")
nations
## # A tibble: 25 x 10
##    iso2c iso3c country  year gdp_percap population birth_rate
##    <chr> <chr> <chr>   <dbl>      <dbl>      <dbl>      <dbl>
##  1 US    USA   United~  2001     37274.  284968955       14.1
##  2 US    USA   United~  2008     48401.  304093966       14  
##  3 US    USA   United~  2002     38166.  287625193       14  
##  4 US    USA   United~  1999     34621.  279040000       14.2
##  5 US    USA   United~  2009     47002.  306771529       13.5
##  6 US    USA   United~  2007     48062.  301231207       14.3
##  7 US    USA   United~  2003     39677.  290107933       14.1
##  8 US    USA   United~  2000     36450.  282162411       14.4
##  9 US    USA   United~  1998     32949.  275854000       14.3
## 10 US    USA   United~  1996     30068.  269394000       14.4
## # ... with 15 more rows, and 3 more variables: neonat_mortal_rate <dbl>,
## #   region <chr>, income <chr>
nations <- nations %>%
  arrange(year)
nations
## # A tibble: 25 x 10
##    iso2c iso3c country  year gdp_percap population birth_rate
##    <chr> <chr> <chr>   <dbl>      <dbl>      <dbl>      <dbl>
##  1 US    USA   United~  1990     23954.  249623000       16.7
##  2 US    USA   United~  1991     24405.  252981000       16.2
##  3 US    USA   United~  1992     25493.  256514000       15.8
##  4 US    USA   United~  1993     26465.  259919000       15.4
##  5 US    USA   United~  1994     27777.  263126000       15  
##  6 US    USA   United~  1995     28782.  266278000       14.6
##  7 US    USA   United~  1996     30068.  269394000       14.4
##  8 US    USA   United~  1997     31573.  272657000       14.2
##  9 US    USA   United~  1998     32949.  275854000       14.3
## 10 US    USA   United~  1999     34621.  279040000       14.2
## # ... with 15 more rows, and 3 more variables: neonat_mortal_rate <dbl>,
## #   region <chr>, income <chr>
nations <- nations %>%
  select(country, year, population, birth_rate, neonat_mortal_rate)
nations
## # A tibble: 25 x 5
##    country        year population birth_rate neonat_mortal_rate
##    <chr>         <dbl>      <dbl>      <dbl>              <dbl>
##  1 United States  1990  249623000       16.7                5.8
##  2 United States  1991  252981000       16.2                5.6
##  3 United States  1992  256514000       15.8                5.4
##  4 United States  1993  259919000       15.4                5.2
##  5 United States  1994  263126000       15                  5.1
##  6 United States  1995  266278000       14.6                5  
##  7 United States  1996  269394000       14.4                4.9
##  8 United States  1997  272657000       14.2                4.8
##  9 United States  1998  275854000       14.3                4.7
## 10 United States  1999  279040000       14.2                4.6
## # ... with 15 more rows
cols <- c("grey","green", "red")

highchart() %>%
  hc_title(text = "Population, Birth Rate, and Neonatal Mortality Rate over Time in the United States")%>% 
  hc_yAxis_multiples(
    list(title = list(text = "Population")),
    list(title = list(text = "Birth Rate"),
         opposite = TRUE)
  ) %>%
  hc_add_series(data = nations$population,
                name = "Population",
                type = "column",
                yAxis = 0) %>%
  hc_add_series(data = nations$birth_rate,
                name = "Birth Rate",
                type = "line",
                yAxis = 1) %>%
  hc_add_series(data = nations$neonat_mortal_rate,
                name = "Neonatal Mortality Rate",
                type = "line",
                yAxis = 1) %>%
  hc_xAxis(categories = nations$year,
           tickInterval = 5) %>%
  hc_colors(cols) %>%
  hc_chart(style = list(fontFamily = "Georgia",
                        fontWeight = "bold"))

I used the nations.csv dataset. I tried to create a graph similar to the last graph given in the notes. After importing the dataset, I filtered the data specifically for the United States. Then, I sorted that by year. Then I created the graph highcharter. I thing that I changed was that I used other data and I also added another variable. I decided to test the change in population, B=birth Rate, and neonatal mortality rate over time in the United States and how they related to each other. I added the third line by adding another hc_add_series section in the high charter code.