Load the libraries

# install.packages("tidyverse")
# install.packages("zoo")
library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.0.2

## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.1     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   1.0.0
## v tidyr   1.1.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(zoo)  # this package will help us re-format the period to be a useable date.

## Warning: package 'zoo' was built under R version 4.0.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Loading Data from a Working Directory

You may get data from a url, from a pre-built dataset, or you may load data from a working directory. A working directory is simply the file location on your computer.

The easiest way to find out what your current workinig directory is, use the command getwd().

getwd()

## [1] "C:/Users/Salifou Sylla/Downloads"

This command shows you (in your console below) the path to your directory. My current path is: [1] “C:/Users/sonko/Desktop/DATA110”

If you want to change the path, there are several ways to do so. I find the easiest way to change it is to click the “Session” tab at the top of R Studio. Select “Set Working Directory”, and then arrow over to “Choose Directory”. At this point, it will take you to your computer folders, and you need to select where your data is held. I suggest you create a folder called “Datasets” and keep all the data you load for this class in that folder.

Notice that down in the console below, it will show the new path you have chosen: setwd(“C:/Users/rsaidi/Dropbox/Rachel/MontColl/Datasets/Datasets”). At this point, I copy that command and put it directly into a new chunk.

Load the data

The following data comes from New York Fed (https://www.newyorkfed.org/microeconomics/hhdc.html) regarding household debt for housing and non-housing expenses.

Download this dataset, Household_debt, from http://bit.ly/2P3084E and save it in your dataset folder. Change your working directory to load the dataset from YOUR folder. Then run this code.

setwd("C:/Users/Salifou Sylla/Downloads")
household <- read_csv("household_debt.csv")

## Parsed with column specification:
## cols(
##   Period = col_character(),
##   Mortgage = col_double(),
##   `HE Revolving` = col_double(),
##   `Auto Loan` = col_double(),
##   `Credit Card` = col_double(),
##   `Student Loan` = col_double(),
##   Other = col_double(),
##   Total = col_double()
## )

Clean data headings and variable names

Very soon, you will find data from other sources. The data will require some cleaning. Here are some important points to check: 1. Be sure the format is .csv 2. Be sure there are no spaces between variable names (headers). 3. Set all variable names to lowercase so you do not have to keep track of capitalizing.

Here are some useful cleaning commands:

Make all headings (column names) lowercase. Remove all spaces between words in headings and replace them with underscores with the gsub command. Then look at it with “head” and “tail”.

names(household) <- tolower(names(household))
names(household) <- gsub(" ","_",names(household))
head(household)

## # A tibble: 6 x 8
##   period mortgage he_revolving auto_loan credit_card student_loan other total
##   <chr>     <dbl>        <dbl>     <dbl>       <dbl>        <dbl> <dbl> <dbl>
## 1 03:Q1      4.94         0.24      0.64        0.69         0.24  0.48  7.23
## 2 03:Q2      5.08         0.26      0.62        0.69         0.24  0.49  7.38
## 3 03:Q3      5.18         0.27      0.68        0.69         0.25  0.48  7.56
## 4 03:Q4      5.66         0.3       0.7         0.7          0.25  0.45  8.07
## 5 04:Q1      5.84         0.33      0.72        0.7          0.26  0.45  8.29
## 6 04:Q2      5.97         0.37      0.74        0.7          0.26  0.42  8.46

tail(household)

## # A tibble: 6 x 8
##   period mortgage he_revolving auto_loan credit_card student_loan other total
##   <chr>     <dbl>        <dbl>     <dbl>       <dbl>        <dbl> <dbl> <dbl>
## 1 17:Q3      8.74         0.45      1.21        0.81         1.36  0.39  13.0
## 2 17:Q4      8.88         0.44      1.22        0.83         1.38  0.39  13.2
## 3 18:Q1      8.94         0.44      1.23        0.82         1.41  0.39  13.2
## 4 18:Q2      9            0.43      1.24        0.83         1.41  0.39  13.3
## 5 18:Q3      9.14         0.42      1.27        0.84         1.44  0.4   13.5
## 6 18:Q4      9.12         0.41      1.27        0.87         1.46  0.41  13.5

Look at the dimensions and the structure of the data. Note that it will be listed as a tibble (to be discussed later in these notes).

dim(household)

## [1] 64  8

str(household)

## tibble [64 x 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ period      : chr [1:64] "03:Q1" "03:Q2" "03:Q3" "03:Q4" ...
##  $ mortgage    : num [1:64] 4.94 5.08 5.18 5.66 5.84 5.97 6.21 6.36 6.51 6.7 ...
##  $ he_revolving: num [1:64] 0.24 0.26 0.27 0.3 0.33 0.37 0.43 0.47 0.5 0.53 ...
##  $ auto_loan   : num [1:64] 0.64 0.62 0.68 0.7 0.72 0.74 0.75 0.73 0.73 0.77 ...
##  $ credit_card : num [1:64] 0.69 0.69 0.69 0.7 0.7 0.7 0.71 0.72 0.71 0.72 ...
##  $ student_loan: num [1:64] 0.24 0.24 0.25 0.25 0.26 0.26 0.33 0.35 0.36 0.37 ...
##  $ other       : num [1:64] 0.48 0.49 0.48 0.45 0.45 0.42 0.41 0.42 0.39 0.4 ...
##  $ total       : num [1:64] 7.23 7.38 7.56 8.07 8.29 8.46 8.83 9.04 9.21 9.49 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Period = col_character(),
##   ..   Mortgage = col_double(),
##   ..   `HE Revolving` = col_double(),
##   ..   `Auto Loan` = col_double(),
##   ..   `Credit Card` = col_double(),
##   ..   `Student Loan` = col_double(),
##   ..   Other = col_double(),
##   ..   Total = col_double()
##   .. )

Mutate

Mutate is a powerful command in tidyverse. It creates a new variable (column) in your dataset. In our dataset, “period” is not anything useful if we want to plot chronological data. So we will use mutate from “tidyverse” with the package “zoo” to create a useable date format.

Create a new variable to use instead of “period”

You should see that there are 64 observations and 8 variables. All variables are “col_double” (continuous values) except “period”, which is interpreted as characters.We need to use the library “zoo” package to fix the unusual format of the “period”. We will mutate it to create a new variable, date.

household_debt_perc <- household %>%
   mutate(date = as.Date(as.yearqtr(period, format = "%y:Q%q")))
household_debt_perc

## # A tibble: 64 x 9
##    period mortgage he_revolving auto_loan credit_card student_loan other total
##    <chr>     <dbl>        <dbl>     <dbl>       <dbl>        <dbl> <dbl> <dbl>
##  1 03:Q1      4.94         0.24      0.64        0.69         0.24  0.48  7.23
##  2 03:Q2      5.08         0.26      0.62        0.69         0.24  0.49  7.38
##  3 03:Q3      5.18         0.27      0.68        0.69         0.25  0.48  7.56
##  4 03:Q4      5.66         0.3       0.7         0.7          0.25  0.45  8.07
##  5 04:Q1      5.84         0.33      0.72        0.7          0.26  0.45  8.29
##  6 04:Q2      5.97         0.37      0.74        0.7          0.26  0.42  8.46
##  7 04:Q3      6.21         0.43      0.75        0.71         0.33  0.41  8.83
##  8 04:Q4      6.36         0.47      0.73        0.72         0.35  0.42  9.04
##  9 05:Q1      6.51         0.5       0.73        0.71         0.36  0.39  9.21
## 10 05:Q2      6.7          0.53      0.77        0.72         0.37  0.4   9.49
## # ... with 54 more rows, and 1 more variable: date <date>

Finally plot various loan types

plot1 <- household_debt_perc %>% 
  ggplot(aes(date, mortgage)) +
  geom_point() +
  ggtitle("Mortgage Debt Between 2003 and 2018")
plot1

plot2 <- household_debt_perc %>% 
  ggplot(aes(date, credit_card)) +
  geom_point() + 
  ggtitle("Credit Card Debt Between 2003 to 2018")
plot2

Use “facet_wrap” to show all types of debt together

Facet_wrap allows you to plot all variables together for comparison. In order to do this, you have to “reshape the”data from a wide format to a long format. Finally, I colored each variable with a different color.

# install.package("reshape2")
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

household_debt_perc2 <- household_debt_perc %>% tidyr::gather("id", "debt", 2:7) %>% 
  ggplot(., aes(date, debt))+
  geom_point()+
  aes(color = as.factor(id)) +
  facet_wrap(~id)
household_debt_perc2

DS Labs Datasets

Use the package DSLabs (Data Science Labs)

There are a number of datasets in this package to use to practice creating visualizations

# install.packages("dslabs")  # these are data science labs
library("dslabs")

## Warning: package 'dslabs' was built under R version 4.0.2

data(package="dslabs")
list.files(system.file("script", package = "dslabs"))

##  [1] "make-admissions.R"                   
##  [2] "make-brca.R"                         
##  [3] "make-brexit_polls.R"                 
##  [4] "make-death_prob.R"                   
##  [5] "make-divorce_margarine.R"            
##  [6] "make-gapminder-rdas.R"               
##  [7] "make-greenhouse_gases.R"             
##  [8] "make-historic_co2.R"                 
##  [9] "make-mnist_27.R"                     
## [10] "make-movielens.R"                    
## [11] "make-murders-rda.R"                  
## [12] "make-na_example-rda.R"               
## [13] "make-nyc_regents_scores.R"           
## [14] "make-olive.R"                        
## [15] "make-outlier_example.R"              
## [16] "make-polls_2008.R"                   
## [17] "make-polls_us_election_2016.R"       
## [18] "make-reported_heights-rda.R"         
## [19] "make-research_funding_rates.R"       
## [20] "make-stars.R"                        
## [21] "make-temp_carbon.R"                  
## [22] "make-tissue-gene-expression.R"       
## [23] "make-trump_tweets.R"                 
## [24] "make-weekly_us_contagious_diseases.R"
## [25] "save-gapminder-example-csv.R"

Note that the package dslabs also includes some of the scripts used to wrangle the data from their original source:

US murders

This dataset includes gun murder data for US states in 2010. I use this dataset to introduce the basics of R program.

data("murders")
library(tidyverse)
library(ggthemes)

## Warning: package 'ggthemes' was built under R version 4.0.2

library(ggrepel)

## Warning: package 'ggrepel' was built under R version 4.0.2

view(murders)
write_csv(murders, "murders.csv", na="")

Work with the Murders Dataset

Calculate the average murder rate for the country

Once we determine the per million rate to be r, this line is defined by the formula: y=rx, with y and x our axes: total murders and population in millions respectively.

In the log-scale this line turns into: log(y)=log(r)+log(x). So in our plot it’s a line with slope 1 and intercept log(r). To compute r, we use dplyr:

r <- murders %>% 
  summarize(rate = sum(total) /  sum(population) * 10^6) %>% 
  pull(rate)

Create a static graph for which each point is labeled

Use the data science theme. Plot the murders with the x-axis as population for each state per million, the y-axis as the total murders for each state.

Color by region, add a linear regression line based on your calculation for r above, where we only need the intercept: geom_abline(intercept = log10(r))

Scale the x- and y-axes by a factor of log 10, add axes labels and a title.

You can use the command nudge_x argument, if you want to move the text slightly to the right or to the left:

ds_theme_set()
murders %>% ggplot(aes(x = population/10^6, y = total, label = abb)) +
  geom_abline(intercept = log10(r), lty=2, col="darkgrey") +
  geom_point(aes(color=region), size = 3) +
  geom_text_repel(nudge_x = 0.005) +
  scale_x_log10("Populations in millions (log scale)") +
  scale_y_log10("Total number of murders (log scale)") +
  ggtitle("US Gun Murders in 2010") +
  scale_color_discrete(name="Region")

Gapminder Dataset

This dataset includes health and income outcomes for 184 countries from 1960 to 2016. It also includes two character vectors, OECD and OPEC, with the names of OECD and OPEC countries from 2016.

Name the regions: The West, East Asia, Latin America, Sub-Saharan Africa, and Others

data("gapminder")

west <- c("Western Europe","Northern Europe","Southern Europe",
          "Northern America","Australia and New Zealand")

gapminder <- gapminder %>%
  mutate(group = case_when(
    region %in% west ~ "The West",
    region %in% c("Eastern Asia", "South-Eastern Asia") ~ "East Asia",
    region %in% c("Caribbean", "Central America", "South America") ~ "Latin America",
    continent == "Africa" & region != "Northern Africa" ~ "Sub-Saharan Africa",
    TRUE ~ "Others"))
gapminder <- gapminder %>%
  mutate(group = factor(group, levels = rev(c("Others", "Latin America", "East Asia","Sub-Saharan Africa", "The West"))))

Remove all na values from “group”, “fertility”, and “life_expectancy” using !is.na (works the same as na.rm = TRUE)
mutate the population to be a value per million
change the theme of the plot
Use the command: geom_text(aes(x=7, y=82, label=year), cex=12, color=“grey”) to label the two plots at the top right inside the plots (by their years).
Shift the legend to go across the top.

filter(gapminder, year%in%c(1962, 2013) & !is.na(group) &
         !is.na(fertility) & !is.na(life_expectancy)) %>%
  mutate(population_in_millions = population/10^6) %>%
  ggplot( aes(fertility, y=life_expectancy, col = group, size = population_in_millions)) +
  geom_point(alpha = 0.8) +
  guides(size=FALSE) +
  theme(plot.title = element_blank(), legend.title = element_blank()) +
  coord_cartesian(ylim = c(30, 85)) +
  xlab("Fertility rate (births per woman)") +
  ylab("Life Expectancy") +
  geom_text(aes(x=7, y=82, label=year), cex=12, color="grey") +
  facet_grid(. ~ year) +
  theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank(),
   legend.position = "top")

str(gapminder)

## 'data.frame':    10545 obs. of  10 variables:
##  $ country         : Factor w/ 185 levels "Albania","Algeria",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ year            : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
##  $ infant_mortality: num  115.4 148.2 208 NA 59.9 ...
##  $ life_expectancy : num  62.9 47.5 36 63 65.4 ...
##  $ fertility       : num  6.19 7.65 7.32 4.43 3.11 4.55 4.82 3.45 2.7 5.57 ...
##  $ population      : num  1636054 11124892 5270844 54681 20619075 ...
##  $ gdp             : num  NA 1.38e+10 NA NA 1.08e+11 ...
##  $ continent       : Factor w/ 5 levels "Africa","Americas",..: 4 1 1 2 2 3 2 5 4 3 ...
##  $ region          : Factor w/ 22 levels "Australia and New Zealand",..: 19 11 10 2 15 21 2 1 22 21 ...
##  $ group           : Factor w/ 5 levels "The West","Sub-Saharan Africa",..: 1 5 2 4 4 5 4 1 1 5 ...

Contagious disease data for US states

The next dataset contains yearly counts for Hepatitis A, measles, mumps, pertussis, polio, rubella, and smallpox for US states. Original data courtesy of Tycho Project. Use it to show ways one can plot more than 2 dimensions.

Focus on just measles 1. filter out Alaska and Hawaii

mutate the rate of measles by taking the count/(population10,00052)/weeks_reporting
draw a vertical line for 1963, which is when the measles vaccination was developed

library(RColorBrewer)
data("us_contagious_diseases")
the_disease <- "Measles"
us_contagious_diseases %>%
  filter(!state%in%c("Hawaii","Alaska") & disease ==  the_disease) %>%
  mutate(rate = count / population * 10000 * 52 / weeks_reporting) %>%
  mutate(state = reorder(state, rate)) %>%
  ggplot(aes(year, state,  fill = rate)) +
  geom_tile(color = "grey50") +
  scale_x_continuous(expand=c(0,0)) +
  scale_fill_gradientn(colors = brewer.pal(9, "Reds"), trans = "sqrt") +
  geom_vline(xintercept=1963, col = "blue") +
  theme_minimal() +  theme(panel.grid = element_blank()) +
  ggtitle(the_disease) +
  ylab("") +
  xlab("")

Fivethirtyeight 2016 Poll Data

This data includes poll results from the US 2016 presidential elections aggregated from HuffPost Pollster, RealClearPolitics, polling firms and news reports. The dataset also includes election results (popular vote) and electoral college votes in results_us_election_2016. Use this dataset to explore inference.

Focus on polls for Clinton and Trump after July 2016
Plot a scatterplot of the enddate to the percentage in the polls
Include a loess smoother regression

data(polls_us_election_2016)
polls_us_election_2016 %>%
  filter(state == "U.S." & enddate>="2016-07-01") %>%
  select(enddate, pollster, rawpoll_clinton, rawpoll_trump) %>%
  rename(Clinton = rawpoll_clinton, Trump = rawpoll_trump) %>%
  gather(candidate, percentage, -enddate, -pollster) %>% 
  mutate(candidate = factor(candidate, levels = c("Trump","Clinton")))%>%
  group_by(pollster) %>%
  filter(n()>=10) %>%
  ungroup() %>%
  ggplot(aes(enddate, percentage, color = candidate)) +  
  geom_point(show.legend = FALSE, alpha=0.4)  + 
  geom_smooth(method = "loess", span = 0.15) +
  scale_y_continuous(limits = c(30,50))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 22 rows containing non-finite values (stat_smooth).

## Warning: Removed 22 rows containing missing values (geom_point).

Working with HTML Widgets and Highcharter

Set your working directory to access your files

# load required packages
library(readr)
library(ggplot2)
library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

library(dplyr)

##Make a range of simple charts using the highcharter package

Highcharter is a package within the htmlwidgets framework that connects R to the Highcharts and Highstock JavaScript visualization libraries. For more information, see https://github.com/jbkunst/highcharter/

Also check out this site: https://cran.r-project.org/web/packages/highcharter/vignettes/charting-data-frames.html

Install and load required packages

Now install and load highcharter, plus RColorBrewer, which will make it possible to use ColorBrewer color palettes.

Also load dplyr and readr for loading and processing data.

# install highcharter, RColorBrewer
# install.packages("highcharter","RColorBrewer")

# load required packages
library(highcharter)

## Warning: package 'highcharter' was built under R version 4.0.2

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Highcharts (www.highcharts.com) is a Highsoft software product which is

## not free for commercial and Governmental use

## 
## Attaching package: 'highcharter'

## The following object is masked from 'package:dslabs':
## 
##     stars

library(RColorBrewer)

Load and process nations data

Load the nations data, and add a column showing GDP in trillions of dollars.

library(dplyr)
library(readr)

nations <- read_csv("nations.csv") %>%
  mutate(gdp_tn = gdp_percap*population/1000000000000)

## Parsed with column specification:
## cols(
##   iso2c = col_character(),
##   iso3c = col_character(),
##   country = col_character(),
##   year = col_double(),
##   gdp_percap = col_double(),
##   population = col_double(),
##   birth_rate = col_double(),
##   neonat_mortal_rate = col_double(),
##   region = col_character(),
##   income = col_character()
## )

Make a version of the “China’s rise” chart from unit 3 assignment

First, prepare the data using dplyr:

# prepare data
big4 <- nations %>%
  filter(iso3c == "CHN" | iso3c == "DEU" | iso3c == "JPN" | iso3c == "USA") %>%
  arrange(year)

The arrange step is important, because highcharter needs the data in order when drawing a time series - otherwise any line drawn through the data will follow the path of the data order, not the correct time order.

Now draw a basic chart with default settings:

# basic symbol-and-line chart, default settings
library(highcharter)
highchart() %>%
  hc_add_series(data = big4,
                   type = "line", hcaes(x = year,
                   y = gdp_tn, 
                   group = country))

## Warning: `parse_quosure()` is deprecated as of rlang 0.2.0.
## Please use `parse_quo()` instead.
## This warning is displayed once per session.

## Warning: `group_by_()` is deprecated as of dplyr 0.7.0.
## Please use `group_by()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `as_data_frame()` is deprecated as of tibble 2.0.0.
## Please use `as_tibble()` instead.
## The signature and semantics have changed, see `?as_tibble`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Warning: `rename_()` is deprecated as of dplyr 0.7.0.
## Please use `rename()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

In the code above, the function highchart() creates a chart.

Clicking on the legend items allows you to remove or add series from the chart.

Highcharts works by adding data “series” to a chart, and from R you can add the variables from a data frame all in one go using the hc_add_series function.

Inside this function we define the data frame to be used, with data, the type of chart, the variables to be mapped to the x and y axes, and the variable to group the data: here this draws a separate line for each country in the data.

Go to the github site provided above for the chart types available in Highcharts.

Now let’s begin customizing the chart.

Use a ColorBrewer palette

Using RColorBrewer, we can set a palette, and then use it in highcharter.

library(RColorBrewer)
# define color palette
cols <- brewer.pal(4, "Set1")

highchart() %>%
  hc_add_series(data = big4,
                   type = "line", hcaes(x = year,
                   y = gdp_tn, 
                   group = country)) %>%
  hc_colors(cols)

The first line of code sets a palette with four colors, using the “Set1” palette from ColorBrewer. This is then fed to the function hc_colors() to use those colors on the chart.

Add axis labels

highchart() %>%
  hc_add_series(data = big4,
                   type = "line",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = country)) %>%
  hc_colors(cols) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="GDP ($ trillion)"))

Change the legend position

For this, we use the hc_legend function.

highchart() %>%
  hc_add_series(data = big4,
                   type = "line",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = country)) %>%
  hc_colors(cols) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="GDP ($ trillion)")) %>%
  hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
  hc_legend(align = "right", 
            verticalAlign = "top")

Customize the tooltips

By default we have a tooltip for each series, or line, and the numbers run to many decimal places.

We can change to one tooltip for each year with “shared = TRUE”, and round all the numbers to two decimal places with pointFormat = "{point.country}: {point.gdp_tn:.2f}
.

# customize the tooltips

big4_chart <- highchart() %>%
  hc_add_series(data = big4,
                   type = "line",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = country)) %>%
  hc_colors(cols) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="GDP ($ trillion)")) %>%
  hc_plotOptions(series = list(marker = list(symbol = "circle"))) %>%
  hc_legend(align = "right", 
            verticalAlign = "top") %>%
  hc_tooltip(shared = TRUE,
             borderColor = "black",
             pointFormat = "{point.country}: {point.gdp_tn:.2f}<br>")
big4_chart

Prepare the data

First, prepare the data using dplyr.

# prepare data
regions <- nations %>%
  group_by(year,region) %>%
  summarize(gdp_tn = sum(gdp_tn, na.rm = TRUE)) %>%
  arrange(year,region)

## `summarise()` regrouping output by 'year' (override with `.groups` argument)

Make an area chart using default options

# basic area chart, default options
highchart () %>%
  hc_add_series(data = regions,
                   type = "area",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = region))

This is an area chart, but the areas are plotted over one another, rather than stacked.

The following code fixes that, and customizes the chart in other ways. It uses the same ColorBrewer palette, with seven colors, that we used in unit 3.

# set color palette
cols <- brewer.pal(7, "Set2")

# stacked area chart
highchart () %>%
  hc_add_series(data = regions,
                   type = "area",
                   hcaes(x = year,
                   y = gdp_tn, 
                   group = region)) %>%
  hc_colors(cols) %>% 
  hc_chart(style = list(fontFamily = "Georgia",
                        fontWeight = "bold")) %>%
  hc_plotOptions(series = list(stacking = "normal",
                               marker = list(enabled = FALSE,
                               states = list(hover = list(enabled = FALSE))),
                               lineWidth = 0.5,
                               lineColor = "white")) %>%
  hc_xAxis(title = list(text="Year")) %>%
  hc_yAxis(title = list(text="GDP ($ trillion)")) %>%
  hc_legend(align = "left",
    verticalAlign = "top",
    layout = "vertical",
    x = 0,
    y = 100) %>%
  hc_tooltip(enabled = FALSE)

We have already encountered the main functions used here. The key changes are in the hc_plotOptions() function:

stacking = “normal” creates the stacked chart. See what happens if you use stacking = “percent”.

lineWidth and lineColor set the width and color for the lines under marker = list() the code states = list(hover = list(enabled = FALSE)) turns off the hovering effect for each marker on the chart, so that the markers no longer reappear when hovered or tapped.

In the hc_legend() function, layout = “vertical” changes the layout so that the legend items appear in a vertical column.

Food Stamps Data - Combine Two Types

cols <- c("red","black")
food_stamps<- read_csv("food_stamps.csv")

## Parsed with column specification:
## cols(
##   year = col_double(),
##   participants = col_double(),
##   costs = col_double()
## )

highchart() %>%
  hc_yAxis_multiples(
    list(title = list(text = "Participants (millions)")),
    list(title = list(text = "Costs ($ billions)"),
         opposite = TRUE)
  ) %>%
  hc_add_series(data = food_stamps$participants,
                name = "Participants (millions)",
                type = "column",
                yAxis = 0) %>%
  hc_add_series(data = food_stamps$costs,
                name = "Costs ($ billions)",
                type = "line",
                yAxis = 1) %>%
  hc_xAxis(categories = food_stamps$year,
           tickInterval = 5) %>%
  hc_colors(cols) %>%
  hc_chart(style = list(fontFamily = "Georgia",
                        fontWeight = "bold"))

MIDTERM PROJECT

Economic Growth and Life Expectancy – Do Wealthier Countries Live Longer?

LOAD THE LIBRARIES

Both plot_ly() and ggplotly() support key frame animations through the frame argument/aesthetic. They also support an ids argument/aesthetic to ensure smooth transitions between objects with the same id (which helps facilitate object constancy).

library(tidyverse)
library(ggthemes)
library(ggrepel)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

LOAD THE DATA

Excerpt from the Gapminder data. The main object in this package is the gapminder data frame or “tibble”. There are other goodies, such as the data in tab delimited form, a larger unfiltered dataset, premade color schemes for the countries and continents, and ISO 3166-1 country codes.

data(gapminder, package = "gapminder")

PLOT THE GRAPH

gg <- ggplot(gapminder, aes(gdpPercap, lifeExp, color = continent)) +
  geom_point(aes(size = pop, frame = year, ids = country)) +
  scale_x_log10() + xlab("GDP Per Capita") + ylab("Life Expectancy") +
  theme(
axis.title.x = element_text(color="blue", size=14, face="bold"),
axis.title.y = element_text(color="#993333", size=14, face="bold"))

## Warning: Ignoring unknown aesthetics: frame, ids

ggplotly(gg)

The Graph recreates the animation of the evolution in the relationship between GDP per capita and life expectancy evolved over time. As long as a frame variable is provided, an animation is produced with play/pause button(s) and a slider component for controlling the animation. These components can be removed or customized via the animation_button() and animation_slider() functions. Moreover, various animation options, like the amount of time between frames, the smooth transition duration, and the type of transition easing may be altered via the animation_opts() function.

USING HTML Widgets and Highcharter

HTML widgets work just like R plots except they produce interactive web visualizations. A line or two of R code is all it takes to produce a D3 graphic or Leaflet map. HTML widgets can be used at the R console as well as embedded in R Markdown reports and Shiny web applications. hchart is a generic function which takes an object and returns a highcharter object. There are functions whose behavior are similar to the functions of the ggplot2 package like:

hchart works like ggplot2’s qplot. hc_add_series works like ggplot2’s geom_S. hcaes works like ggplot2’s aes.

Load the libraries

The gapminder data set comes built in with dslabs.

# install.packages("dslabs")  # these are data science labs
library("dslabs")

library(tidyverse)
library(gapminder)

## Warning: package 'gapminder' was built under R version 4.0.2

## 
## Attaching package: 'gapminder'

## The following object is masked from 'package:dslabs':
## 
##     gapminder

#library(dslab)
#glimpse(gapminder)

data(gapminder, package = "gapminder")

The function distinct() [dplyr package] can be used to keep only unique/distinct rows from a data frame. The option .kep_all is used to keep all variables in the data.

gp <- gapminder %>%
  arrange(desc(year)) %>%
  distinct(country, .keep_all = TRUE) 

gp2 <- gapminder %>%
  select(country, year, pop) %>% 
  nest(-country) #%>%

## Warning: All elements of `...` must be named.
## Did you want `data = c(year, pop)`?

gptot <- left_join(gp, gp2, by = "country")

x <- c("Population", "Year", "GDP PerCp", "Life Expectancy", "Continent")
y <- sprintf("{point.%s:.0f}", c("pop", "year", "gdpPercap", "lifeExp", 'continent'))

tltip <- tooltip_table(x, y)

hchart(
  gptot,
  "point",
  hcaes(lifeExp, gdpPercap, name = country, size = pop, group = continent)
  ) %>%
  
#hc_yAxis(type = "logarithmic") %>%
hc_xAxis(title = list(text="Life Expectancy")) %>%
hc_yAxis(title = list(text="GDP Per Capita")) %>%
hc_legend(align = "center",
    verticalAlign = "bottom",
    layout = "horizontal"
    
  ) %>%  
hc_tooltip(
        useHTML = TRUE,
        headerFormat = "<b>{point.key}</b>",
        pointFormat = tltip )

## Warning: `mutate_()` is deprecated as of dplyr 0.7.0.
## Please use `mutate()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Data 110 Household Debt, DS Labs and Highcharter

Salifou Sylla

6/25/2020

Load the libraries

Loading Data from a Working Directory

Load the data

Clean data headings and variable names

Here are some useful cleaning commands:

Mutate

Create a new variable to use instead of “period”

Finally plot various loan types

Use “facet_wrap” to show all types of debt together

DS Labs Datasets

Use the package DSLabs (Data Science Labs)

US murders

Work with the Murders Dataset

Calculate the average murder rate for the country

Create a static graph for which each point is labeled

Gapminder Dataset

Contagious disease data for US states

Fivethirtyeight 2016 Poll Data

Working with HTML Widgets and Highcharter

Install and load required packages

Load and process nations data

Make a version of the “China’s rise” chart from unit 3 assignment

Clicking on the legend items allows you to remove or add series from the chart.

Use a ColorBrewer palette

Add axis labels

Change the legend position

Customize the tooltips

Prepare the data

Make an area chart using default options

This is an area chart, but the areas are plotted over one another, rather than stacked.

Food Stamps Data - Combine Two Types

MIDTERM PROJECT

Economic Growth and Life Expectancy – Do Wealthier Countries Live Longer?

LOAD THE LIBRARIES

LOAD THE DATA

PLOT THE GRAPH

USING HTML Widgets and Highcharter

Load the libraries