Preamble: Setting up our data and libraries

knitr::opts_chunk$set(warning = FALSE)
## Vizathon2021_Rvisualizations.R ##
## Written by: Kees Schipper ##
## Created: 2021-01-03 ##
## Last Edited: 2021-01-03 ##

## Data visualizations in R ##
## Table of contents

# 1) Preamble: setting up our data and libraries...plus some useful functions
# 
# 1) Dataviz: the quick and dirty method
# 
# 2) Refining your plots with ggplot2, a layer-based "grammar of graphics"
# 
# 3) Lets add some animation


# Preamble: Setting up our data and libraries -----------------------------

# getting an idea of what our data looks like
# install.packages("tidyverse")
# install.packages("gapminder")
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(gapminder)
library(gganimate)
library(ggdark)

gapminder

names(gapminder) # gives variable names
head(gapminder, n = 10)  # shows first n observations of the data set
tail(gapminder, n = 10)  # shows last n observations of a data set
nrow(gapminder) # number of rows
ncol(gapminder) # number of columns
unique(gapminder$year) # shows unique values of year, country, and continent
unique(gapminder$country)
n_distinct(gapminder$country) # how many unique values are there?
range(gapminder$lifeExp) # gives min and max of a vector
min(gapminder$lifeExp) # self-explanatory functions
max(gapminder$lifeExp)
mean(gapminder$lifeExp)
median(gapminder$lifeExp)
quantile(gapminder$lifeExp, c(.10, .25, .50, .75, .90))
sum(complete.cases(gapminder)) # how many observations have complete data?
sum(is.na(gapminder$lifeExp)) # how many missing values are there
# is.na spits out a boolean True or False answer, depending on if a value is
# missing or not. Expressed numerically, True = 1 and False = 0. Given this fact,
# you can calculate the number of missing values in a vector.
# consider complete.cases() to be the is.na, however the object that we are testing
# is an entire row of data, rather than a single column.
# you can do similar operations with any boolean statement (==, >, <, !=...)

summary(gapminder) # provides basic summary statistics of all variables in your dataset

gapminder %>% count(continent, sort = T) #useful function to count frequencies of a value
# equivalent to
count(gapminder, continent, sort = T)

# The Tidyverse uses a pipe operator (%>%), which serves to make data cleaning easily readable
# the pipe passes the object to the left of the pipe into the first object to the right of the
# pipe

# for example, f(x, y) = x %>% f(y)

# if you don't know what a function does, use `?` to look up the documentation
# ?names
# ?head
# ?tail
# ?unique
# ?summary
# ?count
# ?nrow
# ?ncol

Dataviz: the quick and dirty method

if you simply want to visualize data for your understanding, the plot function is perfectly fine!

plot(gapminder$year, gapminder$lifeExp, main = "Life Expectancy by Year")

fun trick, you can plot a matrix of your variables with the following code:

plot(gapminder[4:6]) # the [4:6] specifies that we only want to plot the 4th-6th columns

                     # useful for visualizing the relationship between variables in your
                     # data

to demonstrate some plotting options:

China <- gapminder %>%
  filter(country == "China")

Poland <- gapminder %>%
  filter(country == "Poland")

Morocco <- gapminder %>%
  filter(country == "Morocco")

Brazil <- gapminder %>%
  filter(country == "Brazil")

# this function sets the parameters of our graphing device. If we want to have 4
# graphs in our device at once, we can specify a 2x2 plotting structure using the
# mfrow argument below.
par(mfrow = c(2, 2))

# going through the options..."type" determines how your points will look (scatter vs. line)
# col = color (which can be given as text or as R color codes), main is the title of your
# plot, sub = subtitle of your plot, xlab = x label, ylab = y label, lty = line type 
# (try options 0-4). Experiment with the settings of your plots to see what you like!
plot(China$year, China$lifeExp, type = "o", col = "green", 
     main = "China's Life Expectancy by year", sub = "Gapminder data set",
     xlab = "Year", ylab = "Life Expectancy (yrs)")

plot(Poland$year, Poland$lifeExp, type = "l", lty = 2, col = "red",
     main = "Poland's Life Expectancy by year")

plot(Morocco$year, Morocco$lifeExp, col = "blue",
     main = "Morocco Life Expectancy")

plot(Brazil$year, Brazil$lifeExp, type = "o", col = "orange",
     main = "Brazil Life Expectancy")

# resets your graphing device to only plot one at a time
# dev.off()
par(mfrow = c(1, 1))

Base R also has some easy plot functions for histograms and boxplots

hist(gapminder$gdpPercap, breaks = 40, main = "Histogram of GDP per capita",
     xlab = "gdp per capita")

dens <- density(gapminder$gdpPercap) # density creates a probability density of your variable

# plot your density as a line graph
plot(dens, type = "l", frame = F, col = "firebrick3", main = "density plot of gdp per cap")
polygon(dens, col = "firebrick3") # fills in your density plot

Boxplot create a boxplot. The ~ creates an equation. In the below code, it means that you are going to plot gdp per capita, divided by continent. You can also flip the plot to be horizontal

boxplot(gapminder$gdpPercap ~ gapminder$continent, main = "GDP per capita by continent",
        xlab = "continent", ylab = "GDP per capita", horizontal = F)

Say you make a model, and you want to check your model assumptions/check for outliers

# first, build your model
model <- lm(gdpPercap ~ continent + pop + lifeExp, data = gapminder)

# Then, to check model assumptions and screen outliers, use the plot function:
plot(model)

# gives us residuals vs. fitted, normal QQ, and leverage points in Cook's distance

# you can also access data from your model to plot what your estimates might look like

plot(model$fitted.values)

plot(model$residuals, main = "Model Residuals")

# residuals(model) # functions to get residuals and fitted values from your model
# fitted(model)

abline(h = 0, col = "red") # plots a straight line in your graph. Here we plotted a

                           # red horizontal line to check for residuals around 0

# we should also check how our residuals^2 look, to see if there is any departure from
# homoskedasticity
plot(model$residuals^2, main = "Model Residuals^2") # exponents can be expressed as either x^2 or x**2. Both will work
abline(h = 0, col = "red")

hist(model$residuals, breaks = 50, main = "Distribution of model residuals")

# can plot a simple histogram of the residuals from your model.
# we can see that the residuals are mostly normal, but have some
# outliers that we must examine (some other time)

I’ve only touched on a little bit of plotting in base R, but if you are interested in more, check out the following link: http://www.sthda.com/english/wiki/r-base-graphs

In addition, here’s a quick cheat sheet on plotting in base R: http://publish.illinois.edu/johnrgallagher/files/2015/10/BaseGraphicsCheatsheet.pdf

Refining your plots with ggplot2, a layer-based “grammar of graphics”

ggplot2 is an R package found in the tidyverse, which we activated at the beginning of this script. Each ggplot has 3 main elements, a base (ggplot()), one or more geoms (like geom_box, geom_point, geom_smooth, or geom_area), and a theme (theme_minimal…) you can also add text and adjust labels, and much more. I’ll only be going through a couple of these capabilities.

also, side note, pieces of a ggplot are connected by “+”. This further shows that ggplot is a layered system, built by adding pieces on top of one another. This can strategically be used to show pieces of a visualization more prominently than others.

simple scatterplot of gdp per cap by year

## simple scatterplot of gdp per cap by year
ggplot(data = gapminder, aes(x = year, y = gdpPercap)) +
  geom_point(aes(col = continent)) +
  labs(title = "GDP Per Capita By Year", caption = "gapminder project") +
  ylab("GDP per Capita") +
  theme_minimal()

notes on what I just did above. We start with a ggplot object, specifying our dataset. as long as we stay connected to that original ggplot object, our plot knows that we are pulling from that dataset. This makes it easier to refer to variables, as you don’t have to type out gapminder$continent. Instead, you can just type out continent.

in addition, the aes function in ggplot specifies aesthetics of the graph in relation to variables in your dataset. We have set year and gdpPercap to be the global x and y values We then created a point geom, and using the aes() function, we specified that we want to color our points by continent (you can customize your colors, but R also has a base color palette).

labs specifies your title and caption, while ylab and xlab when used specify y and x axis labels. The final piece is a theme. There are many themes, such as theme_minimal(), theme_dark(), theme_classic, theme_dark, etc…

you can store ggplot objects as a variable so you don’t have to keep specifying global options

base <- ggplot(data = gapminder, aes(x = year, y = gdpPercap))

## can do the same with a smoother plot
base +
  geom_smooth(aes(col = continent), se = F, method = "loess", size = 2, span = 0.3) +
  labs(title = "GDP Per Capita by Year", caption = "gapminder project") +
  ylab("GDP per Capita") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

geom smooth produces a smoother curve given the data and a method, or window (span). the three most used methods are loess, lm (linear model), and gam (generalized additive model). Loess is very useful in itself, and as the span gets smaller, the line gets more jagged

as mentioned, ggplot2 is a layer-based graphics package. Because of this, you can use multiple geoms in one plot

base + 
  geom_point(aes(col = continent)) +
  geom_smooth(aes(col = continent), se = F, method = "loess", span = 0.3) +
  labs(title = "Scatterplot and smoothers of gdp per capita by year",
       caption = "split by continent") +
  ylab("gdp per capita")

## `geom_smooth()` using formula 'y ~ x'

# se is a standard error option. See for yourself what happens when you switch the 
# "se" option to T or TRUE
# also, try using different smoothing methods. lm is great for seeing what your linear
# model might look like in a simple linear regression

## just to show off some other geoms you could use

just to show off some other geoms you could use

ggplot(data = gapminder) +
  geom_boxplot(aes(x = continent, y = gdpPercap), fill = "black", color = "blue") +
  geom_jitter(aes(x = continent, y = gdpPercap), color = "red", alpha = 0.10) +
  labs(title = "Boxplot with Jitters")

ggplot(data = gapminder) +
  geom_violin(aes(x = continent, y = gdpPercap, fill = continent), scale = "width") +
  labs(title = "Violin plot showing distributions within a group")

# gapminder %>% filter(gdpPercap < 40000) %>%
ggplot(data = gapminder) +
  geom_histogram(aes(x = gdpPercap, fill = continent), bins = 10, position = "stack") +
  labs(title = "histograms split by continent", 
       subtitle = "split by facet but can also be stacked or fill") +
  facet_wrap(~continent) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90)) +
  scale_fill_brewer(palette="Set1")

# check out http://www.sthda.com/english/wiki/ggplot2-colors-how-to-change-colors-automatically-and-manually
# for how to set your own color palette in a ggplot.


ggplot(data = gapminder) +
  geom_freqpoly(aes(x = gdpPercap, col = continent), bins = 50, size = 2) +
  labs(title = "frequency polygons split by continent", 
       subtitle = "much easier to compare distributions") +
  theme_minimal() +
  scale_color_manual(values = c(464, 419, 20, 153, 26))

find some colors that you like here: https://www.statmethods.net/advgraphs/parameters.html

other interesting geometries: geom_area() geom_density() geom_bar() geom_bin2d() geom_errorbar() geom_map() # yes you can even make maps with ggplot. # I have not gone into depth in this topic

Here’s a cheat sheet showing all of the most frequently used geoms, and some syntax on how to use geoms: https://rstudio.com/wp-content/uploads/2016/11/ggplot2-cheatsheet-2.1.pdf

Let’s add some animation!

animation uses the gganimate package, which we installed and used earlier. gganimate is a simple extension of ggplot, where you use an extra variable as your “frame” variable. The frame variable is taken into account in transition_time()

x <- ggplot(data = gapminder) +
  geom_point(aes(x = gdpPercap, y = lifeExp, size = pop, col = continent), alpha = 0.5) +
  scale_size_continuous(range = c(1, 10)) +
  xlim(0, 42000) +
  dark_mode() +
  transition_time(year) +
  labs(subtitle = 'Year: {frame_time}', x = "GDP Per Capita", y = "life expectancy") +
  ease_aes('linear') + # or 'cubic-in-out' 'elastic' 'quartic' 'quintic'...
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

## Inverted geom defaults of fill and color/colour.
## To change them back, use invert_geom_defaults().

x

ggplot(data = gapminder) +
  geom_point(aes(x = gdpPercap, y = lifeExp, size = pop, col = continent), alpha = 0.5) +
  scale_size_continuous(range = c(1, 10)) +
  xlim(0, 42000) +
  dark_mode() +
  transition_time(year) +
  labs(subtitle = 'Year: {frame_time}', x = "GDP Per Capita", y = "life expectancy") +
  ease_aes('linear') +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_wrap(~continent)

Note, these animations are part of the R graph gallery. I merely want to provide an intuition and a display that these tools are available in R

for a basic vignette of gganimate, follow this link: https://cran.r-project.org/web/packages/gganimate/vignettes/gganimate.html

for more intense (and honestly better) graphs, check out the R graph gallery! https://www.r-graph-gallery.com/

If you’re just looking for data to play around with

library(tidytuesdayR)

coffee <- tidytuesdayR::tt_load(2020, week = 28)[[1]]

## --- Compiling #TidyTuesday Information for 2020-07-07 ----

## --- There is 1 file available ---

## --- Starting Download ---

## 
##  Downloading file 1 of 1: `coffee_ratings.csv`

## --- Download complete ---

head(coffee)

## # A tibble: 6 x 43
##   total_cup_points species owner country_of_orig~ farm_name lot_number mill 
##              <dbl> <chr>   <chr> <chr>            <chr>     <chr>      <chr>
## 1             90.6 Arabica meta~ Ethiopia         "metad p~ <NA>       meta~
## 2             89.9 Arabica meta~ Ethiopia         "metad p~ <NA>       meta~
## 3             89.8 Arabica grou~ Guatemala        "san mar~ <NA>       <NA> 
## 4             89   Arabica yidn~ Ethiopia         "yidneka~ <NA>       wole~
## 5             88.8 Arabica meta~ Ethiopia         "metad p~ <NA>       meta~
## 6             88.8 Arabica ji-a~ Brazil            <NA>     <NA>       <NA> 
## # ... with 36 more variables: ico_number <chr>, company <chr>, altitude <chr>,
## #   region <chr>, producer <chr>, number_of_bags <dbl>, bag_weight <chr>,
## #   in_country_partner <chr>, harvest_year <chr>, grading_date <chr>,
## #   owner_1 <chr>, variety <chr>, processing_method <chr>, aroma <dbl>,
## #   flavor <dbl>, aftertaste <dbl>, acidity <dbl>, body <dbl>, balance <dbl>,
## #   uniformity <dbl>, clean_cup <dbl>, sweetness <dbl>, cupper_points <dbl>,
## #   moisture <dbl>, category_one_defects <dbl>, quakers <dbl>, color <chr>,
## #   category_two_defects <dbl>, expiration <chr>, certification_body <chr>,
## #   certification_address <chr>, certification_contact <chr>,
## #   unit_of_measurement <chr>, altitude_low_meters <dbl>,
## #   altitude_high_meters <dbl>, altitude_mean_meters <dbl>

## to explore other datasets, use the tt_load function, with the first argument
## being the year you want to look (2018 onwards I think), and the second argument being
## the week of the dataset you want to look at. For 2020, there should be a dataset
## for every week, so explore weeks 1-52!

big_mac <- tt_load(2020, week = 52)[[1]]

## --- Compiling #TidyTuesday Information for 2020-12-22 ----

## --- There is 1 file available ---

## --- Starting Download ---

## 
##  Downloading file 1 of 1: `big-mac.csv`

## --- Download complete ---

head(big_mac)

## # A tibble: 6 x 19
##   date       iso_a3 currency_code name  local_price dollar_ex dollar_price
##   <date>     <chr>  <chr>         <chr>       <dbl>     <dbl>        <dbl>
## 1 2000-04-01 ARG    ARS           Arge~        2.5       1            2.5 
## 2 2000-04-01 AUS    AUD           Aust~        2.59      1.68         1.54
## 3 2000-04-01 BRA    BRL           Braz~        2.95      1.79         1.65
## 4 2000-04-01 CAN    CAD           Cana~        2.85      1.47         1.94
## 5 2000-04-01 CHE    CHF           Swit~        5.9       1.7          3.47
## 6 2000-04-01 CHL    CLP           Chile     1260       514            2.45
## # ... with 12 more variables: usd_raw <dbl>, eur_raw <dbl>, gbp_raw <dbl>,
## #   jpy_raw <dbl>, cny_raw <dbl>, gdp_dollar <dbl>, adj_price <dbl>,
## #   usd_adjusted <dbl>, eur_adjusted <dbl>, gbp_adjusted <dbl>,
## #   jpy_adjusted <dbl>, cny_adjusted <dbl>

# There is so much more to R data visualizations. You can plot in 3d, make maps, make 
# your visualizations interactive (Highcharter, leaflet, and plotly). The geoms that I showed
# here are only some of the ggplot tools that are available. There are also many excellent
# viz packages outside of ggplot. The R graph gallery is the best place to find inspiration
# for telling your data story!

# https://www.r-graph-gallery.com/interactive-charts.html

# if you want to check out some interactive graphs from highcharter, look here:
# https://jkunst.com/highcharter/

There is so much more to R data visualizations. You can plot in 3d, make maps, make your visualizations interactive (Highcharter, leaflet, and plotly). The geoms that I showed here are only some of the ggplot tools that are available. There are also many excellent viz packages outside of ggplot. The R graph gallery is the best place to find inspiration for telling your data story!

https://www.r-graph-gallery.com/interactive-charts.html

if you want to check out some interactive graphs from highcharter, look here: https://jkunst.com/highcharter/

If you want more pointers on programming, there are two excellent books freely available online: # R for data science https://r4ds.had.co.nz/

Advanced R (For those willing to go deeper into R)

https://adv-r.hadley.nz/

There is another book entirely for using ggplot # ggplot2 https://ggplot2-book.org/

All of these books are by Hadley Wickham, who is very prominent in the R field

Vizathon 2021 Intro to Data Viz with R

Kees Schipper

1/5/2021