Introduction

Course notes from the Data Vizualization with ggplot2 (Part 3) course on DataCamp

Whats Covered

Part A:

Statistical plots
- Aesthetics review,
- box plots, density plots
- multiple groups/variables
Plots for specific data types (Part 1)
- graphics of large data
- Ternary plots
- Network plots
- Diagnostic plots

Part B:

Plots for specific data types (Part 2)
- choropleths
- cartographic maps
- animations

Part C:

ggplot2 internals
- grid graphics, grid grapshics in ggplot2
- ggplot objects
- gridExtra
Data Munging and Visualization Case Study
- Bag plot case study, weather case study

Libraries and Data

source("create_datasets.R")
load('data/test_datasets.RData')

library(readr)
library(dplyr)
library(ggplot2)
library(purrr)

library(ggplot2movies)
library(viridis)
library(GGally)
library(ggtern)
library(ggthemes)
library(geomnet)
library(ggmap)
library(ggfortify)

Statistical plots

Introduction

– Refresher (1)

# Create movies_small
# library(ggplot2movies)
set.seed(123)
movies_small <- movies[sample(nrow(movies), 1000), ]
movies_small$rating <- factor(round(movies_small$rating))

# Explore movies_small with str()
str(movies_small)

## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  24 variables:
##  $ title      : chr  "Fair and Worm-er" "Shelf Life" "House: After Five Years of Living" "Three Long Years" ...
##  $ year       : int  1946 2000 1955 2003 1963 1992 1999 1972 1994 1985 ...
##  $ length     : int  7 4 11 76 103 107 87 84 127 94 ...
##  $ budget     : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ rating     : Factor w/ 10 levels "1","2","3","4",..: 7 7 6 8 8 5 4 8 5 5 ...
##  $ votes      : int  16 11 15 11 103 28 105 9 37 28 ...
##  $ r1         : num  0 0 14.5 4.5 4.5 4.5 14.5 0 4.5 4.5 ...
##  $ r2         : num  0 0 0 0 4.5 0 4.5 0 4.5 0 ...
##  $ r3         : num  0 0 4.5 4.5 0 4.5 4.5 0 14.5 4.5 ...
##  $ r4         : num  0 0 4.5 0 4.5 4.5 4.5 0 4.5 14.5 ...
##  $ r5         : num  4.5 4.5 0 0 4.5 0 4.5 14.5 24.5 4.5 ...
##  $ r6         : num  4.5 24.5 34.5 4.5 4.5 0 14.5 0 4.5 14.5 ...
##  $ r7         : num  64.5 4.5 24.5 0 14.5 4.5 14.5 14.5 14.5 14.5 ...
##  $ r8         : num  14.5 24.5 4.5 4.5 14.5 24.5 14.5 24.5 14.5 14.5 ...
##  $ r9         : num  0 0 0 14.5 14.5 24.5 14.5 14.5 4.5 4.5 ...
##  $ r10        : num  14.5 24.5 14.5 44.5 44.5 24.5 14.5 44.5 4.5 24.5 ...
##  $ mpaa       : chr  "" "" "" "" ...
##  $ Action     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Animation  : int  1 0 0 0 0 0 0 0 0 0 ...
##  $ Comedy     : int  1 0 0 1 0 1 1 1 0 0 ...
##  $ Drama      : int  0 0 0 0 1 0 0 0 1 1 ...
##  $ Documentary: int  0 0 1 0 0 0 0 0 0 0 ...
##  $ Romance    : int  0 0 0 0 0 0 1 0 0 0 ...
##  $ Short      : int  1 1 1 0 0 0 0 0 0 0 ...

# Build a scatter plot with mean and 95% CI
ggplot(movies_small, aes(x = rating, y = votes)) +
  geom_point() +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "crossbar",
               width = 0.2,
               col = "red") +
  scale_y_log10()

– Refresher (2)

str(diamonds)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

# Reproduce the plot
ggplot(diamonds, aes(x = carat, y = price, col = color)) +
  geom_point(alpha = 0.5, size = 0.5, shape = 16) +
  scale_x_log10(expression(log[10](Carat)), limits = c(0.1, 10)) +
  scale_y_log10(expression(log[10](Price)), limits = c(100, 100000)) +
  scale_color_brewer(palette = "YlOrRd") +
  coord_equal() +
  theme_classic()

– Refresher (3)

# Add smooth layer and facet the plot
ggplot(diamonds, aes(x = carat, y = price, col = color)) +
  stat_smooth(method = "lm") + 
  scale_x_log10(expression(log[10](Carat)), limits = c(0.1,10)) +
  scale_y_log10(expression(log[10](Price)), limits = c(100,100000)) +
  scale_color_brewer(palette = "YlOrRd") +
  coord_equal() +
  theme_classic()

Box Plots

– Transformations

# movies_small is available

# Add a boxplot geom
d <- ggplot(movies_small, aes(x = rating, y = votes)) +
  geom_point() +
  geom_boxplot() +
  stat_summary(fun.data = "mean_cl_normal",
               geom = "crossbar",
               width = 0.2,
               col = "red")

# Untransformed plot
d

# Transform the scale
d + scale_y_log10()

# Transform the coordinates
## coord_trans is different to scale transformations in that it occurs after statistical transformation and will affect the visual appearance of geoms - there is no guarantee that straight lines will continue to be straight.

## This does not work in my case. 
## It likely is from the statitics having a zero value.
## d + coord_trans(y = "log10")

# It works fine without the stats layer
ggplot(movies_small, aes(x = rating, y = votes)) +
  geom_point() +
  geom_boxplot() +
  coord_trans(y = "log10")

This is the example from the documentation which actually works

# Three ways of doing transformation in ggplot:

#  * by transforming the data
ggplot(diamonds, aes(log10(carat), log10(price))) +
  geom_point()

#  * by transforming the scales
ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  scale_x_log10() +
  scale_y_log10()

#  * by transforming the coordinate system:
ggplot(diamonds, aes(carat, price)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10")

# The difference between transforming the scales and
# transforming the coordinate system is that scale
# transformation occurs BEFORE statistics, and coordinate
# transformation afterwards.  Coordinate transformation also
# changes the shape of geoms:

d <- subset(diamonds, carat > 0.5)

ggplot(d, aes(carat, price)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_x_log10() +
  scale_y_log10()

ggplot(d, aes(carat, price)) +
  geom_point() +
  geom_smooth(method = "lm") +
  coord_trans(x = "log10", y = "log10")

– Cut it up!

# Plot object p
p <- ggplot(diamonds, aes(x = carat, y = price))

# Use cut_interval
p + geom_boxplot(aes(group = cut_interval(carat, n = 10)))

# Use cut_number
p + geom_boxplot(aes(group = cut_number(carat, n = 10)))

# Use cut_width
p + geom_boxplot(aes(group = cut_width(carat, width = 0.25)))

If you only have continuous variables, you can convert them into ordinal variables using any of the following functions:
- cut_interval(x, n) makes n groups from vector x with equal range.
- cut_number(x, n) makes n groups from vector x with (approximately) equal numbers of observations.
- cut_width(x, width) makes groups of width width from vector x.

– Understanding quartiles

Notice how the IQR becomes more consistent across methods as the sample size increases

plot_quart <- function(n) {
  set.seed(123)
  playData <- data.frame(raw.values = rnorm(n, 1, 6))

  quan.summary <- data.frame(t(sapply(1:9, function(x) quantile(playData$raw.values, type = x))))
  names(quan.summary) <- c("Min", "Q1", "Median", "Q3", "Max")
  quan.summary$Type <- as.factor(1:9)

  library(reshape2)
  quan.summary <- melt(quan.summary, id = "Type")
  quan.summary <- list(quartiles = quan.summary, values = playData)

  ggplot(quan.summary$quartiles, aes(x = Type, y = value, col = variable)) +
    geom_point() +
    geom_rug(data = quan.summary$values, aes(y = raw.values), sides = "l", inherit.aes = F)
}

plot_quart(4)

plot_quart(10)

plot_quart(50)

plot_quart(100)

Density Plots

– geom_density()

# test_datasets.RData has been loaded

str(ch1_test_data)

## 'data.frame':    200 obs. of  3 variables:
##  $ norm   : num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
##  $ bimodal: num  0.199 -0.688 -2.265 -1.457 -2.414 ...
##  $ uniform: num  -0.117 -0.537 -1.515 -1.812 -0.949 ...

# Calculating density: d
d <- density(ch1_test_data$norm, bw = "nrd0", kernel = "gaussian")

# Use which.max() to calculate mode
mode <- d$x[which.max(d$y)]

# Finish the ggplot call
ggplot(ch1_test_data, aes(x = norm)) +
  geom_density() +
  geom_rug() +
  geom_vline(xintercept = mode, col = "red")

– Combine density plots and histogram

# ch1_test_data is available

# Arguments you'll need later on
fun_args <- list(mean = mean(ch1_test_data$norm), sd = sd(ch1_test_data$norm))

# Finish the ggplot
ggplot(ch1_test_data, aes(x = norm)) + 
  geom_histogram(aes(y = ..density..)) + 
  geom_density(col = "red") + 
  stat_function(
    fun = dnorm, 
    args = fun_args, 
    col = "blue")

– Adjusting density plots

There are three parameters that you may be tempted to adjust in a density plot:

bw - the smoothing bandwidth to be used, see ?density for details
adjust - adjustment of the bandwidth, see density for details
kernel - kernel used for density estimation, defined as
- “g” = gaussian
- “r” = rectangular
- “t” = triangular
- “e” = epanechnikov
- “b” = biweight
- “c” = cosine
- “o” = optcosine

# small_data is available
small_data <- structure(list(x = c(-3.5, 0, 0.5, 6)), .Names = "x", row.names = c(NA, 
-4L), class = "data.frame")

# Get the bandwith
get_bw <- density(small_data$x)$bw

# Basic plotting object
p <- ggplot(small_data, aes(x = x)) +
  geom_rug() +
  coord_cartesian(ylim = c(0,0.5))

# Create three plots
p + geom_density()

p + geom_density(adjust = 0.25)

p + geom_density(bw = 0.25 * get_bw)

# Create two plots
## rectangular kernel
p + geom_density(kernel = "r")

## epanechnikov kernel
p + geom_density(kernel = "e")

Multiple Groups/Variables

– Box plots with varying width

One way to represent the sample size, n, is to use variable witdths for the boxes.

# Finish the plot
ggplot(diamonds, aes(x = cut, y = price, col = color)) + 
  geom_boxplot(varwidth = T) + 
  facet_grid(. ~ color)

– Mulitple density plots

# ch1_test_data and ch1_test_data2 are available
str(ch1_test_data)

## 'data.frame':    200 obs. of  3 variables:
##  $ norm   : num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...
##  $ bimodal: num  0.199 -0.688 -2.265 -1.457 -2.414 ...
##  $ uniform: num  -0.117 -0.537 -1.515 -1.812 -0.949 ...

str(ch1_test_data2)

## 'data.frame':    400 obs. of  2 variables:
##  $ dist : Factor w/ 2 levels "norm","bimodal": 1 1 1 1 1 1 1 1 1 1 ...
##  $ value: num  -0.5605 -0.2302 1.5587 0.0705 0.1293 ...

# Plot with ch1_test_data
ggplot(ch1_test_data, aes(x = norm)) +
  geom_rug() + 
  geom_density()

# Plot two distributions with ch1_test_data2
ggplot(ch1_test_data2, aes(x = value, fill = dist, col = dist)) +
  geom_rug(alpha = 0.6) + 
  geom_density(alpha = 0.6)

– Multiple density plots (2)

# Individual densities
ggplot(mammals[mammals$vore == "Insectivore", ], 
    aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# With faceting
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3)) +
  facet_wrap( ~ vore, nrow = 2)

# Note that by default, the x ranges fill the scale
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# Trim each density plot individually
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35, trim = T) +
  scale_x_continuous(limits=c(0,24)) +
  coord_cartesian(ylim = c(0, 0.3))

– Weighted density plots

When plotting a single variable, the density plots (and their bandwidths) are calculated separate for each variable (see the plot from the previous exercise, provided).
However, when you compare several variables (such as eating habits) it’s useful to see the density of each subset in relation to the whole data set.

# Unweighted density plot from before
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24)) +
  coord_cartesian(ylim = c(0, 0.3))

# Unweighted violin plot
ggplot(mammals, aes(x = vore, y = sleep_total, fill = vore)) +
  geom_violin()

# Calculate weighting measure
library(dplyr)
mammals2 <- mammals %>%
  group_by(vore) %>%
  mutate(n = n() / nrow(mammals)) -> mammals

str(mammals2, give.attr = F)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  76 obs. of  3 variables:
##  $ vore       : Factor w/ 4 levels "Carnivore","Herbivore",..: 1 4 2 4 2 2 1 1 2 2 ...
##  $ sleep_total: num  12.1 17 14.4 14.9 4 14.4 8.7 10.1 3 5.3 ...
##  $ n          : num  0.25 0.263 0.421 0.263 0.421 ...

str(mammals, give.attr = F)

## Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame':  76 obs. of  3 variables:
##  $ vore       : Factor w/ 4 levels "Carnivore","Herbivore",..: 1 4 2 4 2 2 1 1 2 2 ...
##  $ sleep_total: num  12.1 17 14.4 14.9 4 14.4 8.7 10.1 3 5.3 ...
##  $ n          : num  0.25 0.263 0.421 0.263 0.421 ...

# Weighted density plot
## I remove the ylim because the y scale changes here
ggplot(mammals, aes(x = sleep_total, fill = vore)) +
  geom_density(aes(weight = n), col = NA, alpha = 0.35) +
  scale_x_continuous(limits = c(0, 24))

# Weighted violin plot
ggplot(mammals, aes(x = vore, y = sleep_total, fill = vore)) +
  geom_violin(aes(weight = n), col = NA)

– 2D density plots (1)

# Base layers
p <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
  scale_y_continuous(limits = c(1, 5.5), expand = c(0, 0)) +
  scale_x_continuous(limits = c(40, 100), expand = c(0, 0)) +
  coord_fixed(60 / 4.5)

# 1 - Use geom_density_2d()
p + geom_density_2d()

# 2 - Use stat_density_2d() with arguments
p + stat_density_2d(aes(col = ..level..), h = c(5, 0.5))

– 2D density plots (2)

# Load in the viridis package
library(viridis)

# Add viridis color scale
ggplot(faithful, aes(x = waiting, y = eruptions)) +
  scale_y_continuous(limits = c(1, 5.5), expand = c(0,0)) +
  scale_x_continuous(limits = c(40, 100), expand = c(0,0)) +
  coord_fixed(60/4.5) +
  stat_density_2d(geom = "tile", aes(fill = ..density..), h=c(5,.5), contour = FALSE) +
  scale_fill_viridis()

Plots for specific data types (Part 1)

Graphics of Large Data

– Pair plots and correlation matrices

# pairs
pairs(iris[1:4])

# chart.Correlation
library(PerformanceAnalytics)

chart.Correlation(iris[1:4])

# ggpairs
# library(GGally)

mtcars_fact <- mtcars %>%
  mutate(
    cyl = as.factor(cyl),
    vs = as.factor(vs),
    am = as.factor(am),
    gear = as.factor(gear),
    carb = as.factor(carb)
  )

ggpairs(mtcars_fact[1:3])

– Create a correlation matrix in ggplot2

library(ggplot2)
library(reshape2)

cor_list <- function(x) {
  L <- M <- cor(x)
  
  M[lower.tri(M, diag = TRUE)] <- NA
  M <- melt(M)
  names(M)[3] <- "points"
  
  L[upper.tri(L, diag = TRUE)] <- NA
  L <- melt(L)
  names(L)[3] <- "labels"
  
  merge(M, L)
}

# Calculate xx with cor_list
library(dplyr)

xx <- iris %>%
  group_by(Species) %>%
  do(cor_list(.[1:4])) 

# Finish the plot
ggplot(xx, aes(x = Var1, y = Var2)) +
  geom_point(
    aes(col = points, size = abs(points)), 
    shape = 16
    ) +
  geom_text(
    aes(col = labels, size = abs(labels), label = round(labels, 2))
    ) +
  scale_size(range = c(0, 6)) +
  scale_color_gradient2("r", limits = c(-1, 1)) +
  scale_y_discrete("", limits = rev(levels(xx$Var1))) +
  scale_x_discrete("") +
  guides(size = FALSE) +
  geom_abline(slope = -1, intercept = nlevels(xx$Var1) + 1) +
  coord_fixed() +
  facet_grid(. ~ Species) +
  theme(axis.text.y = element_text(angle = 45, hjust = 1),
        axis.text.x = element_text(angle = 45, hjust = 1),
        strip.background = element_blank())

Ternary Plots

– Proportional/stacked bar plots

# Explore africa
load('data/africa.RData')
africa_sample <- sample_n(africa, 50)
str(africa_sample)

## 'data.frame':    50 obs. of  3 variables:
##  $ Sand: num  42 28 83 40 35 24 29 30 75 92 ...
##  $ Silt: num  15 29 6 14 10 22 20 4 10 6 ...
##  $ Clay: num  43 43 11 46 55 54 51 66 15 2 ...

head(africa_sample)

##       Sand Silt Clay
## 34893   42   15   43
## 85261   28   29   43
## 64201   83    6   11
## 52595   40   14   46
## 44793   35   10   55
## 79175   24   22   54

# Add an ID column from the row.names
africa_sample$ID <- row.names(africa_sample)

# Gather africa_sample
library(tidyr)
africa_sample_tidy <- gather(africa_sample, key, value, -ID)
head(africa_sample_tidy)

##      ID  key value
## 1 34893 Sand    42
## 2 85261 Sand    28
## 3 64201 Sand    83
## 4 52595 Sand    40
## 5 44793 Sand    35
## 6 79175 Sand    24

# Finish the ggplot command
ggplot(africa_sample_tidy, aes(x = factor(ID), y = value, fill = key)) +
  geom_col() +
  coord_flip()

– Producing ternary plots

# The ggtern library is loaded

# Build ternary plot
str(africa)

## 'data.frame':    40093 obs. of  3 variables:
##  $ Sand: num  24 36 56 52 65 43 42 47 57 51 ...
##  $ Silt: num  12 14 18 21 3 14 22 19 15 14 ...
##  $ Clay: num  64 50 26 27 32 43 36 34 28 35 ...

ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  geom_point(shape = 16, alpha = 0.2)

– Adjusting ternary plots

# ggtern and ggplot2 are loaded

# Plot 1
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  geom_density_tern()

# Plot 2
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  stat_density_tern(geom = 'polygon', aes(fill = ..level.., alpha = ..level..)) +
  guides(fill = FALSE)

Just playing around and trying to make my own pretty plot
- I want something like the faithful density but on the tern
- This is a start

## I want to see all the points on there
ggtern(africa, aes(x = Sand, y = Silt, z = Clay)) +
  geom_point(alpha = 0.1, color = "navyblue", size = .5) + 
  stat_density_tern(
    geom = 'polygon', 
    aes(fill = ..level.., alpha = ..level..),
    bins = 100
    ) +
  guides(alpha = FALSE) + 
  scale_fill_viridis()

Network Plots

– Build the network (1)

# Load geomnet & examine structure of madmen
# The geomnet library is loaded
str(madmen)

## List of 2
##  $ edges   :'data.frame':    39 obs. of  2 variables:
##   ..$ Name1: Factor w/ 9 levels "Betty Draper",..: 1 1 2 2 2 2 2 2 2 2 ...
##   ..$ Name2: Factor w/ 39 levels "Abe Drexler",..: 15 31 2 4 5 6 8 9 11 21 ...
##  $ vertices:'data.frame':    45 obs. of  2 variables:
##   ..$ label : Factor w/ 45 levels "Abe Drexler",..: 5 9 16 23 26 32 33 38 39 17 ...
##   ..$ Gender: Factor w/ 2 levels "female","male": 1 2 2 1 2 1 2 2 2 2 ...

## This is a much better way to see whats in each list. Love it. 
## library(purrr)
madmen %>% purrr::map(head)

## $edges
##          Name1            Name2
## 1 Betty Draper    Henry Francis
## 2 Betty Draper       Random guy
## 3   Don Draper          Allison
## 4   Don Draper Bethany Van Nuys
## 5   Don Draper     Betty Draper
## 6   Don Draper   Bobbie Barrett
## 
## $vertices
##           label Gender
## 1  Betty Draper female
## 2    Don Draper   male
## 3   Harry Crane   male
## 4 Joan Holloway female
## 5    Lane Pryce   male
## 6   Peggy Olson female

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Examine structure of mmnet
head(mmnet)

##          Name1            Name2 Gender
## 1 Betty Draper    Henry Francis female
## 2 Betty Draper       Random guy female
## 3   Don Draper          Allison   male
## 4   Don Draper Bethany Van Nuys   male
## 5   Don Draper     Betty Draper   male
## 6   Don Draper   Bobbie Barrett   male

str(mmnet)

## 'data.frame':    75 obs. of  3 variables:
##  $ Name1 : Factor w/ 45 levels "Betty Draper",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ Name2 : Factor w/ 39 levels "Abe Drexler",..: 15 31 2 4 5 6 8 9 11 21 ...
##  $ Gender: Factor w/ 2 levels "female","male": 1 1 2 2 2 2 2 2 2 2 ...

– Build the network (2)

# geomnet is pre-loaded

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

head(mmnet)

##          Name1            Name2 Gender
## 1 Betty Draper    Henry Francis female
## 2 Betty Draper       Random guy female
## 3   Don Draper          Allison   male
## 4   Don Draper Bethany Van Nuys   male
## 5   Don Draper     Betty Draper   male
## 6   Don Draper   Bobbie Barrett   male

# Finish the ggplot command
ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
  geom_net(
    aes(col = Gender),
    size = 6,
    linewidth = 1,
    labelon = T,
    fontsize = 3,
    labelcolour = "black")

– Adjusting the network

# geomnet is pre-loaded
# ggmap is already loaded

# Merge edges and vertices
mmnet <- merge(madmen$edges, madmen$vertices,
               by.x = "Name1", by.y = "label",
               all = TRUE)

# Tweak the network plot
ggplot(data = mmnet, aes(from_id = Name1, to_id = Name2)) +
  geom_net(
    aes(col = Gender),
    size = 6,
    linewidth = 1,
    labelon = TRUE,
    fontsize = 3,
    labelcolour = "black",
    directed = T) +
  scale_color_manual(values = c("#FF69B4", "#0099ff")) +
  xlim(c(-0.05, 1.05)) +
  theme_nothing() +
  theme(legend.key = element_blank())

Diagnostic Plots

– Autoplot on linear models

# Create linear model: res
res <- lm(Volume ~ Girth, data = trees)

# Plot res
plot(res)

# Import ggfortify and use autoplot()
# library(ggfortify)
autoplot(res, ncol = 2)

– ggfortify - time series

# ggfortify and Canada are available

# Inspect structure of Canada
str(Canada)

##  Time-Series [1:84, 1:4] from 1980 to 2001: 930 930 930 931 933 ...
##  - attr(*, "dimnames")=List of 2
##   ..$ : NULL
##   ..$ : chr [1:4] "e" "prod" "rw" "U"

head(Canada)

## [1] 929.6105 929.8040 930.3184 931.4277 932.6620 933.5509

# Call plot() on Canada
plot(Canada)

# Call autoplot() on Canada
# autoplot is from the ggfortify library
autoplot(Canada)

– Distance matrices and Multi-Dimensional Scaling (MDS)

The cmdscale() function from the stats package performs Classical Multi-Dimensional Scaling and returns point coodinates as a matrix.
Although autoplot() will work on this object, it will produce a heatmap, and not a scatter plot.
However, if either eig = TRUE, add = TRUE or x.ret = TRUE is specified, cmdscale() will return a list instead of matrix.
In these cases, the list method for autoplot() in the ggfortify package can deal with the output.
Specifics on multi-dimensional scaling is beyond the scope of this course, however details on the method and these arguments can be found in the help pages ?cmdscale.

# ggfortify and eurodist are available
str(eurodist)

## Class 'dist'  atomic [1:210] 3313 2963 3175 3339 2762 ...
##   ..- attr(*, "Size")= num 21
##   ..- attr(*, "Labels")= chr [1:21] "Athens" "Barcelona" "Brussels" "Calais" ...

# Autoplot + ggplot2 tweaking
autoplot(eurodist) + 
  coord_fixed()

# Autoplot of MDS
autoplot(cmdscale(eurodist, eig = TRUE), 
         label = TRUE, 
         label.size = 3, 
         size = 0)

– Plotting K-means clustering

You must explicitly pass the original data to the autoplot function via the data argument, since kmeans objects don’t contain the original data.
This kmeans cluster is wrong. Its very differnt from what I saw in the datacamp exercise. Not sure why.

# Perform clustering
iris_k <- kmeans(iris[-5], 3)

# Autoplot: color according to cluster
autoplot(iris_k, data = iris, frame = T)

# Autoplot: above, plus shape according to species
autoplot(iris_k, data = iris, frame = T, shape = 'Species')

Data Visualization with ggplot2 (Part 3-A)

William Surles

2017-08-02

Introduction

Whats Covered

Libraries and Data

Statistical plots

Introduction

– Refresher (1)

– Refresher (2)

– Refresher (3)

Box Plots

– Transformations

– Cut it up!

– Understanding quartiles

Density Plots

– geom_density()

– Combine density plots and histogram

– Adjusting density plots

Multiple Groups/Variables

– Box plots with varying width

– Mulitple density plots

– Multiple density plots (2)

– Weighted density plots

– 2D density plots (1)

– 2D density plots (2)

Plots for specific data types (Part 1)

Graphics of Large Data

– Pair plots and correlation matrices

– Create a correlation matrix in ggplot2

Ternary Plots

– Proportional/stacked bar plots

– Producing ternary plots

– Adjusting ternary plots

Network Plots

– Build the network (1)

– Build the network (2)

– Adjusting the network

Diagnostic Plots

– Autoplot on linear models

– ggfortify - time series

– Distance matrices and Multi-Dimensional Scaling (MDS)

– Plotting K-means clustering