Part 3 - Visualization with ggplot2

October 13, 2018

Questions from last week

Scientific notation is annoying. How do you get rid of it?
- options(scipen=999)
Is there a shortcut for clearing the output in R Notebook?
- I didn’t find one, but let me know if you do!
What do the # values do in your code?
- Use the '#' to comment code in R

ggplot2

Yardstick for plotting data in R
tidyverse package
Virtually no limits to plot types
Add-on packages available
Plotting resources

Plot types

What we will learn

Requests to learn another plot type?

Plot types

1 discrete variable (plus other optional discrete and/or continuous variables)

1 continuous variable

Plot types

2 continuous variables

1 continuous variable + date variable

Let’s make sure you’re able to plot your data

library(tidyverse)
mpg %>% ggplot() + geom_point(aes(displ, hwy))

Let’s make sure you’re able to plot your data

If you see a plot, you’re ready to go!
If you do not, reinstall tidyverse and re-run the test code

install.packages('tidyverse')
library(tidyverse)
mpg %>% ggplot() + geom_point(aes(displ, hwy))

If it still didn’t work, install ggplot2 and re-run the test code

install.packages('ggplot2')
library(ggplot2)
mpg %>% ggplot() + geom_point(aes(displ, hwy))

And, read your data

library(tidyverse)
crime <- read_csv('https://goo.gl/FHW2Ni') %>% as.data.frame()
candi <- read_csv('https://goo.gl/GTRqZs') %>% as.data.frame()

How to plot data

Pick your data
Pick your chart type (Geom)
Program stuff like chart titles, axis names, axis types, and legends (Scales)
Make it pretty (Themes)
Keep tuning and reuse code

ggplot basics

Geoms

Every plot needs a ggplot() function and a geom layer
Separate layers and other plotting functions with +

ggplot() + geom_bar() # create bar and stacked bar plots
ggplot() + geom_histogram() # create histograms
ggplot() + geom_point() # create scatter plots
ggplot() + geom_line() # create line plots

Your data is an argument in the ggplot() function

candi %>% ggplot() + geom_bar()
ggplot(candi) + geom_bar()

Geoms

Map your data to the plot with aes()
- aes() indicates the variables that affect the chart aesthetics
- aes() can be an argument in ggplot() or the geom
- Many arguments in the aes() function
- Not necessary to name x and y in aes()

candi %>% ggplot() + geom_bar(aes(primary_general))
candi %>% ggplot(aes(primary_general)) + geom_bar()

aes(x = NULL, y = NULL, color = NULL
  , fill = NULL, alpha = NULL, label = NULL
  , shape = NULL, size = NULL, group = NULL
  , linetype = NULL
  )

Geoms

candi %>% ggplot(aes(primary_general)) + geom_bar()
candi %>% ggplot() + geom_bar(aes(primary_general))

Geoms

candi %>% ggplot(aes(primary_general, fill = primary_general)) + geom_bar()

Geoms

candi %>% ggplot(aes(primary_general, fill = type)) + geom_bar()

Exercise - 7 minutes

Build a bar chart with crime data that shows how many incidences occurred in sector
Only include these sector values: c('B', 'E', 'D', 'R', 'O', 'C', 'K')
Color the bars using crime_subcategory
Only include these crime_subcategory values: c('ROBBERY-STREET', 'THEFT-BICYCLE', 'AGGRAVATED ASSAULT', 'TRESPASS')

Hint

# Use to include/exclude values
filter()

Exercise - 7 minutes

Build a bar chart with crime data that shows how many incidences occurred in sector
Only include these sector values: c('B', 'E', 'D', 'R', 'O', 'C', 'K')
Color the bars using crime_subcategory
Only include these crime_subcategory values: c('ROBBERY-STREET', 'THEFT-BICYCLE', 'AGGRAVATED ASSAULT', 'TRESPASS')

Hint

crime %>%
  filter() %>%
  ggplot()

Exercise - 7 minutes

crime %>% 
  filter(
    sector %in% c('B', 'E', 'D', 'R', 'O', 'C', 'K') &
    crime_subcategory %in% c('ROBBERY-STREET', 'THEFT-BICYCLE'
                             , 'AGGRAVATED ASSAULT', 'TRESPASS')
    ) %>%
  ggplot(aes(sector, fill = crime_subcategory)) + 
  geom_bar()

Scales

Axis, legend, and chart titles

Axis names

Keep names short-ish
Consider including units in axis name
It is possible you don’t need an axis name

2 options

# Option 1
labs(x = 'X Axis Title', y = 'Y Axis Title')

# Option 2
xlab('X Axis Title')
ylab('Y Axis Title')

Axis, legend, and chart titles

candi %>% 
  filter(! primary_general %in% NA) %>% 
  ggplot(aes(primary_general, fill = primary_general)) + 
  geom_bar() + 
  labs(x = 'Election Type', y = 'Donations (#)')

Axis, legend, and chart titles

Try adding axis names to the sector plot you made during the exercise

crime %>% 
  filter(
    sector %in% c('B', 'E', 'D', 'R', 'O', 'C', 'K') &
    crime_subcategory %in% c('ROBBERY-STREET', 'THEFT-BICYCLE'
                             , 'AGGRAVATED ASSAULT', 'TRESPASS')
    ) %>%
  ggplot(aes(sector, fill = crime_subcategory)) + geom_bar() + 
  labs(x = 'District Sectors', y = 'Incidences by Event\nClearance Group (#)')

Axis, legend, and chart titles

Legend names

Keep names short
Be mindful of what the legend is for
- Color? Fill? Size? etc.
Consider hiding the legend title for a clean look

labs(fill = '')
labs(fill = element_blank())
labs(fill = NULL)
labs(colour = 'Check out these colors')

Axis, legend, and chart titles

crime %>% 
  filter(
    sector %in% c('B', 'E', 'D', 'R', 'O', 'C', 'K') &
    crime_subcategory %in% c('ROBBERY-STREET', 'THEFT-BICYCLE'
                             , 'AGGRAVATED ASSAULT', 'TRESPASS')
    ) %>%
  ggplot(aes(sector, fill = crime_subcategory)) + geom_bar() + 
  labs(x = 'Election Type', y = 'Donations (#)', fill = 'Check out these colors')

Axis, legend, and chart titles

Try removing the legend name from your sector plot

Axis, legend, and chart titles

Try removing the legend name from your sector plot

crime %>% 
  filter(sector %in% c('B', 'E', 'D', 'R', 'O', 'C', 'K')) %>%
  ggplot(aes(sector, fill = crime_subcategory)) + geom_bar() + 
  labs(x = 'District Sectors', y = 'Incidences by Event\nClearance Group (#)'
       , fill = element_blank())

Axis, legend, and chart titles

Chart titles and subtitles

Consider making titles (or subtitles) descriptive
Don’t over use subtitles

# Option 1
labs(title = NULL, subtitle = NULL)

# Option 2
ggtitle(title = NULL, subtitle = NULL)

Axis, legend, and chart titles

candi %>% 
  ggplot(aes(primary_general, fill = primary_general)) + 
  geom_bar() + 
  labs(
    x = 'Election Type'
    , y = 'Donations (#)'
    , fill = element_blank()
    , title = 'Number of Donations by Election Type'
    , subtitle = 'Over 6,000 full election cycle donations'
    )

Axis, legend, and chart titles

Try adding a title to the sector plot

Axis units

Change axis units
This is a good idea and you should do it when its relevant
Good for x and y axes

install.packages('scales')
library(scales)

scale_y_continuous(labels = percent) # Add a percentage sign to numbers on axis
scale_y_continuous(labels = dollar) # Add a dollar sign to numbers on axis
scale_y_continuous(labels = comma) # Add a comma to numbers on axis

Axis units

candi %>% 
  filter(! primary_general %in% NA) %>% 
  ggplot(aes(primary_general, fill = primary_general)) + 
  geom_bar() + 
  labs(
    x = 'Election Type'
    , y = 'Donations (#)'
    , fill = element_blank()
    , title = 'Number of Donations by Election Type'
    , subtitle = 'Over 6,000 full election cycle donations'
    ) + 
  scale_y_continuous(labels = comma)

Axis units

Themes

Change layer and background colors
Change fonts
Change plot borders/boundaries and ticks
Pre-built themes vs. custom themes
- ggplot2 themes
- ggthemes themes

install.packages('ggthemes')
library(ggthemes)

Pre-built themes

Selected ggplot2 themes

theme_classic()
theme_minimal()
theme_dark()

Selected ggthemes themes

theme_stata() + scale_colour_stata() # scale_fill_stata()
theme_economist() + scale_colour_economist() # scale_fill_economist()
theme_fivethirtyeight() + scale_color_fivethirtyeight() # scale_fill_fivethirtyeight()
theme_wsj() + scale_colour_wsj() # scale_fill_wsj()
theme_pander() + scale_colour_pander() # scale_fill_pander()
theme_hc(bgcolor = "darkunica") + scale_colour_hc("darkunica") # scale_fill_hc("darkunica")

Pre-built themes

candi %>% 
  filter(! primary_general %in% NA) %>% 
  ggplot(aes(primary_general, fill = primary_general)) + 
  geom_bar() + 
  labs(
    x = 'Election Type'
    , y = 'Donations (#)'
    , fill = element_blank()
    , title = 'Number of Donations by Election Type'
    , subtitle = 'Over 6,000 full election cycle donations'
    ) + 
  scale_y_continuous(labels = comma) +
  theme_economist() + 
  scale_fill_economist()

Pre-built themes

Economist theme

Pre-built themes

Try adding a theme to the sector plot

Custom themes

You can create custom themes with theme()
Greater control over chart aesthestics
Use custom themes in place of pre-built themes if pre-built themes don’t cut it
Also, use custom themes to augment pre-built themes

Custom themes

t <- theme(
  plot.title = element_text(size=14, face="bold", vjust=1)
  , plot.background = element_blank()
  , panel.grid.major = element_blank()
  , panel.grid.minor = element_blank()
  , panel.border = element_blank()
  , panel.background = element_blank()
  , axis.ticks = element_blank()
  , axis.text = element_text(colour="black", size=12)
  , axis.text.x = element_text(angle=45, hjust=1)
  , legend.title = element_blank()
  , legend.position = "none"
  , legend.text = element_text(size=12)
  )

Circling back to the Tidyverse to aggregate data with group_by() and summarise()

What you can do with group_by() and summarise()

Count, sum, average, and identify max and min values
Example questions you could answer
- In candi, which type received the most money in campaign donations from 'NON PARTISAN'?
- In candi, which five contributor_state values saw the largest number of cash donations, excluding 'WA' and NA values?
- In candi, what is the largest average donation amount for contributor_state?
- In crime, what how many crimes were reported (reported_year) one and two years after the year in which they occurred (occurred_year)?
- In crime, of 'THEFT-ALL OTHER' values in crime_subcategory which primary_offense_description value has the fewest incidences?

How you use group_by() and summarise() to aggregate data

Changes the observation type in your dataset
In group_by(), list the variables by which you want to aggregate data
In summarise(), create a variable and define the variable with an aggregation function
Use %>% to ‘link’ group_by() and summarise()
Aggregation functions
- n(), n_distinct(), sum(), mean(), max(), min(), etc.
Perform multiple aggregations in a single summarise() function
- summarise(total_amount = sum(amount), n = n())

Example
In candi, which type received the most money in donations from 'NON PARTISAN'?

candi %>% group_by(type, party) %>% summarise(dollars = sum(amount, na.rm = TRUE))

How you use group_by() and summarise() to aggregate data

Example
In candi, which type received the most money in donations from 'NON PARTISAN'?

candi %>% 
  group_by(type, party) %>% 
  summarise(dollars = sum(amount, na.rm = TRUE)) %>%
  filter(party %in% 'NON PARTISAN')
## # A tibble: 3 x 3
## # Groups:   type [3]
##   type                party         dollars
##   <chr>               <chr>           <dbl>
## 1 Candidate           NON PARTISAN 1803901.
## 2 Political Committee NON PARTISAN     170.
## 3 <NA>                NON PARTISAN   64202.

How you use group_by() and summarise() to aggregate data

Example
In candi, which five contributor_state values saw the largest number of cash donations (see cash_or_in_kind), excluding 'WA' and NA values?

candi %>% 
  group_by(contributor_state, cash_or_in_kind) %>% 
  summarise(n = n()) %>% 
  filter(cash_or_in_kind %in% 'Cash' & ! contributor_state %in% c('WA', NA)) %>% 
  arrange(desc(n)) %>% # arrange() is a new function you haven't seen yet
  head(5)
## # A tibble: 5 x 3
## # Groups:   contributor_state [5]
##   contributor_state cash_or_in_kind     n
##   <chr>             <chr>           <int>
## 1 CA                Cash              917
## 2 OR                Cash              637
## 3 ID                Cash              296
## 4 NY                Cash              269
## 5 TX                Cash              269

How you use group_by() and summarise()

How would you turn the tabular output data into a chart?

## # A tibble: 5 x 3
## # Groups:   contributor_state [5]
##   contributor_state cash_or_in_kind     n
##   <chr>             <chr>           <int>
## 1 CA                Cash              917
## 2 OR                Cash              637
## 3 ID                Cash              296
## 4 NY                Cash              269
## 5 TX                Cash              269

How you use group_by() and summarise()

How would you turn the tabular output data into a chart?

candi %>% 
  group_by(contributor_state, cash_or_in_kind) %>% 
  summarise(n = n()) %>% 
  filter(cash_or_in_kind %in% 'Cash' & ! contributor_state %in% c('WA', NA)) %>% 
  arrange(desc(n)) %>% 
  head(5) %>% 
  ggplot(aes(reorder(contributor_state, n), n, fill = contributor_state)) + # reorder() is a new function
  geom_bar(stat = "identity") + 
  coord_flip() # coord_flip() is a new function

How you use group_by() and summarise()

How would you turn the tabular output data into a chart?

candi %>% 
  group_by(contributor_state, cash_or_in_kind) %>% 
  summarise(n = n()) %>% 
  filter(cash_or_in_kind %in% 'Cash' & ! contributor_state %in% c('WA', NA)) %>% 
  arrange(desc(n)) %>% 
  head(5) %>% 
  ggplot(aes(reorder(contributor_state, n), n, fill = contributor_state)) + 
  geom_bar(stat = "identity") + 
  coord_flip()

Necessary when data is aggregated

geom_bar(stat = 'identity') # use stat when working with aggregated data

How you use group_by() and summarise()

Example
In candi, which type received the most money in donations in the receipt_year 2015?

candi %>% 
  group_by(receipt_year, type) %>% 
  summarise(total_amount = sum(amount, na.rm = TRUE)) %>%
  filter(receipt_year %in% 2015)
## # A tibble: 3 x 3
## # Groups:   receipt_year [1]
##   receipt_year type                total_amount
##          <int> <chr>                      <dbl>
## 1         2015 Candidate                572504.
## 2         2015 Political Committee      883725.
## 3         2015 <NA>                      59190.

How you use group_by() and summarise()

How would you turn the tabular output data into a chart?

## # A tibble: 3 x 3
## # Groups:   receipt_year [1]
##   receipt_year type                total_amount
##          <int> <chr>                      <dbl>
## 1         2015 Candidate                572504.
## 2         2015 Political Committee      883725.
## 3         2015 <NA>                      59190.

How you use group_by() and summarise()

How would you turn the tabular output data into a chart?

candi %>% 
  group_by(receipt_year, type) %>% 
  summarise(total_amount = sum(amount, na.rm = TRUE)) %>%
  filter(receipt_year %in% 2015) %>%
  ggplot(aes(type, total_amount, fill = type)) +
  geom_bar(stat = 'identity') + 
  theme_classic()

How you use group_by() and summarise()

How would you turn the tabular output data into a chart?

candi %>% 
  group_by(receipt_year, type) %>% 
  summarise(total_amount = sum(amount, na.rm = TRUE)) %>%
  filter(receipt_year %in% 2015 & ! type %in% NA) %>%
  ggplot(aes(type, total_amount, fill = type)) +
  geom_bar(stat = 'identity') + 
  theme_classic()

Exercise - 15 minutes

From the crime dataset, report the mean difference in reported_year and occurred_year values by sector as a bar chart
- The variable that tells us the difference between reported_year and occurred_year should be called year_dif (created with transmute())
- The variable that tells us the mean difference should be called avg_year_dif (created with summarise())
- In your output of mean values by sector only report rows where the number of sectors in the dataset is greater than 1000
- Also, exclude null sector values in your output
- You’ll use transmute() and filter() as well as group_by() and summarise() to solve this problem

Hint

crime %>% 
  transmute(
    sector
    , year_dif = reported_year - occurred_year
  ) %>%

Exercise - 15 minutes

From the crime dataset, report the mean difference in reported_year and occurred_year values by sector as a bar chart
- The variable that tells us the difference between reported_year and occurred_year should be called year_dif (created with transmute())
- The variable that tells us the mean difference should be called avg_year_dif (created with summarise())
- In your output of mean values by sector only report rows where the number of sectors in the dataset is greater than 1000
- Also, exclude null sector values in your output
- You’ll use transmute() and filter() as well as group_by() and summarise() to solve this problem

crime %>% 
  transmute(
    sector
    , year_dif = reported_year - occurred_year
  ) %>% 
  group_by(sector) %>% 
  summarise(
    avg_year_dif = mean(year_dif, na.rm = TRUE)
    , n = n()
  ) %>%

Exercise - 15 minutes

From the crime dataset, report the mean difference in reported_year and occurred_year values by sector as a bar chart
- The variable that tells us the difference between reported_year and occurred_year should be called year_dif (created with transmute())
- The variable that tells us the mean difference should be called avg_year_dif (created with summarise())
- In your output of mean values by sector only report rows where the number of sectors in the dataset is greater than 1000
- Also, exclude null sector values in your output

crime %>% 
  transmute(
    sector
    , year_dif = reported_year - occurred_year
  ) %>% 
  group_by(sector) %>% 
  summarise(
    avg_year_dif = mean(year_dif, na.rm = TRUE)
    , n = n()
  ) %>% 
  filter(n > 1000 & ! sector %in% NA) %>%

Exercise - 15 minutes

crime %>% 
  transmute(
    sector
    , year_dif = reported_year - occurred_year
  ) %>% 
  group_by(sector) %>% 
  summarise(
    avg_year_dif = mean(year_dif, na.rm = TRUE)
    , n = n()
  ) %>% 
  filter(n > 1000 & ! sector %in% NA) %>% 
  ggplot(aes(reorder(sector, -avg_year_dif), avg_year_dif)) +
  geom_bar(stat = 'identity') + 
  theme_classic()

Exercise - 15 minutes

Histograms, scatterplots, and line plots

Histograms: 1 continuous variable

candi %>% 
  filter(amount > 0 & amount < 1000) %>%
  ggplot(aes(amount)) + geom_histogram() # 'bins' is the argument you use to indicate the number of columns in your histogram

Histograms: 1 continuous variable

candi %>% 
  filter(amount > 0 & amount < 1000) %>%
  ggplot(aes(amount)) + geom_histogram(bins = 10) # 'bins' is the argument you use to indicate the number of columns in your histogram

Histograms: 1 continuous variable

candi %>% 
  filter(amount > 0 & amount < 1000) %>%
  ggplot(aes(amount)) + geom_density() # works like geom_histogram(); continuous curve

Scatterplots: 2 continuous variables

candi %>% 
  filter(amount > 0 & amount < 1000000) %>%
  ggplot(aes(receipt_date, amount)) + 
  geom_point()

Scatterplots: 2 continuous variables

candi %>% 
  ggplot(aes(receipt_date, amount)) + 
  geom_point()

geom_count() # adds a size component to your scatterplot
geom_jitter() # adjusts cartesian location of point relative to other points

Line plot: 1 continuous variable + date variable

candi %>% 
  filter(amount > 0) %>%
  group_by(receipt_date) %>%
  summarise(amount = sum(amount)) %>% 
  ggplot(aes(x = receipt_date, y = cumsum(amount))) + # cumsum() is a new function!
  geom_line()

Exercise - 5 minutes

With either dataset, candi or crime, create a scatterplot or line plot
Include one aes() arguments in addition to x and y
- For example, size, shape, color or alpha
Use group_by() and summarise() if you would like

 crime %>% 
  filter(! precinct %in% NA) %>%
  ggplot(aes(reported_date, occurred_date, color = precinct)) + 
  geom_point() + 
  theme_classic()

Exercise - 5 minutes

 crime %>% 
  filter(! precinct %in% NA) %>%
  ggplot(aes(reported_date, occurred_date, color = precinct)) + 
  geom_point() + 
  theme_classic()

Exercise - 15 minutes

Create a chart from the crime dataset
Create a new variable with mutate() called occurred_time_ampm that tells whether an incident occurred in the AM or PM.
Make sure your axes are labeled correctly
Give your chart a title
Use occurred_time_ampm in the color or fill arguments in aes()
Use a pre-built theme

Exercise - 15 minutes

crime %>% 
  mutate(occurred_time_ampm = ifelse(occurred_time >= 1200, 'PM', 'AM')) %>% 
  filter(
    ! occurred_time_ampm %in% NA & 
    ! crime_subcategory %in% NA
    ) %>%
  group_by(crime_subcategory, occurred_time_ampm) %>%
  summarise(n = n()) %>%
  ungroup %>%
  group_by(crime_subcategory) %>%
  mutate(total = sum(n, na.rm = TRUE)) %>%
  filter(total >= 100) %>%
  ggplot(aes(reorder(crime_subcategory, n), n, fill = occurred_time_ampm)) + 
  geom_bar(stat = 'identity') + 
  coord_flip() + 
  theme_wsj() + 
  scale_colour_wsj() + 
  labs(title = 'Most incidence occur\nin the AM', fill = 'Time of Day')

Exercise - 10 minutes

Create a chart from the candi dataset
Make some comparison
Give your chart a title
Use a pre-built theme

Exercise - 10 minutes

candi %>% 
  filter(
    ! party %in% c(NA, 'OTHER', 'NONE') & 
    amount > 0 & 
    receipt_date >= '2008-01-01' & 
    receipt_date <= '2020-12-31'
    ) %>% 
  group_by(party, receipt_date) %>%
  summarise(amount = sum(amount)) %>%
  mutate(amount = cumsum(amount)) %>% # You can see that I use cumsum() in mutate()
  ggplot(aes(x = receipt_date, y = amount, color = party)) + 
  geom_line(size = 1, stat = 'identity') + 
  theme_wsj() + 
  scale_y_continuous(labels = dollar) +
  scale_colour_wsj() + 
  labs(
    title = 'Democrats receive nearly 1\nmillion more in donations than\nnext closest party',
    colour = element_blank()
    )

How to create a heatmap

Heatmap using crime data

crime %>% 
  group_by(neighborhood, crime_subcategory) %>% 
  summarise(count = n()) %>%
  filter(
    neighborhood %in% c('DOWNTOWN COMMERCIAL', 'CAPITOL HILL', 'NORTHGATE', 'UNIVERSITY', 'BALLARD SOUTH', 'BELLTOWN') & 
    crime_subcategory %in% c('CAR PROWL', 'THEFT-ALL OTHER', 'THEFT-SHOPLIFT', 'BURGLARY-RESIDENTIAL', 'MOTOR VEHICLE THEFT')
    ) %>%
  ggplot(aes(x = neighborhood, y = crime_subcategory, fill = count)) +
  geom_tile() + 
  theme_classic() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))