January 24, 2017

ggplot2

Plot types

Plot types

Plot types

Plot types

What we will learn

Requests to learn another plot type?

Plot types

1 discrete variable

1 continuous variable

Plot types

2 continuous variables

1 continuous variable + date variable

Let’s make sure you’re able to plot your data

library(tidyverse)
mpg %>% ggplot() + geom_point(aes(displ, hwy))

Let’s make sure you’re able to plot your data

  • If you see a plot, you’re ready to go!
  • If you do not, reinstall tidyverse and re-run the test code
install.packages('tidyverse')
library(tidyverse)
mpg %>% ggplot() + geom_point(aes(displ, hwy))
  • If it still didn’t work, install ggplot2 and re-run the test code
install.packages('ggplot2')
library(ggplot2)
mpg %>% ggplot() + geom_point(aes(displ, hwy))

And, read your data…the easy way

library(tidyverse)
donor <- read.csv('https://goo.gl/tm9JQ5')
police <- read.csv('https://goo.gl/nNAuDy')

How to plot data

  1. Pick your data
  2. Pick your chart type (Geom)
  3. Program stuff like chart titles, axis names, axis types, and legends (Scales)
  4. Make it pretty (Themes)
  5. Keep tuning and reuse code

ggplot basics

Geoms

  • Every plot needs a ggplot() function and a geom layer
  • Separate layers and other plotting functions with +
ggplot() + geom_bar() # create bar and stacked bar plots
ggplot() + geom_histogram() # create histograms
ggplot() + geom_point() # create scatter plots
ggplot() + geom_line() # create line plots
  • Your data is an argument in the ggplot() function
donor %>% ggplot() + geom_bar()
ggplot(donor) + geom_bar()

Geoms

  • Map your data to the plot with aes()
    • aes() can be an argument in ggplot() or the geom
    • Many arguments in the aes() function
    • Not necessary to name x and y in aes()
aes(x = NULL, y = NULL, color = NULL
  , fill = NULL, alpha = NULL, label = NULL
  , shape = NULL, size = NULL, group = NULL
  , linetype = NULL
  )
donor %>% ggplot() + geom_bar(aes(primary_general))
donor %>% ggplot(aes(primary_general)) + geom_bar()

Geoms

donor %>% ggplot() + geom_bar(aes(primary_general))
donor %>% ggplot(aes(primary_general)) + geom_bar()

Geoms

donor %>% ggplot(aes(primary_general, fill = primary_general)) + geom_bar()

Exercise - 7 minutes

  • Build a bar chart with police data that shows how many incidences occurred in district_sector
  • Only include these district_sector values: c('B', 'E', 'D', 'R', 'O', 'C', 'K')
  • Color the bars using event_clearance_group
  • Only include these event_clearance_group values: c('TRAFFIC RELATED CALLS', 'FRAUD CALLS', 'BURGLARY', 'BIKE')

Hint

# Use to include/exclude values
filter()

Exercise - 7 minutes

  • Build a bar chart with police data that shows how many incidences occurred in district_sector
  • Only include these district_sector values: c('B', 'E', 'D', 'R', 'O', 'C', 'K')
  • Color the bars using event_clearance_group
  • Only include these event_clearance_group values: c('TRAFFIC RELATED CALLS', 'FRAUD CALLS', 'BURGLARY', 'BIKE')

Hint

police %>%
  filter() %>%
  ggplot()

Exercise - 7 minutes

police %>% 
  filter(
    district_sector %in% c('B', 'E', 'D', 'R', 'O', 'C', 'K') &
    event_clearance_group %in% c('TRAFFIC RELATED CALLS', 'FRAUD CALLS', 'BURGLARY', 'BIKE')
    ) %>%
  ggplot(aes(district_sector, fill = event_clearance_group)) + geom_bar()

Scales

Axis, legend, and chart titles

Axis names

  • Keep names short-ish
  • Consider including units in axis name
  • It is possible you don’t need an axis name

2 options

# Option 1
labs(x = 'X Axis Title', y = 'Y Axis Title')

# Option 2
xlab('X Axis Title')
ylab('Y Axis Title')

Axis, legend, and chart titles

donor %>% 
  ggplot(aes(primary_general, fill = primary_general)) + 
  geom_bar() + 
  labs(x = 'Election Type', y = 'Donations (#)')

Axis, legend, and chart titles

Try adding axis names to the district_sector plot you made during the exercise

police %>% 
  filter(
    district_sector %in% c('B', 'E', 'D', 'R', 'O', 'C', 'K') &
    event_clearance_group %in% c('TRAFFIC RELATED CALLS', 'FRAUD CALLS', 'BURGLARY', 'BIKE')
    ) %>%
  ggplot(aes(district_sector, fill = event_clearance_group)) + geom_bar() + 
  labs(x = 'District Sectors', y = 'Incidences by Event\nClearance Group (#)')

Axis, legend, and chart titles

Legend names

  • Keep names short
  • Be mindful of what the legend is for
    • Color? Fill? Size? etc.
  • Consider hiding the legend title for a clean look
labs(fill = '')
labs(fill = element_blank())
labs(fill = NULL)
labs(colour = 'Check out these colors')

Axis, legend, and chart titles

police %>% 
  filter(
    district_sector %in% c('B', 'E', 'D', 'R', 'O', 'C', 'K') &
    event_clearance_group %in% c('TRAFFIC RELATED CALLS', 'FRAUD CALLS', 'BURGLARY', 'BIKE')
    ) %>%
  ggplot(aes(district_sector, fill = event_clearance_group)) + geom_bar() + 
  labs(x = 'Election Type', y = 'Donations (#)', fill = element_blank())

Axis, legend, and chart titles

Try removing the legend name from your district_sector plot

police %>% 
  filter(district_sector %in% c('B', 'E', 'D', 'R', 'O', 'C', 'K')) %>%
  ggplot(aes(district_sector, fill = event_clearance_group)) + geom_bar() + 
  labs(x = 'District Sectors', y = 'Incidences by Event\nClearance Group (#)'
       , fill = element_blank())

Axis, legend, and chart titles

Try removing the legend name from your district_sector plot

Axis, legend, and chart titles

Chart titles and subtitles

  • Consider making titles (or subtitles) descriptive
  • Don’t over use subtitles
# Option 1
labs(title = NULL, subtitle = NULL)

# Option 2
ggtitle(title = NULL, subtitle = NULL)

Axis, legend, and chart titles

donor %>% 
  ggplot(aes(primary_general, fill = primary_general)) + 
  geom_bar() + 
  labs(
    x = 'Election Type'
    , y = 'Donations (#)'
    , fill = element_blank()
    , title = 'Number of Donations by Election Type'
    , subtitle = 'Over 6,000 full election cycle donations'
    )

Axis, legend, and chart titles

Axis, legend, and chart titles

Try adding a title to the district_sector plot

Axis units

  • Change axis units
  • This is a good idea and you should do it when its relevant
  • Good for x and y axes
install.packages('scales')
library(scales)

scale_y_continuous(labels = percent) # Add a percentage sign to numbers on axis
scale_y_continuous(labels = dollar) # Add a dollar sign to numbers on axis
scale_y_continuous(labels = comma) # Add a comma to numbers on axis

Axis units

donor %>% 
  ggplot(aes(primary_general, fill = primary_general)) + 
  geom_bar() + 
  labs(
    x = 'Election Type'
    , y = 'Donations (#)'
    , fill = element_blank()
    , title = 'Number of Donations by Election Type'
    , subtitle = 'Over 6,000 full election cycle donations'
    ) + 
  scale_y_continuous(labels = comma) 

Axis units

Themes

Themes

  • Change layer and background colors
  • Change fonts
  • Change plot borders/boundaries and ticks
  • Pre-built themes vs. custom themes
    • ggplot2 themes
    • ggthemes themes
install.packages('ggthemes')
library(ggthemes)

Pre-built themes

Selected ggplot2 themes

theme_classic()
theme_minimal()
theme_dark()

Selected ggthemes themes

theme_stata() + scale_colour_stata() # scale_fill_stata()
theme_economist() + scale_colour_economist() # scale_fill_economist()
theme_fivethirtyeight() + scale_color_fivethirtyeight() # scale_fill_fivethirtyeight()
theme_wsj() + scale_colour_wsj() # scale_fill_wsj()
theme_pander() + scale_colour_pander() # scale_fill_pander()
theme_hc(bgcolor = "darkunica") + scale_colour_hc("darkunica") # scale_fill_hc("darkunica")

Pre-built themes

donor %>% 
  ggplot(aes(primary_general, fill = primary_general)) + 
  geom_bar() + 
  labs(
    x = 'Election Type'
    , y = 'Donations (#)'
    , fill = element_blank()
    , title = 'Number of Donations by Election Type'
    , subtitle = 'Over 6,000 full election cycle donations'
    ) + 
  scale_y_continuous(labels = comma) +
  theme_economist() + 
  scale_fill_economist()

Pre-built themes

Economist theme

Pre-built themes

Pre-built themes

Try adding a theme to the district_sector plot

Selected ggplot2 themes

theme_classic()
theme_minimal()
theme_dark()

Selected ggthemes themes

theme_stata() + scale_colour_stata() # scale_fill_stata()
theme_economist() + scale_colour_economist() # scale_fill_economist()
theme_fivethirtyeight() + scale_color_fivethirtyeight() # scale_fill_fivethirtyeight()
theme_wsj() + scale_colour_wsj() # scale_fill_wsj()
theme_pander() + scale_colour_pander() # scale_fill_pander()
theme_hc(bgcolor = "darkunica") + scale_colour_hc("darkunica") # scale_fill_hc("darkunica")

Pre-built themes

Try adding a theme to the district_sector plot

Custom themes

  • You can create custom themes with theme()
  • Greater control over chart aesthestics
  • Use custom themes if pre-built themes don’t cut it

Custom themes

t <- theme(
  plot.title = element_text(size=14, face="bold", vjust=1)
  , plot.background = element_blank()
  , panel.grid.major = element_blank()
  , panel.grid.minor = element_blank()
  , panel.border = element_blank()
  , panel.background = element_blank()
  , axis.ticks = element_blank()
  , axis.text = element_text(colour="black", size=12)
  , axis.text.x = element_text(angle=45, hjust=1)
  , legend.title = element_blank()
  , legend.position = "none"
  , legend.text = element_text(size=12)
  )

Histograms, scatterplots, and line plots

Histograms: 1 continuous variable

donor %>% 
  filter(amount < 1000) %>%
  ggplot(aes(amount)) + geom_histogram()

Histograms: 1 continuous variable

donor %>% 
  filter(amount < 1000) %>%
  ggplot(aes(amount)) + geom_histogram()
geom_density() # works like geom_histogram(); continuous curve

Scatterplots: 2 continuous variables

donor %>% 
  ggplot(aes(receipt_year, election_year)) + geom_point()

Scatterplots: 2 continuous variables

donor %>% 
  ggplot(aes(receipt_year, election_year)) + geom_point()
geom_count() # adds a size component to your scatterplot
geom_jitter() # adjusts cartesian location of point relative to other points

Line plot: 1 continuous variable + date variable

donor %>% 
  ggplot(aes(receipt_year, amount)) + geom_line()

Exercise - 5 minutes

  • With either dataset, donor or police, create a scatterplot
  • Include two aes() arguments in addition to x and y
    • For example, size, shape, or alpha
  • Start by identifying numeric variables in your dataset

Exercise - 5 minutes

police %>% 
  ggplot(aes(latitude, longitude, color = district_sector, alpha = district_sector)) + 
  geom_point() + 
  theme_classic()

Aggregate data for chart creation with group_by() and summarise()

What you can do with group_by() and summarise()

  • Count, sum, average, and identify max and min values
  • This is like COUNTIFS and SUMIFS formulas in Excel and LOD expressions in Tableau
  • Example questions you could answer
    • In donor, which type received the most money in donations from 'REPUBLICAN'?
    • In donor, which type received the most money in donations in the receipt_year 2015?
    • In donor, what is the largest average donation amount for contributor_state?

How you use group_by() and summarise() to aggregate data

  • In group_by(), list the variables by which you want to aggregate data
  • In summarise(), create a variable and define the variable with an aggregation function
  • Use %>% to ‘link’ group_by() and summarise()
  • Aggregation functions
    • n(), n_distinct(), sum(), mean(), max(), min(), etc.

Example
In donor, which type received the most money in donations from 'REPUBLICAN'?

donor %>% group_by(type, party) %>% summarise(dollars = sum(amount, na.rm = TRUE))

How you use group_by() and summarise() to aggregate data

Example
In donor, which type received the most money in donations from 'REPUBLICAN'?

donor %>% group_by(type, party) %>% summarise(dollars = sum(amount, na.rm = TRUE))
## # A tibble: 7 x 3
## # Groups: type [?]
##   type                party        dollars
##   <fctr>              <fctr>         <dbl>
## 1 Candidate           DEMOCRAT      311690
## 2 Candidate           INDEPENDENT      467
## 3 Candidate           NON PARTISAN  206196
## 4 Candidate           NONE            6465
## 5 Candidate           OTHER          25859
## 6 Candidate           REPUBLICAN    313540
## 7 Political Committee <NA>         1563098

How you use group_by() and summarise() to aggregate data

How would you turn the tabular output data into a chart?

donor %>% group_by(type, party) %>% summarise(dollars = sum(amount, na.rm = TRUE))
## # A tibble: 7 x 3
## # Groups: type [?]
##   type                party        dollars
##   <fctr>              <fctr>         <dbl>
## 1 Candidate           DEMOCRAT      311690
## 2 Candidate           INDEPENDENT      467
## 3 Candidate           NON PARTISAN  206196
## 4 Candidate           NONE            6465
## 5 Candidate           OTHER          25859
## 6 Candidate           REPUBLICAN    313540
## 7 Political Committee <NA>         1563098

How you use group_by() and summarise() to aggregate data

How would you turn the tabular output data into a chart?

donor %>% 
  group_by(type, party) %>% 
  summarise(dollars = sum(amount, na.rm = TRUE)) %>%
  ggplot(aes(type, dollars, fill = party)) + geom_bar(stat = "identity") + 
  coord_flip() + 
  scale_y_continuous(labels = dollar) + theme_classic()

How you use group_by() and summarise() to aggregate data

How would you turn the tabular output data into a chart?

donor %>% 
  group_by(type, party) %>% 
  summarise(dollars = sum(amount, na.rm = TRUE)) %>%
  ggplot(aes(type, dollars, fill = party)) + geom_bar(stat = "identity") + 
  scale_y_continuous(labels = dollar) + theme_classic()

Necessary when data is aggregated

geom_bar(stat = 'identity') # use stat when working with aggregated data
geom_line(stat = 'identity')

How you use group_by() and summarise() to aggregate data

Example
In donor, which type received the most money in donations in the receipt_year 2015?

donor %>% 
  group_by(receipt_year, type) %>% 
  summarise(total_amount = sum(amount, na.rm = TRUE)) %>%
  filter(receipt_year %in% 2015)
## # A tibble: 2 x 3
## # Groups: receipt_year [1]
##   receipt_year type                total_amount
##          <int> <fctr>                     <dbl>
## 1         2015 Candidate                  55729
## 2         2015 Political Committee        66821

How you use group_by() and summarise() to aggregate data

How would you turn the tabular output data into a chart?

## # A tibble: 5 x 3
##   receipt_year type                total_amount
##          <int> <fctr>                     <dbl>
## 1         2013 Political Committee       118804
## 2         2008 Candidate                 139377
## 3         2007 Political Committee       162470
## 4         2017 Political Committee        44280
## 5         2013 Candidate                  46316

How you use group_by() and summarise() to aggregate data

How would you turn the tabular output data into a chart?

donor %>% 
  group_by(receipt_year, type) %>% 
  summarise(total_amount = sum(amount, na.rm = TRUE)) %>%
  ggplot(aes(receipt_year, total_amount, color = type)) +
  geom_line(stat = 'identity') + 
  scale_y_continuous(labels = dollar) + theme_classic()

Exercise - 10 minutes

  • In donor, determine what the largest average donation amount is for contributor_state?
  • Use mean() in the summarise function
  • Exclude NA values with filter()
  • Show your findings in a chart
donor %>%
  group_by() %>%
  summarise() %>%
  filter()

Exercise - 10 minutes

  • In donor, determine what the largest average donation amount is for contributor_state?
  • Use mean() in the summarise function
  • Exclude NA values with filter()
  • Show your findings in a chart
donor %>%
  group_by() %>%
  summarise() %>%
  filter() %>%
  ggplot() %>%
  geom_bar()

Exercise - 10 minutes

  • In donor, determine what the largest average donation amount is for contributor_state?
  • Use mean() in the summarise function
  • Exclude NA values with filter()
  • Show your findings in a chart
donor %>%
  group_by() %>%
  summarise() %>%
  filter() %>%
  ggplot() %>%
  geom_bar(stat = 'identity')

Exercise - 10 minutes

donor %>% 
  group_by(contributor_state) %>% 
  summarise(avg_amount = mean(amount, na.rm = TRUE)) %>%
  filter(! contributor_state %in% NA) %>%
  ggplot(aes(reorder(contributor_state, -avg_amount), avg_amount)) +
  geom_bar(stat = 'identity') + 
  scale_y_continuous(labels = dollar) + theme_classic()

Exercise - 15 minutes

  • Create a chart from the police dataset
  • Make sure your axes are labeled correctly
  • Give your chart a title
  • Use color in aes()
  • Use a pre-built theme

Exercise - 15 minutes

Exercise - 15 minutes

police %>% 
  filter(
    ! event_clearance_ampm %in% NA & 
    ! event_clearance_group %in% NA
    ) %>%
  group_by(event_clearance_group, event_clearance_ampm) %>%
  summarise(n = n()) %>%
  ungroup %>%
  group_by(event_clearance_group) %>%
  mutate(total = sum(n, na.rm = TRUE)) %>%
  filter(total >= 100) %>%
  ggplot(aes(reorder(event_clearance_group, n), n, fill = event_clearance_ampm)) + 
  geom_bar(stat = 'identity') + 
  coord_flip() + 
  theme_wsj() + 
  scale_colour_wsj() + 
  labs(title = 'Most incidence occur\nin the AM', fill = 'Time of Day')

Exercise - 15 minutes

Exercise - 15 minutes

police %>% 
  filter(
    ! event_clearance_ampm %in% NA & 
    ! event_clearance_group %in% NA
    ) %>%
  group_by(event_clearance_group, event_clearance_ampm) %>%
  summarise(n = n()) %>% 
  ggplot(aes(n, color = event_clearance_ampm)) + 
  geom_density(size = 1) + 
  theme_wsj() + 
  scale_colour_wsj() + 
  labs(title = 'Most incidences have\noccurred fewer than 25 times', colour = 'Time of Day')

Exercise - 10 minutes

  • Create a chart from the donor dataset
  • Make some comparison between values in cash_or_in_kind
  • Give your chart a title
  • Use a pre-built theme

Exercise - 10 minutes

Exercise - 10 minutes

donor %>% 
  ggplot(aes(cash_or_in_kind, fill = primary_general)) + 
  geom_bar() + 
  theme_wsj() + 
  scale_colour_wsj() + 
  labs(title = 'So many more\ncash donations!!!', colour = 'Election Type')