October 15, 2016

Welcome to R-Ladies RTP!

Outline

  • Why ggplot2?
  • Conceptual ggplot2
  • Basic ggplot2 syntax
  • Common pitfalls
  • The power of layers
  • The power of groups
  • The power of facets
  • Scales and legends
  • Themes and refinements
  • Extensions

Materials

Datasets

Dataset Descriptions

  • Transit ride data
  • rides_df: one row per ride
  • daily_df: daily summary of rides
  • hourly_df: hourly summary of rides

  • Durham registered voter data
  • durham_voters_df: one row per voter

Why ggplot2?

Exploratory Power

  • explore data very quickly, and don't screw up
  • graphics "pit of success"
  • refine plots to be publication quality

Conceptually Cohesive

  • based on the grammar of graphics
  • principles and patterns
  • very complex plots from simple building blocks

Dominance

  • the most prominent static graphics package in R
  • help is easy to find
  • developers are contibuting extensions

Conceptual ggplot2

The basics

  • map variables to aethestics
  • add "geoms" for visual representation layers
  • scales can be independently managed
  • legends are automatically created
  • statistics are sometimes calculated by geoms…

Learn more

  • ggplot2 book

Basic ggplot2 syntax

The required elements

  • DATA
  • MAPPING
  • GEOM

Our first hands-on example …

download.file(
'https://www.dropbox.com/s/zhmn02ti0ggxdj7/rladies_ggplot2_datasets.rda?dl=1',
        'rladies_ggplot2_datasets.rda')
        
attach('rladies_ggplot2_datasets.rda')

library(tidyverse)

Basic plot

ggplot(data = daily_df) +
geom_point(mapping = aes(x = ride_date, y = n_rides))

Mappings can be general

ggplot(data = daily_df, mapping = aes(x = ride_date, y = n_rides)) +
geom_point()

Parameters can be unnamed

ggplot(daily_df, aes(x = ride_date, y = n_rides)) +
geom_point()

Data can be passed in

daily_df %>%
ggplot(aes(x = ride_date, y = n_rides)) +
geom_point()

Data can be passed in

daily_df %>%
ggplot(aes(x = ride_date, y = n_rides)) +
geom_point()

Mapping is more than x and y

daily_df %>%
ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
geom_point()

Mapping is more than x and y

daily_df %>%
ggplot(aes(x = ride_date, y = n_rides, size = n_riders)) +
geom_point()

Variable creation on the fly…

daily_df %>%
ggplot(aes(x = ride_date, y = n_rides, color = day_of_week %in% c('Sat', 'Sun'))) +
geom_point()

… or passed in

daily_df %>%
  mutate(day_type = if_else(day_of_week %in% c('Sat', 'Sun'),
                            'Weekend',
                            'Weekday')) %>%
ggplot(aes(x = ride_date, y = n_rides, color = day_type)) +
geom_point()

Common early pitfalls

Mappings that aren't

daily_df %>%
ggplot() +
geom_point(aes(x = ride_date, y = n_rides, color = 'blue'))

Mappings that aren't

daily_df %>%
ggplot() +
geom_point(aes(x = ride_date, y = n_rides), color = 'blue')

+ and %>%

daily_df %>% mutate(day_type = if_else(day_of_week %in% c('Sat', 'Sun'), 'Weekend', 'Weekday')) %>% ggplot(aes(x = ride_date, y = n_rides, color = day_type)) %>% geom_point()

Basic plot exercise

Plot the number of unique routes per day over time, colored by day of week. (n_unique_routes)

Basic plot exercise

daily_df %>%
  ggplot(aes(x = ride_date, y = n_unique_routes, color = day_of_week)) +
  geom_point() 

The power of layers

Basic plot

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides)) +
  geom_point() 

Two layers!

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides)) +
  geom_point() +
  geom_line()

Iterate on layers

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides)) +
  geom_point() + 
  geom_smooth(span = .1) # try changing span

The power of groups

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_point() + 
  geom_line()

Now we've got it

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_smooth(span = .2, se = FALSE)

Control data by layer

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_point(data = filter(daily_df, !(day_of_week %in% c('Sat', 'Sun')) & n_rides < 200),
             size = 5, color = 'gray') +
  geom_point()

Control data by layer

low_weekdays_df <- daily_df %>%
  filter(!(day_of_week %in% c('Sat', 'Sun')) & n_rides < 100)

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, 
             color = day_of_week, label = ride_date)) +
  geom_point(data = low_weekdays_df, size = 5, color = 'gray') +
  geom_text(data = low_weekdays_df, aes(y = n_rides + 15), 
            size = 2, color = 'black') +
  geom_point() 

The power of facets

facet_wrap

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides)) +
  geom_point() + 
  geom_line() +
  facet_wrap( ~ day_of_week)

facet_grid

durham_voters_df %>%
  group_by(race_code, gender_code, age) %>%
  summarize(n_voters = n(),
            n_rep = sum(party == 'REP')) %>%
  filter(gender_code %in% c('F','M') & 
           race_code %in% c('W', 'B', 'A') &
           age != 'Age < 18 Or Invalid Birth Date') %>%
  ggplot(aes(x = age, y = n_voters)) +
  geom_bar(stat = 'identity') +
  facet_grid(race_code ~ gender_code)

Free scales

durham_voters_df %>%
  group_by(race_code, gender_code, age) %>%
  summarize(n_voters = n(),
            n_rep = sum(party == 'REP')) %>%
  filter(gender_code %in% c('F','M') & 
           race_code %in% c('W', 'B', 'A') &
           age != 'Age < 18 Or Invalid Birth Date') %>%
  ggplot(aes(x = age, y = n_voters)) +
  geom_bar(stat = 'identity') +
  facet_grid(race_code ~ gender_code, scales = 'free_y')

Facets + layers

Facets + layers

Note: better to use gather

durham_voters_df %>%
  group_by(race_code, gender_code, age) %>%
  summarize(n_voters = n(),
            n_rep = sum(party == 'REP')) %>%
  filter(gender_code %in% c('F','M') & 
           race_code %in% c('W', 'B', 'A') &
           age != 'Age < 18 Or Invalid Birth Date') %>%
  mutate(age_cat = as.numeric(as.factor(age))) %>%
  ggplot(aes(x = age, y = n_voters)) +
  geom_point() +
  geom_line(aes(x = age_cat)) +
  geom_line(aes(x = age_cat, y = n_rep), color = 'red') +
  geom_point(aes(y = n_rep), color = 'red') +
  facet_grid(race_code ~ gender_code, scales = 'free_y') +
  expand_limits(y = 0) 

Everyday conveniences

Labels and titles

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_smooth(span = .2, se = FALSE) +
  xlab('') +
  ylab('# of Transit Rides') +
  ggtitle('Transit Rides over time by Day of Week') +
  scale_color_discrete('Day of Week')

Scales and legends

Scale transformation

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_point() +
  scale_y_reverse()

Scale transformation

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_point() +
  scale_y_sqrt()

Scale details

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_point() +
  scale_y_continuous(breaks = c(0, 200, 500))

Themes and refinements

Overall Themes

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_point() +
  theme_bw()

Overall Themes

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_point() +
  theme_dark()

Changing Details

daily_df %>%
  ggplot(aes(x = ride_date, y = n_rides, color = day_of_week)) +
  geom_point() +
  theme(axis.text.x = element_text(angle = 90))

Themes Vignette

Documentation

Online Docs

Extensions

Contributed extensions

Questions?

Next Meetup

Monday, November 14th, 6pm

Location:

  • TransLoc, near RTP
  • 4505 Emperor Blvd, Durham
  • Parking is easy!

Topic:

  • Getting data into R

Topic/format suggestions?