Data Visualization Using R


Fall 2015 - Data @ Reed Research Skills Workshop Series


Chester Ismay

Office: ETC 223

cismay@reed.edu

http://blogs.reed.edu/datablog

Who am I?

  • Grew up in South Dakota (town of 112 people)
  • BS in Math, minor in Computer Science from SDSM&T
  • MS in Statistics from Northern Arizona University
  • Worked as an actuary before obtaining PhD in Statistics from Arizona State University
  • Was Assistant Professor of Statistics and Data Science at Ripon College the last two years
  • Moved to Portland area this summer
  • Started working at Reed on August 11th

What can I help you with?

  • Data analysis
  • Data wrangling/cleaning
  • Data visualization
  • Data tidying/manipulating
  • Reproducible research

When am I available?

  • Email me at cismay@reed.edu or chester.ismay@reed.edu
  • Office (ETC 223) hours
    • Mondays (11 AM to noon)
    • Tuesdays and Thursdays (2 PM to 3 PM)
  • Sometimes available for virtual office hours via Google Hangouts (email me for details)

Basic research process

Research Process

Further support

data@reed.edu
http://www.reed.edu/data-at-reed

Research Process

What is data visualization?

Data Cycle

What are the properties of good visualizations?

Tufte’s thoughts

Great data visualizations almost always

  • include comparisons of multiple variables
  • make large data sets coherent
  • reveal the data at several levels of detail
  • encourage eyes to compare data
  • serve a clear purpose
  • … show the data!
  • … don’t try to mislead the viewer!

Tukey’s quotes

“The simple graph has brought more information to the data analyst’s mind than any other device.”

“A picture is not merely worth a thousand words, it is much more likely to be scrutinized than words are to be read.”

What makes for bad visualizations?

Baby Boomer - BAD

Hospitals - BAD

Retaining Information - BAD

Other bad examples

WTF Visualizations : http://viz.wtf/

Phones - BAD

Examples using
ggplot2 in R

Portland 2014 Departing Flights

library(pnwflights14); data("flights", package = "pnwflights14")
pdx_flights <- flights %>% filter(origin == "PDX") %>% 
  select(-year, -origin)
pdx_flights %>% str()
## Classes 'tbl_df', 'tbl' and 'data.frame':    53335 obs. of  14 variables:
##  $ month    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ day      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ dep_time : int  1 8 28 526 541 549 559 602 606 618 ...
##  $ dep_delay: num  96 13 -2 -4 1 24 -1 -3 6 -2 ...
##  $ arr_time : int  235 548 800 1148 911 907 916 1204 746 1135 ...
##  $ arr_delay: num  70 -4 -23 15 4 12 -9 7 3 -30 ...
##  $ carrier  : chr  "AS" "UA" "US" "UA" ...
##  $ tailnum  : chr  "N508AS" "N37422" "N547UW" "N813UA" ...
##  $ flight   : int  145 1609 466 229 1569 649 796 1573 406 1650 ...
##  $ dest     : chr  "ANC" "IAH" "CLT" "IAH" ...
##  $ air_time : num  194 201 251 217 130 122 125 203 87 184 ...
##  $ distance : num  1542 1825 2282 1825 991 ...
##  $ hour     : num  0 0 0 5 5 5 5 6 6 6 ...
##  $ minute   : num  1 8 28 26 41 49 59 2 6 18 ...

Question 1

Do flights to Hawaii (from PDX) tend to arrive more than 30 minutes early than flights elsewhere?


Sub-question

What are some of the properties of these early flights?

How do these “arriving early” flights vary by flight distance?

pdx_early_flights <- pdx_flights %>% na.omit() %>%
  filter(arr_delay < -30)
pdx_early_flights %>% ggplot(aes(x = distance, y = arr_delay)) +
  geom_point()
pdx_early_flights %>% ggplot(aes(x = distance, y = arr_delay)) +
  geom_point()

pdx_early_flights %>%
  mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"), 
                              "Hawaiian", "Not Hawaiian")) %>%
  ggplot(aes(x = distance, y = arr_delay)) +
  geom_point(aes(color = hawaii_dest))
pdx_early_flights %>%
  mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"), 
                              "Hawaiian", "Not Hawaiian")) %>%
  ggplot(aes(x = distance, y = arr_delay)) +
  geom_point(aes(color = hawaii_dest))

pdx_early_flights %>% 
  mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"), 
                              "Hawaiian", "Not Hawaiian")) %>%
  ggplot(aes(x = hawaii_dest , y = arr_delay)) +
  geom_boxplot()
pdx_early_flights %>% 
  mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"), 
                              "Hawaiian", "Not Hawaiian")) %>%
  ggplot(aes(x = hawaii_dest , y = arr_delay)) +
  geom_boxplot()

What about delays throughout the year?

Getting date in nice format

date_string <- paste0("2014-", 
                      pdx_early_flights$month, "-", 
                      pdx_early_flights$day)
pdx_early_flights <- pdx_early_flights %>% 
  mutate(day_of_year = lubridate::ymd(date_string))
pdx_early_flights %>% ggplot(aes(x = day_of_year, y = arr_delay)) +
  geom_point()

pdx_early_flights %>%
  mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"), 
                              "Hawaiian", "Not Hawaiian")) %>%
  ggplot(aes(x = day_of_year, y = arr_delay)) +
  geom_point(aes(color = hawaii_dest))

A non-graphical answer

flights %>% na.omit() %>%
  mutate(hawaii_dest = ifelse(dest %in% c("OGG", "KOA", "HNL", "LIH"), 
                              "Hawaiian", "Not Hawaiian")) %>%
  group_by(hawaii_dest) %>%
  summarize(perc_early = sum(arr_delay < -30) / n() * 100)
## Source: local data frame [2 x 2]
## 
##    hawaii_dest perc_early
##          (chr)      (dbl)
## 1     Hawaiian   9.959855
## 2 Not Hawaiian   1.366410

Question 2

Which carrier (departing Seattle) has the worst average dep_delay?

summary_sea_flights <- flights %>% na.omit() %>%
  filter(origin == "SEA") %>%
  group_by(carrier) %>%
  summarize(mean_dep_delay = mean(dep_delay))
summary_sea_flights %>% ggplot(aes(x = carrier, y = mean_dep_delay)) +
  geom_bar()

What happened?

summary_sea_flights %>% ggplot(aes(x = carrier, y = mean_dep_delay)) +
  geom_bar(stat = "identity")

Stylizing

summary_sea_flights %>% ggplot(aes(x = reorder(carrier, mean_dep_delay), 
                                   y = mean_dep_delay)) +
  geom_bar(stat = "identity", colour = "red")

Stylizing (Part Deux)

summary_sea_flights %>% ggplot(aes(x = reorder(carrier, mean_dep_delay), 
                                   y = mean_dep_delay)) +
  geom_bar(stat = "identity", fill = "red") +
  xlab("Airline Carrier") +
  ylab("Mean Departure Delay") +
  ggtitle("Seattle Departure Delays for 2014")

Question 3

How does the distribution of 30 minute+ early arrivals vary for different (meteorological) seasons?

flights %>% filter(arr_delay < -30) %>% 
  ggplot(aes(x = arr_delay)) +
  geom_histogram(stat = "bin", binwidth = 1, colour = "blue")

flights_seasons <- flights %>% filter(arr_delay < -30) %>% 
  na.omit() %>%
  mutate(season = ifelse(month %in% 3:5, "spring",
         ifelse(month %in% 6:8, "summer",
         ifelse(month %in% 9:11, "autumn", 
                "winter"))))

Ugly first try

flights_seasons %>%
  ggplot(aes(x = arr_delay, fill = season)) +
  geom_histogram(stat = "bin", binwidth = 1)

flights_seasons %>% ggplot(aes(x = arr_delay)) +
  geom_histogram(stat = "bin", binwidth = 1) +
  facet_grid(. ~ season)

flights_seasons %>% ggplot(aes(x = arr_delay)) +
  geom_histogram(stat = "bin", binwidth = 1) +
  facet_grid(season ~ .)

flights_seasons %>% ggplot(aes(x = season, y = arr_delay)) +
  geom_boxplot()

Formalizing Good Graphics

The Grammar of Graphics

In brief, the grammar tells us that a statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars).

Tidy data!

  • It will be helpful (and good practice) to get your table into a data.frame in R whenever possible.
  • Additionally, data should be tidy.

Tidy

The lingo

  • aes: mappings of the elements in the data to aesthetics we can see on the graphic.
  • geom: geometric objects (points, lines, bars, etc.)
  • facet: faceting describes breaking the data into subsets and displaying those as “small multiples”

More lingo

  • stat: statistical transformations that summarize data (e.g., grouping the data into bins)
  • scale: draws a legend and/or axes, specifies at which scaling to view the plot
  • coord: almost always the Cartesian coordinate system

R Studio
+
R Markdown

Tools to make working with R friendly

  • RStudio is a powerful user interface that helps you get better control of your analysis.
  • Like R, it is also completely free.
  • You can write your entire paper/report (text, code, analysis, graphics, etc.) all in R Markdown.
  • If you need to update any of your code, R Markdown will automatically update your plots and output of your analysis and will create an updated PDF file.
  • No more copy-and-paste!
  • It’s my job!

Time to try it for yourself!

ggplot2 documentation

R Graphics Cookbook

ggplot2 Cheat Sheet

Your assignment : Right click on me and Save

Solutions (Rmd) : Right click and Save

Solutions (HTML) : Click away…after you’ve tried!

Data @ Reed Research Skills Workshops for Fall 2015

All workshops in ETC 211 from 4 - 5 PM

September 16 - Data analysis with Stata
September 23 - Data analysis with R
September 30 - Data visualization using R

  • October 7 - Maps and more: spatial data
  • October 14 - Reproducible research

Thanks!

cismay@reed.edu



Slides available at http://rpubs.com/cismay/dvur_workshop_2015