Today’s focus

What data to use in introductory statistics and data science courses?

Ideally data that’s:

  1. Rich enough to answer meaningful questions with
  2. Real enough to ensure that there is context
  3. Realistic enough to convey to the reality of much of the world’s data

One goal

On the one hand, Cobb (2015) argues that we should

  1. “Teach through research”
  2. “Minimize prerequisites to research”

Another goal

On the other hand, from New York Times:

Drawing

Analogy for second goal

Two conflicting goals

  • On the one hand: Minimize prerequisites to research
  • On the other: Do not betray reality of data as it exists in much of the world

Back to analogy

In other words, a balancing act is required between:

Data with no prerequisites needed Data as it exists “in the wild”
Drawing Drawing

Data “taming”

Data “taming” sets out to balance:

  • On the one hand: Performing enough pre-processing so that data is accessible to R novices
  • On the other: Not performing so much pre-processing as to betray the reality of data as it exists “in the wild”

“Tame” data principles

We propose the following “tame” data principles to remove biggest hurdles R novices face:

  1. Clean variable names
  2. Identification variables in left-hand columns
  3. Clean dates
  4. Logically ordered categorical variables
  5. Consistent “tidy” format

fivethirtyeight package

In the fivethirtyeight R package, Chester Ismay, Jennifer Chunn, and I:

  • Take FiveThirtyEight’s raw article data from GitHub
  • Pre-process the raw data so that it follows “tame” data principles
  • Make the tame data, documentation, and original article easily accessible via an R package

Examples

Following examples involve code, so I suggest you follow in HTML version of slides:

  1. In your browser, go to bit.ly/causeweb_tame
  2. In the left-hand menu, click on “Principle 1: Clean variable names”

Principle 1: Clean variable names

a) Comparing raw and tamed data

library(readr)
library(fivethirtyeight)

# Raw data: variable names are unwieldy & have spaces
flying_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv")
colnames(flying_raw)[c(5, 19)]
## [1] "Do you have any children under 18?"               
## [2] "In general, is itrude to bring a baby on a plane?"
# Tamed data: corresponding variable names are cleaner
colnames(flying)[c(5, 18)]
## [1] "children_under_18" "baby"

b) Why should we care?

Working with variables names that are long/unwieldy and have spaces is a tricky.

mosaicplot(~ `Do you have any children under 18?` + `In general, is itrude to bring a baby on a plane?`, 
           data = flying_raw,  main = "Raw data",
           xlab = "Have a baby?", ylab = "Is it rude?")
mosaicplot(~ children_under_18 + baby,
           data = flying,  main = "Tamed data",
           xlab = "Have a baby?", ylab = "Is it rude?")

Principle 2: ID variables

More organizational. Any identification variables that uniquely identify the observations/rows should be place in the left-hand columns since they are of highest prominence. Such variables are used to key joins/merging of datasets.

library(fivethirtyeight)

# Both title and imdb site tag uniquely identify movies. Show only 8 first
# columns and 3 first rows of dataset:
biopics[1:3, 1:8]
title site country year_release box_office director number_of_subjects subject
10 Rillington Place tt0066730 UK 1971 NA Richard Fleischer 1 John Christie
12 Years a Slave tt2024544 US/UK 2013 56700000 Steve McQueen 1 Solomon Northup
127 Hours tt1542344 US/UK 2010 18300000 Danny Boyle 1 Aron Ralston
# episode variable uniquely identifies episodes of "The Joy of Painting". Show
# only 8 first columns and 3 randomly chosen rows of dataset using dplyr package
library(dplyr)
bob_ross %>% 
  select(1:8) %>% 
  sample_n(3)
episode season episode_num title apple_frame aurora_borealis barn beach
S20E05 20 5 DIVINE ELEGANCE 0 0 0 0
S23E11 23 11 FROZEN BEAUTY IN VIGNETTE 0 0 0 0
S14E04 14 4 SNOWY SOLITUDE 0 0 0 0

Principle 3: Dates

a) Comparing raw and tamed data

library(readr)
library(dplyr)
library(fivethirtyeight)

# Raw data: year, month, day are separate variables
US_births_1994_2003_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv")
head(US_births_1994_2003_raw)
year month date_of_month day_of_week births
1994 1 1 6 8096
1994 1 2 7 7772
1994 1 3 1 10142
1994 1 4 2 11248
1994 1 5 3 11053
1994 1 6 4 11406
# Tamed data: variable date of type "date" included
head(US_births_1994_2003)
year month date_of_month date day_of_week births
1994 1 1 1994-01-01 Sat 8096
1994 1 2 1994-01-02 Sun 7772
1994 1 3 1994-01-03 Mon 10142
1994 1 4 1994-01-04 Tues 11248
1994 1 5 1994-01-05 Wed 11053
1994 1 6 1994-01-06 Thurs 11406

b) Why should we care?

Without a variable of type date, making time series plots is difficult.

# Use filter command from dplyr package for data wrangling
US_births_1999 <- US_births_1994_2003 %>%
  filter(year == 1999)

# Plot time series via base R:
plot(x = US_births_1999$date, y = US_births_1999$births, type = "l", 
     xlab = "Date", ylab = "Number of births", main = "1999 US Births")

Principle 4: Categorical variables

a) Comparing raw and tamed data

library(readr)
library(ggplot2)
library(fivethirtyeight)
bechdel_raw <- read_csv("https://raw.githubusercontent.com/rudeboybert/fivethirtyeight/master/data-raw/bechdel/movies.csv")

# Raw data: categorical variable clean_test is saved as characters/strings
bechdel_raw$clean_test[1:5]
## [1] "notalk" "ok"     "notalk" "notalk" "men"
# Tamed data: clean_test is saved as factor
bechdel$clean_test[1:5]
## [1] notalk ok     notalk notalk men   
## Levels: nowomen < notalk < men < dubious < ok

b) Why should we care?

R by default plots characters in alphabetical order, whereas with factors we can set the order of the levels; reordering a categorical variable/factor in R is tough, especially for new R users. In this case, we can have the bars ordered along the hierarchical nature of Bechdel test:

# Using raw data:
ggplot(bechdel_raw, aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test outcome", y = "count", title = "Raw data")

# Using tamed data:
ggplot(bechdel, aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test outcome", y = "count", title = "Tamed data")

Principle 5: “Tidy” data format

“Tidy” data format is narrow/long format, as opposed to wide. This format is chosen for input/output data frame standardization across many R packages in the tidyverse: ggplot2, dplyr, etc. There are three interrelated rules which make a dataset “tidy”:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.
Drawing

a) Comparing raw and tamed data

library(dplyr)
library(ggplot2)
library(fivethirtyeight)

# In fivethirtyeight package drinks data is kept in original non-tidy (wide) format
head(drinks)
country beer_servings spirit_servings wine_servings total_litres_of_pure_alcohol
Afghanistan 0 0 0 0.0
Albania 89 132 54 4.9
Algeria 25 0 14 0.7
Andorra 245 138 312 12.4
Angola 217 57 45 5.9
Antigua & Barbuda 102 128 45 4.9
# tidyr::gather() code to convert to tidy format in help file: ?drinks
library(tidyr)
drinks_tidy <- drinks %>%
  gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>% 
  arrange(country)
head(drinks_tidy)
country total_litres_of_pure_alcohol type servings
Afghanistan 0.0 beer_servings 0
Afghanistan 0.0 spirit_servings 0
Afghanistan 0.0 wine_servings 0
Albania 4.9 beer_servings 89
Albania 4.9 spirit_servings 132
Albania 4.9 wine_servings 54
ggplot(drinks_tidy, aes(x = type, y = servings)) + 
  geom_boxplot() +
  labs(x = "Alcohol type", y = "Number of servings", title = "Worldwide alcohol consumption")

Advanced example

a) Comparing raw and tamed data

In the tamed pres_2016_trail data frame we:

  1. Ensured lat and lng were in numerical format, not in degree/minute/second, North/South, and East/West format (A variation on Principle 3: Dates)
  2. Combined both CSV’s into one and added variable candidate (Principle 5: Tidy data format)
library(dplyr)
library(fivethirtyeight)

# Tamed data: 
pres_2016_trail %>% 
  arrange(date) %>% 
  head()
candidate date location lat lng
Trump 2016-09-01 Wilmington, OH 39.44534 -83.82854
Trump 2016-09-03 Detroit, MI 42.33143 -83.04575
Clinton 2016-09-05 Cleveland, Ohio 41.49932 -81.69436
Clinton 2016-09-05 Hampton, Illinois 41.55587 -90.40930
Clinton 2016-09-06 Tampa, Florida 27.95058 -82.45718
Trump 2016-09-06 Virginia Beach, VA 36.85293 -75.97799

b) Why should we care?

So we can easily create a faceted map!

library(ggplot2)
library(maps)
ggplot(data = pres_2016_trail, aes(x = lng, y = lat)) +
  facet_wrap(~candidate) +
  geom_point(col = "black", size = 2) + 
  coord_map() + 
  # Override data & aes()thetic mapping set above to trace path of state outlines:
  geom_path(data = map_data("state"), aes(x = long, y = lat, group = group), size = 0.1)

Comments

  • Analogy I heard that I like: fivethirtyeight is like a data petting zoo
  • No “universal” balance of two goals: it will vary depending on your students’ experience, requirements, and needs
  • Tame data principles and fivethirtyeight can be used in other contexts: 1) intermediate-level data science courses and 2) advanced projects

Used in data science courses

  1. Recruited STAT231 Data Science students to “tame” datasets STAT135 Intro students found for their final projects
  2. Available on GitHub: data wrangling source code by package authors to convert 538 raw CSV data to “tamed” format process_data_sets_albert.R, process_data_sets_chester.R, process_data_sets_jen.R

Used for advanced projects

Other resources