Tame data principles and the fivethirtyeight R package

Today’s focus

What data to use in introductory statistics and data science courses?

Ideally data that’s:

Rich enough to answer meaningful questions with

Real enough to ensure that there is context

Realistic enough to convey to the reality of much of the world’s data

One goal

On the one hand, Cobb (2015) argues that we should

“Teach through research”

“Minimize prerequisites to research”

Another goal

On the other hand, from New York Times:

Analogy for second goal

love @JennyBryan's analogy of classroom data as teddybears & real data like a grizzly bear with salmon blood dripping out its mouth #jsm2015
— sandy griffith (@sgrifter) August 11, 2015

Two conflicting goals

On the one hand: Minimize prerequisites to research
On the other: Do not betray reality of data as it exists in much of the world

Back to analogy

In other words, a balancing act is required between:

Data with no prerequisites needed	Data as it exists “in the wild”

Data “taming”

Data “taming” sets out to balance:

On the one hand: Performing enough pre-processing so that data is accessible to R novices
On the other: Not performing so much pre-processing as to betray the reality of data as it exists “in the wild”

“Tame” data principles

We propose the following “tame” data principles to remove biggest hurdles R novices face:

Clean variable names

Identification variables in left-hand columns

Clean dates

Logically ordered categorical variables

Consistent “tidy” format

fivethirtyeight package

In the fivethirtyeight R package, Chester Ismay, Jennifer Chunn, and I:

Take FiveThirtyEight’s raw article data from GitHub

Pre-process the raw data so that it follows “tame” data principles

Make the tame data, documentation, and original article easily accessible via an R package

Examples

Following examples involve code, so I suggest you follow in HTML version of slides:

In your browser, go to bit.ly/causeweb_tame
In the left-hand menu, click on “Principle 1: Clean variable names”

Principle 1: Clean variable names

a) Comparing raw and tamed data

Original article: 41 Percent Of Fliers Think You’re Rude If You Recline Your Seat
Raw CSV data: flying-etiquette.csv

library(readr)
library(fivethirtyeight)

# Raw data: variable names are unwieldy & have spaces
flying_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv")
colnames(flying_raw)[c(5, 19)]

## [1] "Do you have any children under 18?"               
## [2] "In general, is itrude to bring a baby on a plane?"

# Tamed data: corresponding variable names are cleaner
colnames(flying)[c(5, 18)]

## [1] "children_under_18" "baby"

b) Why should we care?

Working with variables names that are long/unwieldy and have spaces is a tricky.

mosaicplot(~ `Do you have any children under 18?` + `In general, is itrude to bring a baby on a plane?`, 
           data = flying_raw,  main = "Raw data",
           xlab = "Have a baby?", ylab = "Is it rude?")
mosaicplot(~ children_under_18 + baby,
           data = flying,  main = "Tamed data",
           xlab = "Have a baby?", ylab = "Is it rude?")

Principle 2: ID variables

More organizational. Any identification variables that uniquely identify the observations/rows should be place in the left-hand columns since they are of highest prominence. Such variables are used to key joins/merging of datasets.

Original articles:
1. Straight Outta Compton’ Is The Rare Biopic Not About White Dudes
2. A Statistical Analysis of the Work of Bob Ross
Raw CSV data:
1. biopics.csv
2. elements-by-episode.csv

library(fivethirtyeight)

# Both title and imdb site tag uniquely identify movies. Show only 8 first
# columns and 3 first rows of dataset:
biopics[1:3, 1:8]

title	site	country	year_release	box_office	director	number_of_subjects	subject
10 Rillington Place	tt0066730	UK	1971	NA	Richard Fleischer	1	John Christie
12 Years a Slave	tt2024544	US/UK	2013	56700000	Steve McQueen	1	Solomon Northup
127 Hours	tt1542344	US/UK	2010	18300000	Danny Boyle	1	Aron Ralston

# episode variable uniquely identifies episodes of "The Joy of Painting". Show
# only 8 first columns and 3 randomly chosen rows of dataset using dplyr package
library(dplyr)
bob_ross %>% 
  select(1:8) %>% 
  sample_n(3)

episode	season	episode_num	title
S20E05	20	5	DIVINE ELEGANCE
S23E11	23	11	FROZEN BEAUTY IN VIGNETTE
S14E04	14	4	SNOWY SOLITUDE

Principle 3: Dates

a) Comparing raw and tamed data

Original article: Some People Are Too Superstitious To Have A Baby On Friday The 13th
Raw CSV data: US_births_1994-2003_CDC_NCHS.csv

library(readr)
library(dplyr)
library(fivethirtyeight)

# Raw data: year, month, day are separate variables
US_births_1994_2003_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv")
head(US_births_1994_2003_raw)

year	month	date_of_month	day_of_week	births
1994	1	1	6	8096
1994	1	2	7	7772
1994	1	3	1	10142
1994	1	4	2	11248
1994	1	5	3	11053
1994	1	6	4	11406

# Tamed data: variable date of type "date" included
head(US_births_1994_2003)

year	month	date_of_month	date	day_of_week	births
1994	1	1	1994-01-01	Sat	8096
1994	1	2	1994-01-02	Sun	7772
1994	1	3	1994-01-03	Mon	10142
1994	1	4	1994-01-04	Tues	11248
1994	1	5	1994-01-05	Wed	11053
1994	1	6	1994-01-06	Thurs	11406

b) Why should we care?

Without a variable of type date, making time series plots is difficult.

# Use filter command from dplyr package for data wrangling
US_births_1999 <- US_births_1994_2003 %>%
  filter(year == 1999)

# Plot time series via base R:
plot(x = US_births_1999$date, y = US_births_1999$births, type = "l", 
     xlab = "Date", ylab = "Number of births", main = "1999 US Births")

Principle 4: Categorical variables

a) Comparing raw and tamed data

Original article: The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women
Raw CSV data: movies.csv

library(readr)
library(ggplot2)
library(fivethirtyeight)
bechdel_raw <- read_csv("https://raw.githubusercontent.com/rudeboybert/fivethirtyeight/master/data-raw/bechdel/movies.csv")

# Raw data: categorical variable clean_test is saved as characters/strings
bechdel_raw$clean_test[1:5]

## [1] "notalk" "ok"     "notalk" "notalk" "men"

# Tamed data: clean_test is saved as factor
bechdel$clean_test[1:5]

## [1] notalk ok     notalk notalk men   
## Levels: nowomen < notalk < men < dubious < ok

b) Why should we care?

R by default plots characters in alphabetical order, whereas with factors we can set the order of the levels; reordering a categorical variable/factor in R is tough, especially for new R users. In this case, we can have the bars ordered along the hierarchical nature of Bechdel test:

# Using raw data:
ggplot(bechdel_raw, aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test outcome", y = "count", title = "Raw data")

# Using tamed data:
ggplot(bechdel, aes(x = clean_test)) +
  geom_bar() +
  labs(x = "Bechdel test outcome", y = "count", title = "Tamed data")

Principle 5: “Tidy” data format

“Tidy” data format is narrow/long format, as opposed to wide. This format is chosen for input/output data frame standardization across many R packages in the tidyverse: ggplot2, dplyr, etc. There are three interrelated rules which make a dataset “tidy”:

Each variable must have its own column.
Each observation must have its own row.
Each value must have its own cell.

a) Comparing raw and tamed data

Original article: Dear Mona Followup: Where Do People Drink The Most Beer, Wine And Spirits?
Raw CSV data: drinks.csv

library(dplyr)
library(ggplot2)
library(fivethirtyeight)

# In fivethirtyeight package drinks data is kept in original non-tidy (wide) format
head(drinks)

country	beer_servings	spirit_servings	wine_servings	total_litres_of_pure_alcohol
Afghanistan	0	0	0	0.0
Albania	89	132	54	4.9
Algeria	25	0	14	0.7
Andorra	245	138	312	12.4
Angola	217	57	45	5.9
Antigua & Barbuda	102	128	45	4.9

# tidyr::gather() code to convert to tidy format in help file: ?drinks
library(tidyr)
drinks_tidy <- drinks %>%
  gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>% 
  arrange(country)
head(drinks_tidy)

country	total_litres_of_pure_alcohol	type	servings
Afghanistan	0.0	beer_servings	0
Afghanistan	0.0	spirit_servings	0
Afghanistan	0.0	wine_servings	0
Albania	4.9	beer_servings	89
Albania	4.9	spirit_servings	132
Albania	4.9	wine_servings	54

ggplot(drinks_tidy, aes(x = type, y = servings)) + 
  geom_boxplot() +
  labs(x = "Alcohol type", y = "Number of servings", title = "Worldwide alcohol consumption")

Advanced example

a) Comparing raw and tamed data

Original article: The Last 10 Weeks Of 2016 Campaign Stops In One Handy Gif
Raw CSV data were in two separate CSVs
- clinton.csv
- trump.csv

In the tamed pres_2016_trail data frame we:

Ensured lat and lng were in numerical format, not in degree/minute/second, North/South, and East/West format (A variation on Principle 3: Dates)
Combined both CSV’s into one and added variable candidate (Principle 5: Tidy data format)

library(dplyr)
library(fivethirtyeight)

# Tamed data: 
pres_2016_trail %>% 
  arrange(date) %>% 
  head()

candidate	date	location	lat	lng
Trump	2016-09-01	Wilmington, OH	39.44534	-83.82854
Trump	2016-09-03	Detroit, MI	42.33143	-83.04575
Clinton	2016-09-05	Cleveland, Ohio	41.49932	-81.69436
Clinton	2016-09-05	Hampton, Illinois	41.55587	-90.40930
Clinton	2016-09-06	Tampa, Florida	27.95058	-82.45718
Trump	2016-09-06	Virginia Beach, VA	36.85293	-75.97799

b) Why should we care?

So we can easily create a faceted map!

library(ggplot2)
library(maps)
ggplot(data = pres_2016_trail, aes(x = lng, y = lat)) +
  facet_wrap(~candidate) +
  geom_point(col = "black", size = 2) + 
  coord_map() + 
  # Override data & aes()thetic mapping set above to trace path of state outlines:
  geom_path(data = map_data("state"), aes(x = long, y = lat, group = group), size = 0.1)

Comments

Analogy I heard that I like: fivethirtyeight is like a data petting zoo

No “universal” balance of two goals: it will vary depending on your students’ experience, requirements, and needs

Tame data principles and fivethirtyeight can be used in other contexts: 1) intermediate-level data science courses and 2) advanced projects

Used in data science courses

Recruited STAT231 Data Science students to “tame” datasets STAT135 Intro students found for their final projects

Available on GitHub: data wrangling source code by package authors to convert 538 raw CSV data to “tamed” format process_data_sets_albert.R, process_data_sets_chester.R, process_data_sets_jen.R

Used for advanced projects

fivethirtyeight package is in maintenance mode: no new development, only need to add new datasets

Get student interns to do it instead!

Internship model of learning/development: learning R package construction, GitHub, communication and project management skills, etc. RStudio’s 2018 broom package summer internship follows a similar model.

Undergraduate student written data wrangling source code to convert 538 raw CSV data to “tamed” format process_data_sets_maggie.R, process_data_sets_meredith.R

Other resources

Complete TISE article (HTML, PDF)
Package homepage including list of all datasets
Link to this presentation bit.ly/causeweb_tame

“Tame” data principles and the
fivethirtyeight R package

Albert Y. Kim - Amherst College -> Smith College (July 2018)

Tuesday June 12, 2018

Today’s focus

One goal

Another goal

Analogy for second goal

Two conflicting goals

Back to analogy

Data “taming”

“Tame” data principles

fivethirtyeight package

Examples

Principle 1: Clean variable names

a) Comparing raw and tamed data

b) Why should we care?

Principle 2: ID variables

Principle 3: Dates

a) Comparing raw and tamed data

b) Why should we care?

Principle 4: Categorical variables

a) Comparing raw and tamed data

b) Why should we care?

Principle 5: “Tidy” data format

a) Comparing raw and tamed data

Advanced example

a) Comparing raw and tamed data

b) Why should we care?

Comments

Used in data science courses

Used for advanced projects

Other resources

“Tame” data principles and the fivethirtyeight R package

Albert Y. Kim - Amherst College -> Smith College (July 2018)

Tuesday June 12, 2018

Today’s focus

One goal

Another goal

Analogy for second goal

Two conflicting goals

Back to analogy

Data “taming”

“Tame” data principles

fivethirtyeight package

Examples

Principle 1: Clean variable names

a) Comparing raw and tamed data

b) Why should we care?

Principle 2: ID variables

Principle 3: Dates

a) Comparing raw and tamed data

b) Why should we care?

Principle 4: Categorical variables

a) Comparing raw and tamed data

b) Why should we care?

Principle 5: “Tidy” data format

a) Comparing raw and tamed data

Advanced example

a) Comparing raw and tamed data

b) Why should we care?

Comments

Used in data science courses

Used for advanced projects

Other resources

“Tame” data principles and the
fivethirtyeight R package