What data to use in introductory statistics and data science courses?
Ideally data that’s:
- Rich enough to answer meaningful questions with
- Real enough to ensure that there is context
- Realistic enough to convey to the reality of much of the world’s data
On the one hand, Cobb (2015) argues that we should
- “Teach through research”
- “Minimize prerequisites to research”
love @JennyBryan's analogy of classroom data as teddybears & real data like a grizzly bear with salmon blood dripping out its mouth #jsm2015
— sandy griffith (@sgrifter) August 11, 2015
In other words, a balancing act is required between:
| Data with no prerequisites needed | Data as it exists “in the wild” |
|---|---|
Data “taming” sets out to balance:
We propose the following “tame” data principles to remove biggest hurdles R novices face:
- Clean variable names
- Identification variables in left-hand columns
- Clean dates
- Logically ordered categorical variables
- Consistent “tidy” format
In the fivethirtyeight R package, Chester Ismay, Jennifer Chunn, and I:
- Take FiveThirtyEight’s raw article data from GitHub
- Pre-process the raw data so that it follows “tame” data principles
- Make the tame data, documentation, and original article easily accessible via an R package
Following examples involve code, so I suggest you follow in HTML version of slides:
flying-etiquette.csvlibrary(readr)
library(fivethirtyeight)
# Raw data: variable names are unwieldy & have spaces
flying_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/flying-etiquette-survey/flying-etiquette.csv")
colnames(flying_raw)[c(5, 19)]## [1] "Do you have any children under 18?"
## [2] "In general, is itrude to bring a baby on a plane?"
# Tamed data: corresponding variable names are cleaner
colnames(flying)[c(5, 18)]## [1] "children_under_18" "baby"
Working with variables names that are long/unwieldy and have spaces is a tricky.
mosaicplot(~ `Do you have any children under 18?` + `In general, is itrude to bring a baby on a plane?`,
data = flying_raw, main = "Raw data",
xlab = "Have a baby?", ylab = "Is it rude?")
mosaicplot(~ children_under_18 + baby,
data = flying, main = "Tamed data",
xlab = "Have a baby?", ylab = "Is it rude?")More organizational. Any identification variables that uniquely identify the observations/rows should be place in the left-hand columns since they are of highest prominence. Such variables are used to key joins/merging of datasets.
library(fivethirtyeight)
# Both title and imdb site tag uniquely identify movies. Show only 8 first
# columns and 3 first rows of dataset:
biopics[1:3, 1:8]| title | site | country | year_release | box_office | director | number_of_subjects | subject |
|---|---|---|---|---|---|---|---|
| 10 Rillington Place | tt0066730 | UK | 1971 | NA | Richard Fleischer | 1 | John Christie |
| 12 Years a Slave | tt2024544 | US/UK | 2013 | 56700000 | Steve McQueen | 1 | Solomon Northup |
| 127 Hours | tt1542344 | US/UK | 2010 | 18300000 | Danny Boyle | 1 | Aron Ralston |
# episode variable uniquely identifies episodes of "The Joy of Painting". Show
# only 8 first columns and 3 randomly chosen rows of dataset using dplyr package
library(dplyr)
bob_ross %>%
select(1:8) %>%
sample_n(3)| episode | season | episode_num | title | apple_frame | aurora_borealis | barn | beach |
|---|---|---|---|---|---|---|---|
| S20E05 | 20 | 5 | DIVINE ELEGANCE | 0 | 0 | 0 | 0 |
| S23E11 | 23 | 11 | FROZEN BEAUTY IN VIGNETTE | 0 | 0 | 0 | 0 |
| S14E04 | 14 | 4 | SNOWY SOLITUDE | 0 | 0 | 0 | 0 |
US_births_1994-2003_CDC_NCHS.csvlibrary(readr)
library(dplyr)
library(fivethirtyeight)
# Raw data: year, month, day are separate variables
US_births_1994_2003_raw <- read_csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv")
head(US_births_1994_2003_raw)| year | month | date_of_month | day_of_week | births |
|---|---|---|---|---|
| 1994 | 1 | 1 | 6 | 8096 |
| 1994 | 1 | 2 | 7 | 7772 |
| 1994 | 1 | 3 | 1 | 10142 |
| 1994 | 1 | 4 | 2 | 11248 |
| 1994 | 1 | 5 | 3 | 11053 |
| 1994 | 1 | 6 | 4 | 11406 |
# Tamed data: variable date of type "date" included
head(US_births_1994_2003)| year | month | date_of_month | date | day_of_week | births |
|---|---|---|---|---|---|
| 1994 | 1 | 1 | 1994-01-01 | Sat | 8096 |
| 1994 | 1 | 2 | 1994-01-02 | Sun | 7772 |
| 1994 | 1 | 3 | 1994-01-03 | Mon | 10142 |
| 1994 | 1 | 4 | 1994-01-04 | Tues | 11248 |
| 1994 | 1 | 5 | 1994-01-05 | Wed | 11053 |
| 1994 | 1 | 6 | 1994-01-06 | Thurs | 11406 |
Without a variable of type date, making time series plots is difficult.
# Use filter command from dplyr package for data wrangling
US_births_1999 <- US_births_1994_2003 %>%
filter(year == 1999)
# Plot time series via base R:
plot(x = US_births_1999$date, y = US_births_1999$births, type = "l",
xlab = "Date", ylab = "Number of births", main = "1999 US Births")movies.csvlibrary(readr)
library(ggplot2)
library(fivethirtyeight)
bechdel_raw <- read_csv("https://raw.githubusercontent.com/rudeboybert/fivethirtyeight/master/data-raw/bechdel/movies.csv")
# Raw data: categorical variable clean_test is saved as characters/strings
bechdel_raw$clean_test[1:5]## [1] "notalk" "ok" "notalk" "notalk" "men"
# Tamed data: clean_test is saved as factor
bechdel$clean_test[1:5]## [1] notalk ok notalk notalk men
## Levels: nowomen < notalk < men < dubious < ok
R by default plots characters in alphabetical order, whereas with factors we can set the order of the levels; reordering a categorical variable/factor in R is tough, especially for new R users. In this case, we can have the bars ordered along the hierarchical nature of Bechdel test:
# Using raw data:
ggplot(bechdel_raw, aes(x = clean_test)) +
geom_bar() +
labs(x = "Bechdel test outcome", y = "count", title = "Raw data")
# Using tamed data:
ggplot(bechdel, aes(x = clean_test)) +
geom_bar() +
labs(x = "Bechdel test outcome", y = "count", title = "Tamed data")“Tidy” data format is narrow/long format, as opposed to wide. This format is chosen for input/output data frame standardization across many R packages in the tidyverse: ggplot2, dplyr, etc. There are three interrelated rules which make a dataset “tidy”:
drinks.csvlibrary(dplyr)
library(ggplot2)
library(fivethirtyeight)
# In fivethirtyeight package drinks data is kept in original non-tidy (wide) format
head(drinks)| country | beer_servings | spirit_servings | wine_servings | total_litres_of_pure_alcohol |
|---|---|---|---|---|
| Afghanistan | 0 | 0 | 0 | 0.0 |
| Albania | 89 | 132 | 54 | 4.9 |
| Algeria | 25 | 0 | 14 | 0.7 |
| Andorra | 245 | 138 | 312 | 12.4 |
| Angola | 217 | 57 | 45 | 5.9 |
| Antigua & Barbuda | 102 | 128 | 45 | 4.9 |
# tidyr::gather() code to convert to tidy format in help file: ?drinks
library(tidyr)
drinks_tidy <- drinks %>%
gather(type, servings, -c(country, total_litres_of_pure_alcohol)) %>%
arrange(country)
head(drinks_tidy)| country | total_litres_of_pure_alcohol | type | servings |
|---|---|---|---|
| Afghanistan | 0.0 | beer_servings | 0 |
| Afghanistan | 0.0 | spirit_servings | 0 |
| Afghanistan | 0.0 | wine_servings | 0 |
| Albania | 4.9 | beer_servings | 89 |
| Albania | 4.9 | spirit_servings | 132 |
| Albania | 4.9 | wine_servings | 54 |
ggplot(drinks_tidy, aes(x = type, y = servings)) +
geom_boxplot() +
labs(x = "Alcohol type", y = "Number of servings", title = "Worldwide alcohol consumption")clinton.csvtrump.csvIn the tamed pres_2016_trail data frame we:
lat and lng were in numerical format, not in degree/minute/second, North/South, and East/West format (A variation on Principle 3: Dates)candidate (Principle 5: Tidy data format)library(dplyr)
library(fivethirtyeight)
# Tamed data:
pres_2016_trail %>%
arrange(date) %>%
head()| candidate | date | location | lat | lng |
|---|---|---|---|---|
| Trump | 2016-09-01 | Wilmington, OH | 39.44534 | -83.82854 |
| Trump | 2016-09-03 | Detroit, MI | 42.33143 | -83.04575 |
| Clinton | 2016-09-05 | Cleveland, Ohio | 41.49932 | -81.69436 |
| Clinton | 2016-09-05 | Hampton, Illinois | 41.55587 | -90.40930 |
| Clinton | 2016-09-06 | Tampa, Florida | 27.95058 | -82.45718 |
| Trump | 2016-09-06 | Virginia Beach, VA | 36.85293 | -75.97799 |
So we can easily create a faceted map!
library(ggplot2)
library(maps)
ggplot(data = pres_2016_trail, aes(x = lng, y = lat)) +
facet_wrap(~candidate) +
geom_point(col = "black", size = 2) +
coord_map() +
# Override data & aes()thetic mapping set above to trace path of state outlines:
geom_path(data = map_data("state"), aes(x = long, y = lat, group = group), size = 0.1)
- Recruited STAT231 Data Science students to “tame” datasets STAT135 Intro students found for their final projects
- Available on GitHub: data wrangling source code by package authors to convert 538 raw CSV data to “tamed” format
process_data_sets_albert.R,process_data_sets_chester.R,process_data_sets_jen.R
fivethirtyeightpackage is in maintenance mode: no new development, only need to add new datasets- Get student interns to do it instead!
- Internship model of learning/development: learning R package construction, GitHub, communication and project management skills, etc. RStudio’s 2018
broompackage summer internship follows a similar model.- Undergraduate student written data wrangling source code to convert 538 raw CSV data to “tamed” format
process_data_sets_maggie.R,process_data_sets_meredith.R
Comments