May 29, 2015

Class Description

HUM 110

All Reed freshman are required to take Humanities 110.

MATH241: Data Science

Only prereq was intro stats. No serious programming experience necessary.

Classroom: ETC 208

Classroom: ETC 208

Demographics

18 students, mostly juniors and seniors.

Major Count
Mathematics 4
Biological Science: Biology & Biochem and Molecular Biology 4
Other Science: Chemistry, Environmental Studies, Physics 4
Social Science: Political Science, Sociology 2
Economics 2
Misc: Psychology, Linguistics 2

In Practice

  • Real data i.e. messy & needing cleaning, from potentially disparate sources
  • Bottom-up: Let questions/data motivate the statistical methodology, rather than vice-versa
  • Discussions in class
  • Lean on the statistical software language R heavily

Tools

Environment: RStudio

How to get students to use R?

  • Key: Forget base R
  • How? The Hadleyverse of packages by Hadley Wickham.
  • In particular
    • dplyr package for data wrangling/manipulation
    • ggplot2 package for data visualization

The Grammar of Graphics

     

The Grammar of Graphics

A statistical graphic consists of a mapping of data variables to aesthetic attributes of geometric objects that we can observe.

The Grammar of Graphics

Minard's map of Napoleon's Russian campaign of 1812:

The Grammar of Graphics

Data (Variable) Geometric Object Aesthetic Attribute of Geo Obj
longitude points x position
latitude points y position
army size bars width
army direction bars color
date text (x,y) position
temperature lines (x,y) position

Results

Delayed Flights

Age of Airplanes

Dataset: OkCupid Data

  • Sample of 10% of San Francisco OkCupid users in June 2012 (\(n=5995\))
  • 40.2% of the sample was female
  • Use logistic regression to predict gender

Listed Job

Self-Referenced Body Type

Dataset: Reed Jukebox

All 222,540 songs played on the Reed pool hall jukebox from 2003-2009 c/o Noah Pepper '09

Dataset: Reed Jukebox

date_time artist album track
Sun Dec 7 05:12:57 2003 Tom Petty and the Heartbreakers Into the Great Wide Open
Sun Dec 7 05:15:56 2003 Jefferson Airplane Somebody To Love
Sun Dec 7 05:23:04 2003 Led Zeppelin Led Zeppelin IV 08 When The Levee Breaks

Artist Popularity

Time Series

Maps

Interactive Shiny Apps

The Future

Statistics' Image Problem

  • I hear this a lot:
    • Me: Hi, I'm Albert. I'm a statistician.
    • Them: Statistics? I hated that class!

Solution: Data Visualization

  • Data visualization is a backdoor way to get students interested in statistics.
  • Prez from Season 4 of "The Wire":

Conclusions

Take Home Messages

Resources

Extras

dplyr Package

Features

  • Data manipulation is performed using verbs
  • The pipe %>% command, pronounced "THEN"

Example: Houston Flights Dataset

Info on all domestic flights leaving Houston (IAH) in 2011:

  • flights: info on 227,496 flights
  • planes: info on 2853 airplanes

What are the top 5 carriers using the oldest planes (averaged over all flights)?

Flights

The flights dataset:

date dep arr carrier flight dest plane
2011-01-01 1400 1500 AA 428 DFW N576AA
2011-01-02 1401 1501 AA 428 DFW N557AA
2011-01-03 1352 1502 AA 428 DFW N541AA
2011-01-04 1403 1513 AA 428 DFW N403AA
2011-01-05 1405 1507 AA 428 DFW N492AA

Planes

The planes dataset:

plane year model mfr no.seats
N576AA 1991 DC-9-82(MD-82) MCDONNELL DOUGLAS 172
N557AA 1993 KITFOX IV MARZ BARRY 2
N403AA 1974 S55A RAVEN 1
N492AA 1989 DC-9-82(MD-82) MCDONNELL DOUGLAS 172
N262AA 1985 DC-9-82(MD-82) MCDONNELL DOUGLAS 172

Example: Age of Planes

The following sequence of verbs wrangle/manipulate the data:

left_join(flights, planes, by='plane') %>%
  select(carrier, plane, year) %>%
  mutate(age = 2011 - year) %>%
  group_by(carrier) %>%
  summarise(avg_age = mean(age)) %>%
  arrange(desc(avg_age)) %>%
  top_n(5)

Example: Age of Planes

carrier avg_age
MQ 29.421
AA 24.325
DL 20.760
US 19.078
UA 14.635