May 27, 2015

Background

Reed College

Reed College

  • Small (1400 students) undergraduate only liberal arts college
    • Smaller class sizes
    • Very motivated and quirky student body
    • Socially progressive but academically conservative
  • Established in 1908 because founder
    • Didn't care for sports
    • Despised social clubs
  • High degree of interaction between departments

Curriculum

  • Not a vocational school
  • High proportion of students who continue to do a PhD
  • All students must:
    • take a junior qualifying exam
    • write a senior thesis
    • take freshman humanities class: HUM 110

HUM 110

Statistics at Reed

  • Statistics within the math department
  • Classes:
    • Intro stats
    • Year-long junior level prob & math stats sequence
    • Applied/methods classes in other departments
  • New class: MATH241 Case Studies in Statistical Analysis

The Bigger Picture

Similar Class

Former Googler Rachel Schutt taught a similar Data Science class course at Columbia University.

Class Description

Class Structure

Prereqs

Only intro stats and some exposure to R

Syllabus

  • 5 biweekly mini-reports
    • submitted in R Markdown: reproducible research
    • large amount of feedback from me
  • Term project: both report and 20 min oral
  • In-class participation

Classroom

Classroom

Demographics

18 students, mostly juniors and seniors.

Major Count
Mathematics 4
Biological Science: Biology & Biochem and Molecular Biology 4
Other Science: Chemistry, Environmental Studies, Physics 4
Social Science: Political Science, Sociology 2
Economics 2
Misc: Psychology, Linguistics 2

Principles

ASA's GAISE Reports

  • Use real data.
  • Stress conceptual understanding, rather than mere knowledge of procedures.
  • Foster active learning in the classroom.
  • Use technology for developing conceptual understanding and analyzing data.

In Practice

  • Messy data, from potentially disparate sources
  • Bottom-up: Let questions/data motivate the statistical methodology, rather than vice-versa
  • Discussions in class
  • Lean on R heavily
  • Focus on the entire analysis pipeline: article in Nature

Tools

Environment: RStudio

How to get students to use R?

  • Key: Forget Base R
  • How? The Hadleyverse.
  • In particular
    • dplyr package for data wrangling/manipulation
    • ggplot2 package for data visualization

dplyr Verbs

Data manipulation via the following verbs on tidy data:

  1. filter: keep observations matching criteria
  2. summarise: reduce many values to one
  3. mutate: create new variables from existing ones
  4. arrange: reorder rows
  5. select: pick columns by name
  6. join: join two data sets
  7. group_by: group subsets of observations together
  • Moral: No matrix indexing or for loops

ggplot2: the Grammar of Graphics

     

ggplot2: the Grammar of Graphics

A statistical graphic consists of a mapping of data variables to aesthetic attributes of geometric objects that we can observe.

ggplot2 allows us to construct graphics in a modular fashion by specifying these components.

ggplot2: the Grammar of Graphics

ggplot2: the Grammar of Graphics

Data (Variable) Aesthetic Geometric Object
longitude x position points
latitude y position points
army size size = width bars
army direction color = brown or black bars
date (x,y) position text
temperature (x,y) position lines

Results

Dataset: Houston Flights

Domestic flights leaving Houston airport (IAH) in 2011. Four data sets:

  • flights: info on all 227,496 flights
  • weather: hourly weather info
  • planes: information on all 2853 airplanes
  • airports: information on all 3376 destination airports

Delayed Flights

Age of Airplanes

Dataset: OkCupid Data

  • Sample of 10% of San Francisco OkCupid users in June 2012 (\(n=5995\))
  • 40.2% of the sample was female
  • Use logistic regression to predict gender
  • Overfitting, out-of-sample prediction, cross-validation

Height

Self-Referenced Body Type

Best predictors have distinct differences (in gender) in large segments of the population.

Dataset: Reed Jukebox

All 222,540 songs played on the Reed pool hall room jukebox from 2003-2009.

date_time artist album track
Sun Dec 7 05:12:57 2003 Tom Petty and the Heartbreakers Into the Great Wide Open
Sun Dec 7 05:15:56 2003 Jefferson Airplane Somebody To Love
Sun Dec 7 05:23:04 2003 Led Zeppelin Led Zeppelin IV 08 When The Levee Breaks

Importance of EDA

Importance of EDA

Artist Popularity

Time Series

Time Series

Maps

Maps

Interactive Shiny Apps

Student Comments

  • Econ junior: STATA user, now building a Shiny app for his senior thesis
  • Seniors: incorporated such tools in their thesis
  • Biology major: if she took this class earlier, it would have convinced her to be a Math major

The Future

Areas for Improvement

  • Statistical topics: More on dependent data, missing data, causal inference, and some machine learning
  • Databases/SQL
  • Ask better questions
  • Flipped-classroom
    • Lab exercises at home
    • Problem solving/debugging and discussion in class

Statistics' Image Problem

  • You hear this a lot:
    • Statistician: Hi, I'm a statistician.
    • Non-statistician: Statistics? I hated that class.
  • You'll never hear this:
    • Statistician: Hi, my work involves a lot of data visualization.
    • Non-statistician: Data visualization? I hate that stuff.

Solution: Data Visualization

  • Data visualization is a backdoor way to get students interested in statistics.
  • Prez from Season 4 of "The Wire":
  • Students loved ggplot, maps, and Shiny apps

Impact on my Intro Stats Classes

Issue: Programming

Issue: Programming

  • Point-and-click vs command line.
  • Thinking algorithmically
  • Debugging: help files and Google
  • Not easy: like learning a language

Conclusions

Take Home Messages

  • A statistics class focused on the data first, methods second
  • Rich Majerus wrote "Why should students at a small liberal arts college learn R?"
    • Learned R using dplyr and ggplot, not base R.
    • New tools like Datacamp are increasing the ratio: \[\frac{\mbox{Payoff from learning R}}{\mbox{Startup costs}}\]
  • Data visualization as a "gateway drug" for statistics
  • Developing skills and intuition takes time. At Reed classes are small: attention and feedback
  • Interactivity boosts student interest

Google Wisdom Imparted to Students

Presentation on 2011/06/27 given by Deirdre and Amir:

  • Look at your data ASAP.
  • Don't thrash
    • Do your due diligence, but don't overdo.
    • Seek expert advice.
  • Do the most braindead thing first, take it end to end, then iterate/improve.
  • "You actually don’t know what you are doing until after you have done it."
  • Think of the marginal return of your efforts.

Conclusion