Pop quiz!

Sampling from a bowl of red and white balls!

About me

  • PhD trained statistician
  • Worked at Google in AdWords Division
  • Small colleges: Reed, Middlebury, Amherst, and now Smith Colleges
  • Awash in “statistics vs data science” debates

Awash in Venn Diagrams…

Drew Conway Drew Conway 2.0
Drawing Drawing

Awash in quotes…

Some more problematic than others:

  • A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.
  • A data scientist is a statistician who lives in San Francisco
  • A data scientist is a statistician who is wearing a bow tie.
  • From my time at Google: An engineer knows what an average is, but an analyst knows what a standard deviation is

rudeboybert says

What can statistics bring to the data science table? Among other things, how about?

A statistician is a data scientist who understands what a standard error is.

ModernDive

Drawing


A few pedagogical principles

  1. Have intro students “play the whole game”
  2. Use data science, not probability/mathematical formulas, to motivate statistical inference
  3. Coding as a basic skill

Principle 1: “Play the whole game”

In the context of statistics/data analysis, by “play the whole game” we mean Wickham’s data/science pipeline:

Drawing

Tee-ball

Think how children learn “tee-ball” and play a simplified version of the “whole game” first…

Drawing

Softball & baseball

… and then eventually graduate to softball/baseball.

Drawing Drawing

Example “whole game”

  • Load the Seattle house prices dataset from Kaggle saved in moderndive::house_prices
  • Model \(y\) the sale price of house as a function of two explanatory/predictor variables:
    1. \(x_1\): size (sqft_living square feet)
    2. \(x_2\): condition (catgorical w/ 1 = lowest, 5 = best)
  • Communicate the results to a realtor

1. Load packages and data

Load subset of variables:

library(ggplot2)
library(dplyr)
library(moderndive)
library(patchwork)
house_prices %>% 
  select(id, date, price, sqft_living, condition) %>% 
  head()
id date price sqft_living condition
7129300520 2014-10-13 221900 1180 3
6414100192 2014-12-09 538000 2570 3
5631500400 2015-02-25 180000 770 3
2487200875 2014-12-09 604000 1960 5
1954400510 2015-02-18 510000 1680 3
7237550310 2014-05-12 1225000 5420 3

2. Exploratory data analysis

Variables price and sqft_living are right-skewed:

p1 <- ggplot(house_prices, aes(x = price)) +
  geom_histogram() +
  labs(x = "price", title = "House prices in Seattle")
p2 <- ggplot(house_prices, aes(x = sqft_living)) +
  geom_histogram() +
  labs(x = "square feet", title = "Size of houses in Seattle")
p1 + p2

Apply a log base 10 tranformation:

house_prices <- house_prices %>%
  mutate(
    log10_price = log10(price),
    log10_sqft_living = log10(sqft_living)
    )

p1 <- ggplot(house_prices, aes(x = log10_price)) +
  geom_histogram() +
  labs(x = "log10 price", title = "House prices in Seattle")
p2 <- ggplot(house_prices, aes(x = log10_sqft_living)) +
  geom_histogram() +
  labs(x = "log10 square feet", title = "Size of houses in Seattle")
p1 + p2

3. Eyeball the relationship

Visualize the relationship between the variables using facets…

ggplot(house_prices, aes(x = log10_sqft_living, y = log10_price)) +
  geom_point(alpha = 0.5) +
  labs(y = "log10 price", x = "log10 square footage", title = "House prices in Seattle") +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~condition)

… or colors

ggplot(house_prices, aes(x = log10_sqft_living, y = log10_price, col = condition)) +
  geom_point(alpha = 0.1) +
  labs(y = "log10 price", x = "log10 square footage", title = "House prices in Seattle") +
  geom_smooth(method = "lm", se = FALSE)

4. Quantify the relationship

  • Fit an interaction model which allows for a unique regression line for each condition value
  • Output the regression table along with confidence intervals, not just the p-values.
model_price <- lm(log10_price ~ log10_sqft_living * condition, data = house_prices)
get_regression_table(model_price)
term estimate std_error statistic p_value conf_low conf_high
intercept 3.330 0.451 7.380 0.000 2.446 4.215
log10_sqft_living 0.690 0.148 4.652 0.000 0.399 0.980
condition2 0.047 0.498 0.094 0.925 -0.930 1.024
condition3 -0.367 0.452 -0.812 0.417 -1.253 0.519
condition4 -0.398 0.453 -0.879 0.380 -1.286 0.490
condition5 -0.883 0.457 -1.931 0.053 -1.779 0.013
log10_sqft_living:condition2 -0.024 0.163 -0.148 0.882 -0.344 0.295
log10_sqft_living:condition3 0.133 0.148 0.893 0.372 -0.158 0.424
log10_sqft_living:condition4 0.146 0.149 0.979 0.328 -0.146 0.437
log10_sqft_living:condition5 0.310 0.150 2.067 0.039 0.016 0.604

Objective

End goal: understand and interpret the inference for regression, which requires lot of skills/knowledge. For example:

  1. What’s the difference between R & RStudio? What’s an R package?
  2. How do effectively visualize data?
  3. How can I clean data as it “exists in the wild”?
  4. How do I model the relationship between variables?
  5. What is the error/uncertainty of our results?

Means to the end

  1. Analogies: R vs RStudio and R packages
  2. Visualization via ggplot2: Grammar of Graphics and limit scope to “Five Named Graphs”
  3. dplyr “Five Main Verbs” for data wrangling/transformation
  4. Descriptive regression modeling with emphasis on exploratory data analysis (See Figure 7.4)
  5. How do we teach ideas of representative sampling, sampling distributions, and standard errors? A work in progress…

Principle 2: Inference via data science

Simulations not probability/formulas

More of this Less of this
Drawing Drawing

First: Tactile simulations

Second: Virtual simulations

  • Take a virtual bowl
  • Extract a virtual sample using a virtual shovel
  • Construct the sampling distribution by repeating the above 1000 times.
  • Plot!
library(dplyr)
library(ggplot2)
library(moderndive)

# Take 1000 virtual samples of size n = 50 from bowl
virtual_samples <- bowl %>%
  rep_sample_n(size = 50, reps = 1000)

# Compute 1000 simulated p-hats based on these 1000 virtual samples
virtual_p_hats <- virtual_samples %>% 
  group_by(replicate) %>% 
  summarize(p_hat = mean(color == "red"))

# Plot sampling distribution
ggplot(virtual_p_hats, aes(x = p_hat)) +
  geom_histogram(binwidth = 0.05) +
  labs(title = "Sampling distribution of p_hat based on samples of n = 50")

Simulations

  1. Need to do tactile simulations first; too many layers of abstraction otherwise
  2. Draw link between tactile (actual bowl and shovel) and virtual (data frames and functions in R) then
  3. To perform and deconstruct latter, students need to be equipped with a data science toolbox: data visualization and basic data wrangling.
  4. Example involving sample means: Average year of minting of pennies using a virtual sack of \(N=800\) pennies

State of affairs

  • Beta version of above has been implemented in development version of ModernDive Chapter 8: Sampling
  • Under construction: Chapters 9 thru 11 on confidence intervals, hypothesis testing, and inference for regression
  • Pending developments on infer package https://infer.netlify.com/

infer package for tidy statistical inference

Drawing

Principle 3: Coding as a basic skill

  • Battle is more psychological than anything else.
  • I’m constantly saying: “Don’t code from scratch. Rather copy, paste, and tweak!”
  • ModernDive Chapters 3 thru 5 on data visualization with ggplot2, tidy data, and data wrangling with dplyr align near perfectly with DataCamp “Introduction to the Tidyverse” so we can outsource less sexy aspects of teaching coding for data science to beginners.

Drawing

Thanks!

Chester Ismay Albert Y. Kim
Drawing Drawing

Resources

