Pop quiz!
Sampling from a bowl of red and white balls!
About me
- PhD trained statistician
- Worked at Google in AdWords Division
- Small colleges: Reed, Middlebury, Amherst, and now Smith Colleges
- Awash in “statistics vs data science” debates
Awash in quotes…
Some more problematic than others:
- A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.
- A data scientist is a statistician who lives in San Francisco
- A data scientist is a statistician who is wearing a bow tie.
- From my time at Google: An engineer knows what an average is, but an analyst knows what a standard deviation is
rudeboybert says
What can statistics bring to the data science table? Among other things, how about?
A statistician is a data scientist who understands what a standard error is.
ModernDive
A few pedagogical principles
- Have intro students “play the whole game”
- Use data science, not probability/mathematical formulas, to motivate statistical inference
- Coding as a basic skill
Principle 1: “Play the whole game”
In the context of statistics/data analysis, by “play the whole game” we mean Wickham’s data/science pipeline:
Tee-ball
Think how children learn “tee-ball” and play a simplified version of the “whole game” first…

Softball & baseball
… and then eventually graduate to softball/baseball.

Example “whole game”
- Load the Seattle house prices dataset from Kaggle saved in
moderndive::house_prices
- Model \(y\) the sale
price of house as a function of two explanatory/predictor variables:
- \(x_1\): size (
sqft_living square feet)
- \(x_2\):
condition (catgorical w/ 1 = lowest, 5 = best)
- Communicate the results to a realtor
1. Load packages and data
Load subset of variables:
library(ggplot2)
library(dplyr)
library(moderndive)
library(patchwork)
house_prices %>%
select(id, date, price, sqft_living, condition) %>%
head()
| 7129300520 |
2014-10-13 |
221900 |
1180 |
3 |
| 6414100192 |
2014-12-09 |
538000 |
2570 |
3 |
| 5631500400 |
2015-02-25 |
180000 |
770 |
3 |
| 2487200875 |
2014-12-09 |
604000 |
1960 |
5 |
| 1954400510 |
2015-02-18 |
510000 |
1680 |
3 |
| 7237550310 |
2014-05-12 |
1225000 |
5420 |
3 |
2. Exploratory data analysis
Variables price and sqft_living are right-skewed:
p1 <- ggplot(house_prices, aes(x = price)) +
geom_histogram() +
labs(x = "price", title = "House prices in Seattle")
p2 <- ggplot(house_prices, aes(x = sqft_living)) +
geom_histogram() +
labs(x = "square feet", title = "Size of houses in Seattle")
p1 + p2

Apply a log base 10 tranformation:
house_prices <- house_prices %>%
mutate(
log10_price = log10(price),
log10_sqft_living = log10(sqft_living)
)
p1 <- ggplot(house_prices, aes(x = log10_price)) +
geom_histogram() +
labs(x = "log10 price", title = "House prices in Seattle")
p2 <- ggplot(house_prices, aes(x = log10_sqft_living)) +
geom_histogram() +
labs(x = "log10 square feet", title = "Size of houses in Seattle")
p1 + p2

3. Eyeball the relationship
Visualize the relationship between the variables using facets…
ggplot(house_prices, aes(x = log10_sqft_living, y = log10_price)) +
geom_point(alpha = 0.5) +
labs(y = "log10 price", x = "log10 square footage", title = "House prices in Seattle") +
geom_smooth(method = "lm", se = FALSE) +
facet_wrap(~condition)

… or colors
ggplot(house_prices, aes(x = log10_sqft_living, y = log10_price, col = condition)) +
geom_point(alpha = 0.1) +
labs(y = "log10 price", x = "log10 square footage", title = "House prices in Seattle") +
geom_smooth(method = "lm", se = FALSE)

4. Quantify the relationship
- Fit an interaction model which allows for a unique regression line for each
condition value
- Output the regression table along with confidence intervals, not just the p-values.
model_price <- lm(log10_price ~ log10_sqft_living * condition, data = house_prices)
get_regression_table(model_price)
| intercept |
3.330 |
0.451 |
7.380 |
0.000 |
2.446 |
4.215 |
| log10_sqft_living |
0.690 |
0.148 |
4.652 |
0.000 |
0.399 |
0.980 |
| condition2 |
0.047 |
0.498 |
0.094 |
0.925 |
-0.930 |
1.024 |
| condition3 |
-0.367 |
0.452 |
-0.812 |
0.417 |
-1.253 |
0.519 |
| condition4 |
-0.398 |
0.453 |
-0.879 |
0.380 |
-1.286 |
0.490 |
| condition5 |
-0.883 |
0.457 |
-1.931 |
0.053 |
-1.779 |
0.013 |
| log10_sqft_living:condition2 |
-0.024 |
0.163 |
-0.148 |
0.882 |
-0.344 |
0.295 |
| log10_sqft_living:condition3 |
0.133 |
0.148 |
0.893 |
0.372 |
-0.158 |
0.424 |
| log10_sqft_living:condition4 |
0.146 |
0.149 |
0.979 |
0.328 |
-0.146 |
0.437 |
| log10_sqft_living:condition5 |
0.310 |
0.150 |
2.067 |
0.039 |
0.016 |
0.604 |
Objective
End goal: understand and interpret the inference for regression, which requires lot of skills/knowledge. For example:
- What’s the difference between R & RStudio? What’s an R package?
- How do effectively visualize data?
- How can I clean data as it “exists in the wild”?
- How do I model the relationship between variables?
- What is the error/uncertainty of our results?
Principle 2: Inference via data science
First: Tactile simulations
Second: Virtual simulations
- Take a virtual bowl
- Extract a virtual sample using a virtual shovel
- Construct the sampling distribution by repeating the above 1000 times.
- Plot!
library(dplyr)
library(ggplot2)
library(moderndive)
# Take 1000 virtual samples of size n = 50 from bowl
virtual_samples <- bowl %>%
rep_sample_n(size = 50, reps = 1000)
# Compute 1000 simulated p-hats based on these 1000 virtual samples
virtual_p_hats <- virtual_samples %>%
group_by(replicate) %>%
summarize(p_hat = mean(color == "red"))
# Plot sampling distribution
ggplot(virtual_p_hats, aes(x = p_hat)) +
geom_histogram(binwidth = 0.05) +
labs(title = "Sampling distribution of p_hat based on samples of n = 50")

Simulations
- Need to do tactile simulations first; too many layers of abstraction otherwise
- Draw link between tactile (actual bowl and shovel) and virtual (data frames and functions in R) then
- To perform and deconstruct latter, students need to be equipped with a data science toolbox: data visualization and basic data wrangling.
- Example involving sample means: Average year of minting of pennies using a virtual sack of \(N=800\)
pennies
State of affairs
- Beta version of above has been implemented in development version of ModernDive Chapter 8: Sampling
- Under construction: Chapters 9 thru 11 on confidence intervals, hypothesis testing, and inference for regression
- Pending developments on
infer package https://infer.netlify.com/
infer package for tidy statistical inference
Principle 3: Coding as a basic skill
- Battle is more psychological than anything else.
- I’m constantly saying: “Don’t code from scratch. Rather copy, paste, and tweak!”
- ModernDive Chapters 3 thru 5 on data visualization with
ggplot2, tidy data, and data wrangling with dplyr align near perfectly with DataCamp “Introduction to the Tidyverse” so we can outsource less sexy aspects of teaching coding for data science to beginners.

Thanks!
- Chester Ismay - Senior Curriculum Lead, DataCamp
- Albert Y. Kim - Assnt Prof. of Statistical & Data Sciences, Smith College
