Compare simple random sampling with stratified random sampling. We will compare two cases:
Sample 20% of the observations from a data set – each observation has an equal chance of being selected
Stratify a data set into segments (or strata) – for each segment, sample 20% of the observations
The second case is useful if there is a segment that is rare and may not get selected under case 1.
library(tidyverse)
library(earth)
data(etitanic)
dim(etitanic)
## [1] 1046 6
names(etitanic)
## [1] "pclass" "survived" "sex" "age" "sibsp" "parch"
mean(etitanic$survived) ## overall survival rate = 40.82218%
## [1] 0.4082218
Randomly sample 20% of the observations from the entire data set.
set.seed(1)
simple_sample <- etitanic %>%
slice_sample(prop=0.2)
dim(simple_sample)
## [1] 209 6
mean(simple_sample$survived) ## sample survival rate = 39.23445%
## [1] 0.3923445
Segment the data set by passenger class and gender, then sample 20% of the observations from each segment.
set.seed(1)
strata_sample <- etitanic %>%
group_by(pclass, sex) %>%
slice_sample(prop=0.2) %>%
ungroup()
dim(strata_sample)
## [1] 206 6
mean(strata_sample$survived) ## sample survival rate = 39.80583%
## [1] 0.3980583
The segment definition can be easily expanded by adding extra columns inside the group_by function.
I wonder if parameter estimates (e.g., sample survival rates) from a stratified random sample has lower variance than from a simple random sample.