Sampling using Tidyverse

Compare simple random sampling with stratified random sampling. We will compare two cases:

  1. Sample 20% of the observations from a data set – each observation has an equal chance of being selected

  2. Stratify a data set into segments (or strata) – for each segment, sample 20% of the observations

The second case is useful if there is a segment that is rare and may not get selected under case 1.

Titanic

library(tidyverse)
library(earth)

data(etitanic)
dim(etitanic)
## [1] 1046    6
names(etitanic)
## [1] "pclass"   "survived" "sex"      "age"      "sibsp"    "parch"
mean(etitanic$survived) ## overall survival rate = 40.82218%
## [1] 0.4082218

Simple Random Sampling

Randomly sample 20% of the observations from the entire data set.

set.seed(1)

simple_sample <- etitanic %>% 
  slice_sample(prop=0.2)

dim(simple_sample)
## [1] 209   6
mean(simple_sample$survived) ## sample survival rate = 39.23445%
## [1] 0.3923445

Stratified Random Sampling

Segment the data set by passenger class and gender, then sample 20% of the observations from each segment.

set.seed(1)

strata_sample <- etitanic %>% 
  group_by(pclass, sex) %>% 
  slice_sample(prop=0.2) %>% 
  ungroup()

dim(strata_sample)
## [1] 206   6
mean(strata_sample$survived) ## sample survival rate = 39.80583%
## [1] 0.3980583

The segment definition can be easily expanded by adding extra columns inside the group_by function.

Conclusion

I wonder if parameter estimates (e.g., sample survival rates) from a stratified random sample has lower variance than from a simple random sample.