Stratified Sampling Using Tidyverse

Sampling using Tidyverse

Compare simple random sampling with stratified random sampling. We will compare two cases:

Sample 20% of the observations from a data set – each observation has an equal chance of being selected
Stratify a data set into segments (or strata) – for each segment, sample 20% of the observations

The second case is useful if there is a segment that is rare and may not get selected under case 1.

library(tidyverse)
library(earth)

data(etitanic)
dim(etitanic)

## [1] 1046    6

names(etitanic)

## [1] "pclass"   "survived" "sex"      "age"      "sibsp"    "parch"

mean(etitanic$survived) ## overall survival rate = 40.82218%

## [1] 0.4082218

Randomly sample 20% of the observations from the entire data set.

set.seed(1)

simple_sample <- etitanic %>% 
  slice_sample(prop=0.2)

dim(simple_sample)

## [1] 209   6

mean(simple_sample$survived) ## sample survival rate = 39.23445%

## [1] 0.3923445

Segment the data set by passenger class and gender, then sample 20% of the observations from each segment.

set.seed(1)

strata_sample <- etitanic %>% 
  group_by(pclass, sex) %>% 
  slice_sample(prop=0.2) %>% 
  ungroup()

dim(strata_sample)

## [1] 206   6

mean(strata_sample$survived) ## sample survival rate = 39.80583%

## [1] 0.3980583

The segment definition can be easily expanded by adding extra columns inside the group_by function.

I wonder if parameter estimates (e.g., sample survival rates) from a stratified random sample has lower variance than from a simple random sample.