DATA429/599: Stratified Random Sample Estimators

Learning Objectives

In this lesson students will learn to:

Construct estimates and confidence intervals for stratified random samples from finite populations

Examples from “Sampling: Design and Analysis” by Sharon Lohr.

Warm-up / Review

Data from a SRS of 120 golf courses, selected from a list of 14,938 golf courses in the United States, are in the file golfsrs.csv

golf<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/CODE/golfsrs.csv", 
               header=TRUE)

Visualize

Display the data in a histogram for the weekday greens fees for nine holes of golf (variable wkday9). How would you describe the shape of the data?

library(tidyverse)

## HISTOGRAM
ggplot(data=golf, aes(x=wkday9))+
  geom_histogram()+
  theme_bw()+
  ggtitle("Distribution of Weekday Greens Fees for 9 Holes")

Average and SRS Standard Error

Find the average weekday greens fee to play nine holes of golf and give the SE for the estimate.

## SAMPLE STATS
y_bar<-mean(golf$wkday9)
y_bar

## [1] 20.15333

sampVar<-var(golf$wkday9)
sampVar

## [1] 321.357

## STANDARD ERR
n<-120
N<-14938

## FPC
thisFPC<-1-(n/N)
thisFPC

## [1] 0.9919668

## STANDARD ERROR
standErr<-sqrt(thisFPC*(sampVar/n))
standErr

## [1] 1.629866

Confidence Interval

Use what you found in Part B to construct a 95% confidence interval.

## CONFIDENCE INTERVAL
## Z CI
zCritVal<-qnorm(0.975)
y_bar+c(-1,1)*zCritVal*standErr

## [1] 16.95886 23.34781

## NEW T CI
tCritVal<-qt(0.975, df=n-1)
y_bar+c(-1,1)*tCritVal*standErr

## [1] 16.92604 23.38063

EX 1: Publications

In the previous example from Class 7B the SRS, not all departments were represented. The following data are from a stratified sample using academic division. The total number of faculty in Biological Sciences is 102, Physical Science is 310, Social Sciences is 217, and Humanities is 178 in the population.

pubStr<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/CODE/pubStrat.csv", 
                 header=TRUE)

head(pubStr)

##   Publications Biological Physical Social Humanities
## 1            0          1       10      9          8
## 2            1          2        2      0          2
## 3            2          0        0      1          0
## 4            3          1        1      0          1
## 5            4          0        2      2          0
## 6            5          2        1      0          0

Stratified Average

Estimate the average number of referred publications by faculty members in the college, and give the standard error. Use these to construct a 95% confidence interval.

### BIO
n1<-sum(pubStr$Biological)
ybar1<-sum(pubStr$Publications*pubStr$Biological)/n1
ybar1

## [1] 3.142857

s2_1<-(1/(n1-1))*sum(pubStr$Biological*(pubStr$Publications-ybar1)^2)
s2_1

## [1] 6.809524

sqrt(s2_1)

## [1] 2.609506

### PHYSICAL
n2<-sum(pubStr$Physical)
ybar2<-sum(pubStr$Publications*pubStr$Physical)/n2
ybar2

## [1] 2.105263

s2_2<-(1/(n2-1))*sum(pubStr$Physical*(pubStr$Publications-ybar2)^2)
s2_2

## [1] 8.210526

sqrt(s2_2)

## [1] 2.865402

### SOCIAL
n3<-sum(pubStr$Social)
ybar3<-sum(pubStr$Publications*pubStr$Social)/n3
ybar3

## [1] 1.230769

s2_3<-(1/(n3-1))*sum(pubStr$Social*(pubStr$Publications-ybar3)^2)
s2_3

## [1] 4.358974

sqrt(s2_3)

## [1] 2.087816

### HUMANITIES
n4<-sum(pubStr$Humanities)
ybar4<-sum(pubStr$Publications*pubStr$Humanities)/n4
ybar4

## [1] 0.4545455

s2_4<-(1/(n4-1))*sum(pubStr$Humanities*(pubStr$Publications-ybar4)^2)
s2_4

## [1] 0.8727273

sqrt(s2_4)

## [1] 0.9341987

### COMBINE
N<-807
Nh<-c(102, 310, 217, 178)
nh<-c(n1, n2, n3, n4)
ybars<-c(ybar1, ybar2, ybar3, ybar4)
s2s<-c(s2_1, s2_2, s2_3, s2_4)

## Y BAR STRAT
y_bar_strat<-sum((Nh/N)*ybars)
y_bar_strat

## [1] 1.637161

## VAR Y BAR STRAT
var_ybar_strat<-sum((1-(nh/Nh))*(Nh/N)^2*(s2s/nh))
var_ybar_strat

## [1] 0.1007461

## STD ERR
stdErr_ybar_strat<-sqrt(sum((1-(nh/Nh))*(Nh/N)^2*(s2s/nh)))
stdErr_ybar_strat

## [1] 0.3174052

## CONF INT
y_bar_strat+c(-1,1)*qt(.975, 50-4)*stdErr_ybar_strat

## [1] 0.9982575 2.2760647

Think about it!

How does your result from Part A compare to what you found using the simple random sample?
Did stratification increase precision in this sample? Explain why you think it did or did not.

## DESIGN EFFECT 
.101/0.135

## [1] 0.7481481

EX 2: Public Health Research

Probability sampling (random selection) allows researchers to generalize from the sample to the population, and random assignment in experiments allows researchers to infer causation, but how often are these methods employed in public health research? Hayat and Knapp (2017) drew a stratified random sample of 198 articles from the 547 research articles published in 2013 by three leading public health journals. For each article, they determined the number of authors, the type of statistical inference used (confidence intervals, hypothesis tests, both, or neither), and whether random selection or random assignment was used. The data are in file healthjournals.csv.

#### PUBLIC HEALTH
ph<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/CODE/healthjournals.csv", 
             header=TRUE)

str(ph)

## 'data.frame':    198 obs. of  7 variables:
##  $ Journal   : chr  "AJPH" "AJPH" "AJPH" "AJPH" ...
##  $ NumAuthors: int  4 9 6 5 7 4 3 4 2 4 ...
##  $ RandomSel : chr  "No" "No" "No" "No" ...
##  $ RandomAssn: chr  "No" "No" "No" "No" ...
##  $ ConfInt   : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ HypTest   : chr  "No" "Yes" "Yes" "Yes" ...
##  $ Asterisks : chr  "No" "No" "No" "No" ...

The population sizes of journals are as follows:

American Journal of Public Health (AJPH): 280
American Journal of Preventive Medicine (AJPM): 103
Preventive Medicine (PM): 164

Confidence Interval for Proportion

Estimate the percentage of the articles that use random selection. Calculate the point estimate, standard error, and a 95% confidence interval.

STEP 1: Summary Data

ph%>%
  group_by(Journal, RandomSel)%>%
  summarise(n=n())

## `summarise()` has grouped output by 'Journal'. You can override using the
## `.groups` argument.

## # A tibble: 6 × 3
## # Groups:   Journal [3]
##   Journal RandomSel     n
##   <chr>   <chr>     <int>
## 1 AJPH    No           71
## 2 AJPH    Yes          29
## 3 AJPM    No           24
## 4 AJPM    Yes          14
## 5 PM      No           45
## 6 PM      Yes          15

p_hats<-c(29/(29+71), 14/(14+24), 15/(15+45))
p_hats

## [1] 0.2900000 0.3684211 0.2500000

Nh<-c(280, 103, 164)
nh<-c(100, 38, 60)
N<-sum(Nh)

Alternatively

### WE CAN MAKE THIS BINARY
ph01<-ph%>%
  mutate(randSel01=(RandomSel=="Yes"))%>%
  group_by(Journal)%>%
  summarise(n=n(), 
            prop=mean(randSel01, na.rm=TRUE))

ph01

## # A tibble: 3 × 3
##   Journal     n  prop
##   <chr>   <int> <dbl>
## 1 AJPH      100 0.29 
## 2 AJPM       38 0.368
## 3 PM         60 0.25

STEP 2: Estimates

## P HAT STRAT
pHat_strat<-sum((Nh/N)*p_hats)
pHat_strat

## [1] 0.292774

## SE
## LET'S LOOK AT EACH COMPONENT
(1-(nh/Nh))*(Nh/N)^2*((p_hats*(1-p_hats))/(nh-1))

## [1] 0.0003503298 0.0001407169 0.0001811556

## SAMPLE VARIANCE
varPHat_str=sum((1-(nh/Nh))*(Nh/N)^2*((p_hats*(1-p_hats))/(nh-1)))
varPHat_str

## [1] 0.0006722023

## STANDARD ERROR
sqrt(varPHat_str)

## [1] 0.02592686

STEP 3: Confidence Interval

### CONF INT
pHat_strat+c(-1,1)*qnorm(0.975)*sqrt(varPHat_str)

## [1] 0.2419583 0.3435897

Confidence Interval for Total

Estimate the total number of articles (from the 547) that use random selection. Calculate the point estimate, standard error, and a 95% confidence interval.

STEP 1: Total Estimate

### ESTIMATE TOTAL FOR EACH STRATA
Nh*p_hats

## [1] 81.20000 37.94737 41.00000

### STRATIFIED TOTAL
tot_str<-sum(Nh*p_hats)
tot_str

## [1] 160.1474

STEP 2: Variance Estimate

## VAR
N^2*varPHat_str

## [1] 201.129

## SE
sqrt(N^2*varPHat_str)

## [1] 14.18199

STEP 3: Confidence Interval

## CONF INT
tot_str+c(-1,1)*qnorm(0.975)*sqrt(N^2*varPHat_str)

## [1] 132.3512 187.9436