In this lesson students will learn to:
Examples from “Sampling: Design and Analysis” by Sharon Lohr.
Data from a SRS of 120 golf courses, selected from a list of 14,938 golf courses in the United States, are in the file golfsrs.csv
golf<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/CODE/golfsrs.csv",
header=TRUE)
Display the data in a histogram for the weekday greens fees for nine holes of golf (variable wkday9). How would you describe the shape of the data?
library(tidyverse)
## HISTOGRAM
ggplot(data=golf, aes(x=wkday9))+
geom_histogram()+
theme_bw()+
ggtitle("Distribution of Weekday Greens Fees for 9 Holes")
Find the average weekday greens fee to play nine holes of golf and give the SE for the estimate.
## SAMPLE STATS
y_bar<-mean(golf$wkday9)
y_bar
## [1] 20.15333
sampVar<-var(golf$wkday9)
sampVar
## [1] 321.357
## STANDARD ERR
n<-120
N<-14938
## FPC
thisFPC<-1-(n/N)
thisFPC
## [1] 0.9919668
## STANDARD ERROR
standErr<-sqrt(thisFPC*(sampVar/n))
standErr
## [1] 1.629866
Use what you found in Part B to construct a 95% confidence interval.
## CONFIDENCE INTERVAL
## Z CI
zCritVal<-qnorm(0.975)
y_bar+c(-1,1)*zCritVal*standErr
## [1] 16.95886 23.34781
## NEW T CI
tCritVal<-qt(0.975, df=n-1)
y_bar+c(-1,1)*tCritVal*standErr
## [1] 16.92604 23.38063
In the previous example from Class 7B the SRS, not all departments were represented. The following data are from a stratified sample using academic division. The total number of faculty in Biological Sciences is 102, Physical Science is 310, Social Sciences is 217, and Humanities is 178 in the population.
pubStr<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/CODE/pubStrat.csv",
header=TRUE)
head(pubStr)
## Publications Biological Physical Social Humanities
## 1 0 1 10 9 8
## 2 1 2 2 0 2
## 3 2 0 0 1 0
## 4 3 1 1 0 1
## 5 4 0 2 2 0
## 6 5 2 1 0 0
Estimate the average number of referred publications by faculty members in the college, and give the standard error. Use these to construct a 95% confidence interval.
### BIO
n1<-sum(pubStr$Biological)
ybar1<-sum(pubStr$Publications*pubStr$Biological)/n1
ybar1
## [1] 3.142857
s2_1<-(1/(n1-1))*sum(pubStr$Biological*(pubStr$Publications-ybar1)^2)
s2_1
## [1] 6.809524
sqrt(s2_1)
## [1] 2.609506
### PHYSICAL
n2<-sum(pubStr$Physical)
ybar2<-sum(pubStr$Publications*pubStr$Physical)/n2
ybar2
## [1] 2.105263
s2_2<-(1/(n2-1))*sum(pubStr$Physical*(pubStr$Publications-ybar2)^2)
s2_2
## [1] 8.210526
sqrt(s2_2)
## [1] 2.865402
### SOCIAL
n3<-sum(pubStr$Social)
ybar3<-sum(pubStr$Publications*pubStr$Social)/n3
ybar3
## [1] 1.230769
s2_3<-(1/(n3-1))*sum(pubStr$Social*(pubStr$Publications-ybar3)^2)
s2_3
## [1] 4.358974
sqrt(s2_3)
## [1] 2.087816
### HUMANITIES
n4<-sum(pubStr$Humanities)
ybar4<-sum(pubStr$Publications*pubStr$Humanities)/n4
ybar4
## [1] 0.4545455
s2_4<-(1/(n4-1))*sum(pubStr$Humanities*(pubStr$Publications-ybar4)^2)
s2_4
## [1] 0.8727273
sqrt(s2_4)
## [1] 0.9341987
### COMBINE
N<-807
Nh<-c(102, 310, 217, 178)
nh<-c(n1, n2, n3, n4)
ybars<-c(ybar1, ybar2, ybar3, ybar4)
s2s<-c(s2_1, s2_2, s2_3, s2_4)
## Y BAR STRAT
y_bar_strat<-sum((Nh/N)*ybars)
y_bar_strat
## [1] 1.637161
## VAR Y BAR STRAT
var_ybar_strat<-sum((1-(nh/Nh))*(Nh/N)^2*(s2s/nh))
var_ybar_strat
## [1] 0.1007461
## STD ERR
stdErr_ybar_strat<-sqrt(sum((1-(nh/Nh))*(Nh/N)^2*(s2s/nh)))
stdErr_ybar_strat
## [1] 0.3174052
## CONF INT
y_bar_strat+c(-1,1)*qt(.975, 50-4)*stdErr_ybar_strat
## [1] 0.9982575 2.2760647
How does your result from Part A compare to what you found using the simple random sample?
Did stratification increase precision in this sample? Explain why you think it did or did not.
## DESIGN EFFECT
.101/0.135
## [1] 0.7481481
Probability sampling (random selection) allows researchers to generalize from the sample to the population, and random assignment in experiments allows researchers to infer causation, but how often are these methods employed in public health research? Hayat and Knapp (2017) drew a stratified random sample of 198 articles from the 547 research articles published in 2013 by three leading public health journals. For each article, they determined the number of authors, the type of statistical inference used (confidence intervals, hypothesis tests, both, or neither), and whether random selection or random assignment was used. The data are in file healthjournals.csv.
#### PUBLIC HEALTH
ph<-read.csv("https://raw.githubusercontent.com/kitadasmalley/Teaching/refs/heads/main/DATA429_599/CODE/healthjournals.csv",
header=TRUE)
str(ph)
## 'data.frame': 198 obs. of 7 variables:
## $ Journal : chr "AJPH" "AJPH" "AJPH" "AJPH" ...
## $ NumAuthors: int 4 9 6 5 7 4 3 4 2 4 ...
## $ RandomSel : chr "No" "No" "No" "No" ...
## $ RandomAssn: chr "No" "No" "No" "No" ...
## $ ConfInt : chr "Yes" "Yes" "Yes" "Yes" ...
## $ HypTest : chr "No" "Yes" "Yes" "Yes" ...
## $ Asterisks : chr "No" "No" "No" "No" ...
The population sizes of journals are as follows:
Estimate the percentage of the articles that use random selection. Calculate the point estimate, standard error, and a 95% confidence interval.
ph%>%
group_by(Journal, RandomSel)%>%
summarise(n=n())
## `summarise()` has grouped output by 'Journal'. You can override using the
## `.groups` argument.
## # A tibble: 6 × 3
## # Groups: Journal [3]
## Journal RandomSel n
## <chr> <chr> <int>
## 1 AJPH No 71
## 2 AJPH Yes 29
## 3 AJPM No 24
## 4 AJPM Yes 14
## 5 PM No 45
## 6 PM Yes 15
p_hats<-c(29/(29+71), 14/(14+24), 15/(15+45))
p_hats
## [1] 0.2900000 0.3684211 0.2500000
Nh<-c(280, 103, 164)
nh<-c(100, 38, 60)
N<-sum(Nh)
Alternatively
### WE CAN MAKE THIS BINARY
ph01<-ph%>%
mutate(randSel01=(RandomSel=="Yes"))%>%
group_by(Journal)%>%
summarise(n=n(),
prop=mean(randSel01, na.rm=TRUE))
ph01
## # A tibble: 3 × 3
## Journal n prop
## <chr> <int> <dbl>
## 1 AJPH 100 0.29
## 2 AJPM 38 0.368
## 3 PM 60 0.25
## P HAT STRAT
pHat_strat<-sum((Nh/N)*p_hats)
pHat_strat
## [1] 0.292774
## SE
## LET'S LOOK AT EACH COMPONENT
(1-(nh/Nh))*(Nh/N)^2*((p_hats*(1-p_hats))/(nh-1))
## [1] 0.0003503298 0.0001407169 0.0001811556
## SAMPLE VARIANCE
varPHat_str=sum((1-(nh/Nh))*(Nh/N)^2*((p_hats*(1-p_hats))/(nh-1)))
varPHat_str
## [1] 0.0006722023
## STANDARD ERROR
sqrt(varPHat_str)
## [1] 0.02592686
### CONF INT
pHat_strat+c(-1,1)*qnorm(0.975)*sqrt(varPHat_str)
## [1] 0.2419583 0.3435897
Estimate the total number of articles (from the 547) that use random selection. Calculate the point estimate, standard error, and a 95% confidence interval.
### ESTIMATE TOTAL FOR EACH STRATA
Nh*p_hats
## [1] 81.20000 37.94737 41.00000
### STRATIFIED TOTAL
tot_str<-sum(Nh*p_hats)
tot_str
## [1] 160.1474
## VAR
N^2*varPHat_str
## [1] 201.129
## SE
sqrt(N^2*varPHat_str)
## [1] 14.18199
## CONF INT
tot_str+c(-1,1)*qnorm(0.975)*sqrt(N^2*varPHat_str)
## [1] 132.3512 187.9436