First, I load my packages for this exercise, as well as the data that we’ll be using:
library(plyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
Here, we are going to calculate the standard error for mean household sizes, by the birthplace of the head-of-household (US Born or Foreign Born):
ipums %>%
filter(relate==1) %>%
mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
group_by(birthplace) %>%
summarise(mean_family_size=mean(famsize),sd=sd(famsize),standard_error=sd(famsize)/sqrt(length(famsize)))
## # A tibble: 2 x 4
## birthplace mean_family_size sd standard_error
## <chr> <dbl> <dbl> <dbl>
## 1 FOREIGN_BORN 2.934445 1.683525 0.013353324
## 2 US_BORN 2.290301 1.332221 0.004183421
Notice there is a higher standard error for FOREIGN_BORN than for US_BORN, probably reflecting the smaller sample-size of FOREIGN_BORN.
First, we load the function to calculate a normal confidence interval:
norm.interval = function(data, conf.level = 0.95)
{z = qnorm((1 - conf.level)/2, lower.tail = FALSE)
variance = var(data, na.rm=T)
xbar = mean(data, na.rm=T)
sdx = sqrt(variance/length(data))
c(xbar - z * sdx, xbar + z * sdx) }
Then, we calculate for mean family size of US Born and Foreign Born heads-of-households:
us_born<-ipums %>%
filter(relate==1) %>%
mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
filter(birthplace=="US_BORN")
foreign_born<-ipums %>%
filter(relate==1) %>%
mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
filter(birthplace=="FOREIGN_BORN")
norm.interval(us_born$famsize)
## [1] 2.282102 2.298500
norm.interval(foreign_born$famsize)
## [1] 2.908273 2.960617
As we can see, the confidence intervals for US Born (2.282 - 2.298) and Foreign Born (2.90 - 2.96) do not overlap.
Here, I’ll be using the “Bootstrap method” to calculate confidence intervals for mean household sizes, by birthplace of head-of-household, starting with US_BORN head-of-households.
n.sim<-1000
mus<-numeric(n.sim)
vars<-numeric(n.sim)
for (i in 1:n.sim){
dat<-sample(us_born$famsize,size=length(us_born$famsize), replace=T)
mus[i]<-mean(dat, na.rm=T)
vars[i]<-var(dat, na.rm=T)
}
par(mfrow=c(1,2))
hist(mus,freq=F, main="B.strap. US_BORN famsize means")
abline(v=mean(us_born$famsize, na.rm=T), col=2, lwd=3)
hist(vars, freq=F,main="B.strap. US_BORN famsize var")
abline(v=var(us_born$famsize, na.rm=T), col=2, lwd=3)
Here is the bootstrap distribution of US BORN means, in terms of the percentile method:
quantile(mus,p=c(.025, .975))
## 2.5% 97.5%
## 2.282420 2.297914
As you can see, this is very close to the normal confidence interval provided by “norm.interval”
Now, I’ll be using the “Bootstrap method” to calculate confidence intervals for mean household sizes, by birthplace of head-of-household, but this time with FOREIGN_BORN head-of-households.
n.sim<-1000
foreign_mus<-numeric(n.sim)
foreign_vars<-numeric(n.sim)
for (i in 1:n.sim){
dat<-sample(foreign_born$famsize,size=length(foreign_born$famsize), replace=T)
foreign_mus[i]<-mean(dat, na.rm=T)
foreign_vars[i]<-var(dat, na.rm=T)
}
par(mfrow=c(1,2))
hist(foreign_mus,freq=F, main="B.strap. FOREIGN famsize means")
abline(v=mean(foreign_born$famsize, na.rm=T), col=2, lwd=3)
hist(foreign_vars, freq=F,main="B.strap. FOREIGN famsize var")
abline(v=var(foreign_born$famsize, na.rm=T), col=2, lwd=3)
Here is the bootstrap distribution of FOREIGN BORN family-size means, in terms of the percentile method:
quantile(foreign_mus,p=c(.025, .975))
## 2.5% 97.5%
## 2.909843 2.959495
As you can see, these two bootstrapped confidence intervals are extremely close to the original confidence intervals derived from Dr. Spark’s “norm.interval” function.