Loading Packages

First, I load my packages for this exercise, as well as the data that we’ll be using:

library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(haven)

ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

Standard error for mean household sizes, by birthplace of head-of-household

Here, we are going to calculate the standard error for mean household sizes, by the birthplace of the head-of-household (US Born or Foreign Born):

ipums %>%
  filter(relate==1) %>%
  mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
  group_by(birthplace) %>%
  summarise(mean_family_size=mean(famsize),sd=sd(famsize),standard_error=sd(famsize)/sqrt(length(famsize)))
## # A tibble: 2 x 4
##     birthplace mean_family_size       sd standard_error
##          <chr>            <dbl>    <dbl>          <dbl>
## 1 FOREIGN_BORN         2.934445 1.683525    0.013353324
## 2      US_BORN         2.290301 1.332221    0.004183421

Notice there is a higher standard error for FOREIGN_BORN than for US_BORN, probably reflecting the smaller sample-size of FOREIGN_BORN.

Confidence intervals for mean of household sizes, by birthplace of head-of-household, using the normal approximation

First, we load the function to calculate a normal confidence interval:

norm.interval = function(data, conf.level = 0.95) 
{z = qnorm((1 - conf.level)/2, lower.tail = FALSE)

 variance = var(data, na.rm=T)
 xbar = mean(data, na.rm=T)
 sdx = sqrt(variance/length(data))
 c(xbar - z * sdx, xbar + z * sdx) }

Then, we calculate for mean family size of US Born and Foreign Born heads-of-households:

us_born<-ipums %>%
  filter(relate==1) %>%
  mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
  filter(birthplace=="US_BORN")

foreign_born<-ipums %>%
  filter(relate==1) %>%
  mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
  filter(birthplace=="FOREIGN_BORN")

norm.interval(us_born$famsize)
## [1] 2.282102 2.298500
norm.interval(foreign_born$famsize)
## [1] 2.908273 2.960617

As we can see, the confidence intervals for US Born (2.282 - 2.298) and Foreign Born (2.90 - 2.96) do not overlap.

Confidence intervals for mean of household sizes, by birthplace of head-of-household, using Bootstrap

Here, I’ll be using the “Bootstrap method” to calculate confidence intervals for mean household sizes, by birthplace of head-of-household, starting with US_BORN head-of-households.

n.sim<-1000

mus<-numeric(n.sim)
vars<-numeric(n.sim)
for (i in 1:n.sim){  
  dat<-sample(us_born$famsize,size=length(us_born$famsize), replace=T)
  mus[i]<-mean(dat, na.rm=T)
  vars[i]<-var(dat, na.rm=T)
}

par(mfrow=c(1,2))
hist(mus,freq=F, main="B.strap. US_BORN famsize means")
abline(v=mean(us_born$famsize, na.rm=T), col=2, lwd=3)

hist(vars, freq=F,main="B.strap. US_BORN famsize var")
abline(v=var(us_born$famsize, na.rm=T), col=2, lwd=3)

Here is the bootstrap distribution of US BORN means, in terms of the percentile method:

quantile(mus,p=c(.025, .975))
##     2.5%    97.5% 
## 2.282420 2.297914

As you can see, this is very close to the normal confidence interval provided by “norm.interval”

Now, I’ll be using the “Bootstrap method” to calculate confidence intervals for mean household sizes, by birthplace of head-of-household, but this time with FOREIGN_BORN head-of-households.

n.sim<-1000

foreign_mus<-numeric(n.sim)
foreign_vars<-numeric(n.sim)
for (i in 1:n.sim){  
  dat<-sample(foreign_born$famsize,size=length(foreign_born$famsize), replace=T)
  foreign_mus[i]<-mean(dat, na.rm=T)
  foreign_vars[i]<-var(dat, na.rm=T)
}

par(mfrow=c(1,2))
hist(foreign_mus,freq=F, main="B.strap. FOREIGN famsize means")
abline(v=mean(foreign_born$famsize, na.rm=T), col=2, lwd=3)

hist(foreign_vars, freq=F,main="B.strap. FOREIGN famsize var")
abline(v=var(foreign_born$famsize, na.rm=T), col=2, lwd=3)

Here is the bootstrap distribution of FOREIGN BORN family-size means, in terms of the percentile method:

quantile(foreign_mus,p=c(.025, .975))
##     2.5%    97.5% 
## 2.909843 2.959495

As you can see, these two bootstrapped confidence intervals are extremely close to the original confidence intervals derived from Dr. Spark’s “norm.interval” function.