Descriptive Statistics for US-vs-Foreign-born Head-of-Household Family Size

Here, I’m going to provide descriptive statistics for US-vs-Foreign Born Head-of-Household Family size. First, I need to load my libraries and the data itself.

library(plyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(haven)
library(broom)
library(ggplot2)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

Now, I just need the code to produce descriptive statistics for Head-of-Household family size, by place of birth.

ipums %>%
  filter(relate==1) %>%
  mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
  group_by(birthplace) %>%
  summarise(mean_family_size=mean(famsize),sd=sd(famsize),sample_size=n())
## # A tibble: 2 x 4
##     birthplace mean_family_size       sd sample_size
##          <chr>            <dbl>    <dbl>       <int>
## 1 FOREIGN_BORN         2.934445 1.683525       15895
## 2      US_BORN         2.290301 1.332221      101412

We can see that the mean family size appears to be quite different depending on the place-of-birth for the head-of-household. Let’s examine this relationship with a graph:

ipums %>%
  filter(relate==1) %>%
  mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
  group_by(birthplace) %>%
  ggplot()+
  geom_boxplot(aes(x=birthplace,y=famsize))+
  ggtitle(label="Family Size by Place-of-Birth for Head-of-Household")+
  xlab("Place of Birth for Head-of-Household")+
  ylab("Family Size")
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.

Here is a much more dramatic graph which shows the differences between the means of Foreign Born vs. US Born family sizes:

mean_plot<-ipums %>%
  filter(relate==1) %>%
  mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN")) %>%
  group_by(birthplace) %>%
  summarise(mean_family_size=mean(famsize),sd=sd(famsize),sample_size=n(),standard_error=sd(famsize)/sqrt(length(famsize)))

ggplot(mean_plot, aes(x=birthplace,y=mean_family_size,group=1)) +
  geom_line()+
  geom_errorbar(width=.1, aes(ymin=mean_family_size-standard_error,ymax=mean_family_size+standard_error))+
  geom_point(shape=21,size=3,fill="white") +
  ggtitle(label="Differences of Mean Family Size in Foreign Born vs. US Born Head-of-Household")+
  labs(subtitle="With Error Bars")

Here, we can see with our own eyes that the means for foreign-born head-of-household family size versus US-born appear to be very different.The error bars around FOREIGN_BORN and US_BORN also reveal a larger standard error for FOREIGN_BORN, which is to be expected due to it’s small sample size. Still, you can see that both means are very different.

Testing for Equality of Mean Family Size using the Linear Model

Now, I am going to test for equality of the FOREIGN_BORN and US_BORN family size means, using the linear model as taught in class:

new_ipums<-ipums %>%
  filter(relate==1) %>%
  mutate(birthplace=ifelse(bpl<=120,"US_BORN","FOREIGN_BORN"))

family_mean_fit<-lm(famsize~birthplace,data=new_ipums)

tidy(family_mean_fit)
##                term   estimate  std.error statistic p.value
## 1       (Intercept)  2.9344448 0.01098588 267.11061       0
## 2 birthplaceUS_BORN -0.6441438 0.01181550 -54.51685       0

We now see that the means are very different. FOREIGN_BORN (the Intercept) has a mean of 2.93 while US_BORN (birthplaceUS_BORN) has a mean of (2.934-0.644) or 2.29. This is consistent with the exploratory statistics from above. With a p-value of “0” we can be fairly confident that, all else being equal, the chances of these two means actually being the same is quite low.

Normality of Errors

Now we will check the residuals of this linear model for normality, using a q-q plot:

qqnorm(rstudent(family_mean_fit), main="Q-Q Plot for Model Residuals")

It appears that the model does contain some degree of non-normality (that is, the Q-Q plot is fairly non-linear).

Transformations

We can transform the data through various methods in an attept to glimpse it’s underlying normality (if indeed the errors are normally distributed).

Here is a log transoformation of the linear model:

family_mean_fit_log<-lm(log(famsize)~birthplace,data=new_ipums)
qqnorm(rstudent(family_mean_fit_log), main="Q-Q Plot for Model Residuals")

We can also try a square-root transformation:

family_mean_fit_sqrt<-lm(sqrt(famsize)~birthplace,data=new_ipums)
qqnorm(rstudent(family_mean_fit_sqrt), main="Q-Q Plot for Model Residuals")

Or, we can try a reciprocal transformation:

family_mean_fit_recip<-lm(I(1/famsize)~birthplace,data=new_ipums)
qqnorm(rstudent(family_mean_fit_recip), main="Q-Q Plot for Model Residuals")

A cursory glance at these transformations will reveal that the data appears to be somewhat non-linear in each Q-Q plot, suggesting errors that are somewhat far from normal.