Using IPUMS Data
library(readr)
library(dplyr) #to manipulate data
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2) #to visualize data
library(broom) #to make results printable
## Loading Data
library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
1.
Descriptive Statistics
ipums %>%
filter(relate==1) %>% #to only have household heads
mutate(BirthPlace=ifelse(bpl<=120,"US Born","Foreign Born")) %>% #applying the conditions
group_by(BirthPlace) %>%
summarise(mean_familysize=mean(famsize), sd=sd(famsize), n())
## # A tibble: 2 x 4
## BirthPlace mean_familysize sd `n()`
## <chr> <dbl> <dbl> <int>
## 1 Foreign Born 2.934445 1.683525 15895
## 2 US Born 2.290301 1.332221 101412
The mean family size among the Foreign Born and US Born Households are quite different, 2.93 and 2.29, respectively. The standard deviation also differs for these groups, 1.68 and 1.33 respectively. It should be noted that the sample sizes are quite dissimilar for these two groups.
Graphical Representation: Box Plot
ipums %>%
filter(relate==1) %>%
mutate(BirthPlace= case_when(.$bpl %in% c(121:998)~"Foreign Born",
.$bpl %in% c(1:120)~"US Born")) %>%
group_by(BirthPlace) %>%
ggplot()+
geom_boxplot(aes(x=BirthPlace, y=famsize, fill=BirthPlace))+
ggtitle(label="Family Size by BirthPlace of Household Heads") +
coord_flip()+
xlab("Birthplace of Household Heads")+
ylab("Family Size")
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.

The above box plots suffice for the findings from the descriptive statistics above that there is a difference in the mean family size in these two groups.
2. Testing for Equality of the Family Size in Foreign Born and US Born Households
Using Linear Model
newpums<-ipums %>%
filter(relate==1) %>%
mutate(birthplace=ifelse(bpl<=120,"US Born","Foreign Born"))
fam_mean<-lm(famsize~birthplace,data=newpums) #Linear Model
tidy(fam_mean)
## term estimate std.error statistic p.value
## 1 (Intercept) 2.9344448 0.01098588 267.11061 0
## 2 birthplaceUS Born -0.6441438 0.01181550 -54.51685 0
The estimates show that the means are pretty different from each other. If we subtract the estimate of the birthplaceUS Born from the Intecept (Foreign Born), we can see that the mean for the US Born Household Heads is (2.93-0.64)=2.29 compared to the Foreign Born Household Heads’ mean of 2.93. The small p-Value of “0” (perhaps it is not zero) allows us to conclude that, the chances of the means of these two groups being similar are quite low, if none at all.
3. Examining the Model for Normality of Errors
Q-Q Plot for Model Residuals
qqnorm(rstudent(fam_mean), main="Q-Q Plot for Model Residuals")

The Q-Q plot shows that the model does contain non-normality, meaning that plot is quite non-linear.