Using IPUMS Data

library(readr)
library(dplyr) #to manipulate data
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) #to visualize data
library(broom) #to make results printable

## Loading Data
library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

1.

Descriptive Statistics

ipums %>%
  filter(relate==1) %>% #to only have household heads
  mutate(BirthPlace=ifelse(bpl<=120,"US Born","Foreign Born")) %>% #applying the conditions
  group_by(BirthPlace) %>%
  summarise(mean_familysize=mean(famsize), sd=sd(famsize), n())
## # A tibble: 2 x 4
##     BirthPlace mean_familysize       sd  `n()`
##          <chr>           <dbl>    <dbl>  <int>
## 1 Foreign Born        2.934445 1.683525  15895
## 2      US Born        2.290301 1.332221 101412

The mean family size among the Foreign Born and US Born Households are quite different, 2.93 and 2.29, respectively. The standard deviation also differs for these groups, 1.68 and 1.33 respectively. It should be noted that the sample sizes are quite dissimilar for these two groups.

Graphical Representation: Box Plot

ipums %>%
  filter(relate==1) %>%
  mutate(BirthPlace= case_when(.$bpl %in% c(121:998)~"Foreign Born", 
                               .$bpl %in% c(1:120)~"US Born")) %>%
    group_by(BirthPlace) %>%
  ggplot()+
  geom_boxplot(aes(x=BirthPlace, y=famsize, fill=BirthPlace))+
  ggtitle(label="Family Size by BirthPlace of Household Heads") +
  coord_flip()+
  xlab("Birthplace of Household Heads")+
  ylab("Family Size")
## Don't know how to automatically pick scale for object of type labelled. Defaulting to continuous.

The above box plots suffice for the findings from the descriptive statistics above that there is a difference in the mean family size in these two groups.

2. Testing for Equality of the Family Size in Foreign Born and US Born Households

Using Linear Model

newpums<-ipums %>%
  filter(relate==1) %>%
  mutate(birthplace=ifelse(bpl<=120,"US Born","Foreign Born"))

fam_mean<-lm(famsize~birthplace,data=newpums) #Linear Model

tidy(fam_mean)
##                term   estimate  std.error statistic p.value
## 1       (Intercept)  2.9344448 0.01098588 267.11061       0
## 2 birthplaceUS Born -0.6441438 0.01181550 -54.51685       0

The estimates show that the means are pretty different from each other. If we subtract the estimate of the birthplaceUS Born from the Intecept (Foreign Born), we can see that the mean for the US Born Household Heads is (2.93-0.64)=2.29 compared to the Foreign Born Household Heads’ mean of 2.93. The small p-Value of “0” (perhaps it is not zero) allows us to conclude that, the chances of the means of these two groups being similar are quite low, if none at all.

3. Examining the Model for Normality of Errors

Q-Q Plot for Model Residuals

qqnorm(rstudent(fam_mean), main="Q-Q Plot for Model Residuals")

The Q-Q plot shows that the model does contain non-normality, meaning that plot is quite non-linear.

Transformation to Address Non-Normality

Log, Square-Root & Inverse Transformation

fam_mean1<-lm(log(famsize)~birthplace, data=newpums) #Log Transformation

fam_mean2<-lm(sqrt(famsize)~birthplace, data=newpums) #Square-Root Transformation

fam_mean3<-lm(I(1/famsize)~birthplace, data=newpums) #Inverse Transformation


qqnorm(rstudent(fam_mean1), main="Q-Q Plot for Model Residuals")

qqnorm(rstudent(fam_mean2), main="Q-Q Plot for Model Residuals")

qqnorm(rstudent(fam_mean3), main="Q-Q Plot for Model Residuals")

The trial of transformations do not really affect the non-linearity of the data and hence do not help the case.