I am interested in looking at the relationship of family sizes with respect to the number of own children in the household. I am hoping to see that, if the number of children goes up, the size of the family increases as well. However, is it the case for the United States? This is tested below:

Loading Packages

library(readr)
library(dplyr) #to manipulate data
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2) #to visualize data
library(broom) #to make results printable
library(tidyr)

Using ACS PUMS

#Using IPUMS Data
library(haven)
ipums<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")

1 & 2) Constructing Outcome and Predictor Variables

newpums<-ipums%>%
  filter(relate==3)%>% #Only including the child's relationship to the household head
 mutate(n_child=as.numeric(nchild))%>% #Predictor Variable: Number of Own Children in the Household
 mutate(fam_size=as.numeric(famsize)) #Outcome Variable: Number of own family members in household

3) Estimating the OLS Regression Model for the Outcome variable

fit<-lm(fam_size~n_child, data=newpums)
coef(fit)
## (Intercept)     n_child 
##    4.239698    0.655249

The regression line is then: 4.239 + 0.655 * number of children, which means, for every additional number of children in the household, the family size increases by 0.655.

We observe a positive realtionship between the Family Size and Number of Children.

Plotting the Linear Function

 ggplot(newpums, aes(x=n_child, y=fam_size))+geom_point()+geom_smooth(method = "lm", se = FALSE)+
 ggtitle("Family Size by the Number of Children")+
 xlab("Number of Children")+
 ylab("Family Size")

It can be seen that:

. The number of children impacts the family size in a linear manner, meaning that the relationship between number of children and the family size is constant for all values of number of children.

. So this model says that we can predict the family size using the number of children and a random error term.

4) Evaluating the assumptions of the OLS model

Getting the summary estimates

summary(fit)
## 
## Call:
## lm(formula = fam_size ~ n_child, data = newpums)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2397 -1.2397 -0.2397  0.7603 15.7603 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.239698   0.005343  793.44   <2e-16 ***
## n_child     0.655249   0.014702   44.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.465 on 77332 degrees of freedom
## Multiple R-squared:  0.02504,    Adjusted R-squared:  0.02503 
## F-statistic:  1986 on 1 and 77332 DF,  p-value: < 2.2e-16

We see the estimates of the parameters, their associated standard errors, and the t-statistics for each, along with the calculated p-value of the test.

The t statistics is different than 0, which means there is an association.

The Multiple R-sqaured value shows that, number of children explains 2% variability in the family size, which means that the relationship is really weak.

Confidence Interval

confint(fit)
##                2.5 %   97.5 %
## (Intercept) 4.229225 4.250171
## n_child     0.626434 0.684064

The estimate (0.655249) of the predictor variable falls within the confidence intervals.

Checking non-constant error variance

plot(fit, which=1)

In this plot, the residulas are not constantly varying with respect to the fitted values. The red line begins to indicate a positive trend in the residuals for higher values of the fitted values, suggesting non-constant variation.

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

Breusch-Pagan test

bptest(fit)
## 
##  studentized Breusch-Pagan test
## 
## data:  fit
## BP = 1.8033, df = 1, p-value = 0.1793

The p-value is small enough to support our assertion that the variance was non-constant.

Normality for residuals

plot(fit, which=2)

The Q-Q plot for residuals does not show a promising association.

Kolmogorov-Smirnov test

ks.test(resid(fit), y = pnorm)
## Warning in ks.test(resid(fit), y = pnorm): ties should not be present for
## the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  resid(fit)
## D = 0.23224, p-value < 2.2e-16
## alternative hypothesis: two-sided

The K-S test says that, with a tiny p-value, there is a 23% difference between the two empirical distribution, which means that the cumulative distribution of the residuals does not differ from a normal.

5) Results of the Model