Linear regression
library(haven)
acs<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
Foreign born household heads with at least one year in the United States.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
acslm<-acs%>%
filter(bpl>120, yrsusa1>0, bpl<950, relate==1)%>%
mutate(age=as.numeric(age))
Linear regression model by ordinary least squares
lm1<-lm(yrsusa1~age, data = acslm)
coef(lm1)
## (Intercept) age
## -7.4214140 0.6851689
Plot
library(ggplot2)
ggplot(acslm, aes(x=age, y=yrsusa1))+geom_point()+geom_smooth(method = "lm", se=FALSE)+ggtitle("Foreing born household heads", "Data from IPUMS")+xlab("Age")+ylab("Years in the US")

summary(lm1)
##
## Call:
## lm(formula = yrsusa1 ~ age, data = acslm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.503 -7.671 -0.300 7.682 37.441
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.421414 0.326496 -22.73 <2e-16 ***
## age 0.685169 0.006164 111.16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.17 on 15724 degrees of freedom
## Multiple R-squared: 0.44, Adjusted R-squared: 0.44
## F-statistic: 1.236e+04 on 1 and 15724 DF, p-value: < 2.2e-16
Confidence intervals by normal approximation
confint(lm1)
## 2.5 % 97.5 %
## (Intercept) -8.0613829 -6.7814450
## age 0.6730871 0.6972507
Evaluating the model assumptions
Plot of residuals (looks like a fish)
plot(lm1, which=1)

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(lm1)
##
## studentized Breusch-Pagan test
##
## data: lm1
## BP = 1382.9, df = 1, p-value < 2.2e-16
Normality of residuals
plot(lm1, which=2)

ks.test(resid(lm1), y=pnorm)
## Warning in ks.test(resid(lm1), y = pnorm): ties should not be present for
## the Kolmogorov-Smirnov test
##
## One-sample Kolmogorov-Smirnov test
##
## data: resid(lm1)
## D = 0.41954, p-value < 2.2e-16
## alternative hypothesis: two-sided
This was one of the simplest linear models that could be. It tries to “explain” the years a foreign born household head has been in the United States with one regressor, age. Our model give us a coefficient for age of 0.6851689, wich can be interpreted as follow: for one birthday more of the household head, the years in the United States increase by almost 8 months. From the summary table we know that our coefficient is significative at a workship level and this single variable can “explain” 44% of the outcome variable (Adjusted R-squared of 0.44). The variance of the residuals was no constant, looks like a weird fish.