Linear regression

library(haven)
acs<-read_dta("https://github.com/coreysparks/data/blob/master/usa_00045.dta?raw=true")
Foreign born household heads with at least one year in the United States.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.4.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
acslm<-acs%>%
  filter(bpl>120, yrsusa1>0, bpl<950, relate==1)%>%
  mutate(age=as.numeric(age))
Linear regression model by ordinary least squares
lm1<-lm(yrsusa1~age, data = acslm)
coef(lm1)
## (Intercept)         age 
##  -7.4214140   0.6851689
Plot
library(ggplot2) 
ggplot(acslm, aes(x=age, y=yrsusa1))+geom_point()+geom_smooth(method = "lm", se=FALSE)+ggtitle("Foreing born household heads", "Data from IPUMS")+xlab("Age")+ylab("Years in the US")

summary(lm1)
## 
## Call:
## lm(formula = yrsusa1 ~ age, data = acslm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -48.503  -7.671  -0.300   7.682  37.441 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.421414   0.326496  -22.73   <2e-16 ***
## age          0.685169   0.006164  111.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.17 on 15724 degrees of freedom
## Multiple R-squared:   0.44,  Adjusted R-squared:   0.44 
## F-statistic: 1.236e+04 on 1 and 15724 DF,  p-value: < 2.2e-16
Confidence intervals by normal approximation
confint(lm1)
##                  2.5 %     97.5 %
## (Intercept) -8.0613829 -6.7814450
## age          0.6730871  0.6972507
Evaluating the model assumptions
Plot of residuals (looks like a fish)
plot(lm1, which=1)

library(lmtest)
## Warning: package 'lmtest' was built under R version 3.4.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.4.2
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(lm1)
## 
##  studentized Breusch-Pagan test
## 
## data:  lm1
## BP = 1382.9, df = 1, p-value < 2.2e-16
Normality of residuals
plot(lm1, which=2)

ks.test(resid(lm1), y=pnorm)
## Warning in ks.test(resid(lm1), y = pnorm): ties should not be present for
## the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  resid(lm1)
## D = 0.41954, p-value < 2.2e-16
## alternative hypothesis: two-sided
This was one of the simplest linear models that could be. It tries to “explain” the years a foreign born household head has been in the United States with one regressor, age. Our model give us a coefficient for age of 0.6851689, wich can be interpreted as follow: for one birthday more of the household head, the years in the United States increase by almost 8 months. From the summary table we know that our coefficient is significative at a workship level and this single variable can “explain” 44% of the outcome variable (Adjusted R-squared of 0.44). The variance of the residuals was no constant, looks like a weird fish.