Multiple regression

Labor Training Data

The DAAG library in R has a nswpsid1 dataset that looks at the way earnings changed between 1974-1975 and 1978 in the absence of training. It has the following columns:

trt: Training (0=Control 1=Training)
age: Age (years)
educ: Years of education
black: 0=not black 1=black
hisp: 0=not hispanic 1=hispanic
marr: 0=not married 1=married
nodeg: 0=completed HS 1=dropout
re74: real earnings in 1974
re75: real earnings in 1975
re78: real earnings in 1978

Training vs 1978 Earnings

Let’s see how the training relates to the 1978 real earnings.

library(DAAG)
library(psych)
library(StMoSim)
boxplot(re78~trt, data=nswpsid1,
        main="Training vs Earnings", xlab = "Training", ylab = "Earnings", names = c("Contol","Trained"))

describeBy(nswpsid1$re78, group=nswpsid1$trt,mat=T,type=3,digits=2)
##     item group1 vars    n     mean       sd   median  trimmed      mad min
## X11    1      0    1 2490 21553.92 15555.35 20688.17 20356.86 13145.26   0
## X12    2      1    1  297  5976.35  6923.80  4232.31  4910.59  6274.82   0
##           max     range skew kurtosis     se
## X11 121173.58 121173.58 1.24     3.88 311.73
## X12  60307.93  60307.93 2.63    13.62 401.76

Looking at the data we see that there are more people without training than people that did. The no training group also a higher average 1978 real earning.

1978 Earnings vs Other Factors

Now let’s how the other factors relate to the 1978 real earnings. First we’ll build a model and evalute the p-vales to see which factors are relevant.

nswpsid1.lm <- lm(re78~trt+age+educ+black+hisp+marr+nodeg+re74+re75, data=nswpsid1)
summary(nswpsid1.lm)
## 
## Call:
## lm(formula = re78 ~ trt + age + educ + black + hisp + marr + 
##     nodeg + re74 + re75, data = nswpsid1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64870  -4302   -435   3786 110412 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -129.74276 1688.51706  -0.077   0.9388    
## trt          751.94643  915.25723   0.822   0.4114    
## age          -83.56559   20.81380  -4.015 6.11e-05 ***
## educ         592.61020  103.30278   5.737 1.07e-08 ***
## black       -570.92797  495.17772  -1.153   0.2490    
## hisp        2163.28118 1092.29036   1.981   0.0478 *  
## marr        1240.51952  586.25391   2.116   0.0344 *  
## nodeg        590.46695  646.78417   0.913   0.3614    
## re74           0.27812    0.02792   9.960  < 2e-16 ***
## re75           0.56809    0.02756  20.613  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10070 on 2665 degrees of freedom
##   (112 observations deleted due to missingness)
## Multiple R-squared:  0.5864, Adjusted R-squared:  0.585 
## F-statistic: 419.8 on 9 and 2665 DF,  p-value: < 2.2e-16

Our results show us that training and having a HS degree have the highest p values meaning they are least relevant. We’ll remove them and make another model.

nswpsid12.lm <- lm(re78~age+educ+black+hisp+marr+re74+re75, data=nswpsid1)
summary(nswpsid12.lm)
## 
## Call:
## lm(formula = re78 ~ age + educ + black + hisp + marr + re74 + 
##     re75, data = nswpsid1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64905  -4266   -460   3769 110560 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1078.64658 1327.94000   0.812   0.4167    
## age          -83.08699   20.62900  -4.028 5.79e-05 ***
## educ         526.48693   75.25393   6.996 3.31e-12 ***
## black       -450.41194  484.48849  -0.930   0.3526    
## hisp        2233.97761 1089.84293   2.050   0.0405 *  
## marr        1030.47695  550.48155   1.872   0.0613 .  
## re74           0.27681    0.02790   9.922  < 2e-16 ***
## re75           0.56671    0.02752  20.594  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10070 on 2667 degrees of freedom
##   (112 observations deleted due to missingness)
## Multiple R-squared:  0.5861, Adjusted R-squared:  0.585 
## F-statistic: 539.5 on 7 and 2667 DF,  p-value: < 2.2e-16

Removing those factors didn’t affect the R-squared but shows us that factors like 1974 and 1975 real earnings are big factors of the 1978 earnings.

Residuals

qqnormSim(nswpsid12.lm$residuals)

plot(abs(nswpsid12.lm$residuals) ~ nswpsid12.lm$fitted.values)

Looking at the plots for the residuals we can see that our data is not normally distributed. There are plenty of outliers.