The DAAG library in R has a nswpsid1 dataset that looks at the way earnings changed between 1974-1975 and 1978 in the absence of training. It has the following columns:
trt: Training (0=Control 1=Training)
age: Age (years)
educ: Years of education
black: 0=not black 1=black
hisp: 0=not hispanic 1=hispanic
marr: 0=not married 1=married
nodeg: 0=completed HS 1=dropout
re74: real earnings in 1974
re75: real earnings in 1975
re78: real earnings in 1978
Let’s see how the training relates to the 1978 real earnings.
library(DAAG)
library(psych)
library(StMoSim)
boxplot(re78~trt, data=nswpsid1,
main="Training vs Earnings", xlab = "Training", ylab = "Earnings", names = c("Contol","Trained"))describeBy(nswpsid1$re78, group=nswpsid1$trt,mat=T,type=3,digits=2)## item group1 vars n mean sd median trimmed mad min
## X11 1 0 1 2490 21553.92 15555.35 20688.17 20356.86 13145.26 0
## X12 2 1 1 297 5976.35 6923.80 4232.31 4910.59 6274.82 0
## max range skew kurtosis se
## X11 121173.58 121173.58 1.24 3.88 311.73
## X12 60307.93 60307.93 2.63 13.62 401.76
Looking at the data we see that there are more people without training than people that did. The no training group also a higher average 1978 real earning.
Now let’s how the other factors relate to the 1978 real earnings. First we’ll build a model and evalute the p-vales to see which factors are relevant.
nswpsid1.lm <- lm(re78~trt+age+educ+black+hisp+marr+nodeg+re74+re75, data=nswpsid1)
summary(nswpsid1.lm)##
## Call:
## lm(formula = re78 ~ trt + age + educ + black + hisp + marr +
## nodeg + re74 + re75, data = nswpsid1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64870 -4302 -435 3786 110412
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -129.74276 1688.51706 -0.077 0.9388
## trt 751.94643 915.25723 0.822 0.4114
## age -83.56559 20.81380 -4.015 6.11e-05 ***
## educ 592.61020 103.30278 5.737 1.07e-08 ***
## black -570.92797 495.17772 -1.153 0.2490
## hisp 2163.28118 1092.29036 1.981 0.0478 *
## marr 1240.51952 586.25391 2.116 0.0344 *
## nodeg 590.46695 646.78417 0.913 0.3614
## re74 0.27812 0.02792 9.960 < 2e-16 ***
## re75 0.56809 0.02756 20.613 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10070 on 2665 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.5864, Adjusted R-squared: 0.585
## F-statistic: 419.8 on 9 and 2665 DF, p-value: < 2.2e-16
Our results show us that training and having a HS degree have the highest p values meaning they are least relevant. We’ll remove them and make another model.
nswpsid12.lm <- lm(re78~age+educ+black+hisp+marr+re74+re75, data=nswpsid1)
summary(nswpsid12.lm)##
## Call:
## lm(formula = re78 ~ age + educ + black + hisp + marr + re74 +
## re75, data = nswpsid1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64905 -4266 -460 3769 110560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1078.64658 1327.94000 0.812 0.4167
## age -83.08699 20.62900 -4.028 5.79e-05 ***
## educ 526.48693 75.25393 6.996 3.31e-12 ***
## black -450.41194 484.48849 -0.930 0.3526
## hisp 2233.97761 1089.84293 2.050 0.0405 *
## marr 1030.47695 550.48155 1.872 0.0613 .
## re74 0.27681 0.02790 9.922 < 2e-16 ***
## re75 0.56671 0.02752 20.594 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10070 on 2667 degrees of freedom
## (112 observations deleted due to missingness)
## Multiple R-squared: 0.5861, Adjusted R-squared: 0.585
## F-statistic: 539.5 on 7 and 2667 DF, p-value: < 2.2e-16
Removing those factors didn’t affect the R-squared but shows us that factors like 1974 and 1975 real earnings are big factors of the 1978 earnings.
qqnormSim(nswpsid12.lm$residuals)plot(abs(nswpsid12.lm$residuals) ~ nswpsid12.lm$fitted.values)Looking at the plots for the residuals we can see that our data is not normally distributed. There are plenty of outliers.