'data.frame': 29 obs. of 5 variables:
$ YEAR : int 1 2 3 4 5 6 7 8 9 10 ...
$ ROLL : int 5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
$ UNEM : num 8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
$ HGRAD: int 9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
$ INC : int 1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...
Make scatterplots
Enrollment against the other variables
plot(enroll$UNEM, enroll$ROLL,xlab ="Unemployment Rate (%)",ylab ="Fall Enrollment",main ="Enrollment vs Unemployment")
plot(enroll$HGRAD, enroll$ROLL,xlab ="Spring High School Graduates",ylab ="Fall Enrollment",main ="Enrollment vs Spring High School Graduates")
plot(enroll$INC, enroll$ROLL,xlab ="Per Capita Income",ylab ="Fall Enrollment",main ="Enrollment vs Income")
Observing the scatterplots, there is an increasingly linear relationship between enrollment and unemployment, spring graduation, and income, respectively.
Build a linear model
Use the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e.ROLL ~ UNEM + HGRAD
fit1 =lm(ROLL ~ UNEM + HGRAD, data = enroll)
Use the summary() and anova() functions to investigate the model
summary(fit1)
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = enroll)
Residuals:
Min 1Q Median 3Q Max
-2102.2 -861.6 -349.4 374.5 3603.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 ***
UNEM 6.983e+02 2.244e+02 3.111 0.00449 **
HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373
F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11
anova(fit1)
Analysis of Variance Table
Response: ROLL
Df Sum Sq Mean Sq F value Pr(>F)
UNEM 1 45407767 45407767 26.349 2.366e-05 ***
HGRAD 1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26 44805568 1723291
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which variable is the most closely related to enrollment?
The ANOVA table shows HGRAD has a much larger F statistic and smaller p-value, indicating a stronger relationship with gradaution than unemployment.The residual value shows there is variation that isn’t explained by either variable.
Make a residual plot
Check for any bias in the model (Residual is which =1)
plot(fit1, which =1)
The plot shows clustering around zero and curvature in the smoothing line. There are also points with strong influence (Cook’s Distance plot below). Overall, I don’t think this model is showing the full picture of enrollment varaibles.
Cook’s Distance plot
plot(fit1, which =4)
Use the predict() function
Estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students. Note: The column names in the new data frame must match the predictor names used in the model.
est =data.frame(UNEM =9,HGRAD =25000)predict(fit1, est)
Expected enrollment is 21,586 students when unemployment is 9% and the graduating high school class is 25,000.
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enroll)
Residuals:
Min 1Q Median 3Q Max
-1148.84 -489.71 -1.88 387.40 1425.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 ***
UNEM 4.501e+02 1.182e+02 3.809 0.000807 ***
HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 ***
INC 4.275e+00 4.947e-01 8.642 5.59e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576
F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
anova(fit1, fit2)
Analysis of Variance Table
Model 1: ROLL ~ UNEM + HGRAD
Model 2: ROLL ~ UNEM + HGRAD + INC
Res.Df RSS Df Sum of Sq F Pr(>F)
1 26 44805568
2 25 11237313 1 33568255 74.68 5.594e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Does including this variable improve the model?
The higher R2 value and the smaller p-value for model 2 indicate that including income improves the model. Model 1 had an R2 of about 84% and p-value of 2.144e-11. Model 2 has an R2 of 96% and a p-value less than 2.2e-16.This also fits what is shown by the observable linear relationships in the first 3 scatterplots.