enrollment.forecast = read.csv("enrollmentForecast.csv")
library(ggplot2)Module 10 Assignment
Read in and prepare data
View data structure
str(enrollment.forecast)'data.frame': 29 obs. of 5 variables:
$ YEAR : int 1 2 3 4 5 6 7 8 9 10 ...
$ ROLL : int 5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
$ UNEM : num 8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
$ HGRAD: int 9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
$ INC : int 1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...
summary(enrollment.forecast) YEAR ROLL UNEM HGRAD INC
Min. : 1 Min. : 5501 Min. : 5.700 Min. : 9552 Min. :1923
1st Qu.: 8 1st Qu.:10167 1st Qu.: 7.000 1st Qu.:15723 1st Qu.:2351
Median :15 Median :14395 Median : 7.500 Median :17203 Median :2863
Mean :15 Mean :12707 Mean : 7.717 Mean :16528 Mean :2729
3rd Qu.:22 3rd Qu.:14969 3rd Qu.: 8.200 3rd Qu.:18266 3rd Qu.:3127
Max. :29 Max. :16081 Max. :10.100 Max. :19800 Max. :3345
Make scatterplots of ROLL against other variables
ggplot(enrollment.forecast, aes(x=ROLL, y=UNEM, col=YEAR)) + geom_point()ggplot(enrollment.forecast, aes(x=ROLL, y=HGRAD, col=YEAR)) + geom_point()ggplot(enrollment.forecast, aes(x=ROLL, y=INC, col=YEAR)) + geom_point()Build linear model using unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRAD
lm(ROLL ~ UNEM + HGRAD, data = enrollment.forecast)
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = enrollment.forecast)
Coefficients:
(Intercept) UNEM HGRAD
-8255.7511 698.2681 0.9423
enrollment.forecast$UNEM.cen = enrollment.forecast$UNEM - mean(enrollment.forecast$UNEM)
enrollment.forecast$HGRAD.cen = enrollment.forecast$HGRAD - mean(enrollment.forecast$HGRAD)
fit1 = lm(ROLL ~ UNEM.cen + HGRAD.cen, data = enrollment.forecast)Make residual plot and check for any bias in the model
hist(residuals(fit1))plot(fit1, which = 1)plot(fit1, which = 4)Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students
fit2 = lm(ROLL ~ UNEM.cen, data = enrollment.forecast)
summary(fit2)
Call:
lm(formula = ROLL ~ UNEM.cen, data = enrollment.forecast)
Residuals:
Min 1Q Median 3Q Max
-7640.0 -1046.5 602.8 1934.3 4187.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12707.0 566.3 22.44 <2e-16 ***
UNEM.cen 1133.8 513.1 2.21 0.0358 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3049 on 27 degrees of freedom
Multiple R-squared: 0.1531, Adjusted R-squared: 0.1218
F-statistic: 4.883 on 1 and 27 DF, p-value: 0.03579
expect.unemploy = 9 - mean(enrollment.forecast$UNEM)
12707 + (1133.8 * expect.unemploy)[1] 14161.39
predict(fit2, UNEM.cen = expect.unemploy) 1 2 3 4 5 6 7 8
13141.02 11893.81 12233.96 12460.72 11893.81 11213.51 11326.90 11213.51
9 10 11 12 13 14 15 16
11100.13 12687.49 13254.40 12460.72 12347.34 13254.40 15408.67 14388.22
17 18 19 20 21 22 23 24
12687.49 10419.83 11326.90 12460.72 12233.96 14388.22 15408.67 12460.72
25 26 27 28 29
13934.69 14274.84 13934.69 12800.87 11893.81
new.unemploy = data.frame(UNEM.cen = expect.unemploy)
predict(fit2, new.unemploy, interval = "prediction") fit lwr upr
1 14161.46 7655.725 20667.19
fit3 = lm(ROLL ~ HGRAD.cen, data = enrollment.forecast)
summary(fit3)
Call:
lm(formula = ROLL ~ HGRAD.cen, data = enrollment.forecast)
Residuals:
Min 1Q Median 3Q Max
-2156.8 -1175.5 -301.0 649.8 3089.0
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.271e+04 2.802e+02 45.35 < 2e-16 ***
HGRAD.cen 9.898e-01 9.743e-02 10.16 1.01e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1509 on 27 degrees of freedom
Multiple R-squared: 0.7926, Adjusted R-squared: 0.7849
F-statistic: 103.2 on 1 and 27 DF, p-value: 1.012e-10
expect.hgraduate = 25000 - mean(enrollment.forecast$HGRAD)
1.271e+04 + (9.898e-01 * expect.hgraduate)[1] 21095.45
predict(fit3, HGRAD.cen = expect.hgraduate) 1 2 3 4 5 6 7 8
5802.023 5928.718 5979.198 7894.469 10872.790 11456.775 11673.542 11910.105
9 10 11 12 13 14 15 16
12680.173 13065.207 13375.016 13873.877 14270.789 14427.178 15458.554 14385.606
17 18 19 20 21 22 23 24
15150.725 15653.546 15945.538 15694.128 15269.501 14929.999 13978.797 13474.986
25 26 27 28 29
12932.573 12925.645 13099.850 13402.730 12991.962
new.hgraduate = data.frame(HGRAD.cen = expect.hgraduate)
predict(fit3, new.hgraduate, interval = "prediction") fit lwr upr
1 21092.52 17516.71 24668.33
plot(ROLL ~ UNEM.cen + HGRAD.cen, data = enrollment.forecast)Build a second model which includes per capita income (INC)
fit4 = lm(ROLL ~ UNEM + HGRAD + INC, data = enrollment.forecast)
summary(fit4)
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enrollment.forecast)
Residuals:
Min 1Q Median 3Q Max
-1148.84 -489.71 -1.88 387.40 1425.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 ***
UNEM 4.501e+02 1.182e+02 3.809 0.000807 ***
HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 ***
INC 4.275e+00 4.947e-01 8.642 5.59e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576
F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
Compare the two models with anova(). Does including this variable improve the model?
summary(fit1)
Call:
lm(formula = ROLL ~ UNEM.cen + HGRAD.cen, data = enrollment.forecast)
Residuals:
Min 1Q Median 3Q Max
-2102.2 -861.6 -349.4 374.5 3603.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.271e+04 2.438e+02 52.127 < 2e-16 ***
UNEM.cen 6.983e+02 2.244e+02 3.111 0.00449 **
HGRAD.cen 9.423e-01 8.613e-02 10.941 3.16e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373
F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11
anova(fit1)Analysis of Variance Table
Response: ROLL
Df Sum Sq Mean Sq F value Pr(>F)
UNEM.cen 1 45407767 45407767 26.349 2.366e-05 ***
HGRAD.cen 1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26 44805568 1723291
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(fit4)
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enrollment.forecast)
Residuals:
Min 1Q Median 3Q Max
-1148.84 -489.71 -1.88 387.40 1425.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 ***
UNEM 4.501e+02 1.182e+02 3.809 0.000807 ***
HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 ***
INC 4.275e+00 4.947e-01 8.642 5.59e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576
F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
anova(fit4)Analysis of Variance Table
Response: ROLL
Df Sum Sq Mean Sq F value Pr(>F)
UNEM 1 45407767 45407767 101.02 2.894e-10 ***
HGRAD 1 206279143 206279143 458.92 < 2.2e-16 ***
INC 1 33568255 33568255 74.68 5.594e-09 ***
Residuals 25 11237313 449493
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As the R-squared value increased in fit4 compared to fit1, yes including INC does improve the model