library(ggplot2)GEOG 5680 Module 10: Statistical Modeling in R
Introduction
This exercise builds linear models to predict fall undergraduate enrollment (ROLL) at the University of New Mexico using unemployment rate (UNEM), spring high school graduates (HGRAD), and per capita income (INC).
Read In The Data
enroll <- read.csv("C:/Users/yolom/OneDrive/Desktop/GEOG5680/Module 10/enrollmentForecast.csv")Data Structure
str(enroll)'data.frame': 29 obs. of 5 variables:
$ YEAR : int 1 2 3 4 5 6 7 8 9 10 ...
$ ROLL : int 5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
$ UNEM : num 8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
$ HGRAD: int 9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
$ INC : int 1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...
summary(enroll) YEAR ROLL UNEM HGRAD INC
Min. : 1 Min. : 5501 Min. : 5.700 Min. : 9552 Min. :1923
1st Qu.: 8 1st Qu.:10167 1st Qu.: 7.000 1st Qu.:15723 1st Qu.:2351
Median :15 Median :14395 Median : 7.500 Median :17203 Median :2863
Mean :15 Mean :12707 Mean : 7.717 Mean :16528 Mean :2729
3rd Qu.:22 3rd Qu.:14969 3rd Qu.: 8.200 3rd Qu.:18266 3rd Qu.:3127
Max. :29 Max. :16081 Max. :10.100 Max. :19800 Max. :3345
Scatterplots
ggplot(enroll, aes(x = UNEM, y = ROLL)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
x = "Unemployment rate (%)",
y = "Fall undergraduate enrollment"
)ggplot(enroll, aes(x = HGRAD, y = ROLL)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
x = "Spring high school graduates",
y = "Fall undergraduate enrollment"
)ggplot(enroll, aes(x = INC, y = ROLL)) +
geom_point() +
geom_smooth(method = "lm") +
labs(
x = "Per capita income (1961 dollars)",
y = "Fall undergraduate enrollment"
)The correlations between enrollment and the predictor variables are:
cor(enroll[, c("ROLL", "UNEM", "HGRAD", "INC")]) ROLL UNEM HGRAD INC
ROLL 1.0000000 0.3913436 0.8902939 0.9498757
UNEM 0.3913436 1.0000000 0.1773760 0.2823104
HGRAD 0.8902939 0.1773760 1.0000000 0.8200894
INC 0.9498757 0.2823104 0.8200894 1.0000000
First Linear Model
The first model uses unemployment rate and the number of spring high school graduates to predict fall enrollment:
fit1 <- lm(ROLL ~ UNEM + HGRAD, data = enroll)
summary(fit1)
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = enroll)
Residuals:
Min 1Q Median 3Q Max
-2102.2 -861.6 -349.4 374.5 3603.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 ***
UNEM 6.983e+02 2.244e+02 3.111 0.00449 **
HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373
F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11
anova(fit1)Analysis of Variance Table
Response: ROLL
Df Sum Sq Mean Sq F value Pr(>F)
UNEM 1 45407767 45407767 26.349 2.366e-05 ***
HGRAD 1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26 44805568 1723291
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The fitted model is approximately:
[ ROLL = -8255.75 + 698.27(UNEM) + 0.94(HGRAD) ]
Both UNEM and HGRAD are significant predictors in this model. The R-squared value is about 0.849, meaning the model explains about 84.9% of the variation in fall undergraduate enrollment.
Residual Plots
plot(fit1, which = 1)hist(
residuals(fit1),
main = "Histogram of residuals for fit1",
xlab = "Residuals"
)ggplot(enroll, aes(x = YEAR, y = residuals(fit1))) +
geom_point() +
geom_hline(yintercept = 0, linetype = 2) +
geom_smooth() +
labs(
x = "Year",
y = "Residuals"
)The residual plot should be checked for patterns around the zero line. There is some evidence of bias over time: the first model tends to under-predict enrollment in the later years, suggesting that another time-related variable may be missing.
Prediction
Use the model to estimate expected fall enrollment when the unemployment rate is 9% and the spring high school graduating class is 25,000 students.
newdata1 <- data.frame(UNEM = 9, HGRAD = 25000)
predict(fit1, newdata1) 1
21585.58
predict(fit1, newdata1, interval = "prediction") fit lwr upr
1 21585.58 18452.36 24718.8
The expected fall enrollment is about 21,586 students. The 95% prediction interval is approximately 18,452 to 24,719 students.
Second Linear Model
The second model adds per capita income (INC) as a predictor.
fit2 <- lm(ROLL ~ UNEM + HGRAD + INC, data = enroll)
summary(fit2)
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enroll)
Residuals:
Min 1Q Median 3Q Max
-1148.84 -489.71 -1.88 387.40 1425.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 ***
UNEM 4.501e+02 1.182e+02 3.809 0.000807 ***
HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 ***
INC 4.275e+00 4.947e-01 8.642 5.59e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576
F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
anova(fit2)Analysis of Variance Table
Response: ROLL
Df Sum Sq Mean Sq F value Pr(>F)
UNEM 1 45407767 45407767 101.02 2.894e-10 ***
HGRAD 1 206279143 206279143 458.92 < 2.2e-16 ***
INC 1 33568255 33568255 74.68 5.594e-09 ***
Residuals 25 11237313 449493
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model Comparison
anova(fit1, fit2)Analysis of Variance Table
Model 1: ROLL ~ UNEM + HGRAD
Model 2: ROLL ~ UNEM + HGRAD + INC
Res.Df RSS Df Sum of Sq F Pr(>F)
1 26 44805568
2 25 11237313 1 33568255 74.68 5.594e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding INC improves the model. The comparison of fit1 and fit2 gives a very small p value, showing that INC significantly reduces residual error. The R-squared value increases from about 0.849 in the first model to about 0.962 in the second model.