# Read in the dataenrollment =read.csv("./data/enrollmentForecast.csv")# Look at the data structurestr(enrollment)
'data.frame': 29 obs. of 5 variables:
$ YEAR : int 1 2 3 4 5 6 7 8 9 10 ...
$ ROLL : int 5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
$ UNEM : num 8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
$ HGRAD: int 9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
$ INC : int 1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...
1. Scatterplots
# Make scatterplots of ROLL against the other variableslibrary(ggplot2)# ROLL vs UNEMggplot(enrollment, aes(x = UNEM, y = ROLL)) +geom_point() +labs(title ="Enrollment vs Unemployment Rate")
# ROLL vs HGRADggplot(enrollment, aes(x = HGRAD, y = ROLL)) +geom_point() +labs(title ="Enrollment vs High School Graduates")
# ROLL vs INCggplot(enrollment, aes(x = INC, y = ROLL)) +geom_point() +labs(title ="Enrollment vs Income")
2. Model Building & Diagnostics
# Build a linear model using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRADfit_enroll1 =lm(ROLL ~ UNEM + HGRAD, data = enrollment)# Use the summary() and anova() functions to investigate the modelsummary(fit_enroll1)
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = enrollment)
Residuals:
Min 1Q Median 3Q Max
-2102.2 -861.6 -349.4 374.5 3603.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 ***
UNEM 6.983e+02 2.244e+02 3.111 0.00449 **
HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373
F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11
anova(fit_enroll1)
Analysis of Variance Table
Response: ROLL
Df Sum Sq Mean Sq F value Pr(>F)
UNEM 1 45407767 45407767 26.349 2.366e-05 ***
HGRAD 1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26 44805568 1723291
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
3. Which variable is the most closely related to enrollment?
HGRAD (High School Graduates)
Because HGRAD has the highest absolute t-value and the most significant p-value, flagged by three stars (***) in the summary(fit_enroll1) output. This indicates it is the dominant statistical driver of enrollment.
4. Residual Plot and Bias Check
# Residuals vs Fitted values plotplot(fit_enroll1, which =1)
No Pattern: Points randomly scattered around the 0 line indicate no systemic bias (good fit).
Pattern Present: Any curve, trend, or shape implies omitted non-linear information or missed patterns (poor fit).
5. Estimate Fall Enrollment (Prediction)
# Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students# Define the new observationnew_obs <-data.frame(UNEM =9, HGRAD =25000)# Calculate the predictionpred_result <-predict(fit_enroll1, newdata = new_obs, interval ="prediction")print(pred_result)
fit lwr upr
1 21585.58 18452.36 24718.8
The expected fall enrollment point estimate is found under the fit column of the printed output. There is a 95% probability that the actual enrollment value will fall within the lower (lwr) and upper (upr) bounds.
6. Build a Second Model and Compare
# Build a second model which includes per capita income (INC).fit_enroll2 <-lm(ROLL ~ UNEM + HGRAD + INC, data = enrollment)summary(fit_enroll2)
Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enrollment)
Residuals:
Min 1Q Median 3Q Max
-1148.84 -489.71 -1.88 387.40 1425.75
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 ***
UNEM 4.501e+02 1.182e+02 3.809 0.000807 ***
HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 ***
INC 4.275e+00 4.947e-01 8.642 5.59e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576
F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
# Compare the two models with anova().anova(fit_enroll1, fit_enroll2)
Analysis of Variance Table
Model 1: ROLL ~ UNEM + HGRAD
Model 2: ROLL ~ UNEM + HGRAD + INC
Res.Df RSS Df Sum of Sq F Pr(>F)
1 26 44805568
2 25 11237313 1 33568255 74.68 5.594e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
7. Does including this variable improve the model?
Yes
The comparison yields an extremely significant p-value in the second row of the anova(fit_enroll1, fit_enroll2) table. Adding INC dramatically reduces the Residual Sum of Squares (RSS), meaning it significantly improves the model’s overall fit.