ef <- read.csv("../Data/enrollmentForecast.csv")
library(ggplot2)Warning: package 'ggplot2' was built under R version 4.4.3
ef <- read.csv("../Data/enrollmentForecast.csv")
library(ggplot2)Warning: package 'ggplot2' was built under R version 4.4.3
str(ef)'data.frame': 29 obs. of 5 variables:
$ YEAR : int 1 2 3 4 5 6 7 8 9 10 ...
$ ROLL : int 5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
$ UNEM : num 8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
$ HGRAD: int 9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
$ INC : int 1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...
summary(ef) YEAR ROLL UNEM HGRAD INC
Min. : 1 Min. : 5501 Min. : 5.700 Min. : 9552 Min. :1923
1st Qu.: 8 1st Qu.:10167 1st Qu.: 7.000 1st Qu.:15723 1st Qu.:2351
Median :15 Median :14395 Median : 7.500 Median :17203 Median :2863
Mean :15 Mean :12707 Mean : 7.717 Mean :16528 Mean :2729
3rd Qu.:22 3rd Qu.:14969 3rd Qu.: 8.200 3rd Qu.:18266 3rd Qu.:3127
Max. :29 Max. :16081 Max. :10.100 Max. :19800 Max. :3345
ROLL against the other variables# 1. ROLL vs YEAR
ggplot(ef, aes(x = YEAR, y = ROLL)) +
geom_point() +
labs(title = "Undergraduate Enrollment by Year",
x = "Year (1 = 1961)",
y = "Fall Undergraduate Enrollment (ROLL)")# 2. ROLL vs UNEM (Unemployment)
ggplot(ef, aes(x = UNEM, y = ROLL)) +
geom_point() +
labs(title = "Undergraduate Enrollment vs Unemployment Rate",
x = "January Unemployment Rate (%) (UNEM)",
y = "Fall Undergraduate Enrollment (ROLL)")# 3. ROLL vs HGRAD (High School Graduates)
ggplot(ef, aes(x = HGRAD, y = ROLL)) +
geom_point() +
labs(title = "Undergraduate Enrollment vs High School Graduates",
x = "Spring High School Graduates (HGRAD)",
y = "Fall Undergraduate Enrollment (ROLL)")# 4. ROLL vs INC (Income)
ggplot(ef, aes(x = INC, y = ROLL)) +
geom_point() +
labs(title = "Undergraduate Enrollment vs Per Capita Income",
x = "Per Capita Income (1961 dollars) (INC)",
y = "Fall Undergraduate Enrollment (ROLL)")UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRADenroll_model <- lm(ROLL ~ UNEM + HGRAD, data = ef)summary() and anova() functions to investigate the modelsummary(enroll_model)
Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = ef)
Residuals:
Min 1Q Median 3Q Max
-2102.2 -861.6 -349.4 374.5 3603.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 ***
UNEM 6.983e+02 2.244e+02 3.111 0.00449 **
HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373
F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11
anova(enroll_model)Analysis of Variance Table
Response: ROLL
Df Sum Sq Mean Sq F value Pr(>F)
UNEM 1 45407767 45407767 26.349 2.366e-05 ***
HGRAD 1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26 44805568 1723291
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Which variable is the most closely related to enrollment? High school grads is the most closely related to enrollment. We know this because it has a much smaller p-value and the sum of squares is much larger
Make a residual plot and check for any bias in the model
hist(residuals(enroll_model),
main = "Histogram of Residuals",
xlab = "Residuals")plot(enroll_model, which = 1)predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 studentsfuture_data <- data.frame(UNEM = 9.0, HGRAD = 25000)
predict(enroll_model, newdata = future_data) 1
21585.58
INC)enroll_model_2 <- lm(ROLL ~ UNEM + HGRAD + INC, data = ef)anova(). Does including this variable improve the model?anova(enroll_model, enroll_model_2)Analysis of Variance Table
Model 1: ROLL ~ UNEM + HGRAD
Model 2: ROLL ~ UNEM + HGRAD + INC
Res.Df RSS Df Sum of Sq F Pr(>F)
1 26 44805568
2 25 11237313 1 33568255 74.68 5.594e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The small p-value indicates that including INC does significantly improve the model