exercise10

Author

Hangu Lee

0. Setting

# Read in the data
enrollment = read.csv("./data/enrollmentForecast.csv")

# Look at the data structure
str(enrollment)
'data.frame':   29 obs. of  5 variables:
 $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
 $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
 $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
 $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

1. Scatterplots

# Make scatterplots of ROLL against the other variables
library(ggplot2)

# ROLL vs UNEM
ggplot(enrollment, aes(x = UNEM, y = ROLL)) + 
  geom_point() + 
  labs(title = "Enrollment vs Unemployment Rate")

# ROLL vs HGRAD
ggplot(enrollment, aes(x = HGRAD, y = ROLL)) + 
  geom_point() + 
  labs(title = "Enrollment vs High School Graduates")

# ROLL vs INC
ggplot(enrollment, aes(x = INC, y = ROLL)) + 
  geom_point() + 
  labs(title = "Enrollment vs Income")

2. Model Building & Diagnostics

# Build a linear model using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRAD

fit_enroll1 = lm(ROLL ~ UNEM + HGRAD, data = enrollment)

# Use the summary() and anova() functions to investigate the model

summary(fit_enroll1)

Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = enrollment)

Residuals:
    Min      1Q  Median      3Q     Max 
-2102.2  -861.6  -349.4   374.5  3603.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.256e+03  2.052e+03  -4.023  0.00044 ***
UNEM         6.983e+02  2.244e+02   3.111  0.00449 ** 
HGRAD        9.423e-01  8.613e-02  10.941 3.16e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared:  0.8489,    Adjusted R-squared:  0.8373 
F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11
anova(fit_enroll1)
Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM       1  45407767  45407767  26.349 2.366e-05 ***
HGRAD      1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26  44805568   1723291                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4. Residual Plot and Bias Check

# Residuals vs Fitted values plot
plot(fit_enroll1, which = 1)

No Pattern: Points randomly scattered around the 0 line indicate no systemic bias (good fit).

Pattern Present: Any curve, trend, or shape implies omitted non-linear information or missed patterns (poor fit).

5. Estimate Fall Enrollment (Prediction)

# Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students

# Define the new observation
new_obs <- data.frame(UNEM = 9, HGRAD = 25000)

# Calculate the prediction
pred_result <- predict(fit_enroll1, newdata = new_obs, interval = "prediction")
print(pred_result)
       fit      lwr     upr
1 21585.58 18452.36 24718.8

The expected fall enrollment point estimate is found under the fit column of the printed output. There is a 95% probability that the actual enrollment value will fall within the lower (lwr) and upper (upr) bounds.

6. Build a Second Model and Compare

# Build a second model which includes per capita income (INC).
fit_enroll2 <- lm(ROLL ~ UNEM + HGRAD + INC, data = enrollment)
summary(fit_enroll2)

Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enrollment)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.84  -489.71    -1.88   387.40  1425.75 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.153e+03  1.053e+03  -8.691 5.02e-09 ***
UNEM         4.501e+02  1.182e+02   3.809 0.000807 ***
HGRAD        4.065e-01  7.602e-02   5.347 1.52e-05 ***
INC          4.275e+00  4.947e-01   8.642 5.59e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared:  0.9621,    Adjusted R-squared:  0.9576 
F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16
# Compare the two models with anova().
anova(fit_enroll1, fit_enroll2)
Analysis of Variance Table

Model 1: ROLL ~ UNEM + HGRAD
Model 2: ROLL ~ UNEM + HGRAD + INC
  Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
1     26 44805568                                 
2     25 11237313  1  33568255 74.68 5.594e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7. Does including this variable improve the model?

Yes

The comparison yields an extremely significant p-value in the second row of the anova(fit_enroll1, fit_enroll2) table. Adding INC dramatically reduces the Residual Sum of Squares (RSS), meaning it significantly improves the model’s overall fit.