GEOG 5680 Module 10: Statistical Modeling in R

Author

Zach Grube

Introduction

This exercise builds linear models to predict fall undergraduate enrollment (ROLL) at the University of New Mexico using unemployment rate (UNEM), spring high school graduates (HGRAD), and per capita income (INC).

library(ggplot2)

Read In The Data

enroll <- read.csv("C:/Users/yolom/OneDrive/Desktop/GEOG5680/Module 10/enrollmentForecast.csv")

Data Structure

str(enroll)

'data.frame':   29 obs. of  5 variables:
 $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
 $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
 $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
 $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

summary(enroll)

      YEAR         ROLL            UNEM            HGRAD            INC      
 Min.   : 1   Min.   : 5501   Min.   : 5.700   Min.   : 9552   Min.   :1923  
 1st Qu.: 8   1st Qu.:10167   1st Qu.: 7.000   1st Qu.:15723   1st Qu.:2351  
 Median :15   Median :14395   Median : 7.500   Median :17203   Median :2863  
 Mean   :15   Mean   :12707   Mean   : 7.717   Mean   :16528   Mean   :2729  
 3rd Qu.:22   3rd Qu.:14969   3rd Qu.: 8.200   3rd Qu.:18266   3rd Qu.:3127  
 Max.   :29   Max.   :16081   Max.   :10.100   Max.   :19800   Max.   :3345

Scatterplots

ggplot(enroll, aes(x = UNEM, y = ROLL)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    x = "Unemployment rate (%)",
    y = "Fall undergraduate enrollment"
  )

ggplot(enroll, aes(x = HGRAD, y = ROLL)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    x = "Spring high school graduates",
    y = "Fall undergraduate enrollment"
  )

ggplot(enroll, aes(x = INC, y = ROLL)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(
    x = "Per capita income (1961 dollars)",
    y = "Fall undergraduate enrollment"
  )

The correlations between enrollment and the predictor variables are:

cor(enroll[, c("ROLL", "UNEM", "HGRAD", "INC")])

           ROLL      UNEM     HGRAD       INC
ROLL  1.0000000 0.3913436 0.8902939 0.9498757
UNEM  0.3913436 1.0000000 0.1773760 0.2823104
HGRAD 0.8902939 0.1773760 1.0000000 0.8200894
INC   0.9498757 0.2823104 0.8200894 1.0000000

First Linear Model

The first model uses unemployment rate and the number of spring high school graduates to predict fall enrollment:

fit1 <- lm(ROLL ~ UNEM + HGRAD, data = enroll)
summary(fit1)


Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = enroll)

Residuals:
    Min      1Q  Median      3Q     Max 
-2102.2  -861.6  -349.4   374.5  3603.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.256e+03  2.052e+03  -4.023  0.00044 ***
UNEM         6.983e+02  2.244e+02   3.111  0.00449 ** 
HGRAD        9.423e-01  8.613e-02  10.941 3.16e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared:  0.8489,    Adjusted R-squared:  0.8373 
F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11

anova(fit1)

Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM       1  45407767  45407767  26.349 2.366e-05 ***
HGRAD      1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26  44805568   1723291                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The fitted model is approximately:

[ ROLL = -8255.75 + 698.27(UNEM) + 0.94(HGRAD) ]

Both UNEM and HGRAD are significant predictors in this model. The R-squared value is about 0.849, meaning the model explains about 84.9% of the variation in fall undergraduate enrollment.

Residual Plots

plot(fit1, which = 1)

hist(
  residuals(fit1),
  main = "Histogram of residuals for fit1",
  xlab = "Residuals"
)

ggplot(enroll, aes(x = YEAR, y = residuals(fit1))) +
  geom_point() +
  geom_hline(yintercept = 0, linetype = 2) +
  geom_smooth() +
  labs(
    x = "Year",
    y = "Residuals"
  )

The residual plot should be checked for patterns around the zero line. There is some evidence of bias over time: the first model tends to under-predict enrollment in the later years, suggesting that another time-related variable may be missing.

Prediction

Use the model to estimate expected fall enrollment when the unemployment rate is 9% and the spring high school graduating class is 25,000 students.

newdata1 <- data.frame(UNEM = 9, HGRAD = 25000)
predict(fit1, newdata1)

       1 
21585.58

predict(fit1, newdata1, interval = "prediction")

       fit      lwr     upr
1 21585.58 18452.36 24718.8

The expected fall enrollment is about 21,586 students. The 95% prediction interval is approximately 18,452 to 24,719 students.

Second Linear Model

The second model adds per capita income (INC) as a predictor.

fit2 <- lm(ROLL ~ UNEM + HGRAD + INC, data = enroll)
summary(fit2)


Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enroll)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.84  -489.71    -1.88   387.40  1425.75 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.153e+03  1.053e+03  -8.691 5.02e-09 ***
UNEM         4.501e+02  1.182e+02   3.809 0.000807 ***
HGRAD        4.065e-01  7.602e-02   5.347 1.52e-05 ***
INC          4.275e+00  4.947e-01   8.642 5.59e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared:  0.9621,    Adjusted R-squared:  0.9576 
F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16

anova(fit2)

Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM       1  45407767  45407767  101.02 2.894e-10 ***
HGRAD      1 206279143 206279143  458.92 < 2.2e-16 ***
INC        1  33568255  33568255   74.68 5.594e-09 ***
Residuals 25  11237313    449493                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Comparison

anova(fit1, fit2)

Analysis of Variance Table

Model 1: ROLL ~ UNEM + HGRAD
Model 2: ROLL ~ UNEM + HGRAD + INC
  Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
1     26 44805568                                 
2     25 11237313  1  33568255 74.68 5.594e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding INC improves the model. The comparison of fit1 and fit2 gives a very small p value, showing that INC significantly reduces residual error. The R-squared value increases from about 0.849 in the first model to about 0.962 in the second model.