Module 10: Statistical Modeling

Building a linear model of student enrollment.

roll = read.csv("enrollmentForecast.csv")
str(roll)
## 'data.frame':    29 obs. of  5 variables:
##  $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
##  $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
##  $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
##  $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

Scatterplots of ROLL against the other variables:

library(ggplot2)
qplot(YEAR, ROLL, data = roll)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

qplot(UNEM, ROLL, data = roll)

qplot(HGRAD, ROLL, data = roll)

qplot(INC, ROLL, data = roll)

Build a linear model using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL):

lm(ROLL ~ UNEM + HGRAD, data = roll)
## 
## Call:
## lm(formula = ROLL ~ UNEM + HGRAD, data = roll)
## 
## Coefficients:
## (Intercept)         UNEM        HGRAD  
##  -8255.7511     698.2681       0.9423
roll_1 = lm(ROLL ~ UNEM + HGRAD, data = roll)
class(roll_1)
## [1] "lm"

Use the summary() and anova() functions to investigate the model:

summary(roll_1)
## 
## Call:
## lm(formula = ROLL ~ UNEM + HGRAD, data = roll)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2102.2  -861.6  -349.4   374.5  3603.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.256e+03  2.052e+03  -4.023  0.00044 ***
## UNEM         6.983e+02  2.244e+02   3.111  0.00449 ** 
## HGRAD        9.423e-01  8.613e-02  10.941 3.16e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1313 on 26 degrees of freedom
## Multiple R-squared:  0.8489, Adjusted R-squared:  0.8373 
## F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11
anova(roll_1)
## Analysis of Variance Table
## 
## Response: ROLL
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## UNEM       1  45407767  45407767  26.349 2.366e-05 ***
## HGRAD      1 206279143 206279143 119.701 3.157e-11 ***
## Residuals 26  44805568   1723291                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which variable is the most closely related to enrollment?

The HGRAD variable has a higher F-statistic and a lower p-value in the ANOVA, as well as a higher t-value in the summary. I am concluding from this that HGRAD is more closely related to enrollment. This also makes sense looking at the scatterplots.

Make a residual plot and check for any bias in the model:

hist(residuals(roll_1), breaks = 20)

This doesn’t really appear to be a normal distribution, and may show that the model is overestimating enrollment.

Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students:

newdata <- data.frame(UNEM = 9, HGRAD = 25000)
predict(roll_1, newdata = newdata, interval = "prediction")
##        fit      lwr     upr
## 1 21585.58 18452.36 24718.8

Build a second model which includes per capita income (INC):

lm(ROLL ~ UNEM + HGRAD + INC, data = roll)
## 
## Call:
## lm(formula = ROLL ~ UNEM + HGRAD + INC, data = roll)
## 
## Coefficients:
## (Intercept)         UNEM        HGRAD          INC  
##  -9153.2545     450.1245       0.4065       4.2749
roll_inc = lm(ROLL ~ UNEM + HGRAD + INC, data = roll)
class(roll_inc)
## [1] "lm"

Compare the two models with anova(). Does including this variable improve the model?

anova(roll_1)
## Analysis of Variance Table
## 
## Response: ROLL
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## UNEM       1  45407767  45407767  26.349 2.366e-05 ***
## HGRAD      1 206279143 206279143 119.701 3.157e-11 ***
## Residuals 26  44805568   1723291                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(roll_inc)
## Analysis of Variance Table
## 
## Response: ROLL
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## UNEM       1  45407767  45407767  101.02 2.894e-10 ***
## HGRAD      1 206279143 206279143  458.92 < 2.2e-16 ***
## INC        1  33568255  33568255   74.68 5.594e-09 ***
## Residuals 25  11237313    449493                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model that includes INC has a lower RSS value, and the large F-statistic and low p-value indicate that adding INC significantly improved the model.