Module 10 Exercise

Author

u1535008

Module 10 Exercise Forecasting Enrollment

Read in the data and look at its structure.

enroll <- read.csv("enrollmentForecast.csv")
str(enroll)

'data.frame':   29 obs. of  5 variables:
 $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
 $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
 $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
 $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

Make scatterplots of ROLL against the other variables.

library(ggplot2)
ggplot(enroll, aes(x = UNEM, y = ROLL)) + geom_point()

ggplot(enroll, aes(x = HGRAD, y = ROLL)) + geom_point()

ggplot(enroll, aes(x = INC, y = ROLL)) + geom_point()

Build a linear model using the unemployment rate and number of spring high school graduates to predict the fall enrollment.

fit1 <- lm(ROLL ~ UNEM + HGRAD, data = enroll)

Use the summary() and anova() functions to investigate the model.

summary(fit1)


Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = enroll)

Residuals:
    Min      1Q  Median      3Q     Max 
-2102.2  -861.6  -349.4   374.5  3603.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.256e+03  2.052e+03  -4.023  0.00044 ***
UNEM         6.983e+02  2.244e+02   3.111  0.00449 ** 
HGRAD        9.423e-01  8.613e-02  10.941 3.16e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared:  0.8489,    Adjusted R-squared:  0.8373 
F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11

anova(fit1)

Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM       1  45407767  45407767  26.349 2.366e-05 ***
HGRAD      1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26  44805568   1723291                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which variable is the most closely related to enrollment? The number of spring high school graduates is most closely related to fall enrollment because it is statistically significant at the 99% confidence interval.

Make a residual plot and check for any bias in the model.

hist(residuals(fit1))

Estimate the expected fall enrollment if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students.

given <- data.frame(UNEM = 9, HGRAD = 25000)
predict(fit1, given)

       1 
21585.58

Build a second model which includes per capita income.

fit2 <- lm(ROLL ~ UNEM + HGRAD + INC, data = enroll)
summary(fit2)


Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enroll)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.84  -489.71    -1.88   387.40  1425.75 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.153e+03  1.053e+03  -8.691 5.02e-09 ***
UNEM         4.501e+02  1.182e+02   3.809 0.000807 ***
HGRAD        4.065e-01  7.602e-02   5.347 1.52e-05 ***
INC          4.275e+00  4.947e-01   8.642 5.59e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared:  0.9621,    Adjusted R-squared:  0.9576 
F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16

Compare the two models with anova().

anova(fit1, fit2)

Analysis of Variance Table

Model 1: ROLL ~ UNEM + HGRAD
Model 2: ROLL ~ UNEM + HGRAD + INC
  Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
1     26 44805568                                 
2     25 11237313  1  33568255 74.68 5.594e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Does including the INC variable improve the model? Because the new model included only one extra variable, but has a p-value less than 0.001 we can conclude that the INC variable improved the model.