Module10_Gordon

Author

Mason Gordon

Module 10

Read in the data

ef <- read.csv("../Data/enrollmentForecast.csv")
library(ggplot2)

Warning: package 'ggplot2' was built under R version 4.4.3

Look at the data structure

str(ef)

'data.frame':   29 obs. of  5 variables:
 $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
 $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
 $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
 $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

summary(ef)

      YEAR         ROLL            UNEM            HGRAD            INC      
 Min.   : 1   Min.   : 5501   Min.   : 5.700   Min.   : 9552   Min.   :1923  
 1st Qu.: 8   1st Qu.:10167   1st Qu.: 7.000   1st Qu.:15723   1st Qu.:2351  
 Median :15   Median :14395   Median : 7.500   Median :17203   Median :2863  
 Mean   :15   Mean   :12707   Mean   : 7.717   Mean   :16528   Mean   :2729  
 3rd Qu.:22   3rd Qu.:14969   3rd Qu.: 8.200   3rd Qu.:18266   3rd Qu.:3127  
 Max.   :29   Max.   :16081   Max.   :10.100   Max.   :19800   Max.   :3345

Make scatterplots of ROLL against the other variables

# 1. ROLL vs YEAR
ggplot(ef, aes(x = YEAR, y = ROLL)) + 
  geom_point() +
  labs(title = "Undergraduate Enrollment by Year",
       x = "Year (1 = 1961)",
       y = "Fall Undergraduate Enrollment (ROLL)")

# 2. ROLL vs UNEM (Unemployment)
ggplot(ef, aes(x = UNEM, y = ROLL)) + 
  geom_point() +
  labs(title = "Undergraduate Enrollment vs Unemployment Rate",
       x = "January Unemployment Rate (%) (UNEM)",
       y = "Fall Undergraduate Enrollment (ROLL)")

# 3. ROLL vs HGRAD (High School Graduates)
ggplot(ef, aes(x = HGRAD, y = ROLL)) + 
  geom_point() +
  labs(title = "Undergraduate Enrollment vs High School Graduates",
       x = "Spring High School Graduates (HGRAD)",
       y = "Fall Undergraduate Enrollment (ROLL)")

# 4. ROLL vs INC (Income)
ggplot(ef, aes(x = INC, y = ROLL)) + 
  geom_point() +
  labs(title = "Undergraduate Enrollment vs Per Capita Income",
       x = "Per Capita Income (1961 dollars) (INC)",
       y = "Fall Undergraduate Enrollment (ROLL)")

Build a linear model using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRAD

enroll_model <- lm(ROLL ~ UNEM + HGRAD, data = ef)

Use the summary() and anova() functions to investigate the model

summary(enroll_model)


Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = ef)

Residuals:
    Min      1Q  Median      3Q     Max 
-2102.2  -861.6  -349.4   374.5  3603.5 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.256e+03  2.052e+03  -4.023  0.00044 ***
UNEM         6.983e+02  2.244e+02   3.111  0.00449 ** 
HGRAD        9.423e-01  8.613e-02  10.941 3.16e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared:  0.8489,    Adjusted R-squared:  0.8373 
F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11

anova(enroll_model)

Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM       1  45407767  45407767  26.349 2.366e-05 ***
HGRAD      1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26  44805568   1723291                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Which variable is the most closely related to enrollment? High school grads is the most closely related to enrollment. We know this because it has a much smaller p-value and the sum of squares is much larger
Make a residual plot and check for any bias in the model

hist(residuals(enroll_model), 
     main = "Histogram of Residuals", 
     xlab = "Residuals")

plot(enroll_model, which = 1)

Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students

future_data <- data.frame(UNEM = 9.0, HGRAD = 25000)

predict(enroll_model, newdata = future_data)

       1 
21585.58

Build a second model which includes per capita income (INC)

enroll_model_2 <- lm(ROLL ~ UNEM + HGRAD + INC, data = ef)

Compare the two models with anova(). Does including this variable improve the model?

anova(enroll_model, enroll_model_2)

Analysis of Variance Table

Model 1: ROLL ~ UNEM + HGRAD
Model 2: ROLL ~ UNEM + HGRAD + INC
  Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
1     26 44805568                                 
2     25 11237313  1  33568255 74.68 5.594e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The small p-value indicates that including INC does significantly improve the model