m10_Quarto_2_Strange

Module 10 Exercise

Set working directory and load data (setwd(“~/Desktop/geog5680/Module Deliverables/Module 10”))

Read in the data

ef = read.csv("enrollmentForecast.csv")
library(ggplot2)

Look at the data structure

str(ef)
'data.frame':   29 obs. of  5 variables:
 $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
 $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
 $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
 $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

1. Make scatterplots of ROLL against the other variables

(ROLL x UNEM)

Fall Undergraduate Enrollment and January Unemployment for New Mexico

ggplot(ef, aes(x=UNEM, y=ROLL)) + geom_point(size=3, alpha=.5) + labs(x="January unemployment rate (%) for New Mexico (UNEM)", y="Fall undergraduate enrollment (ROLL)", main="Fall Undergraduate Enrollment and January Unemployment for New Mexico")

(ROLL x HGRAD)

Fall Undergraduate Enrollment and Spring High School Graduates in New Mexico

ggplot(ef, aes(x=HGRAD, y=ROLL)) + geom_point(size=3, alpha=.5) + labs(x="Spring high school graduates in New Mexico (HGRAD)", y="Fall undergraduate enrollment (ROLL)", main="Fall Undergraduate Enrollment and Spring High School Graduates in New Mexico")

(ROLL x INC)

Undergraduate Enrollment and Per-Capita Income in Albuquerque

ggplot(ef, aes(x=INC, y=ROLL)) + geom_point(size=3, alpha=.5) + labs(x="Per capita income in Albuquerque (1961 dollars) (INC)", y="Fall undergraduate enrollment (ROLL)", main="Undergraduate Enrollment and Per-Capita Income in Albuquerque")

2. Build a linear model using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRAD

ef$UNEM.cen = ef$UNEM-mean(ef$UNEM)
ef$HGRAD.cen = ef$HGRAD-mean(ef$HGRAD)
lm(ROLL~UNEM.cen+HGRAD.cen, data = ef)

Call:
lm(formula = ROLL ~ UNEM.cen + HGRAD.cen, data = ef)

Coefficients:
(Intercept)     UNEM.cen    HGRAD.cen  
  1.271e+04    6.983e+02    9.423e-01  
future_enroll_predict = lm(ROLL~UNEM.cen+HGRAD.cen, data = ef)

3. Use the summary() and anova() functions to investigate the model

summary(future_enroll_predict)

Call:
lm(formula = ROLL ~ UNEM.cen + HGRAD.cen, data = ef)

Residuals:
    Min      1Q  Median      3Q     Max 
-2102.2  -861.6  -349.4   374.5  3603.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.271e+04  2.438e+02  52.127  < 2e-16 ***
UNEM.cen    6.983e+02  2.244e+02   3.111  0.00449 ** 
HGRAD.cen   9.423e-01  8.613e-02  10.941 3.16e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared:  0.8489,    Adjusted R-squared:  0.8373 
F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11
anova(future_enroll_predict)
Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM.cen   1  45407767  45407767  26.349 2.366e-05 ***
HGRAD.cen  1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26  44805568   1723291                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

4. Make a residual plot and check for any bias in the model

hist(residuals(future_enroll_predict)) 

plot(future_enroll_predict, which = 1)

5. Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students.

unem_circumstance.cen = 9-mean(ef$UNEM)
hgrad_circumstance.cen = 25000-mean(ef$HGRAD)
circumstance_new_data_frame = data.frame(UNEM.cen = unem_circumstance.cen, HGRAD.cen=hgrad_circumstance.cen)
predict(future_enroll_predict, circumstance_new_data_frame)
       1 
21585.58 
predicted_fall_enrollment_noINC = predict(future_enroll_predict, circumstance_new_data_frame)

If the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students, then the estimated expected fall enrollment is 21,585.58 students.

6. Build a second model which includes per capita income (INC).

ef$INC.cen = ef$INC-mean(ef$INC)
exp_fall_enroll_w_inc = lm(ROLL~UNEM.cen+HGRAD.cen+INC.cen, data = ef)
summary(exp_fall_enroll_w_inc)

Call:
lm(formula = ROLL ~ UNEM.cen + HGRAD.cen + INC.cen, data = ef)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.84  -489.71    -1.88   387.40  1425.75 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.271e+04  1.245e+02 102.066  < 2e-16 ***
UNEM.cen    4.501e+02  1.182e+02   3.809 0.000807 ***
HGRAD.cen   4.065e-01  7.602e-02   5.347 1.52e-05 ***
INC.cen     4.275e+00  4.947e-01   8.642 5.59e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared:  0.9621,    Adjusted R-squared:  0.9576 
F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16
q6_data_frame = data.frame(UNEM.cen = 9-mean(ef$UNEM), HGRAD.cen = 25000-mean(ef$HGRAD), INC.cen = 0)
predict(exp_fall_enroll_w_inc, q6_data_frame)
       1 
16728.11 
predicted_fall_enrollment_withINC = predict(exp_fall_enroll_w_inc, q6_data_frame)

If the current year’s unemployment rate is 9%, the size of the spring high school graduating class is 25,000 students, and the per capita income is average, then the estimated expected fall enrollment is 16728.11 students.

7. Compare the two models with anova().

anova(future_enroll_predict, exp_fall_enroll_w_inc)
Analysis of Variance Table

Model 1: ROLL ~ UNEM.cen + HGRAD.cen
Model 2: ROLL ~ UNEM.cen + HGRAD.cen + INC.cen
  Res.Df      RSS Df Sum of Sq     F    Pr(>F)    
1     26 44805568                                 
2     25 11237313  1  33568255 74.68 5.594e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Does including this variable improve the model?

Yes, including the Per capita income in Albuquerque (INC) variable does improve the model. When comparing the two models with ANOVA, the p-value is 5.594e-09 which is much smaller than the assumed significance level of 0.05 (and would even be smaller than a significance level of 0.001). Adding the income per capita (INC) variable to the model significantly improves the model’s ability to accurately predict fall enrollment numbers.