Module 10 Assignment

Author

MJ Kemp

Read in and prepare data

enrollment.forecast = read.csv("enrollmentForecast.csv")
library(ggplot2)

View data structure

str(enrollment.forecast)

'data.frame':   29 obs. of  5 variables:
 $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
 $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
 $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
 $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

summary(enrollment.forecast)

      YEAR         ROLL            UNEM            HGRAD            INC      
 Min.   : 1   Min.   : 5501   Min.   : 5.700   Min.   : 9552   Min.   :1923  
 1st Qu.: 8   1st Qu.:10167   1st Qu.: 7.000   1st Qu.:15723   1st Qu.:2351  
 Median :15   Median :14395   Median : 7.500   Median :17203   Median :2863  
 Mean   :15   Mean   :12707   Mean   : 7.717   Mean   :16528   Mean   :2729  
 3rd Qu.:22   3rd Qu.:14969   3rd Qu.: 8.200   3rd Qu.:18266   3rd Qu.:3127  
 Max.   :29   Max.   :16081   Max.   :10.100   Max.   :19800   Max.   :3345

Make scatterplots of ROLL against other variables

ggplot(enrollment.forecast, aes(x=ROLL, y=UNEM, col=YEAR)) + geom_point()

ggplot(enrollment.forecast, aes(x=ROLL, y=HGRAD, col=YEAR)) + geom_point()

ggplot(enrollment.forecast, aes(x=ROLL, y=INC, col=YEAR)) + geom_point()

Build linear model using unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRAD

lm(ROLL ~ UNEM + HGRAD, data = enrollment.forecast)


Call:
lm(formula = ROLL ~ UNEM + HGRAD, data = enrollment.forecast)

Coefficients:
(Intercept)         UNEM        HGRAD  
 -8255.7511     698.2681       0.9423

enrollment.forecast$UNEM.cen = enrollment.forecast$UNEM - mean(enrollment.forecast$UNEM)
enrollment.forecast$HGRAD.cen = enrollment.forecast$HGRAD - mean(enrollment.forecast$HGRAD)
fit1 = lm(ROLL ~ UNEM.cen + HGRAD.cen, data = enrollment.forecast)

Use the summary() and anova() functions to investigate the model. Find variable most closely related to enrollment.

summary(fit1)


Call:
lm(formula = ROLL ~ UNEM.cen + HGRAD.cen, data = enrollment.forecast)

Residuals:
    Min      1Q  Median      3Q     Max 
-2102.2  -861.6  -349.4   374.5  3603.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.271e+04  2.438e+02  52.127  < 2e-16 ***
UNEM.cen    6.983e+02  2.244e+02   3.111  0.00449 ** 
HGRAD.cen   9.423e-01  8.613e-02  10.941 3.16e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared:  0.8489,    Adjusted R-squared:  0.8373 
F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11

cor.test(enrollment.forecast$ROLL, enrollment.forecast$UNEM.cen)


    Pearson's product-moment correlation

data:  enrollment.forecast$ROLL and enrollment.forecast$UNEM.cen
t = 2.2097, df = 27, p-value = 0.03579
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.02899694 0.66278610
sample estimates:
      cor 
0.3913436

cor.test(enrollment.forecast$ROLL, enrollment.forecast$HGRAD.cen)


    Pearson's product-moment correlation

data:  enrollment.forecast$ROLL and enrollment.forecast$HGRAD.cen
t = 10.159, df = 27, p-value = 1.012e-10
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7774773 0.9475998
sample estimates:
      cor 
0.8902939

anova(fit1)

Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM.cen   1  45407767  45407767  26.349 2.366e-05 ***
HGRAD.cen  1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26  44805568   1723291                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on these functions, HGRAD is the most related variable - as is seen thorugh having the largest F-stat and smallest p-value.

Make residual plot and check for any bias in the model

hist(residuals(fit1))

plot(fit1, which = 1)

plot(fit1, which = 4)

Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students

fit2 = lm(ROLL ~ UNEM.cen, data = enrollment.forecast)
summary(fit2)


Call:
lm(formula = ROLL ~ UNEM.cen, data = enrollment.forecast)

Residuals:
    Min      1Q  Median      3Q     Max 
-7640.0 -1046.5   602.8  1934.3  4187.2 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  12707.0      566.3   22.44   <2e-16 ***
UNEM.cen      1133.8      513.1    2.21   0.0358 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3049 on 27 degrees of freedom
Multiple R-squared:  0.1531,    Adjusted R-squared:  0.1218 
F-statistic: 4.883 on 1 and 27 DF,  p-value: 0.03579

expect.unemploy = 9 - mean(enrollment.forecast$UNEM)
12707 + (1133.8 * expect.unemploy)

[1] 14161.39

predict(fit2, UNEM.cen = expect.unemploy)

       1        2        3        4        5        6        7        8 
13141.02 11893.81 12233.96 12460.72 11893.81 11213.51 11326.90 11213.51 
       9       10       11       12       13       14       15       16 
11100.13 12687.49 13254.40 12460.72 12347.34 13254.40 15408.67 14388.22 
      17       18       19       20       21       22       23       24 
12687.49 10419.83 11326.90 12460.72 12233.96 14388.22 15408.67 12460.72 
      25       26       27       28       29 
13934.69 14274.84 13934.69 12800.87 11893.81

new.unemploy = data.frame(UNEM.cen = expect.unemploy)
predict(fit2, new.unemploy, interval = "prediction")

       fit      lwr      upr
1 14161.46 7655.725 20667.19

fit3 = lm(ROLL ~ HGRAD.cen, data = enrollment.forecast)
summary(fit3)


Call:
lm(formula = ROLL ~ HGRAD.cen, data = enrollment.forecast)

Residuals:
    Min      1Q  Median      3Q     Max 
-2156.8 -1175.5  -301.0   649.8  3089.0 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.271e+04  2.802e+02   45.35  < 2e-16 ***
HGRAD.cen   9.898e-01  9.743e-02   10.16 1.01e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1509 on 27 degrees of freedom
Multiple R-squared:  0.7926,    Adjusted R-squared:  0.7849 
F-statistic: 103.2 on 1 and 27 DF,  p-value: 1.012e-10

expect.hgraduate = 25000 - mean(enrollment.forecast$HGRAD)
1.271e+04 + (9.898e-01 * expect.hgraduate)

[1] 21095.45

predict(fit3, HGRAD.cen = expect.hgraduate)

        1         2         3         4         5         6         7         8 
 5802.023  5928.718  5979.198  7894.469 10872.790 11456.775 11673.542 11910.105 
        9        10        11        12        13        14        15        16 
12680.173 13065.207 13375.016 13873.877 14270.789 14427.178 15458.554 14385.606 
       17        18        19        20        21        22        23        24 
15150.725 15653.546 15945.538 15694.128 15269.501 14929.999 13978.797 13474.986 
       25        26        27        28        29 
12932.573 12925.645 13099.850 13402.730 12991.962

new.hgraduate = data.frame(HGRAD.cen = expect.hgraduate)
predict(fit3, new.hgraduate, interval = "prediction")

       fit      lwr      upr
1 21092.52 17516.71 24668.33

plot(ROLL ~ UNEM.cen + HGRAD.cen, data = enrollment.forecast)

Build a second model which includes per capita income (INC)

fit4 = lm(ROLL ~ UNEM + HGRAD + INC, data = enrollment.forecast)
summary(fit4)


Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enrollment.forecast)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.84  -489.71    -1.88   387.40  1425.75 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.153e+03  1.053e+03  -8.691 5.02e-09 ***
UNEM         4.501e+02  1.182e+02   3.809 0.000807 ***
HGRAD        4.065e-01  7.602e-02   5.347 1.52e-05 ***
INC          4.275e+00  4.947e-01   8.642 5.59e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared:  0.9621,    Adjusted R-squared:  0.9576 
F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16

Compare the two models with anova(). Does including this variable improve the model?

summary(fit1)


Call:
lm(formula = ROLL ~ UNEM.cen + HGRAD.cen, data = enrollment.forecast)

Residuals:
    Min      1Q  Median      3Q     Max 
-2102.2  -861.6  -349.4   374.5  3603.5 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1.271e+04  2.438e+02  52.127  < 2e-16 ***
UNEM.cen    6.983e+02  2.244e+02   3.111  0.00449 ** 
HGRAD.cen   9.423e-01  8.613e-02  10.941 3.16e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1313 on 26 degrees of freedom
Multiple R-squared:  0.8489,    Adjusted R-squared:  0.8373 
F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11

anova(fit1)

Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM.cen   1  45407767  45407767  26.349 2.366e-05 ***
HGRAD.cen  1 206279143 206279143 119.701 3.157e-11 ***
Residuals 26  44805568   1723291                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary(fit4)


Call:
lm(formula = ROLL ~ UNEM + HGRAD + INC, data = enrollment.forecast)

Residuals:
     Min       1Q   Median       3Q      Max 
-1148.84  -489.71    -1.88   387.40  1425.75 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -9.153e+03  1.053e+03  -8.691 5.02e-09 ***
UNEM         4.501e+02  1.182e+02   3.809 0.000807 ***
HGRAD        4.065e-01  7.602e-02   5.347 1.52e-05 ***
INC          4.275e+00  4.947e-01   8.642 5.59e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 670.4 on 25 degrees of freedom
Multiple R-squared:  0.9621,    Adjusted R-squared:  0.9576 
F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16

anova(fit4)

Analysis of Variance Table

Response: ROLL
          Df    Sum Sq   Mean Sq F value    Pr(>F)    
UNEM       1  45407767  45407767  101.02 2.894e-10 ***
HGRAD      1 206279143 206279143  458.92 < 2.2e-16 ***
INC        1  33568255  33568255   74.68 5.594e-09 ***
Residuals 25  11237313    449493                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As the R-squared value increased in fit4 compared to fit1, yes including INC does improve the model