Module 10

Exercises

Read in the enrollment data. Look at the data structure

library(ggplot2)
unm <- read.csv("enrollmentForecast.csv")
str(unm)

## 'data.frame':    29 obs. of  5 variables:
##  $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
##  $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
##  $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
##  $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

There are five columns: YEAR (starting at 1961), ROLL (undergraduate enrollment), UNEM (NM unemployment), HGRAD (NM high school graduates), and INC (per capita income in Albuquerque).

Make scatterplots of ROLL against the other variables

ggplot(unm, aes(x=YEAR, y=ROLL)) + geom_point() + ggtitle("Roll ~ Year") +
  xlab("Year (1961-1989)") + ylab("UNM Fall Undergraduate Enrollment") + theme_bw()

The relationship between YEAR and ROLL is increasing, and somewhat linear, but there is a curve as the years progress and more students enroll. It appears that the rate of increase in enrollment declines after 1972. After this point the enrollment continues to increase, but at a lower rate.

ggplot(unm, aes(x=UNEM, y=ROLL)) + geom_point()+ ggtitle("Roll ~ Unemployment") +
  xlab("January Unemployment rate (%)") + 
  ylab("UNM Fall Undergraduate Enrollment") + theme_bw()

The data points are somewhat randomly distributed. There does not appear to be an obvious relationship between enrollment and the unemployment rate.

ggplot(unm, aes(x=HGRAD, y=ROLL)) + geom_point() + 
  ggtitle("Roll ~ High School Graduates") +
  xlab("Spring NM High School Graduates") + 
  ylab("UNM Fall Undergraduate Enrollment") + theme_bw()

There appears to be an increasing relationship between the number of high school graduates and enrollment. It is sort of linear, with a cluster of data at higher values of graduates and enrollment.

ggplot(unm, aes(x=INC, y=ROLL)) + geom_point()+ ggtitle("Roll ~ Income") +
  xlab("Per Capita income in Albuquerque") + 
  ylab("UNM Fall Undergraduate Enrollment") + theme_bw()

There is an increasing relationship between per capita income and enrollment. The relationship appears to be somewhat linear. There is a decrease in the slope for higher incomes.

Build a linear model using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRAD

model_unem_hgrad = lm(ROLL ~ UNEM + HGRAD, unm)

Use the summary() and anova() functions to investigate the model

summary(model_unem_hgrad)

## 
## Call:
## lm(formula = ROLL ~ UNEM + HGRAD, data = unm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2102.2  -861.6  -349.4   374.5  3603.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.256e+03  2.052e+03  -4.023  0.00044 ***
## UNEM         6.983e+02  2.244e+02   3.111  0.00449 ** 
## HGRAD        9.423e-01  8.613e-02  10.941 3.16e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1313 on 26 degrees of freedom
## Multiple R-squared:  0.8489, Adjusted R-squared:  0.8373 
## F-statistic: 73.03 on 2 and 26 DF,  p-value: 2.144e-11

The adjusted R-squared value is 0.84, with an F-statistic of 73.03 and a corresponding p-value of 2.14e-11. Both UNEM and HGRAD have significant p-values indicating that both variables are useful in the model. These are all good results. However the median residual is not close to zero and the model intercept is -8256. Centering the data may help the model be more meaningful.

anova(model_unem_hgrad)

## Analysis of Variance Table
## 
## Response: ROLL
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## UNEM       1  45407767  45407767  26.349 2.366e-05 ***
## HGRAD      1 206279143 206279143 119.701 3.157e-11 ***
## Residuals 26  44805568   1723291                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA gives an F-statistic of 26 for unemployment and 120 for high school graduates. Both variables appear to be significant because the p-values for both are very close to zero.

Which variable is the most closely related to enrollment?

The number of spring high school graduates is the more significant variable because of the higher F-statistic value and lower p-value than unemployment.

Make a residual plot and check for any bias in the model

hist(residuals(model_unem_hgrad))

The distribution of the residuals is centered around -1000 and not zero, nor are they normally distributed. They are positively skewed.

plot(model_unem_hgrad, which = 1)

Plotting the residuals against the fitted values, we can see there is a cyclical relationship instead of no trend.

plot(model_unem_hgrad, which = 4)

Looking at the Cook’s distance, there are x values that may be biasing the model.

There appears to be bias in the model because the residuals are not normally distributed and centered about zero.

Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students

newgrads = data.frame(HGRAD = 25000, UNEM = 9.0)
newdata = predict(model_unem_hgrad, newgrads, interval = "prediction")
newdata

##        fit      lwr     upr
## 1 21585.58 18452.36 24718.8

For a high school graduating class of 25,000 and and 9% unemployment. The model predicts an fall enrollment of 21,586 students with a 95% confidence interval of (18,452 and 24,719 students).

Build a second model which includes per capita income (INC).

lm2 = lm(ROLL ~ UNEM + HGRAD + INC, unm)

Compare the two models with anova().

anova(lm2)

## Analysis of Variance Table
## 
## Response: ROLL
##           Df    Sum Sq   Mean Sq F value    Pr(>F)    
## UNEM       1  45407767  45407767  101.02 2.894e-10 ***
## HGRAD      1 206279143 206279143  458.92 < 2.2e-16 ***
## INC        1  33568255  33568255   74.68 5.594e-09 ***
## Residuals 25  11237313    449493                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding the income data results in a model that has high F-statistics for each variable. Surprisingly, income appears to be less significant than the unemployment rate, although both have high F-statistics values and very low p-values.

Does including this variable improve the model?

summary(lm2)

## 
## Call:
## lm(formula = ROLL ~ UNEM + HGRAD + INC, data = unm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1148.84  -489.71    -1.88   387.40  1425.75 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -9.153e+03  1.053e+03  -8.691 5.02e-09 ***
## UNEM         4.501e+02  1.182e+02   3.809 0.000807 ***
## HGRAD        4.065e-01  7.602e-02   5.347 1.52e-05 ***
## INC          4.275e+00  4.947e-01   8.642 5.59e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 670.4 on 25 degrees of freedom
## Multiple R-squared:  0.9621, Adjusted R-squared:  0.9576 
## F-statistic: 211.5 on 3 and 25 DF,  p-value: < 2.2e-16

Adding the income data increases the adjusted R-squared value from 0.84 to 0.96 and increases the F-statistic from 73 to 211. The summary() function shows that income has a higher influence than the unemployment rate on the model, although both are statistically significant.

These results show that including the income data does improve the model.

Module 10

Sherlyn Hilton

2023-06-09

Exercises