Read in the enrollment data. Look at the data structure
library(ggplot2)
unm <- read.csv("enrollmentForecast.csv")
str(unm)
## 'data.frame': 29 obs. of 5 variables:
## $ YEAR : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ROLL : int 5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
## $ UNEM : num 8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
## $ HGRAD: int 9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
## $ INC : int 1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...
There are five columns: YEAR (starting at 1961), ROLL (undergraduate enrollment), UNEM (NM unemployment), HGRAD (NM high school graduates), and INC (per capita income in Albuquerque).
Make scatterplots of ROLL against the other variables
ggplot(unm, aes(x=YEAR, y=ROLL)) + geom_point() + ggtitle("Roll ~ Year") +
xlab("Year (1961-1989)") + ylab("UNM Fall Undergraduate Enrollment") + theme_bw()
The relationship between YEAR and ROLL is increasing, and somewhat linear, but there is a curve as the years progress and more students enroll. It appears that the rate of increase in enrollment declines after 1972. After this point the enrollment continues to increase, but at a lower rate.
ggplot(unm, aes(x=UNEM, y=ROLL)) + geom_point()+ ggtitle("Roll ~ Unemployment") +
xlab("January Unemployment rate (%)") +
ylab("UNM Fall Undergraduate Enrollment") + theme_bw()
The data points are somewhat randomly distributed. There does not appear to be an obvious relationship between enrollment and the unemployment rate.
ggplot(unm, aes(x=HGRAD, y=ROLL)) + geom_point() +
ggtitle("Roll ~ High School Graduates") +
xlab("Spring NM High School Graduates") +
ylab("UNM Fall Undergraduate Enrollment") + theme_bw()
There appears to be an increasing relationship between the number of high school graduates and enrollment. It is sort of linear, with a cluster of data at higher values of graduates and enrollment.
ggplot(unm, aes(x=INC, y=ROLL)) + geom_point()+ ggtitle("Roll ~ Income") +
xlab("Per Capita income in Albuquerque") +
ylab("UNM Fall Undergraduate Enrollment") + theme_bw()
There is an increasing relationship between per capita income and enrollment. The relationship appears to be somewhat linear. There is a decrease in the slope for higher incomes.
Build a linear model using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD) to predict the fall enrollment (ROLL), i.e. ROLL ~ UNEM + HGRAD
model_unem_hgrad = lm(ROLL ~ UNEM + HGRAD, unm)
Use the summary() and anova() functions to investigate the model
summary(model_unem_hgrad)
##
## Call:
## lm(formula = ROLL ~ UNEM + HGRAD, data = unm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2102.2 -861.6 -349.4 374.5 3603.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.256e+03 2.052e+03 -4.023 0.00044 ***
## UNEM 6.983e+02 2.244e+02 3.111 0.00449 **
## HGRAD 9.423e-01 8.613e-02 10.941 3.16e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1313 on 26 degrees of freedom
## Multiple R-squared: 0.8489, Adjusted R-squared: 0.8373
## F-statistic: 73.03 on 2 and 26 DF, p-value: 2.144e-11
The adjusted R-squared value is 0.84, with an F-statistic of 73.03 and a corresponding p-value of 2.14e-11. Both UNEM and HGRAD have significant p-values indicating that both variables are useful in the model. These are all good results. However the median residual is not close to zero and the model intercept is -8256. Centering the data may help the model be more meaningful.
anova(model_unem_hgrad)
## Analysis of Variance Table
##
## Response: ROLL
## Df Sum Sq Mean Sq F value Pr(>F)
## UNEM 1 45407767 45407767 26.349 2.366e-05 ***
## HGRAD 1 206279143 206279143 119.701 3.157e-11 ***
## Residuals 26 44805568 1723291
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA gives an F-statistic of 26 for unemployment and 120 for high school graduates. Both variables appear to be significant because the p-values for both are very close to zero.
Which variable is the most closely related to enrollment?
The number of spring high school graduates is the more significant variable because of the higher F-statistic value and lower p-value than unemployment.
Make a residual plot and check for any bias in the model
hist(residuals(model_unem_hgrad))
The distribution of the residuals is centered around -1000 and not zero, nor are they normally distributed. They are positively skewed.
plot(model_unem_hgrad, which = 1)
Plotting the residuals against the fitted values, we can see there is a cyclical relationship instead of no trend.
plot(model_unem_hgrad, which = 4)
Looking at the Cook’s distance, there are x values that may be biasing the model.
There appears to be bias in the model because the residuals are not normally distributed and centered about zero.
Use the predict() function to estimate the expected fall enrollment, if the current year’s unemployment rate is 9% and the size of the spring high school graduating class is 25,000 students
newgrads = data.frame(HGRAD = 25000, UNEM = 9.0)
newdata = predict(model_unem_hgrad, newgrads, interval = "prediction")
newdata
## fit lwr upr
## 1 21585.58 18452.36 24718.8
For a high school graduating class of 25,000 and and 9% unemployment. The model predicts an fall enrollment of 21,586 students with a 95% confidence interval of (18,452 and 24,719 students).
Build a second model which includes per capita income (INC).
lm2 = lm(ROLL ~ UNEM + HGRAD + INC, unm)
Compare the two models with anova().
anova(lm2)
## Analysis of Variance Table
##
## Response: ROLL
## Df Sum Sq Mean Sq F value Pr(>F)
## UNEM 1 45407767 45407767 101.02 2.894e-10 ***
## HGRAD 1 206279143 206279143 458.92 < 2.2e-16 ***
## INC 1 33568255 33568255 74.68 5.594e-09 ***
## Residuals 25 11237313 449493
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Adding the income data results in a model that has high F-statistics for each variable. Surprisingly, income appears to be less significant than the unemployment rate, although both have high F-statistics values and very low p-values.
Does including this variable improve the model?
summary(lm2)
##
## Call:
## lm(formula = ROLL ~ UNEM + HGRAD + INC, data = unm)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1148.84 -489.71 -1.88 387.40 1425.75
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.153e+03 1.053e+03 -8.691 5.02e-09 ***
## UNEM 4.501e+02 1.182e+02 3.809 0.000807 ***
## HGRAD 4.065e-01 7.602e-02 5.347 1.52e-05 ***
## INC 4.275e+00 4.947e-01 8.642 5.59e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 670.4 on 25 degrees of freedom
## Multiple R-squared: 0.9621, Adjusted R-squared: 0.9576
## F-statistic: 211.5 on 3 and 25 DF, p-value: < 2.2e-16
Adding the income data increases the adjusted R-squared value from 0.84 to 0.96 and increases the F-statistic from 73 to 211. The summary() function shows that income has a higher influence than the unemployment rate on the model, although both are statistically significant.
These results show that including the income data does improve the model.