This report will work on basic usage of statistical modelling in conjunction with ggplot2 using R, RStudio and knitr. The goal of this report is to:
enrollment = read.csv("enrollmentForecast.csv")
str(enrollment)
## 'data.frame': 29 obs. of 5 variables:
## $ YEAR : int 1 2 3 4 5 6 7 8 9 10 ...
## $ ROLL : int 5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
## $ UNEM : num 8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
## $ HGRAD: int 9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
## $ INC : int 1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...
library (ggplot2)
## Warning: package 'ggplot2' was built under R version 4.2.3
plotYEAR=ggplot(enrollment, aes(x=ROLL, y=YEAR))+ geom_point()
plotYEAR
plotUNEM=ggplot(enrollment, aes(x=ROLL, y=UNEM))+ geom_point()
plotUNEM
plotHGRAD=ggplot(enrollment, aes(x=ROLL, y=HGRAD))+ geom_point()
plotHGRAD
fit1=lm(ROLL~ HGRAD*UNEM,enrollment)
fit1
##
## Call:
## lm(formula = ROLL ~ HGRAD * UNEM, data = enrollment)
##
## Coefficients:
## (Intercept) HGRAD UNEM HGRAD:UNEM
## -3.463e+04 2.429e+00 4.251e+03 -1.999e-01
summary(fit1)
##
## Call:
## lm(formula = ROLL ~ HGRAD * UNEM, data = enrollment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2033.3 -578.8 -335.4 794.2 3644.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.463e+04 1.679e+04 -2.063 0.0496 *
## HGRAD 2.429e+00 9.431e-01 2.575 0.0163 *
## UNEM 4.251e+03 2.255e+03 1.885 0.0712 .
## HGRAD:UNEM -1.999e-01 1.263e-01 -1.582 0.1261
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1276 on 25 degrees of freedom
## Multiple R-squared: 0.8626, Adjusted R-squared: 0.8462
## F-statistic: 52.33 on 3 and 25 DF, p-value: 6.417e-11
anova(fit1)
## Analysis of Variance Table
##
## Response: ROLL
## Df Sum Sq Mean Sq F value Pr(>F)
## HGRAD 1 235006809 235006809 144.2588 7.046e-12 ***
## UNEM 1 16680100 16680100 10.2391 0.003717 **
## HGRAD:UNEM 1 4078981 4078981 2.5039 0.126136
## Residuals 25 40726586 1629063
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
plot(fit1, which=1)
The unemployment variable is most closely related to enrollment.
fit1.df=(data.frame(HGRAD = 25000, UNEM = 9))
predict(fit1,fit1.df)
## 1
## 19368.54
fit2=lm(ROLL~HGRAD+UNEM+INC,enrollment)
fit2
##
## Call:
## lm(formula = ROLL ~ HGRAD + UNEM + INC, data = enrollment)
##
## Coefficients:
## (Intercept) HGRAD UNEM INC
## -9153.2545 0.4065 450.1245 4.2749
anova(fit1, fit2)
## Analysis of Variance Table
##
## Model 1: ROLL ~ HGRAD * UNEM
## Model 2: ROLL ~ HGRAD + UNEM + INC
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 40726586
## 2 25 11237313 0 29489274
Because the RSS is lower for the second model, we can determine that including the income variable does improve the accuracy of the model.