Module 10

Introduction

This report will work on basic usage of statistical modelling in conjunction with ggplot2 using R, RStudio and knitr. The goal of this report is to:

Practice statistical modelling
Continue practicing previous concepts

Read in the data and libraries and look at data structure

enrollment = read.csv("enrollmentForecast.csv")
str(enrollment)

## 'data.frame':    29 obs. of  5 variables:
##  $ YEAR : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ ROLL : int  5501 5945 6629 7556 8716 9369 9920 10167 11084 12504 ...
##  $ UNEM : num  8.1 7 7.3 7.5 7 6.4 6.5 6.4 6.3 7.7 ...
##  $ HGRAD: int  9552 9680 9731 11666 14675 15265 15484 15723 16501 16890 ...
##  $ INC  : int  1923 1961 1979 2030 2112 2192 2235 2351 2411 2475 ...

library (ggplot2)

## Warning: package 'ggplot2' was built under R version 4.2.3

Create scatterplots of ROLL against all other variables

plotYEAR=ggplot(enrollment, aes(x=ROLL, y=YEAR))+ geom_point()
plotYEAR

plotUNEM=ggplot(enrollment, aes(x=ROLL, y=UNEM))+ geom_point()
plotUNEM

plotHGRAD=ggplot(enrollment, aes(x=ROLL, y=HGRAD))+ geom_point()
plotHGRAD

Build a linear model using UNEM and HGRAD to predict ROLL

fit1=lm(ROLL~ HGRAD*UNEM,enrollment)
fit1

## 
## Call:
## lm(formula = ROLL ~ HGRAD * UNEM, data = enrollment)
## 
## Coefficients:
## (Intercept)        HGRAD         UNEM   HGRAD:UNEM  
##  -3.463e+04    2.429e+00    4.251e+03   -1.999e-01

Investigate the above model

summary(fit1)

## 
## Call:
## lm(formula = ROLL ~ HGRAD * UNEM, data = enrollment)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2033.3  -578.8  -335.4   794.2  3644.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -3.463e+04  1.679e+04  -2.063   0.0496 *
## HGRAD        2.429e+00  9.431e-01   2.575   0.0163 *
## UNEM         4.251e+03  2.255e+03   1.885   0.0712 .
## HGRAD:UNEM  -1.999e-01  1.263e-01  -1.582   0.1261  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1276 on 25 degrees of freedom
## Multiple R-squared:  0.8626, Adjusted R-squared:  0.8462 
## F-statistic: 52.33 on 3 and 25 DF,  p-value: 6.417e-11

anova(fit1)

## Analysis of Variance Table
## 
## Response: ROLL
##            Df    Sum Sq   Mean Sq  F value    Pr(>F)    
## HGRAD       1 235006809 235006809 144.2588 7.046e-12 ***
## UNEM        1  16680100  16680100  10.2391  0.003717 ** 
## HGRAD:UNEM  1   4078981   4078981   2.5039  0.126136    
## Residuals  25  40726586   1629063                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

plot(fit1, which=1)

The unemployment variable is most closely related to enrollment.

Estimate the fall enrollment for 9% UNEM and 25,000 HGRAD

fit1.df=(data.frame(HGRAD = 25000, UNEM = 9))
predict(fit1,fit1.df)

##        1 
## 19368.54

Second Model with INC

fit2=lm(ROLL~HGRAD+UNEM+INC,enrollment)
fit2

## 
## Call:
## lm(formula = ROLL ~ HGRAD + UNEM + INC, data = enrollment)
## 
## Coefficients:
## (Intercept)        HGRAD         UNEM          INC  
##  -9153.2545       0.4065     450.1245       4.2749

anova(fit1, fit2)

## Analysis of Variance Table
## 
## Model 1: ROLL ~ HGRAD * UNEM
## Model 2: ROLL ~ HGRAD + UNEM + INC
##   Res.Df      RSS Df Sum of Sq F Pr(>F)
## 1     25 40726586                      
## 2     25 11237313  0  29489274

Because the RSS is lower for the second model, we can determine that including the income variable does improve the accuracy of the model.