Chapter 1 - Introduction

Topic

In this report I will be evaluating the influence of race, gender, and other variables on NYC’s SAT math scores from 2012. The SAT plays an important role in determining the academic and career path for many students, and mathematical achievement seems particularly important in the current environment. Technology, engineering, finance, and science have all become the most lucrative sectors in which to work, and a strong background in math is necessary to pursue these fields.

Data Source

I gathered the SAT and school data from NYC Open Data, which is a repository for public data generated by various New York City agencies . I was able to link two datasets by a common identifier, DBN (District Borough Number), which is unique to each school.

The file 2012 SAT Results can be found at:

2012 SAT Results

The file 2006-2012 School Demographics and Accountability Snapshot can be found at:

2006-2012 Demographics

Variables

I have evaluated data from 412 NYC schools with the following variables:

Variable	Definition (all data is for the year 2012)
DBN	District Borough Number - school identifier
Borough	New York City borough name
NumTestTakers	Number of SAT test takers
ReadingAvgScore	Average SAT reading score
MathAvgScore	Average SAT math score
WritingAvgScore	Average SAT writing score
AsianPct	Percentage of school students who were Asian
BlackPct	Percentage of school students who were Black
HispanicPct	Percentage of school students who were Hispanic
WhitePct	Percentage of school students who were White
MalePct	Percentage of school students who were Male
FemalePct	Percentage of school students who were Female

Data View

Below is a small subset of the data:

library(knitr)
library(kableExtra)
SAT<- read.csv("~/Baruch/Baruch Class Projects/9700 Regression/SAT Data.csv")
colnames(SAT)[1]="DBN"
kable(SAT[1:10,c(1,4:13)])%>%kable_styling(bootstrap_options=c("striped","condensed", full_width=F))

DBN	NumTestTakers	ReadingAvgScore	MathAvgScore	WritingAvgScore	AsianPct	BlackPct	HispanicPct	WhitePct	MalePct	FemalePct
01M292	29	355	404	363	14.0	29.1	53.8	1.7	61.4	38.6
01M448	91	383	423	366	29.2	22.6	45.9	2.3	57.4	42.6
01M450	70	377	402	370	9.7	23.9	55.4	10.4	54.7	45.3
01M458	7	414	401	359	2.2	34.4	59.4	3.6	43.3	56.7
01M509	44	390	433	384	9.3	31.6	56.9	1.6	46.3	53.7
01M515	112	332	557	316	84.7	5.2	8.9	0.9	53.7	46.3
01M539	159	522	574	525	27.8	11.7	14.2	44.9	49.2	50.8
01M650	18	417	418	411	0.5	45.4	49.5	4.1	39.9	60.1
01M696	130	624	604	628	15.1	15.1	18.2	49.8	31.3	68.7
02M047	16	395	400	387	1.7	32.2	59.2	6.3	42.5	57.5

Chapter 2 - A Simple Regression Model

I chose to take a closer look at how the percentage of Asian students relates to the average SAT math score, which we can visualize with a scatterplot.

library(tidyverse)
library(ggthemes)
ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=MathAvgScore), color="blue", size=2)+
  labs(x="Percentage of Asian Students", y="Average SAT Math Score")+theme_light()

Analysis of Scatterplot

Heteroscedasticity - We can see that there is a moderate level of heteroscedasticity present. There is more variability in the average SAT math score as we move away from the cluster of values on the left.

Curvature - Fortunately there is no evidence of curvature in the scatterplot.

Leverage Points - The points on the right are all far from the clustered values on the left \((\bar{x})\) and would be considered leverage points.

Outliers - A few points represent either a large avg. SAT math score (near 700) or a small avg. SAT math score (near 300), but I don’t think they would be considered outliers. They are not extreme relative to other datapoints.

Influential Points - Although there are leverage points they would not be considered influential. Their removal would not dramatically alter the slope of the \(\hat{y}\) line.

Although sometimes transformations can help with heteroscedasticity, I did not find an adequate solution in this case. Some examples are presented below.

LOG Transformations

library(gridExtra)

ggplot(data=SAT)+geom_point(mapping=aes(x=log(AsianPct), y=log(MathAvgScore)), color="purple", size=2)+
  labs(x="LOG Percentage of Asian Students", y="LOG Average SAT Math Score")+theme_light() ->p1

ggplot(data=SAT)+geom_point(mapping=aes(x=log(AsianPct), y=MathAvgScore), color="purple", size=2)+
  labs(x="Log Percentage of Asian Students", y="Average SAT Math Score")+theme_light() ->p2

grid.arrange(p1, p2, ncol = 2)

ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=log(MathAvgScore)), color="purple", size=2)+
  labs(x="Percentage of Asian Students", y="LOG Average SAT Math Score")+theme_light()

We can see that LOG transformations of the x values, the y values, or both did not solve the problem of heteroscedasticity, and in some cases made it worse.

Square Root Transformations

ggplot(data=SAT)+geom_point(mapping=aes(x=sqrt(AsianPct), y=sqrt(MathAvgScore)), color="green4", size=2)+
  labs(x="SQRT Percentage of Asian Students", y="SQRT Average SAT Math Score")+theme_light() ->p1

ggplot(data=SAT)+geom_point(mapping=aes(x=sqrt(AsianPct), y=MathAvgScore), color="green4", size=2)+
  labs(x="SQRT Percentage of Asian Students", y="Average SAT Math Score")+theme_light() ->p2

grid.arrange(p1, p2, ncol = 2)

ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=sqrt(MathAvgScore)), color="green4", size=2)+
  labs(x="Percentage of Asian Students", y="SQRT Average SAT Math Score")+theme_light()

Similarly, square root transformations were not helpful.

The Linear Regression Model

\[Y_i = \beta_0+\beta_1x_i+\epsilon\]

The model does not view \(Y_i\) as a datapoint. The model views \(Y_i\) as a sub-population of values that share the same x value. For this to be possible the model assumes that we could get many fresh samples. In this regression \(Y_i\) represents a sub-population of average SAT math scores for a given percentage of Asian students, \(x_i\) .
Like any sub-population, \(Y_i\) has a population average and a population variance. The first component, \(\beta_0+\beta_1x_1\), supplies the sub-population average. \(\beta_0\) represents the model intercept and \(\beta_1\) represents the model slope.
The second component, \(\epsilon\), supplies the variability. We can think of \(\epsilon\) as the name of a bell curved population of \(Y\) values that happens to be centered at zero. The model assumes that this bell curve is added to each sub-population average and that every bell curve has the same standard deviation.

R Output for the fitted model

lm.fit = lm(MathAvgScore~AsianPct, data=SAT)
summary(lm.fit)


Call:
lm(formula = MathAvgScore ~ AsianPct, data = SAT)

Residuals:
     Min       1Q   Median       3Q      Max 
-164.163  -25.101   -6.255   19.754  212.155 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 385.0746     2.7485  140.10   <2e-16 ***
AsianPct      3.0401     0.1565   19.43   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47 on 410 degrees of freedom
Multiple R-squared:  0.4793,    Adjusted R-squared:  0.478 
F-statistic: 377.3 on 1 and 410 DF,  p-value: < 2.2e-16

temp_var  <- predict(lm.fit, interval = "prediction")
new_df <-cbind(SAT, temp_var)
ggplot(new_df, aes(AsianPct,MathAvgScore))+geom_point()+
  geom_line(aes(y=lwr),color="red",linetype="dashed")+
  geom_line(aes(y=upr),color="red",linetype="dashed")+
  geom_smooth(method = lm, se=TRUE)+
  ggtitle("Fit Plot.  Blue Band = 95% Confidence Interval.  Red Lines = 95% Prediction Interval")

If we examine the regression plot above we can see that there are several linear lines presented. The \(\hat{y}\) line in the center provides us with a mean value of y given x. Surrounding the \(\hat{y}\) line we can see a shaded band, which represents the 95% confidence interval. The upper and lower values give us a range in which we are 95% confident that the population mean will fall for the same x value. The plot’s outer dotted lines represent the prediction interval, which is useful if we wanted to evaluate another sample with the same x value. It allows us to say that we are 95% confident that the value of y (not the mean value of y) will fall within that wider range.

Analysis of Output

The t-tests

In the regression model we can use a t-test to evaluate the predictive value of \(\beta_1\).

\(H_0:\beta_1=0\)

\(H_0:\beta_1\neq0\)

To evaluate the null hypothesis we calculate the t-statistic and compare it to the critical points from a t-density function with n-2 degrees of freedom.

\[t=\frac{\hat\beta_1-0}{SE(\hat\beta_1)}=\frac{3.0401}{0.1565}=19.43\] The t-critical value with 410 degrees of freedom and \(\alpha=0.05\) is 1.966

We reject the null hypothesis if the |t-stat| is greater than the t-critical value. Therefore, since the t-stat of 19.43 is greater than 1.966, we can reject the null hypothesis.

We are evaluating the value of \(\beta_1\), but we only have a value for \(b_1\). However, \(b_1\) is an unbiased estimator of \(\beta_1\).

Imagine we were evaluating the average SAT math score for the population of all schools in the US (let’s assume that there are 100,000 schools). If we were to take a sample of 100 schools we could calculate a single sample slope. Now imagine we took all possible samples from the population. Without replacement that would result in 100,000 C100 sample points in our sample space, which is an extremely large number. We could then calculate the slope for each sample in the sample space, resulting in a daughter population of sample slopes. The grand average of that daughter population would equal \(\beta_1\), so we can say that is an unbiased estimator. It is unbiased because the methodology is unbiased, not because ever b??? value would equal \(\beta_1\). The method is “on target” because the daughter population average is centered at \(\beta_1\).

The y-hat equation

The equation for the fitted model is as follows:

\[\hat{y_i} = 385.075 + 3.04(x_i)\]

Similar to how \(b_1\) is an unbiased estimator of \(\beta_1\), \(\hat{y_i}\) is an unbiased estimator of \(Y_i\). If we evaluated all possible samples (of the same size) from the population, we could calculate a daughter population of sample slopes and \(\hat{y_i}\) values. The grand average of the daughter population of \(\hat{y_i}\) values would equal \(Y_i\). We therefore say that it is an unbiased estimator.

Matrix Methods

Simple Linear Regression in Matrix Terms

We can use Matrix algebra to derive the regression equation, and some of the calculations are included below. The calculations are based on the following subset of data:

SATMat <- read.csv("~/Baruch/Baruch Class Projects/9700 Regression/SATMat.csv")
kable(SATMat)%>%kable_styling(position="left",full_width = F)

ReadingAvgScore	WritingAvgScore	MathAvgScore
679	682	735
632	649	688
635	636	682
612	596	660
587	587	659
605	588	654
621	638	651
636	636	648

The Average SAT Reading score will be used as the X matrix:

Xmat=cbind(rep(1,8),SATMat[,1])
Xmatd=as.data.frame(Xmat)
kable(Xmatd, col.names = NULL)%>%kable_styling(full_width = F, position="left")

1	679
1	632
1	635
1	612
1	587
1	605
1	621
1	636

The Average SAT Math Score represents the Y vector:

Y=SATMat[,3]
Y=as.matrix(Y)
kable(SATMat[,3], col.names = NULL)%>%kable_styling(full_width = F, position="left")

735
688
682
660
659
654
651
648

Matrix operations:

\(X^{'}X\)

ab = as.data.frame(t(Xmat)%*%Xmat)
kable(ab, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
  column_spec(1:2,background="yellow")

8	5007
5007	3138965

\((X^{'}X)^{-1}\)

ac = as.data.frame(solve(t(Xmat)%*%Xmat))
kable(ac, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
  column_spec(1:2,background="orange")

75.3273260	-0.1201555
-0.1201555	0.0001920

\(X^{'}\)

ad=as.data.frame(t(Xmat))
kable(ad, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
  column_spec(1:8,background="orange")

1	1	1	1	1	1	1	1
679	632	635	612	587	605	621	636

\(X(X^{'}X)^{-1}X^{'}\)

ae=as.data.frame(Xmat%*%solve(t(Xmat)%*%Xmat)%*%t(Xmat))
kable(ae, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
  column_spec(1:8,background="orchid")

0.6668187	0.1874685	0.2180653	-0.0165103	-0.2714838	-0.0879029	0.0752802	0.2282643
0.1874685	0.1322023	0.1357299	0.1086847	0.0792878	0.1004536	0.1192676	0.1369058
0.2180653	0.1357299	0.1409853	0.1006935	0.0568981	0.0884308	0.1164599	0.1427372
-0.0165103	0.1086847	0.1006935	0.1619592	0.2285522	0.1806052	0.1379856	0.0980298
-0.2714838	0.0792878	0.0568981	0.2285522	0.4151328	0.2807948	0.1613832	0.0494349
-0.0879029	0.1004536	0.0884308	0.1806052	0.2807948	0.2086583	0.1445370	0.0844232
0.0752802	0.1192676	0.1164599	0.1379856	0.1613832	0.1445370	0.1295625	0.1155240
0.2282643	0.1369058	0.1427372	0.0980298	0.0494349	0.0844232	0.1155240	0.1446810

\(\hat{y} = X(X^{'}X)^{-1}X^{'}y\)

af=as.data.frame((Xmat%*%solve(t(Xmat)%*%Xmat)%*%t(Xmat))%*%Y)
kable(af, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
  column_spec(1,background="lightblue")

717.4402
677.3496
679.9085
660.2897
638.9650
654.3188
667.9667
680.7615

b-vectors \(= (X^{'}X)^{-1}X^{'}y\)

ag=as.data.frame((solve(t(Xmat)%*%Xmat)%*%t(Xmat))%*%Y)
kable(ag, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
  column_spec(1,background="yellow")

138.2590771

0.8529913

The matrix calculations match the regression output

lm(MathAvgScore ~ ReadingAvgScore, data = SATMat)


Call:
lm(formula = MathAvgScore ~ ReadingAvgScore, data = SATMat)

Coefficients:
    (Intercept)  ReadingAvgScore  
        138.259            0.853

We can compute the value of \(h_{2,3}\) in the hat matrix by the following formula:

x-value	\((x-\bar{x})^2\)
679.00	2,822.27
632.00	37.52
635.00	83.27
612.00	192.52
587.00	1,511.27
605.00	435.77
621.00	23.77
636.00	102.52

\(\bar{x}=625.88\)
\(ss_x=5208.88\)
\(h_{2,3}=\left(\frac{1}{n}\right)+(x_2-\bar{x})(x_3-\bar{x})/ss_x\)
\(h_{2,3}=\left(\frac{1}{8}\right)+(632-625.88)(635-625.88)/5208.88 = 0.1357\)

The calculation matches the table.

Multiple Linear Regression in Matrix Terms

Using two regressors:

Xmat=cbind(rep(1,8),SATMat[,1],SATMat[,2])
Xmate=as.data.frame(cbind(rep(1,8),SATMat[,1],SATMat[,2]))
kable(Xmate, col.names = NULL)%>%kable_styling(full_width = F, position="left")

1	679	682
1	632	649
1	635	636
1	612	596
1	587	587
1	605	588
1	621	638
1	636	636

Using R Matrix operations to fit the model:

Rhat = as.data.frame(solve(t(Xmat)%*%Xmat)%*%t(Xmat)%*%Y)
kable(Rhat, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
  column_spec(1,background="yellow")

134.4315052

0.9009849

-0.0418363

lm(MathAvgScore~ReadingAvgScore+WritingAvgScore, data=SATMat)


Call:
lm(formula = MathAvgScore ~ ReadingAvgScore + WritingAvgScore, 
    data = SATMat)

Coefficients:
    (Intercept)  ReadingAvgScore  WritingAvgScore  
      134.43151          0.90098         -0.04184

The regression output matches the matrix operations.

Chapter 4 - Model Selection

To select the ideal regression model it is important to visualize the relationships between the dependent variable and regressors, which can be accomplished with a scatterplot matrix.

Before performing model selection I have renamed the variables for ease of use.

Old Variable	New Variable	Description
MathAvgScore	MS	2012 Average SAT math score for NYC schools
ReadingAvgScore	RS	2012 Average SAT reading score for NYC schools
WritingAvgScore	WS	2012 Average SAT writing score for NYC schools
NumTestTakers	NT	Number of test takers
AsianPct	AP	Percentage of Asian students in the school
BlackPct	BP	Percentage of Black students in the school
HispanicPct	HP	Percentage of Hispanic students in the school
WhitePct	WP	Percentage of White students in the school
MalePct	MP	Percentage of Male students in the school
FemalePct	FP	Percentage of Female students in the school

library(GGally)
ggpairs(SAT2[,4:13])

We can see a very strong linear relationship between the average reading, writing, and math SAT scores, as expected.

There also appears to be a positive linear relationship between the percentage of Asian and White students and the average math SAT score, although there’s a modest level of heteroskedasticity given the concentration of datapoints on the left. We can also see a similar trend in the number of test takers.

We can visualize a negative relationship between the percentage of Black and Hispanic students and the average math SAT score, but there is a greater level of heteroskedasticity present.

Stepwise Regression

Next, I used stepwise regression (iteratively adding and removing predictors until the best model is found).

library(leaps)
nullmod = lm(MS~1, data=SAT2) # Intercept only model
fullmod = lm(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data=SAT2) # all parameters
stepwisereg = step(nullmod, scope=list(lower=nullmod, upper=fullmod), direction='both')

Start:  AIC=3441.29
MS ~ 1

       Df Sum of Sq     RSS    AIC
+ WS    1   1394931  344054 2775.7
+ RS    1   1353411  385574 2822.7
+ AP    1    833423  905562 3174.5
+ WP    1    672844 1066140 3241.7
+ NT    1    486543 1252442 3308.1
+ BP    1    261348 1477637 3376.2
+ HP    1    213384 1525601 3389.4
<none>              1738985 3441.3
+ FP    1      1116 1737869 3443.0
+ MP    1      1111 1737874 3443.0

Step:  AIC=2775.74
MS ~ WS

       Df Sum of Sq     RSS    AIC
+ AP    1    179901  164152 2472.9
+ NT    1     37466  306588 2730.2
+ BP    1     35839  308215 2732.4
+ MP    1     25604  318450 2745.9
+ FP    1     25600  318454 2745.9
+ WP    1      4395  339658 2772.4
+ HP    1      3888  340166 2773.1
+ RS    1      3768  340286 2773.2
<none>               344054 2775.7
- WS    1   1394931 1738985 3441.3

Step:  AIC=2472.86
MS ~ WS + AP

       Df Sum of Sq    RSS    AIC
+ MP    1     11190 152963 2445.8
+ FP    1     11189 152964 2445.8
+ RS    1      9315 154837 2450.8
+ WP    1      1836 162316 2470.2
+ BP    1      1799 162353 2470.3
+ NT    1      1429 162723 2471.3
<none>              164152 2472.9
+ HP    1       647 163505 2473.2
- AP    1    179901 344054 2775.7
- WS    1    741410 905562 3174.5

Step:  AIC=2445.77
MS ~ WS + AP + MP

       Df Sum of Sq    RSS    AIC
+ RS    1      6431 146531 2430.1
+ BP    1      1854 151108 2442.7
+ WP    1      1476 151487 2443.8
+ NT    1       976 151987 2445.1
+ HP    1       791 152172 2445.6
<none>              152963 2445.8
+ FP    1        33 152930 2447.7
- MP    1     11190 164152 2472.9
- AP    1    165487 318450 2745.9
- WS    1    747578 900540 3174.2

Step:  AIC=2430.08
MS ~ WS + AP + MP + RS

       Df Sum of Sq    RSS    AIC
+ BP    1      3517 143014 2422.1
+ HP    1      1869 144663 2426.8
+ WP    1      1789 144743 2427.0
+ NT    1      1035 145496 2429.2
<none>              146531 2430.1
+ FP    1        69 146463 2431.9
- RS    1      6431 152963 2445.8
- MP    1      8306 154837 2450.8
- WS    1     20049 166581 2480.9
- AP    1    170512 317044 2746.1

Step:  AIC=2422.07
MS ~ WS + AP + MP + RS + BP

       Df Sum of Sq    RSS    AIC
+ NT    1       787 142228 2421.8
<none>              143014 2422.1
+ WP    1       602 142412 2422.3
+ HP    1       523 142491 2422.6
+ FP    1        88 142926 2423.8
- BP    1      3517 146531 2430.1
- MP    1      8027 151041 2442.6
- RS    1      8094 151108 2442.8
- WS    1     15913 158927 2463.5
- AP    1    135819 278834 2695.2

Step:  AIC=2421.79
MS ~ WS + AP + MP + RS + BP + NT

       Df Sum of Sq    RSS    AIC
<none>              142228 2421.8
- NT    1       787 143014 2422.1
+ WP    1       432 141796 2422.5
+ HP    1       382 141845 2422.7
+ FP    1        92 142136 2423.5
- BP    1      3269 145496 2429.2
- MP    1      7688 149915 2441.5
- RS    1      8084 150312 2442.6
- WS    1     15337 157565 2462.0
- AP    1    115280 257508 2664.4

summary(stepwisereg)


Call:
lm(formula = MS ~ WS + AP + MP + RS + BP + NT, data = SAT2)

Residuals:
    Min      1Q  Median      3Q     Max 
-65.209  -9.906  -0.890  10.668 106.919 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 57.161880   8.876950   6.439 3.40e-10 ***
WS           0.480869   0.072765   6.609 1.23e-10 ***
AP           1.444475   0.079726  18.118  < 2e-16 ***
MP           0.339625   0.072588   4.679 3.94e-06 ***
RS           0.349090   0.072760   4.798 2.26e-06 ***
BP          -0.122270   0.040078  -3.051  0.00243 ** 
NT           0.010680   0.007136   1.497  0.13527    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.74 on 405 degrees of freedom
Multiple R-squared:  0.9182,    Adjusted R-squared:  0.917 
F-statistic: 757.8 on 6 and 405 DF,  p-value: < 2.2e-16

The resulting best model had an AIC value of 2421.79 and and adjusted R-squared value of 0.917

par(mfrow=c(2,2))
plot(lm.fit)

We can see that there are some outliers and high leverage points, but the number is not excessive.

There is some heteroscedasticity of the residuals, mostly because the data is clustered around the averages at the left of the plots.

res = residuals(stepwisereg)
plot(res) # plot residuals

#display the corresponding rows where the absolute value of residuals is >50
kable(SAT2[abs(res)>50,4:13])%>%kable_styling(bootstrap_options=c("striped","condensed", full_width=F))

	NT	RS	MS	WS	AP	BP	HP	WP	MP	FP
6	112	332	557	316	84.7	5.2	8.9	0.9	53.7	46.3
23	79	319	512	357	62.4	7.2	26.4	4.0	51.2	48.8
58	121	345	517	343	35.3	31.0	28.3	5.0	56.1	43.9
71	12	416	403	381	48.7	33.2	16.6	1.5	47.7	52.3
214	9	439	374	418	2.7	88.9	7.1	0.4	99.6	0.4
228	20	441	499	413	7.5	58.2	26.3	4.5	79.4	20.6
392	143	323	475	329	48.3	5.6	43.5	2.6	53.2	46.8

When we examine the most extreme residuals it is usally possible to pinpoint the cause. For example, in row 6, the Asian Percentage of 84.7% is unusally high.

Variance Inflation

The Variance Inflation Factor is a measurement that is used to detect collinearity among the regressors. It is calculated by regressing each variable against the other variables in the model, computing the \(R^2\), subtracting the resulting \(R^2\) from 1, and taking the inverse. \(VIF_K=(1-R^2_K)^{-1}\)

A high level of collinearity could make the model unreliable. It can increase the standard error of the estimated coefficients making the t-statistics faulty. It can also cause the individual regressor coefficients to change erratically.

library(car)
vif(stepwisereg) # calculate variance inflation factor

       WS        AP        MP        RS        BP        NT 
21.445989  1.632100  1.078509 20.155587  1.290958  1.460455

There are no extreme results.

RS and WS have the highest values, likely because the test scores have a strong linear relationship.

Cook’s D

Cook’s distance measures the influence of an observation on the regression model. An observation with a high Cook’s distance would meaningfully change the slope of the regression line if included. It can be used to detect leverage points.

It can be computed by:

\[D_i=\frac{ \sum_{j=1}^{n} (\hat{y}_j - \hat{y}_{j(i)})^2 }{p(MSE)}\]

\(D_i\) is Cook’s distance for the i^th observation
\(\hat{y}_j\) is the full regression model’s estimate for the j^th observation
\(\hat{y}_{j(i)}\) is the regression model’s (fitted without the i^th observation) estimate for the j^th observation
\(p\) is the number of regressors
\(MSE\) is the mean squared error of the full regression model

cooks=cooks.distance(stepwisereg)
plot(cooks) # plot cook's D

#display the corresponding rows where the value of cook's D is >0.05
kable(SAT2[cooks>.05,4:13])%>%kable_styling(bootstrap_options=c("striped","condensed", full_width=F))

	NT	RS	MS	WS	AP	BP	HP	WP	MP	FP
6	112	332	557	316	84.7	5.2	8.9	0.9	53.7	46.3
23	79	319	512	357	62.4	7.2	26.4	4.0	51.2	48.8
58	121	345	517	343	35.3	31.0	28.3	5.0	56.1	43.9
71	12	416	403	381	48.7	33.2	16.6	1.5	47.7	52.3
214	9	439	374	418	2.7	88.9	7.1	0.4	99.6	0.4
392	143	323	475	329	48.3	5.6	43.5	2.6	53.2	46.8

Most large Cook’s D values appear to have low math scores relative to a high percentage of Asian students.

Chapter 5 - Cross Validation

Validation Set

The regression model provides us with a \(\hat{y}\) equation that is designed to be optimal for our dataset, but the true value lies in how well it will work with new data. Since fresh data is often unavailable we can simply split our existing dataset into two pieces. One piece will be used to calculate the regression equation, and the other piece will provide us with our out-of-sample values.

We start by randomizing the data and selecting our “training sample,” which is used to calculate the \(\hat{y}\) equation. We then use the remaining rows as our “validation” sample. We can evaluate the RMSE value to get a sense of how well the model might work with out-of-sample data. When comparing two models, the one that produces the lowest test sample RMSE is the preferred model.

library(caret)
set.seed(1234)
# Split the data into training and test set.  70% Train, 30% Test
training.samples <- SAT2$MS %>% createDataPartition(p = 0.7, list = FALSE)
train.data  <- SAT2[training.samples, ]
test.data <- SAT2[-training.samples, ]
# Build the model
model <- lm(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data = SAT2)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$MS),
            RMSE = RMSE(predictions, test.data$MS),
            MAE = MAE(predictions, test.data$MS))

         R2     RMSE      MAE
1 0.9355193 14.56597 11.25401

Leave One Out Cross Validation - LOOCV

Leave one datapoint out
Build the model on the remaining datapoints
Calculate and record the error associated with the datapoint that was excluded
Repeat the process for all data points
Calculate the average of all the test errors

# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
model <- train(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data = SAT2, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

Linear Regression 

412 samples
  9 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 411, 411, 411, 411, 411, 411, ... 
Resampling results:

  RMSE      Rsquared  MAE     
  19.19874  0.912689  13.89474

Tuning parameter 'intercept' was held constant at a value of TRUE

K-fold cross-validation

This is similar to leave one out, except the data is split into k-subsets. 1 subset is used for testing and the remainder are used for training. The process is repeated until all the subsets are tested.

# Define training control
set.seed(1234) 
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data = SAT2, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

Linear Regression 

412 samples
  9 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 372, 372, 372, 369, 372, 370, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  18.80215  0.9065453  13.94881

Tuning parameter 'intercept' was held constant at a value of TRUE

Repeated K-fold cross-validation

We can also repeat k-fold cross validation multiple times

# Define training control
set.seed(1234)
train.control <- trainControl(method = "repeatedcv", 
                              number = 10, repeats = 3)
# Train the model
model <- train(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data = SAT2, method = "lm",
               trControl = train.control)
# Summarize the results
print(model)

Linear Regression 

412 samples
  9 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 372, 372, 372, 369, 372, 370, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  18.82588  0.9035846  13.91531

Tuning parameter 'intercept' was held constant at a value of TRUE

A Multiple Regression Analysis of NYC’s 2012 Average SAT Math Scores

Paul Jozefek

February 13, 2020