In this report I will be evaluating the influence of race, gender, and other variables on NYC’s SAT math scores from 2012. The SAT plays an important role in determining the academic and career path for many students, and mathematical achievement seems particularly important in the current environment. Technology, engineering, finance, and science have all become the most lucrative sectors in which to work, and a strong background in math is necessary to pursue these fields.
I gathered the SAT and school data from NYC Open Data, which is a repository for public data generated by various New York City agencies . I was able to link two datasets by a common identifier, DBN (District Borough Number), which is unique to each school.
The file 2012 SAT Results can be found at:
The file 2006-2012 School Demographics and Accountability Snapshot can be found at:
I have evaluated data from 412 NYC schools with the following variables:
| Variable | Definition (all data is for the year 2012) |
|---|---|
| DBN | District Borough Number - school identifier |
| Borough | New York City borough name |
| NumTestTakers | Number of SAT test takers |
| ReadingAvgScore | Average SAT reading score |
| MathAvgScore | Average SAT math score |
| WritingAvgScore | Average SAT writing score |
| AsianPct | Percentage of school students who were Asian |
| BlackPct | Percentage of school students who were Black |
| HispanicPct | Percentage of school students who were Hispanic |
| WhitePct | Percentage of school students who were White |
| MalePct | Percentage of school students who were Male |
| FemalePct | Percentage of school students who were Female |
Below is a small subset of the data:
library(knitr)
library(kableExtra)
SAT<- read.csv("~/Baruch/Baruch Class Projects/9700 Regression/SAT Data.csv")
colnames(SAT)[1]="DBN"
kable(SAT[1:10,c(1,4:13)])%>%kable_styling(bootstrap_options=c("striped","condensed", full_width=F))
| DBN | NumTestTakers | ReadingAvgScore | MathAvgScore | WritingAvgScore | AsianPct | BlackPct | HispanicPct | WhitePct | MalePct | FemalePct |
|---|---|---|---|---|---|---|---|---|---|---|
| 01M292 | 29 | 355 | 404 | 363 | 14.0 | 29.1 | 53.8 | 1.7 | 61.4 | 38.6 |
| 01M448 | 91 | 383 | 423 | 366 | 29.2 | 22.6 | 45.9 | 2.3 | 57.4 | 42.6 |
| 01M450 | 70 | 377 | 402 | 370 | 9.7 | 23.9 | 55.4 | 10.4 | 54.7 | 45.3 |
| 01M458 | 7 | 414 | 401 | 359 | 2.2 | 34.4 | 59.4 | 3.6 | 43.3 | 56.7 |
| 01M509 | 44 | 390 | 433 | 384 | 9.3 | 31.6 | 56.9 | 1.6 | 46.3 | 53.7 |
| 01M515 | 112 | 332 | 557 | 316 | 84.7 | 5.2 | 8.9 | 0.9 | 53.7 | 46.3 |
| 01M539 | 159 | 522 | 574 | 525 | 27.8 | 11.7 | 14.2 | 44.9 | 49.2 | 50.8 |
| 01M650 | 18 | 417 | 418 | 411 | 0.5 | 45.4 | 49.5 | 4.1 | 39.9 | 60.1 |
| 01M696 | 130 | 624 | 604 | 628 | 15.1 | 15.1 | 18.2 | 49.8 | 31.3 | 68.7 |
| 02M047 | 16 | 395 | 400 | 387 | 1.7 | 32.2 | 59.2 | 6.3 | 42.5 | 57.5 |
I chose to take a closer look at how the percentage of Asian students relates to the average SAT math score, which we can visualize with a scatterplot.
library(tidyverse)
library(ggthemes)
ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=MathAvgScore), color="blue", size=2)+
labs(x="Percentage of Asian Students", y="Average SAT Math Score")+theme_light()
Heteroscedasticity - We can see that there is a moderate level of heteroscedasticity present. There is more variability in the average SAT math score as we move away from the cluster of values on the left.
Curvature - Fortunately there is no evidence of curvature in the scatterplot.
Leverage Points - The points on the right are all far from the clustered values on the left \((\bar{x})\) and would be considered leverage points.
Outliers - A few points represent either a large avg. SAT math score (near 700) or a small avg. SAT math score (near 300), but I don’t think they would be considered outliers. They are not extreme relative to other datapoints.
Influential Points - Although there are leverage points they would not be considered influential. Their removal would not dramatically alter the slope of the \(\hat{y}\) line.
Although sometimes transformations can help with heteroscedasticity, I did not find an adequate solution in this case. Some examples are presented below.
LOG Transformations
library(gridExtra)
ggplot(data=SAT)+geom_point(mapping=aes(x=log(AsianPct), y=log(MathAvgScore)), color="purple", size=2)+
labs(x="LOG Percentage of Asian Students", y="LOG Average SAT Math Score")+theme_light() ->p1
ggplot(data=SAT)+geom_point(mapping=aes(x=log(AsianPct), y=MathAvgScore), color="purple", size=2)+
labs(x="Log Percentage of Asian Students", y="Average SAT Math Score")+theme_light() ->p2
grid.arrange(p1, p2, ncol = 2)
ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=log(MathAvgScore)), color="purple", size=2)+
labs(x="Percentage of Asian Students", y="LOG Average SAT Math Score")+theme_light()
We can see that LOG transformations of the x values, the y values, or both did not solve the problem of heteroscedasticity, and in some cases made it worse.
Square Root Transformations
ggplot(data=SAT)+geom_point(mapping=aes(x=sqrt(AsianPct), y=sqrt(MathAvgScore)), color="green4", size=2)+
labs(x="SQRT Percentage of Asian Students", y="SQRT Average SAT Math Score")+theme_light() ->p1
ggplot(data=SAT)+geom_point(mapping=aes(x=sqrt(AsianPct), y=MathAvgScore), color="green4", size=2)+
labs(x="SQRT Percentage of Asian Students", y="Average SAT Math Score")+theme_light() ->p2
grid.arrange(p1, p2, ncol = 2)
ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=sqrt(MathAvgScore)), color="green4", size=2)+
labs(x="Percentage of Asian Students", y="SQRT Average SAT Math Score")+theme_light()
Similarly, square root transformations were not helpful.
\[Y_i = \beta_0+\beta_1x_i+\epsilon\]
The model does not view \(Y_i\) as a datapoint. The model views \(Y_i\) as a sub-population of values that share the same x value. For this to be possible the model assumes that we could get many fresh samples. In this regression \(Y_i\) represents a sub-population of average SAT math scores for a given percentage of Asian students, \(x_i\) .
Like any sub-population, \(Y_i\) has a population average and a population variance. The first component, \(\beta_0+\beta_1x_1\), supplies the sub-population average. \(\beta_0\) represents the model intercept and \(\beta_1\) represents the model slope.
The second component, \(\epsilon\), supplies the variability. We can think of \(\epsilon\) as the name of a bell curved population of \(Y\) values that happens to be centered at zero. The model assumes that this bell curve is added to each sub-population average and that every bell curve has the same standard deviation.
lm.fit = lm(MathAvgScore~AsianPct, data=SAT)
summary(lm.fit)
Call:
lm(formula = MathAvgScore ~ AsianPct, data = SAT)
Residuals:
Min 1Q Median 3Q Max
-164.163 -25.101 -6.255 19.754 212.155
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 385.0746 2.7485 140.10 <2e-16 ***
AsianPct 3.0401 0.1565 19.43 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47 on 410 degrees of freedom
Multiple R-squared: 0.4793, Adjusted R-squared: 0.478
F-statistic: 377.3 on 1 and 410 DF, p-value: < 2.2e-16
temp_var <- predict(lm.fit, interval = "prediction")
new_df <-cbind(SAT, temp_var)
ggplot(new_df, aes(AsianPct,MathAvgScore))+geom_point()+
geom_line(aes(y=lwr),color="red",linetype="dashed")+
geom_line(aes(y=upr),color="red",linetype="dashed")+
geom_smooth(method = lm, se=TRUE)+
ggtitle("Fit Plot. Blue Band = 95% Confidence Interval. Red Lines = 95% Prediction Interval")
If we examine the regression plot above we can see that there are several linear lines presented. The \(\hat{y}\) line in the center provides us with a mean value of y given x. Surrounding the \(\hat{y}\) line we can see a shaded band, which represents the 95% confidence interval. The upper and lower values give us a range in which we are 95% confident that the population mean will fall for the same x value. The plot’s outer dotted lines represent the prediction interval, which is useful if we wanted to evaluate another sample with the same x value. It allows us to say that we are 95% confident that the value of y (not the mean value of y) will fall within that wider range.
The t-tests
In the regression model we can use a t-test to evaluate the predictive value of \(\beta_1\).
\(H_0:\beta_1=0\)
\(H_0:\beta_1\neq0\)
To evaluate the null hypothesis we calculate the t-statistic and compare it to the critical points from a t-density function with n-2 degrees of freedom.
\[t=\frac{\hat\beta_1-0}{SE(\hat\beta_1)}=\frac{3.0401}{0.1565}=19.43\] The t-critical value with 410 degrees of freedom and \(\alpha=0.05\) is 1.966
We reject the null hypothesis if the |t-stat| is greater than the t-critical value. Therefore, since the t-stat of 19.43 is greater than 1.966, we can reject the null hypothesis.
We are evaluating the value of \(\beta_1\), but we only have a value for \(b_1\). However, \(b_1\) is an unbiased estimator of \(\beta_1\).
Imagine we were evaluating the average SAT math score for the population of all schools in the US (let’s assume that there are 100,000 schools). If we were to take a sample of 100 schools we could calculate a single sample slope. Now imagine we took all possible samples from the population. Without replacement that would result in 100,000 C100 sample points in our sample space, which is an extremely large number. We could then calculate the slope for each sample in the sample space, resulting in a daughter population of sample slopes. The grand average of that daughter population would equal \(\beta_1\), so we can say that is an unbiased estimator. It is unbiased because the methodology is unbiased, not because ever b??? value would equal \(\beta_1\). The method is “on target” because the daughter population average is centered at \(\beta_1\).
The y-hat equation
The equation for the fitted model is as follows:
\[\hat{y_i} = 385.075 + 3.04(x_i)\]
Similar to how \(b_1\) is an unbiased estimator of \(\beta_1\), \(\hat{y_i}\) is an unbiased estimator of \(Y_i\). If we evaluated all possible samples (of the same size) from the population, we could calculate a daughter population of sample slopes and \(\hat{y_i}\) values. The grand average of the daughter population of \(\hat{y_i}\) values would equal \(Y_i\). We therefore say that it is an unbiased estimator.
We can use Matrix algebra to derive the regression equation, and some of the calculations are included below. The calculations are based on the following subset of data:
SATMat <- read.csv("~/Baruch/Baruch Class Projects/9700 Regression/SATMat.csv")
kable(SATMat)%>%kable_styling(position="left",full_width = F)
| ReadingAvgScore | WritingAvgScore | MathAvgScore |
|---|---|---|
| 679 | 682 | 735 |
| 632 | 649 | 688 |
| 635 | 636 | 682 |
| 612 | 596 | 660 |
| 587 | 587 | 659 |
| 605 | 588 | 654 |
| 621 | 638 | 651 |
| 636 | 636 | 648 |
The Average SAT Reading score will be used as the X matrix:
Xmat=cbind(rep(1,8),SATMat[,1])
Xmatd=as.data.frame(Xmat)
kable(Xmatd, col.names = NULL)%>%kable_styling(full_width = F, position="left")
| 1 | 679 |
| 1 | 632 |
| 1 | 635 |
| 1 | 612 |
| 1 | 587 |
| 1 | 605 |
| 1 | 621 |
| 1 | 636 |
The Average SAT Math Score represents the Y vector:
Y=SATMat[,3]
Y=as.matrix(Y)
kable(SATMat[,3], col.names = NULL)%>%kable_styling(full_width = F, position="left")
| 735 |
| 688 |
| 682 |
| 660 |
| 659 |
| 654 |
| 651 |
| 648 |
Matrix operations:
\(X^{'}X\)
ab = as.data.frame(t(Xmat)%*%Xmat)
kable(ab, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
column_spec(1:2,background="yellow")
| 8 | 5007 |
| 5007 | 3138965 |
\((X^{'}X)^{-1}\)
ac = as.data.frame(solve(t(Xmat)%*%Xmat))
kable(ac, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
column_spec(1:2,background="orange")
| 75.3273260 | -0.1201555 |
| -0.1201555 | 0.0001920 |
\(X^{'}\)
ad=as.data.frame(t(Xmat))
kable(ad, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
column_spec(1:8,background="orange")
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 679 | 632 | 635 | 612 | 587 | 605 | 621 | 636 |
\(X(X^{'}X)^{-1}X^{'}\)
ae=as.data.frame(Xmat%*%solve(t(Xmat)%*%Xmat)%*%t(Xmat))
kable(ae, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
column_spec(1:8,background="orchid")
| 0.6668187 | 0.1874685 | 0.2180653 | -0.0165103 | -0.2714838 | -0.0879029 | 0.0752802 | 0.2282643 |
| 0.1874685 | 0.1322023 | 0.1357299 | 0.1086847 | 0.0792878 | 0.1004536 | 0.1192676 | 0.1369058 |
| 0.2180653 | 0.1357299 | 0.1409853 | 0.1006935 | 0.0568981 | 0.0884308 | 0.1164599 | 0.1427372 |
| -0.0165103 | 0.1086847 | 0.1006935 | 0.1619592 | 0.2285522 | 0.1806052 | 0.1379856 | 0.0980298 |
| -0.2714838 | 0.0792878 | 0.0568981 | 0.2285522 | 0.4151328 | 0.2807948 | 0.1613832 | 0.0494349 |
| -0.0879029 | 0.1004536 | 0.0884308 | 0.1806052 | 0.2807948 | 0.2086583 | 0.1445370 | 0.0844232 |
| 0.0752802 | 0.1192676 | 0.1164599 | 0.1379856 | 0.1613832 | 0.1445370 | 0.1295625 | 0.1155240 |
| 0.2282643 | 0.1369058 | 0.1427372 | 0.0980298 | 0.0494349 | 0.0844232 | 0.1155240 | 0.1446810 |
\(\hat{y} = X(X^{'}X)^{-1}X^{'}y\)
af=as.data.frame((Xmat%*%solve(t(Xmat)%*%Xmat)%*%t(Xmat))%*%Y)
kable(af, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
column_spec(1,background="lightblue")
| 717.4402 |
| 677.3496 |
| 679.9085 |
| 660.2897 |
| 638.9650 |
| 654.3188 |
| 667.9667 |
| 680.7615 |
b-vectors \(= (X^{'}X)^{-1}X^{'}y\)
ag=as.data.frame((solve(t(Xmat)%*%Xmat)%*%t(Xmat))%*%Y)
kable(ag, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
column_spec(1,background="yellow")
| 138.2590771 |
| 0.8529913 |
The matrix calculations match the regression output
lm(MathAvgScore ~ ReadingAvgScore, data = SATMat)
Call:
lm(formula = MathAvgScore ~ ReadingAvgScore, data = SATMat)
Coefficients:
(Intercept) ReadingAvgScore
138.259 0.853
We can compute the value of \(h_{2,3}\) in the hat matrix by the following formula:
| x-value | \((x-\bar{x})^2\) |
|---|---|
| 679.00 | 2,822.27 |
| 632.00 | 37.52 |
| 635.00 | 83.27 |
| 612.00 | 192.52 |
| 587.00 | 1,511.27 |
| 605.00 | 435.77 |
| 621.00 | 23.77 |
| 636.00 | 102.52 |
\(\bar{x}=625.88\)
\(ss_x=5208.88\)
\(h_{2,3}=\left(\frac{1}{n}\right)+(x_2-\bar{x})(x_3-\bar{x})/ss_x\)
\(h_{2,3}=\left(\frac{1}{8}\right)+(632-625.88)(635-625.88)/5208.88 = 0.1357\)
The calculation matches the table.
Using two regressors:
Xmat=cbind(rep(1,8),SATMat[,1],SATMat[,2])
Xmate=as.data.frame(cbind(rep(1,8),SATMat[,1],SATMat[,2]))
kable(Xmate, col.names = NULL)%>%kable_styling(full_width = F, position="left")
| 1 | 679 | 682 |
| 1 | 632 | 649 |
| 1 | 635 | 636 |
| 1 | 612 | 596 |
| 1 | 587 | 587 |
| 1 | 605 | 588 |
| 1 | 621 | 638 |
| 1 | 636 | 636 |
Using R Matrix operations to fit the model:
Rhat = as.data.frame(solve(t(Xmat)%*%Xmat)%*%t(Xmat)%*%Y)
kable(Rhat, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
column_spec(1,background="yellow")
| 134.4315052 |
| 0.9009849 |
| -0.0418363 |
lm(MathAvgScore~ReadingAvgScore+WritingAvgScore, data=SATMat)
Call:
lm(formula = MathAvgScore ~ ReadingAvgScore + WritingAvgScore,
data = SATMat)
Coefficients:
(Intercept) ReadingAvgScore WritingAvgScore
134.43151 0.90098 -0.04184
The regression output matches the matrix operations.
To select the ideal regression model it is important to visualize the relationships between the dependent variable and regressors, which can be accomplished with a scatterplot matrix.
Before performing model selection I have renamed the variables for ease of use.
| Old Variable | New Variable | Description |
|---|---|---|
| MathAvgScore | MS | 2012 Average SAT math score for NYC schools |
| ReadingAvgScore | RS | 2012 Average SAT reading score for NYC schools |
| WritingAvgScore | WS | 2012 Average SAT writing score for NYC schools |
| NumTestTakers | NT | Number of test takers |
| AsianPct | AP | Percentage of Asian students in the school |
| BlackPct | BP | Percentage of Black students in the school |
| HispanicPct | HP | Percentage of Hispanic students in the school |
| WhitePct | WP | Percentage of White students in the school |
| MalePct | MP | Percentage of Male students in the school |
| FemalePct | FP | Percentage of Female students in the school |
library(GGally)
ggpairs(SAT2[,4:13])
We can see a very strong linear relationship between the average reading, writing, and math SAT scores, as expected.
There also appears to be a positive linear relationship between the percentage of Asian and White students and the average math SAT score, although there’s a modest level of heteroskedasticity given the concentration of datapoints on the left. We can also see a similar trend in the number of test takers.
We can visualize a negative relationship between the percentage of Black and Hispanic students and the average math SAT score, but there is a greater level of heteroskedasticity present.
Next, I used stepwise regression (iteratively adding and removing predictors until the best model is found).
library(leaps)
nullmod = lm(MS~1, data=SAT2) # Intercept only model
fullmod = lm(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data=SAT2) # all parameters
stepwisereg = step(nullmod, scope=list(lower=nullmod, upper=fullmod), direction='both')
Start: AIC=3441.29
MS ~ 1
Df Sum of Sq RSS AIC
+ WS 1 1394931 344054 2775.7
+ RS 1 1353411 385574 2822.7
+ AP 1 833423 905562 3174.5
+ WP 1 672844 1066140 3241.7
+ NT 1 486543 1252442 3308.1
+ BP 1 261348 1477637 3376.2
+ HP 1 213384 1525601 3389.4
<none> 1738985 3441.3
+ FP 1 1116 1737869 3443.0
+ MP 1 1111 1737874 3443.0
Step: AIC=2775.74
MS ~ WS
Df Sum of Sq RSS AIC
+ AP 1 179901 164152 2472.9
+ NT 1 37466 306588 2730.2
+ BP 1 35839 308215 2732.4
+ MP 1 25604 318450 2745.9
+ FP 1 25600 318454 2745.9
+ WP 1 4395 339658 2772.4
+ HP 1 3888 340166 2773.1
+ RS 1 3768 340286 2773.2
<none> 344054 2775.7
- WS 1 1394931 1738985 3441.3
Step: AIC=2472.86
MS ~ WS + AP
Df Sum of Sq RSS AIC
+ MP 1 11190 152963 2445.8
+ FP 1 11189 152964 2445.8
+ RS 1 9315 154837 2450.8
+ WP 1 1836 162316 2470.2
+ BP 1 1799 162353 2470.3
+ NT 1 1429 162723 2471.3
<none> 164152 2472.9
+ HP 1 647 163505 2473.2
- AP 1 179901 344054 2775.7
- WS 1 741410 905562 3174.5
Step: AIC=2445.77
MS ~ WS + AP + MP
Df Sum of Sq RSS AIC
+ RS 1 6431 146531 2430.1
+ BP 1 1854 151108 2442.7
+ WP 1 1476 151487 2443.8
+ NT 1 976 151987 2445.1
+ HP 1 791 152172 2445.6
<none> 152963 2445.8
+ FP 1 33 152930 2447.7
- MP 1 11190 164152 2472.9
- AP 1 165487 318450 2745.9
- WS 1 747578 900540 3174.2
Step: AIC=2430.08
MS ~ WS + AP + MP + RS
Df Sum of Sq RSS AIC
+ BP 1 3517 143014 2422.1
+ HP 1 1869 144663 2426.8
+ WP 1 1789 144743 2427.0
+ NT 1 1035 145496 2429.2
<none> 146531 2430.1
+ FP 1 69 146463 2431.9
- RS 1 6431 152963 2445.8
- MP 1 8306 154837 2450.8
- WS 1 20049 166581 2480.9
- AP 1 170512 317044 2746.1
Step: AIC=2422.07
MS ~ WS + AP + MP + RS + BP
Df Sum of Sq RSS AIC
+ NT 1 787 142228 2421.8
<none> 143014 2422.1
+ WP 1 602 142412 2422.3
+ HP 1 523 142491 2422.6
+ FP 1 88 142926 2423.8
- BP 1 3517 146531 2430.1
- MP 1 8027 151041 2442.6
- RS 1 8094 151108 2442.8
- WS 1 15913 158927 2463.5
- AP 1 135819 278834 2695.2
Step: AIC=2421.79
MS ~ WS + AP + MP + RS + BP + NT
Df Sum of Sq RSS AIC
<none> 142228 2421.8
- NT 1 787 143014 2422.1
+ WP 1 432 141796 2422.5
+ HP 1 382 141845 2422.7
+ FP 1 92 142136 2423.5
- BP 1 3269 145496 2429.2
- MP 1 7688 149915 2441.5
- RS 1 8084 150312 2442.6
- WS 1 15337 157565 2462.0
- AP 1 115280 257508 2664.4
summary(stepwisereg)
Call:
lm(formula = MS ~ WS + AP + MP + RS + BP + NT, data = SAT2)
Residuals:
Min 1Q Median 3Q Max
-65.209 -9.906 -0.890 10.668 106.919
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.161880 8.876950 6.439 3.40e-10 ***
WS 0.480869 0.072765 6.609 1.23e-10 ***
AP 1.444475 0.079726 18.118 < 2e-16 ***
MP 0.339625 0.072588 4.679 3.94e-06 ***
RS 0.349090 0.072760 4.798 2.26e-06 ***
BP -0.122270 0.040078 -3.051 0.00243 **
NT 0.010680 0.007136 1.497 0.13527
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.74 on 405 degrees of freedom
Multiple R-squared: 0.9182, Adjusted R-squared: 0.917
F-statistic: 757.8 on 6 and 405 DF, p-value: < 2.2e-16
The resulting best model had an AIC value of 2421.79 and and adjusted R-squared value of 0.917
par(mfrow=c(2,2))
plot(lm.fit)
We can see that there are some outliers and high leverage points, but the number is not excessive.
There is some heteroscedasticity of the residuals, mostly because the data is clustered around the averages at the left of the plots.
res = residuals(stepwisereg)
plot(res) # plot residuals
#display the corresponding rows where the absolute value of residuals is >50
kable(SAT2[abs(res)>50,4:13])%>%kable_styling(bootstrap_options=c("striped","condensed", full_width=F))
| NT | RS | MS | WS | AP | BP | HP | WP | MP | FP | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 112 | 332 | 557 | 316 | 84.7 | 5.2 | 8.9 | 0.9 | 53.7 | 46.3 |
| 23 | 79 | 319 | 512 | 357 | 62.4 | 7.2 | 26.4 | 4.0 | 51.2 | 48.8 |
| 58 | 121 | 345 | 517 | 343 | 35.3 | 31.0 | 28.3 | 5.0 | 56.1 | 43.9 |
| 71 | 12 | 416 | 403 | 381 | 48.7 | 33.2 | 16.6 | 1.5 | 47.7 | 52.3 |
| 214 | 9 | 439 | 374 | 418 | 2.7 | 88.9 | 7.1 | 0.4 | 99.6 | 0.4 |
| 228 | 20 | 441 | 499 | 413 | 7.5 | 58.2 | 26.3 | 4.5 | 79.4 | 20.6 |
| 392 | 143 | 323 | 475 | 329 | 48.3 | 5.6 | 43.5 | 2.6 | 53.2 | 46.8 |
When we examine the most extreme residuals it is usally possible to pinpoint the cause. For example, in row 6, the Asian Percentage of 84.7% is unusally high.
The Variance Inflation Factor is a measurement that is used to detect collinearity among the regressors. It is calculated by regressing each variable against the other variables in the model, computing the \(R^2\), subtracting the resulting \(R^2\) from 1, and taking the inverse. \(VIF_K=(1-R^2_K)^{-1}\)
A high level of collinearity could make the model unreliable. It can increase the standard error of the estimated coefficients making the t-statistics faulty. It can also cause the individual regressor coefficients to change erratically.
library(car)
vif(stepwisereg) # calculate variance inflation factor
WS AP MP RS BP NT
21.445989 1.632100 1.078509 20.155587 1.290958 1.460455
There are no extreme results.
RS and WS have the highest values, likely because the test scores have a strong linear relationship.
Cook’s distance measures the influence of an observation on the regression model. An observation with a high Cook’s distance would meaningfully change the slope of the regression line if included. It can be used to detect leverage points.
It can be computed by:
\[D_i=\frac{ \sum_{j=1}^{n} (\hat{y}_j - \hat{y}_{j(i)})^2 }{p(MSE)}\]
\(D_i\) is Cook’s distance for the i^th observation
\(\hat{y}_j\) is the full regression model’s estimate for the j^th observation
\(\hat{y}_{j(i)}\) is the regression model’s (fitted without the i^th observation) estimate for the j^th observation
\(p\) is the number of regressors
\(MSE\) is the mean squared error of the full regression model
cooks=cooks.distance(stepwisereg)
plot(cooks) # plot cook's D
#display the corresponding rows where the value of cook's D is >0.05
kable(SAT2[cooks>.05,4:13])%>%kable_styling(bootstrap_options=c("striped","condensed", full_width=F))
| NT | RS | MS | WS | AP | BP | HP | WP | MP | FP | |
|---|---|---|---|---|---|---|---|---|---|---|
| 6 | 112 | 332 | 557 | 316 | 84.7 | 5.2 | 8.9 | 0.9 | 53.7 | 46.3 |
| 23 | 79 | 319 | 512 | 357 | 62.4 | 7.2 | 26.4 | 4.0 | 51.2 | 48.8 |
| 58 | 121 | 345 | 517 | 343 | 35.3 | 31.0 | 28.3 | 5.0 | 56.1 | 43.9 |
| 71 | 12 | 416 | 403 | 381 | 48.7 | 33.2 | 16.6 | 1.5 | 47.7 | 52.3 |
| 214 | 9 | 439 | 374 | 418 | 2.7 | 88.9 | 7.1 | 0.4 | 99.6 | 0.4 |
| 392 | 143 | 323 | 475 | 329 | 48.3 | 5.6 | 43.5 | 2.6 | 53.2 | 46.8 |
Most large Cook’s D values appear to have low math scores relative to a high percentage of Asian students.
The regression model provides us with a \(\hat{y}\) equation that is designed to be optimal for our dataset, but the true value lies in how well it will work with new data. Since fresh data is often unavailable we can simply split our existing dataset into two pieces. One piece will be used to calculate the regression equation, and the other piece will provide us with our out-of-sample values.
We start by randomizing the data and selecting our “training sample,” which is used to calculate the \(\hat{y}\) equation. We then use the remaining rows as our “validation” sample. We can evaluate the RMSE value to get a sense of how well the model might work with out-of-sample data. When comparing two models, the one that produces the lowest test sample RMSE is the preferred model.
library(caret)
set.seed(1234)
# Split the data into training and test set. 70% Train, 30% Test
training.samples <- SAT2$MS %>% createDataPartition(p = 0.7, list = FALSE)
train.data <- SAT2[training.samples, ]
test.data <- SAT2[-training.samples, ]
# Build the model
model <- lm(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data = SAT2)
# Make predictions and compute the R2, RMSE and MAE
predictions <- model %>% predict(test.data)
data.frame( R2 = R2(predictions, test.data$MS),
RMSE = RMSE(predictions, test.data$MS),
MAE = MAE(predictions, test.data$MS))
R2 RMSE MAE
1 0.9355193 14.56597 11.25401
# Define training control
train.control <- trainControl(method = "LOOCV")
# Train the model
model <- train(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data = SAT2, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
Linear Regression
412 samples
9 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 411, 411, 411, 411, 411, 411, ...
Resampling results:
RMSE Rsquared MAE
19.19874 0.912689 13.89474
Tuning parameter 'intercept' was held constant at a value of TRUE
This is similar to leave one out, except the data is split into k-subsets. 1 subset is used for testing and the remainder are used for training. The process is repeated until all the subsets are tested.
# Define training control
set.seed(1234)
train.control <- trainControl(method = "cv", number = 10)
# Train the model
model <- train(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data = SAT2, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
Linear Regression
412 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 372, 372, 372, 369, 372, 370, ...
Resampling results:
RMSE Rsquared MAE
18.80215 0.9065453 13.94881
Tuning parameter 'intercept' was held constant at a value of TRUE
We can also repeat k-fold cross validation multiple times
# Define training control
set.seed(1234)
train.control <- trainControl(method = "repeatedcv",
number = 10, repeats = 3)
# Train the model
model <- train(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data = SAT2, method = "lm",
trControl = train.control)
# Summarize the results
print(model)
Linear Regression
412 samples
9 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 372, 372, 372, 369, 372, 370, ...
Resampling results:
RMSE Rsquared MAE
18.82588 0.9035846 13.91531
Tuning parameter 'intercept' was held constant at a value of TRUE