This is an update from a previous report:

https://rpubs.com/pjozefek/574750

Chapters 6 and 7 were added, formatting was changed, and some errors were corrected.

Chapter 1 - Introduction


Topic

In this report I will be evaluating the influence of race, gender, and other variables on NYC’s SAT math scores from 2012. The SAT plays an important role in determining the academic and career path for many students, and mathematical achievement seems particularly important in the current environment. Technology, engineering, finance, and science have all become the most lucrative sectors in which to work, and a strong background in math is necessary to pursue these fields.

Data Source

I gathered the SAT and school data from NYC Open Data, which is a repository for public data generated by various New York City agencies . I was able to link two datasets by a common identifier, DBN (District Borough Number), which is unique to each school.

The file 2012 SAT Results can be found at:

https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4

The file 2006-2012 School Demographics and Accountability Snapshot can be found at:

https://data.cityofnewyork.us/Education/2006-2012-School-Demographics-and-Accountability-S/ihfw-zy9j

Data Wrangling

I began by importing the 2012 SAT results.

> SATraw<- read.csv("2012_SAT_results.csv")
> # Python
+ import pandas as pd
+ import numpy as np
+ r.SATraw.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 6 columns):
DBN                                478 non-null object
SCHOOL.NAME                        478 non-null object
Num.of.SAT.Test.Takers             478 non-null object
SAT.Critical.Reading.Avg..Score    478 non-null object
SAT.Math.Avg..Score                478 non-null object
SAT.Writing.Avg..Score             478 non-null object
dtypes: object(6)
memory usage: 22.5+ KB

There are 478 rows and 6 columns.

> #Python
+ SATdesc = r.SATraw.describe()
> library(knitr)
> library(kableExtra)
> kable(py$SATdesc)%>%kable_styling(bootstrap_options = c("striped","condensed"),
+     full_width = F, font_size = 10,position="left")%>%
+     row_spec(0,background="lightblue")
DBN SCHOOL.NAME Num.of.SAT.Test.Takers SAT.Critical.Reading.Avg..Score SAT.Math.Avg..Score SAT.Writing.Avg..Score
count 478 478 478 478 478 478
unique 478 478 175 164 173 163
top 15K464 MOTT HAVEN VILLAGE PREPARATORY HIGH SCHOOL s s s s
freq 1 1 57 57 57 57

We can see from the table that there are 57 instances of “s”, which represent missing values. I will eliminate them before moving forward.

> library(tidyverse)
> SATtib <- as_tibble(SATraw)
> SATtib <- SATtib %>% filter(Num.of.SAT.Test.Takers != "s")
> SATtib %>% glimpse()
Rows: 421
Columns: 6
$ DBN                             <chr> "01M292", "01M448", "01M450", "01M4...
$ SCHOOL.NAME                     <chr> "HENRY STREET SCHOOL FOR INTERNATIO...
$ Num.of.SAT.Test.Takers          <chr> "29", "91", "70", "7", "44", "112",...
$ SAT.Critical.Reading.Avg..Score <chr> "355", "383", "377", "414", "390", ...
$ SAT.Math.Avg..Score             <chr> "404", "423", "402", "401", "433", ...
$ SAT.Writing.Avg..Score          <chr> "363", "366", "370", "359", "384", ...

Unfortunately the “s” caused the numeric values to be parsed as characters. I converted them to integers and simplified the their variable names. I also eliminated the school name, which is already identified by DBN.

> SATtib <- SATtib %>% mutate(
+ NumTestTakers=as.integer(Num.of.SAT.Test.Takers),
+ ReadingAvgScore=as.integer(SAT.Critical.Reading.Avg..Score),
+ MathAvgScore=as.integer(SAT.Math.Avg..Score),
+ WritingAvgScore=as.integer(SAT.Writing.Avg..Score)) 
> 
> SATtib <- SATtib %>% select(-(2:6))
> SATtib %>% glimpse()
Rows: 421
Columns: 5
$ DBN             <chr> "01M292", "01M448", "01M450", "01M458", "01M509", "...
$ NumTestTakers   <int> 29, 91, 70, 7, 44, 112, 159, 18, 130, 16, 62, 53, 5...
$ ReadingAvgScore <int> 355, 383, 377, 414, 390, 332, 522, 417, 624, 395, 4...
$ MathAvgScore    <int> 404, 423, 402, 401, 433, 557, 574, 418, 604, 400, 3...
$ WritingAvgScore <int> 363, 366, 370, 359, 384, 316, 525, 411, 628, 387, 3...

The DBN value contains a letter in the third position. This letter represents the borough. I extracted it and created a new column.

> SATtib <- SATtib %>% mutate(Borough=str_extract(DBN,"[MXKQR]"))
> 
> SATtib[SATtib$Borough == "M","Borough"] <- "Manhattan"
> SATtib[SATtib$Borough == "X","Borough"] <- "Bronx"
> SATtib[SATtib$Borough == "K","Borough"] <- "Brooklyn"
> SATtib[SATtib$Borough == "Q","Borough"] <- "Queens"
> SATtib[SATtib$Borough == "R","Borough"] <- "StatenIsland"
> 
> SATtib  %>% group_by(Borough) %>% 
+   summarize(count=n()) %>% arrange(desc(count)) 
# A tibble: 5 x 2
  Borough      count
  <chr>        <int>
1 Brooklyn       123
2 Bronx          110
3 Manhattan      106
4 Queens          70
5 StatenIsland    12

Next, I imported the 2006-2012 School Demographics and Accountability Snapshot.

> Demographics<- read.csv("2006_-_2012_School_Demographics_and_Accountability_Snapshot.csv")
> Demotib <- as_tibble(Demographics)
> Demotib %>% glimpse
Rows: 10,075
Columns: 38
$ DBN               <chr> "01M015", "01M015", "01M015", "01M015", "01M015",...
$ Name              <chr> "P.S. 015 ROBERTO CLEMENTE", "P.S. 015 ROBERTO CL...
$ schoolyear        <int> 20052006, 20062007, 20072008, 20082009, 20092010,...
$ fl_percent        <chr> "89.4", "89.4", "89.4", "89.4", "   ", "   ", "",...
$ frl_percent       <dbl> NA, NA, NA, NA, 96.5, 96.5, 89.4, NA, NA, NA, NA,...
$ total_enrollment  <int> 281, 243, 261, 252, 208, 203, 189, 402, 312, 338,...
$ prek              <int> 15, 15, 18, 17, 16, 13, 13, 15, 13, 28, 33, 35, 3...
$ k                 <int> 36, 29, 43, 37, 40, 37, 31, 43, 37, 48, 44, 48, 5...
$ grade1            <int> 40, 39, 39, 44, 28, 35, 35, 55, 45, 46, 56, 49, 5...
$ grade2            <int> 33, 38, 36, 32, 32, 33, 28, 53, 52, 47, 44, 56, 4...
$ grade3            <int> 38, 34, 38, 34, 30, 30, 25, 68, 47, 53, 53, 39, 4...
$ grade4            <int> 52, 42, 47, 39, 24, 30, 28, 59, 61, 48, 48, 50, 4...
$ grade5            <int> 29, 46, 40, 49, 38, 25, 29, 64, 57, 68, 47, 44, 4...
$ grade6            <int> 38, NA, NA, NA, NA, NA, NA, 45, NA, NA, NA, NA, N...
$ grade7            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade8            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade9            <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade10           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade11           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade12           <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ ell_num           <int> 36, 38, 52, 48, 40, 30, 20, 37, 30, 40, 27, 35, 3...
$ ell_percent       <dbl> 12.8, 15.6, 19.9, 19.0, 19.2, 14.8, 10.6, 9.2, 9....
$ sped_num          <int> 57, 55, 60, 62, 46, 46, 40, 93, 72, 75, 70, 71, 6...
$ sped_percent      <dbl> 20.3, 22.6, 23.0, 24.6, 22.1, 22.7, 21.2, 23.1, 2...
$ ctt_num           <int> 25, 19, 20, 21, 14, 21, 23, 7, 13, 12, 15, 18, 17...
$ selfcontained_num <int> 9, 15, 14, 17, 14, 9, 7, 37, 22, 19, 19, 22, 17, ...
$ asian_num         <int> 10, 18, 16, 16, 16, 13, 12, 40, 30, 42, 39, 43, 4...
$ asian_per         <dbl> 3.6, 7.4, 6.1, 6.3, 7.7, 6.4, 6.3, 10.0, 9.6, 12....
$ black_num         <int> 74, 68, 77, 75, 67, 75, 63, 103, 70, 72, 83, 87, ...
$ black_per         <dbl> 26.3, 28.0, 29.5, 29.8, 32.2, 36.9, 33.3, 25.6, 2...
$ hispanic_num      <int> 189, 153, 157, 149, 118, 110, 109, 207, 172, 186,...
$ hispanic_per      <dbl> 67.3, 63.0, 60.2, 59.1, 56.7, 54.2, 57.7, 51.5, 5...
$ white_num         <int> 5, 4, 7, 7, 6, 4, 4, 39, 19, 22, 32, 35, 31, 28, ...
$ white_per         <dbl> 1.8, 1.6, 2.7, 2.8, 2.9, 2.0, 2.1, 9.7, 6.1, 6.5,...
$ male_num          <int> 158, 140, 143, 149, 124, 113, 97, 214, 157, 162, ...
$ male_per          <dbl> 56.2, 57.6, 54.8, 59.1, 59.6, 55.7, 51.3, 53.2, 5...
$ female_num        <int> 123, 103, 118, 103, 84, 90, 92, 188, 155, 176, 16...
$ female_per        <dbl> 43.8, 42.4, 45.2, 40.9, 40.4, 44.3, 48.7, 46.8, 4...

We can see that there are 10075 rows. Each school has 7 entries (schoolyear 20052006 to 20112012) and it contains all grade levels. We are only interested in schoolyear 20112012 and in the demographics percentages.

> Demotib <- Demotib %>% filter(schoolyear == 20112012) %>% 
+   select(DBN,
+          AsianPct = asian_per,
+          BlackPct = black_per,
+          HispanicPct = hispanic_per,
+          WhitePct = white_per,
+          MalePct = male_per,
+          FemalePct = female_per)
> 
> Demotib %>% glimpse
Rows: 1,509
Columns: 7
$ DBN         <chr> "01M015", "01M019", "01M020", "01M034", "01M063", "01M0...
$ AsianPct    <dbl> 6.3, 15.5, 30.4, 5.5, 5.1, 8.0, 12.5, 28.2, 18.3, 4.2, ...
$ BlackPct    <dbl> 33.3, 24.7, 8.8, 22.4, 23.3, 23.5, 13.2, 20.1, 24.8, 16...
$ HispanicPct <dbl> 57.7, 48.2, 57.0, 68.6, 62.5, 59.6, 44.1, 49.1, 53.5, 7...
$ WhitePct    <dbl> 2.1, 8.5, 2.6, 2.0, 8.5, 7.4, 28.2, 2.2, 3.0, 1.7, 2.8,...
$ MalePct     <dbl> 51.3, 44.8, 52.7, 50.9, 55.1, 56.8, 49.8, 48.0, 48.3, 5...
$ FemalePct   <dbl> 48.7, 55.2, 47.3, 49.1, 44.9, 43.2, 50.2, 52.0, 51.7, 4...

I then joined the two data sets by DBN value and removed 9 NA values.

> SAT <- SATtib %>% left_join(Demotib, by="DBN") %>% 
+   relocate(Borough, .after = DBN) %>% na.omit()
> 
> SAT %>% glimpse()
Rows: 412
Columns: 12
$ DBN             <chr> "01M292", "01M448", "01M450", "01M458", "01M509", "...
$ Borough         <chr> "Manhattan", "Manhattan", "Manhattan", "Manhattan",...
$ NumTestTakers   <int> 29, 91, 70, 7, 44, 112, 159, 18, 130, 16, 62, 53, 5...
$ ReadingAvgScore <int> 355, 383, 377, 414, 390, 332, 522, 417, 624, 395, 4...
$ MathAvgScore    <int> 404, 423, 402, 401, 433, 557, 574, 418, 604, 400, 3...
$ WritingAvgScore <int> 363, 366, 370, 359, 384, 316, 525, 411, 628, 387, 3...
$ AsianPct        <dbl> 14.0, 29.2, 9.7, 2.2, 9.3, 84.7, 27.8, 0.5, 15.1, 1...
$ BlackPct        <dbl> 29.1, 22.6, 23.9, 34.4, 31.6, 5.2, 11.7, 45.4, 15.1...
$ HispanicPct     <dbl> 53.8, 45.9, 55.4, 59.4, 56.9, 8.9, 14.2, 49.5, 18.2...
$ WhitePct        <dbl> 1.7, 2.3, 10.4, 3.6, 1.6, 0.9, 44.9, 4.1, 49.8, 6.3...
$ MalePct         <dbl> 61.4, 57.4, 54.7, 43.3, 46.3, 53.7, 49.2, 39.9, 31....
$ FemalePct       <dbl> 38.6, 42.6, 45.3, 56.7, 53.7, 46.3, 50.8, 60.1, 68....

Variables

The data set contains 412 NYC schools with the following variables:

Variable Definition (all data is for the year 2012)
DBN District Borough Number - school identifier
Borough New York City borough name
NumTestTakers Number of SAT test takers
ReadingAvgScore Average SAT reading score
MathAvgScore Average SAT math score
WritingAvgScore Average SAT writing score
AsianPct Percentage of school students who were Asian
BlackPct Percentage of school students who were Black
HispanicPct Percentage of school students who were Hispanic
WhitePct Percentage of school students who were White
MalePct Percentage of school students who were Male
FemalePct Percentage of school students who were Female

Data View

Below is a small subset of the data:

> kable(SAT[1:10,c(1,3:12)])%>%kable_styling(bootstrap_options = c("striped","condensed"),
+         full_width = F, font_size = 12,position="left")%>%
+                 row_spec(0,background="lightblue")
DBN NumTestTakers ReadingAvgScore MathAvgScore WritingAvgScore AsianPct BlackPct HispanicPct WhitePct MalePct FemalePct
01M292 29 355 404 363 14.0 29.1 53.8 1.7 61.4 38.6
01M448 91 383 423 366 29.2 22.6 45.9 2.3 57.4 42.6
01M450 70 377 402 370 9.7 23.9 55.4 10.4 54.7 45.3
01M458 7 414 401 359 2.2 34.4 59.4 3.6 43.3 56.7
01M509 44 390 433 384 9.3 31.6 56.9 1.6 46.3 53.7
01M515 112 332 557 316 84.7 5.2 8.9 0.9 53.7 46.3
01M539 159 522 574 525 27.8 11.7 14.2 44.9 49.2 50.8
01M650 18 417 418 411 0.5 45.4 49.5 4.1 39.9 60.1
01M696 130 624 604 628 15.1 15.1 18.2 49.8 31.3 68.7
02M047 16 395 400 387 1.7 32.2 59.2 6.3 42.5 57.5

Data Exploration

> #Python
+ SATpy = r.SAT
+ 
+ SATpy.iloc[:,2:6].describe()
       NumTestTakers  ReadingAvgScore  MathAvgScore  WritingAvgScore
count     412.000000       412.000000    412.000000       412.000000
mean      112.466019       400.968447    413.842233       394.521845
std       156.535939        57.036135     65.046956        58.829309
min         6.000000       279.000000    312.000000       286.000000
25%        43.000000       368.000000    371.750000       360.750000
50%        63.000000       391.000000    395.000000       382.000000
75%        98.250000       416.250000    437.000000       411.000000
max      1277.000000       679.000000    735.000000       682.000000
> SATpy.iloc[:,6:].describe()
         AsianPct    BlackPct  HispanicPct   WhitePct     MalePct   FemalePct
count  412.000000  412.000000   412.000000  412.00000  412.000000  412.000000
mean     9.462621   39.139320    42.876456    7.71335   49.786165   50.212864
std     14.812181   26.205454    24.525417   13.38467   13.224766   13.224115
min      0.000000    0.000000     2.400000    0.00000    0.000000    0.000000
25%      1.000000   20.050000    18.825000    0.90000   44.075000   45.100000
50%      2.950000   33.150000    43.650000    1.80000   49.800000   50.200000
75%      9.225000   56.725000    62.500000    7.42500   54.900000   55.925000
max     89.500000   95.700000   100.000000   82.10000  100.000000  100.000000
  • The test takers are almost evenly split between male and female.

  • The average scores are all around 400.

  • The number of test takers has the greatest range and standard deviation.

  • The race categories also vary widely and there is a surprisingly low white and asian percentage.

> SAT %>% mutate(RacePct = rowSums(.[,7:10])) %>% 
+     ggplot(mapping=aes(x=RacePct)) + geom_area(aes(y=..count..,),stat="bin", fill="cyan3")+
+ geom_vline(aes(xintercept=mean(RacePct)),linetype="dashed", size=1)+
+ scale_x_continuous(breaks=seq(from=88, to=101, by=1))+  
+ scale_y_continuous(breaks=seq(from=0, to=160, by=20))+ 
+ labs(x="Total Race Percentage")

We can see that the total race percentage does not always equal 100%. Values range from a little over 88% to a little over 100%, with a mean just over 99%. That could be the result of recording errors or from the difficulty in gathering the data, especially when many individuals belong to more than one category.

Math and Scores

> #Python
+ import matplotlib.pyplot as plt
+ import seaborn as sns
+ sns.pairplot(SATpy.iloc[:,[1,4,3,5]], 
+              hue='Borough', palette='hot');
+ plt.show()

We can see a very strong linear relationship between the average reading, writing, and math SAT scores, as expected. The score distributions are all right skewed.

Math and Races

> sns.pairplot(SATpy.iloc[:,[1,4,6,7,8,9]], 
+              hue='Borough', palette='hot');
+ plt.show()

There appears to be a positive linear relationship between the percentage of Asian and White students and the average math SAT score, although there’s a modest level of heteroskedasticity given the concentration of datapoints on the left.

We can also visualize a negative relationship between the percentage of Black and Hispanic students and the average math SAT score, but there is a greater level of heteroskedasticity present and the trend is not as clear.

Math and Gender

> sns.pairplot(SATpy.iloc[:,[1,4,10,11]], 
+              hue='Borough', palette='hot');
+ plt.show()

The relationship between gender and math scores does not appear to be linear.

Chapter 2 - A Simple Regression Model


I chose to take a closer look at how the percentage of Asian students relates to the average SAT math score, which we can visualize with a scatterplot.

> library(tidyverse)
> library(ggthemes)
> ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=MathAvgScore), 
+   color="blue", size=2)+
+   scale_x_continuous(breaks=seq(from=0, to=100, by=10))+
+   labs(x="Percentage of Asian Students", y="Average SAT Math Score")+theme_light()

Analysis of Scatterplot

Heteroscedasticity - We can see that there is a moderate level of heteroscedasticity present. There is more variability in the average SAT math score as we move away from the cluster of values on the left.

Curvature - Fortunately there is no evidence of curvature in the scatterplot.

Leverage Points - The points on the right are all far from the clustered values on the left \((\bar{x})\) and would be considered leverage points.

Outliers - A few points represent either a large avg. SAT math score (near 700) or a small avg. SAT math score (near 300), but I don’t think they would be considered outliers. They are not extreme relative to other datapoints.

Influential Points - Although there are leverage points they would not be considered influential. Their removal would not dramatically alter the slope of the \(\hat{y}\) line.

Transformations

Although sometimes transformations can help with heteroscedasticity, I did not find an adequate solution in this case. Some examples are presented below.

LOG Transformations

> library(gridExtra)
> 
> ggplot(data=SAT)+geom_point(mapping=aes(x=log(AsianPct), y=log(MathAvgScore)), color="purple", size=2)+
+   labs(x="LOG Percentage of Asian Students", y="LOG Average SAT Math Score")+theme_light() ->p1
> 
> ggplot(data=SAT)+geom_point(mapping=aes(x=log(AsianPct), y=MathAvgScore), color="purple", size=2)+
+   labs(x="Log Percentage of Asian Students", y="Average SAT Math Score")+theme_light() ->p2
> 
> grid.arrange(p1, p2, ncol = 2)

> ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=log(MathAvgScore)), color="purple", size=2)+
+   labs(x="Percentage of Asian Students", y="LOG Average SAT Math Score")+theme_light()


We can see that LOG transformations of the x values, the y values, or both did not solve the problem of heteroscedasticity, and in some cases made it worse.













Square Root Transformations

> ggplot(data=SAT)+geom_point(mapping=aes(x=sqrt(AsianPct), y=sqrt(MathAvgScore)), color="green4", size=2)+
+   labs(x="SQRT Percentage of Asian Students", y="SQRT Average SAT Math Score")+theme_light() ->p1
> 
> ggplot(data=SAT)+geom_point(mapping=aes(x=sqrt(AsianPct), y=MathAvgScore), color="green4", size=2)+
+   labs(x="SQRT Percentage of Asian Students", y="Average SAT Math Score")+theme_light() ->p2
> 
> grid.arrange(p1, p2, ncol = 2)

> ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=sqrt(MathAvgScore)), color="green4", size=2)+
+   labs(x="Percentage of Asian Students", y="SQRT Average SAT Math Score")+theme_light()


Similarly, square root transformations were not helpful.
















The Linear Regression Model

\[Y_i = \beta_0+\beta_1x_i+\epsilon\]

  1. The model does not view \(Y_i\) as a datapoint. The model views \(Y_i\) as a sub-population of values that share the same x value. For this to be possible the model assumes that we could get many fresh samples. In this regression \(Y_i\) represents a sub-population of average SAT math scores for a given percentage of Asian students, \(x_i\) .

  2. Like any sub-population, \(Y_i\) has a population average and a population variance. The first component, \(\beta_0+\beta_1x_1\), supplies the sub-population average. \(\beta_0\) represents the model intercept and \(\beta_1\) represents the model slope.

  3. The second component, \(\epsilon\), supplies the variability. We can think of \(\epsilon\) as the name of a bell curved population of \(Y\) values that happens to be centered at zero. The model assumes that this bell curve is added to each sub-population average and that every bell curve has the same standard deviation.

R Output for the fitted model

> lm.fit = lm(MathAvgScore~AsianPct, data=SAT)
> summary(lm.fit)

Call:
lm(formula = MathAvgScore ~ AsianPct, data = SAT)

Residuals:
     Min       1Q   Median       3Q      Max 
-164.163  -25.101   -6.255   19.754  212.155 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 385.0746     2.7485  140.10   <2e-16 ***
AsianPct      3.0401     0.1565   19.43   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47 on 410 degrees of freedom
Multiple R-squared:  0.4793,    Adjusted R-squared:  0.478 
F-statistic: 377.3 on 1 and 410 DF,  p-value: < 2.2e-16
> temp_var  <- predict(lm.fit, interval = "prediction")
> new_df <-cbind(SAT, temp_var)
> ggplot(new_df, aes(AsianPct,MathAvgScore))+geom_point()+
+   geom_line(aes(y=lwr),color="red",linetype="dashed", size=1)+
+   geom_line(aes(y=upr),color="red",linetype="dashed", size=1)+
+   geom_smooth(method = lm, se=TRUE)+
+   ggtitle("Fit Plot \n Blue Band = 95% Confidence Interval \n  Red Lines = 95% Prediction Interval")+
+   theme(plot.title = element_text(hjust = 0.5))

If we examine the regression plot above we can see that there are several linear lines presented. The \(\hat{y}\) line in the center provides us with a mean value of y given x. Surrounding the \(\hat{y}\) line we can see a shaded band, which represents the 95% confidence interval. The upper and lower values give us a range in which we are 95% confident that the population mean will fall for the same x value. The plot’s outer dotted lines represent the prediction interval, which is useful if we wanted to evaluate another sample with the same x value. It allows us to say that we are 95% confident that the value of y (not the mean value of y) will fall within that wider range.

Analysis of Output

The t-tests

In the regression model we can use a t-test to evaluate the predictive value of \(\beta_1\).

\(H_0:\beta_1=0\)

\(H_0:\beta_1\neq0\)

To evaluate the null hypothesis we calculate the t-statistic and compare it to the critical points from a t-density function with n-2 degrees of freedom.

\[t=\frac{\hat\beta_1-0}{SE(\hat\beta_1)}=\frac{3.0401}{0.1565}=19.43\] The t-critical value with 410 degrees of freedom and \(\alpha=0.05\) is 1.966

We reject the null hypothesis if the |t-stat| is greater than the t-critical value. Therefore, since the t-stat of 19.43 is greater than 1.966, we can reject the null hypothesis.

We are evaluating the value of \(\beta_1\), but we only have a value for \(b_1\). However, \(b_1\) is an unbiased estimator of \(\beta_1\).

Imagine we were evaluating the average SAT math score for the population of all schools in the US (let’s assume that there are 100,000 schools). If we were to take a sample of 100 schools we could calculate a single sample slope. Now imagine we took all possible samples from the population. Without replacement that would result in 100,000 C100 sample points in our sample space, which is an extremely large number. We could then calculate the slope for each sample in the sample space, resulting in a daughter population of sample slopes. The grand average of that daughter population would equal \(\beta_1\), so we can say that is an unbiased estimator. It is unbiased because the methodology is unbiased, not because ever b??? value would equal \(\beta_1\). The method is “on target” because the daughter population average is centered at \(\beta_1\).

The y-hat equation

The equation for the fitted model is as follows:

\[\hat{y_i} = 385.075 + 3.04(x_i)\]

Similar to how \(b_1\) is an unbiased estimator of \(\beta_1\), \(\hat{y_i}\) is an unbiased estimator of \(Y_i\). If we evaluated all possible samples (of the same size) from the population, we could calculate a daughter population of sample slopes and \(\hat{y_i}\) values. The grand average of the daughter population of \(\hat{y_i}\) values would equal \(Y_i\). We therefore say that it is an unbiased estimator.

Chapter 3 - Matrix Methods


Simple Linear Regression in Matrix Terms

We can use Matrix algebra to derive the regression equation, and some of the calculations are included below. The calculations are based on the following subset of data:

> SATMat <- SAT[c(42,174,405,375,218,96,347,180),c(4,6,5)]
> SATMat <- as.data.frame(SATMat)
> kable(SATMat)%>%kable_styling(position="left",full_width = F)
ReadingAvgScore WritingAvgScore MathAvgScore
679 682 735
632 649 688
635 636 682
612 596 660
587 587 659
605 588 654
621 638 651
636 636 648

The Average SAT Reading score will be used as the X matrix:

> Xmat=cbind(rep(1,8),SATMat[,1])
> Xmatd=as.data.frame(Xmat)
> kable(Xmatd, col.names = NULL)%>%kable_styling(full_width = F, position="left")
1 679
1 632
1 635
1 612
1 587
1 605
1 621
1 636

The Average SAT Math Score represents the Y vector:

> Y=SATMat[,3]
> Y=as.matrix(Y)
> kable(SATMat[,3], col.names = NULL)%>%kable_styling(full_width = F, position="left")
735
688
682
660
659
654
651
648

Matrix operations:

\(X^{'}X\)

> ab = as.data.frame(t(Xmat)%*%Xmat)
> kable(ab, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+   column_spec(1:2,background="yellow")
8 5007
5007 3138965

\((X^{'}X)^{-1}\)

> ac = as.data.frame(solve(t(Xmat)%*%Xmat))
> kable(ac, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+   column_spec(1:2,background="orange")
75.3273260 -0.1201555
-0.1201555 0.0001920

\(X^{'}\)

> ad=as.data.frame(t(Xmat))
> kable(ad, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+   column_spec(1:8,background="orange")
1 1 1 1 1 1 1 1
679 632 635 612 587 605 621 636

\(X(X^{'}X)^{-1}X^{'}\)

> ae=as.data.frame(Xmat%*%solve(t(Xmat)%*%Xmat)%*%t(Xmat))
> kable(ae, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+   column_spec(1:8,background="orchid")
0.6668187 0.1874685 0.2180653 -0.0165103 -0.2714838 -0.0879029 0.0752802 0.2282643
0.1874685 0.1322023 0.1357299 0.1086847 0.0792878 0.1004536 0.1192676 0.1369058
0.2180653 0.1357299 0.1409853 0.1006935 0.0568981 0.0884308 0.1164599 0.1427372
-0.0165103 0.1086847 0.1006935 0.1619592 0.2285522 0.1806052 0.1379856 0.0980298
-0.2714838 0.0792878 0.0568981 0.2285522 0.4151328 0.2807948 0.1613832 0.0494349
-0.0879029 0.1004536 0.0884308 0.1806052 0.2807948 0.2086583 0.1445370 0.0844232
0.0752802 0.1192676 0.1164599 0.1379856 0.1613832 0.1445370 0.1295625 0.1155240
0.2282643 0.1369058 0.1427372 0.0980298 0.0494349 0.0844232 0.1155240 0.1446810

\(\hat{y} = X(X^{'}X)^{-1}X^{'}y\)

> af=as.data.frame((Xmat%*%solve(t(Xmat)%*%Xmat)%*%t(Xmat))%*%Y)
> kable(af, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+   column_spec(1,background="lightblue")
717.4402
677.3496
679.9085
660.2897
638.9650
654.3188
667.9667
680.7615

b-vectors \(= (X^{'}X)^{-1}X^{'}y\)

> ag=as.data.frame((solve(t(Xmat)%*%Xmat)%*%t(Xmat))%*%Y)
> kable(ag, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+   column_spec(1,background="yellow")
138.2590771
0.8529913

The matrix calculations match the regression output

> lm(MathAvgScore ~ ReadingAvgScore, data = SATMat)

Call:
lm(formula = MathAvgScore ~ ReadingAvgScore, data = SATMat)

Coefficients:
    (Intercept)  ReadingAvgScore  
        138.259            0.853  

We can compute the value of \(h_{2,3}\) in the hat matrix by the following formula:

x-value \((x-\bar{x})^2\)
679.00 2,822.27
632.00 37.52
635.00 83.27
612.00 192.52
587.00 1,511.27
605.00 435.77
621.00 23.77
636.00 102.52

\(\bar{x}=625.88\)
\(ss_x=5208.88\)
\(h_{2,3}=\left(\frac{1}{n}\right)+(x_2-\bar{x})(x_3-\bar{x})/ss_x\)
\(h_{2,3}=\left(\frac{1}{8}\right)+(632-625.88)(635-625.88)/5208.88 = 0.1357\)

The calculation matches the table.

Multiple Linear Regression in Matrix Terms

Using two regressors:

> Xmat=cbind(rep(1,8),SATMat[,1],SATMat[,2])
> Xmate=as.data.frame(cbind(rep(1,8),SATMat[,1],SATMat[,2]))
> kable(Xmate, col.names = NULL)%>%kable_styling(full_width = F, position="left")
1 679 682
1 632 649
1 635 636
1 612 596
1 587 587
1 605 588
1 621 638
1 636 636

Using R Matrix operations to fit the model:

> Rhat = as.data.frame(solve(t(Xmat)%*%Xmat)%*%t(Xmat)%*%Y)
> kable(Rhat, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+   column_spec(1,background="yellow")
134.4315052
0.9009849
-0.0418363
> lm(MathAvgScore~ReadingAvgScore+WritingAvgScore, data=SATMat)

Call:
lm(formula = MathAvgScore ~ ReadingAvgScore + WritingAvgScore, 
    data = SATMat)

Coefficients:
    (Intercept)  ReadingAvgScore  WritingAvgScore  
      134.43151          0.90098         -0.04184  

The regression output matches the matrix operations.

Chapter 4 - Model Selection


Before performing model selection I have renamed the variables for ease of use.

> SAT2 <- SAT %>% rename(NT = NumTestTakers,
+                        RS = ReadingAvgScore,
+                        MS = MathAvgScore,
+                        WS = WritingAvgScore,
+                        AP = AsianPct,
+                        BP = BlackPct,
+                        HP = HispanicPct,
+                        WP = WhitePct,
+                        MP = MalePct,
+                        FP = FemalePct)
Old Variable New Variable Description
MathAvgScore MS 2012 Average SAT math score for NYC schools
ReadingAvgScore RS 2012 Average SAT reading score for NYC schools
WritingAvgScore WS 2012 Average SAT writing score for NYC schools
NumTestTakers NT Number of test takers
AsianPct AP Percentage of Asian students in the school
BlackPct BP Percentage of Black students in the school
HispanicPct HP Percentage of Hispanic students in the school
WhitePct WP Percentage of White students in the school
MalePct MP Percentage of Male students in the school
FemalePct FP Percentage of Female students in the school

Stepwise Regression

Next, I used stepwise regression (iteratively adding and removing predictors until the best model is found).

> library(leaps)
> nullmod = lm(MS~1, data=SAT2) # Intercept only model
> fullmod = lm(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data=SAT2) # all parameters
> stepwisereg = step(nullmod, scope=list(lower=nullmod, upper=fullmod), direction='both')
Start:  AIC=3441.29
MS ~ 1

       Df Sum of Sq     RSS    AIC
+ WS    1   1394931  344054 2775.7
+ RS    1   1353411  385574 2822.7
+ AP    1    833423  905562 3174.5
+ WP    1    672844 1066140 3241.7
+ NT    1    486543 1252442 3308.1
+ BP    1    261348 1477637 3376.2
+ HP    1    213384 1525601 3389.4
<none>              1738985 3441.3
+ FP    1      1116 1737869 3443.0
+ MP    1      1111 1737874 3443.0

Step:  AIC=2775.74
MS ~ WS

       Df Sum of Sq     RSS    AIC
+ AP    1    179901  164152 2472.9
+ NT    1     37466  306588 2730.2
+ BP    1     35839  308215 2732.4
+ MP    1     25604  318450 2745.9
+ FP    1     25600  318454 2745.9
+ WP    1      4395  339658 2772.4
+ HP    1      3888  340166 2773.1
+ RS    1      3768  340286 2773.2
<none>               344054 2775.7
- WS    1   1394931 1738985 3441.3

Step:  AIC=2472.86
MS ~ WS + AP

       Df Sum of Sq    RSS    AIC
+ MP    1     11190 152963 2445.8
+ FP    1     11189 152964 2445.8
+ RS    1      9315 154837 2450.8
+ WP    1      1836 162316 2470.2
+ BP    1      1799 162353 2470.3
+ NT    1      1429 162723 2471.3
<none>              164152 2472.9
+ HP    1       647 163505 2473.2
- AP    1    179901 344054 2775.7
- WS    1    741410 905562 3174.5

Step:  AIC=2445.77
MS ~ WS + AP + MP

       Df Sum of Sq    RSS    AIC
+ RS    1      6431 146531 2430.1
+ BP    1      1854 151108 2442.7
+ WP    1      1476 151487 2443.8
+ NT    1       976 151987 2445.1
+ HP    1       791 152172 2445.6
<none>              152963 2445.8
+ FP    1        33 152930 2447.7
- MP    1     11190 164152 2472.9
- AP    1    165487 318450 2745.9
- WS    1    747578 900540 3174.2

Step:  AIC=2430.08
MS ~ WS + AP + MP + RS

       Df Sum of Sq    RSS    AIC
+ BP    1      3517 143014 2422.1
+ HP    1      1869 144663 2426.8
+ WP    1      1789 144743 2427.0
+ NT    1      1035 145496 2429.2
<none>              146531 2430.1
+ FP    1        69 146463 2431.9
- RS    1      6431 152963 2445.8
- MP    1      8306 154837 2450.8
- WS    1     20049 166581 2480.9
- AP    1    170512 317044 2746.1

Step:  AIC=2422.07
MS ~ WS + AP + MP + RS + BP

       Df Sum of Sq    RSS    AIC
+ NT    1       787 142228 2421.8
<none>              143014 2422.1
+ WP    1       602 142412 2422.3
+ HP    1       523 142491 2422.6
+ FP    1        88 142926 2423.8
- BP    1      3517 146531 2430.1
- MP    1      8027 151041 2442.6
- RS    1      8094 151108 2442.8
- WS    1     15913 158927 2463.5
- AP    1    135819 278834 2695.2

Step:  AIC=2421.79
MS ~ WS + AP + MP + RS + BP + NT

       Df Sum of Sq    RSS    AIC
<none>              142228 2421.8
- NT    1       787 143014 2422.1
+ WP    1       432 141796 2422.5
+ HP    1       382 141845 2422.7
+ FP    1        92 142136 2423.5
- BP    1      3269 145496 2429.2
- MP    1      7688 149915 2441.5
- RS    1      8084 150312 2442.6
- WS    1     15337 157565 2462.0
- AP    1    115280 257508 2664.4
> summary(stepwisereg)

Call:
lm(formula = MS ~ WS + AP + MP + RS + BP + NT, data = SAT2)

Residuals:
    Min      1Q  Median      3Q     Max 
-65.209  -9.906  -0.890  10.668 106.919 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 57.161880   8.876950   6.439 3.40e-10 ***
WS           0.480869   0.072765   6.609 1.23e-10 ***
AP           1.444475   0.079726  18.118  < 2e-16 ***
MP           0.339625   0.072588   4.679 3.94e-06 ***
RS           0.349090   0.072760   4.798 2.26e-06 ***
BP          -0.122270   0.040078  -3.051  0.00243 ** 
NT           0.010680   0.007136   1.497  0.13527    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.74 on 405 degrees of freedom
Multiple R-squared:  0.9182,    Adjusted R-squared:  0.917 
F-statistic: 757.8 on 6 and 405 DF,  p-value: < 2.2e-16

The resulting best model had an AIC value of 2421.79 and and adjusted R-squared value of 0.917

> par(mfrow=c(2,2))
> plot(stepwisereg)

We can see that there are some outliers and high leverage points, but the number is not excessive.

There is some heteroscedasticity of the residuals, mostly because the data is clustered around the averages at the left of the plots.

> #plot residuals by index
> res = residuals(stepwisereg)
> res2 = as.data.frame(res)
> resid = as.numeric(rownames(res2))
> ggplot(res2)+geom_point(aes(x=resid,y=res), color="black",
+   fill="gold", pch=21, size=2)+
+   labs(y="Residuals", x="Index")

> #plot residual histogram
> ggplot(res2)+geom_histogram(aes(x=res), binwidth = 5,
+   fill="gold", color="red")+
+   scale_x_continuous(breaks = seq(from=-70, to=110, by=10))+
+   theme(axis.text.x = element_text(angle = 90))+
+   labs(x="residuals")

> #display the corresponding rows where the 
> #absolute value of residuals is >50
> kable(SAT2[abs(res)>50,3:12])%>%kable_styling(
+   bootstrap_options=c("striped","condensed", full_width=F))
NT RS MS WS AP BP HP WP MP FP
112 332 557 316 84.7 5.2 8.9 0.9 53.7 46.3
79 319 512 357 62.4 7.2 26.4 4.0 51.2 48.8
121 345 517 343 35.3 31.0 28.3 5.0 56.1 43.9
12 416 403 381 48.7 33.2 16.6 1.5 47.7 52.3
9 439 374 418 2.7 88.9 7.1 0.4 99.6 0.4
20 441 499 413 7.5 58.2 26.3 4.5 79.4 20.6
143 323 475 329 48.3 5.6 43.5 2.6 53.2 46.8

When we examine the most extreme residuals it is usally possible to pinpoint the cause. For example the Asian Percentage of 84.7% is unusually high.

Variance Inflation

The Variance Inflation Factor is a measurement that is used to detect collinearity among the regressors. It is calculated by regressing each variable against the other variables in the model, computing the \(R^2\), subtracting the resulting \(R^2\) from 1, and taking the inverse. \(VIF_K=(1-R^2_K)^{-1}\)

A high level of collinearity could make the model unreliable. It can increase the standard error of the estimated coefficients making the t-statistics faulty. It can also cause the individual regressor coefficients to change erratically.

> library(car)
> vif(stepwisereg) # calculate variance inflation factor
       WS        AP        MP        RS        BP        NT 
21.445989  1.632100  1.078509 20.155587  1.290958  1.460455 

There are no extreme results.

RS and WS have the highest values, likely because the test scores have a strong linear relationship.

Cook’s D

Cook’s distance measures the influence of an observation on the regression model. An observation with a high Cook’s distance would meaningfully change the slope of the regression line if included. It can be used to detect leverage points.

It can be computed by:

\[D_i=\frac{ \sum_{j=1}^{n} (\hat{y}_j - \hat{y}_{j(i)})^2 }{p(MSE)}\]

\(D_i\) is Cook’s distance for the i^th observation
\(\hat{y}_j\) is the full regression model’s estimate for the j^th observation
\(\hat{y}_{j(i)}\) is the regression model’s (fitted without the i^th observation) estimate for the j^th observation
\(p\) is the number of regressors
\(MSE\) is the mean squared error of the full regression model

> #plot Cook's D
> cooks <- cooks.distance(stepwisereg)
> cooks <- as.data.frame(cooks)
> cooksid <-  as.numeric(rownames(cooks))
> ggplot(cooks)+geom_point(aes(x=cooksid,y=cooks), color="black",
+                         fill="gold", pch=21, size=2)+
+   labs(y="Cook's D", x="Index")

> #display the corresponding rows where the value of cook's D is >0.05
> kable(SAT2[cooks>.05,3:12])%>%kable_styling(bootstrap_options=c("striped","condensed", full_width=F))
NT RS MS WS AP BP HP WP MP FP
112 332 557 316 84.7 5.2 8.9 0.9 53.7 46.3
79 319 512 357 62.4 7.2 26.4 4.0 51.2 48.8
121 345 517 343 35.3 31.0 28.3 5.0 56.1 43.9
12 416 403 381 48.7 33.2 16.6 1.5 47.7 52.3
9 439 374 418 2.7 88.9 7.1 0.4 99.6 0.4
143 323 475 329 48.3 5.6 43.5 2.6 53.2 46.8

Most large Cook’s D values appear to have low math scores relative to a high percentage of Asian students.

Chapter 5 - Cross Validation


Validation Set

The regression model provides us with a \(\hat{y}\) equation that is designed to be optimal for our dataset, but the true value lies in how well it will work with new data. Since fresh data is often unavailable we can simply split our existing dataset into two pieces. One piece will be used to calculate the regression equation, and the other piece will provide us with our out-of-sample values.

We start by randomizing the data and selecting our “training sample,” which is used to calculate the \(\hat{y}\) equation. We then use the remaining rows as our “validation” sample. We can evaluate the RMSE value to get a sense of how well the model might work with out-of-sample data. When comparing two models, the one that produces the lowest test sample RMSE is the preferred model.

> library(caret)
> set.seed(1234)
> # Split the data into training and test set
> training.samples <- SAT2$MS %>% createDataPartition(p = 0.7, list = FALSE)
> train.data  <- SAT2[training.samples, ]
> test.data <- SAT2[-training.samples, ]
> # Build the model (using selected model)
> model <- lm(MS~WS+AP+MP+RS+BP+NT, data = train.data)
> # Make predictions and compute the R2, RMSE and MAE
> predictions <- model %>% predict(test.data)
> data.frame( R2 = R2(predictions, test.data$MS),
+             RMSE = RMSE(predictions, test.data$MS),
+             MAE = MAE(predictions, test.data$MS))
         R2     RMSE      MAE
1 0.8967801 20.36794 13.84854

Leave One Out Cross Validation - LOOCV

  • Leave one datapoint out
  • Build the model on the remaining datapoints
  • Calculate and record the error associated with the datapoint that was excluded
  • Repeat the process for all data points
  • Calculate the average of all the test errors
> # Define training control
> train.control <- trainControl(method = "LOOCV")
> # Train the model
> model <- train(MS~WS+AP+MP+RS+BP+NT, data = SAT2, 
+                method = "lm",
+                trControl = train.control)
> # Summarize the results
> print(model)
Linear Regression 

412 samples
  6 predictor

No pre-processing
Resampling: Leave-One-Out Cross-Validation 
Summary of sample sizes: 411, 411, 411, 411, 411, 411, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  19.15126  0.9131149  13.83611

Tuning parameter 'intercept' was held constant at a value of TRUE

K-fold cross-validation

This is similar to leave one out, except the data is split into k-subsets. 1 subset is used for testing and the remainder are used for training. The process is repeated until all the subsets are tested.

> # Define training control
> set.seed(1234) 
> train.control <- trainControl(method = "cv", number = 10)
> # Train the model
> model <- train(MS~WS+AP+MP+RS+BP+NT, data = SAT2, 
+                method = "lm",
+                trControl = train.control)
> # Summarize the results
> print(model)
Linear Regression 

412 samples
  6 predictor

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 371, 372, 369, 370, 371, 371, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  18.88628  0.9102345  13.82922

Tuning parameter 'intercept' was held constant at a value of TRUE

Repeated K-fold cross-validation

We can also repeat k-fold cross validation multiple times

> # Define training control
> set.seed(1234)
> train.control <- trainControl(method = "repeatedcv", 
+                               number = 10, repeats = 3)
> # Train the model
> model <- train(MS~WS+AP+MP+RS+BP+NT, data = SAT2, 
+                method = "lm",
+                trControl = train.control)
> # Summarize the results
> print(model)
Linear Regression 

412 samples
  6 predictor

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times) 
Summary of sample sizes: 371, 372, 369, 370, 371, 371, ... 
Resampling results:

  RMSE      Rsquared   MAE     
  18.82951  0.9110351  13.84007

Tuning parameter 'intercept' was held constant at a value of TRUE

Chapter 6 - Python Implementation


Model Selection

> from sklearn.linear_model import LinearRegression
+ from sklearn.feature_selection import RFE
+ import statsmodels.api as sm

Recursive Feature Elimination:

Recursive Feature Elimination (RFE) selects features by considering a smaller and smaller set of regressors. The starting point is the full set of regressors and variables are recursively pruned from the initial set. The procedure is repeated until a desired set of features remain. That number can either be a priori specified, or can be found using cross validation.

> SAT2 = r.SAT2
+ X = SAT2.drop(columns=['DBN','Borough','MS'])
+ y = SAT2.iloc[:,4]
+ names=pd.DataFrame(X.columns)
> #use linear regression as the model
+ lin_reg = LinearRegression()
> #This is to select 6 variables: 
+ #can be changed and checked in model for accuracy
+ rfe_mod = RFE(lin_reg, 6, step=1)  
+ myvalues=rfe_mod.fit(X,y) #to fit
+ myvalues.support_ #The mask of selected features.
array([False, False,  True,  True, False,  True,  True,  True,  True])
> myvalues.ranking_  
+ #The feature ranking, such that ranking_[i] corresponds to the 
+ #ranking position of the i-th feature. Selected 
+ #(i.e., estimated best) features are assigned rank 1.
array([4, 3, 1, 1, 2, 1, 1, 1, 1])
> rankings=pd.DataFrame(myvalues.ranking_) 
+ #Make it into data frame
> #Concat and name columns
+ ranked=pd.concat([names,rankings], axis=1)
+ ranked.columns = ["Feature", "Rank"]
+ ranked
  Feature  Rank
0      NT     4
1      RS     3
2      WS     1
3      AP     1
4      BP     2
5      HP     1
6      WP     1
7      MP     1
8      FP     1
> #Select most important (Only 1's)
+ most_important = ranked.loc[ranked['Rank'] ==1] 
+ print(most_important)
  Feature  Rank
2      WS     1
3      AP     1
5      HP     1
6      WP     1
7      MP     1
8      FP     1
> most_important['Rank'].count()
6

The six selected variables are different from the one’s selected by stepwise regression using R.

Method Language Variables
Stepwise Regression R WS, AP, MP, RS, BP, NT
Recursive Feature Elimination Python WS, AP, MP, HP, WP, FP

Model performance was similar, with an adjusted \(R^2\) of 0.912 (Python) versus 0.917 (R).

Lasso (L1) Based Feature Selection:

Several models are designed to reduce the number of features. One of the shrinkage methods - Lasso - for example reduces several coefficients to zero leaving only features that are truly important.

Sci-Kit offers SelectFromModel as a tool to run embedded models for feature selection. The module makes use of a threshold parameter, which can be either user specified or heuristically set based on median or mean. Below, the code uses Lasso (L1 penalty) to find features for inclusion. I set the threshold to 0.25, which results in three features being selected.

> from sklearn.feature_selection import SelectFromModel
+ from sklearn.linear_model import LassoCV
> # Use L1 penalty
+ estimator = LassoCV(cv=5, normalize = True)
> # Set a minimum threshold of 0.25
+ sfm = SelectFromModel(estimator, threshold=0.25,
+                       prefit=False, norm_order=1, max_features=None)
> sfm.fit(X, y)
SelectFromModel(estimator=LassoCV(alphas=None, copy_X=True, cv=5, eps=0.001,
                                  fit_intercept=True, max_iter=1000,
                                  n_alphas=100, n_jobs=None, normalize=True,
                                  positive=False, precompute='auto',
                                  random_state=None, selection='cyclic',
                                  tol=0.0001, verbose=False),
                max_features=None, norm_order=1, prefit=False, threshold=0.25)
> feature_idx = sfm.get_support()
+ feature_name = X.columns[feature_idx]
+ feature_name
Index(['RS', 'WS', 'AP'], dtype='object')

The ideal model included just 3 variables, RS, WS, and AP.

Validation Set

I chose to proceed with the Lasso model for cross validation and I used the validation set approach (train/test).

> # ideal regressors
+ X = SAT2.loc[:,['RS','WS','AP']]
> #split data into test/train
+ from sklearn.model_selection import train_test_split
+ 
+ X_train, X_test, y_train, y_test = train_test_split(
+    X, y, test_size=0.3, random_state=101)
+ 
+ lm = LinearRegression()
+ lm.fit(X_train,y_train)
+ #build model on training data
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
> #fit model to test data
+ #display output
+ import warnings
+ warnings.filterwarnings('ignore')
+ X = X_test
+ X2 = sm.add_constant(X)
+ est = sm.OLS(y_test, X2)
+ est2 = est.fit()
+ print(est2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     MS   R-squared:                       0.915
Model:                            OLS   Adj. R-squared:                  0.913
Method:                 Least Squares   F-statistic:                     428.7
Date:                Fri, 09 Oct 2020   Prob (F-statistic):           6.27e-64
Time:                        12:47:45   Log-Likelihood:                -545.09
No. Observations:                 124   AIC:                             1098.
Df Residuals:                     120   BIC:                             1109.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        101.6956     14.253      7.135      0.000      73.476     129.916
RS             0.5018      0.147      3.408      0.001       0.210       0.793
WS             0.2342      0.150      1.566      0.120      -0.062       0.530
AP             2.0807      0.141     14.764      0.000       1.802       2.360
==============================================================================
Omnibus:                       16.843   Durbin-Watson:                   2.091
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               44.340
Skew:                           0.427   Prob(JB):                     2.35e-10
Kurtosis:                       5.802   Cond. No.                     4.45e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.45e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

This model, with only 3 variables, performed similar to both of the models with 6 variables. Generally, simpler is better when model performance is close, and so I would recommend the Lasso approach for this data.

> # test predictions
+ predictions = lm.predict(X_test)

We can plot the actual values versus the predictions. In this case the result is nearly linear.

> # Actual vs Predictions
+ plt.figure(figsize=(6,4))
+ sns.set_style('darkgrid')
+ sns.scatterplot(x=y_test, y=predictions, color='red');
+ plt.xlabel('Actual Math Scores');
+ plt.ylabel('Predicted Math Scores');
+ plt.show()

The residuals closely resemble a normal distribution with the addition of some outliers.

> #residual histogram
+ plt.figure(figsize=(6,4))
+ sns.distplot((y_test-predictions),bins=50, color='red');
+ plt.xlabel('Residuals');
+ plt.show()

Chapter 7 - Categorical Variable


Adding a categorical variable, such as Borough, is easy with R. The regression function automatically converts factors and characters into dummy variables.

> SAT2$Borough <- as.factor(SAT2$Borough)
> summary(lm(MS~Borough+RS+AP+WS, data=SAT2))

Call:
lm(formula = MS ~ Borough + RS + AP + WS, data = SAT2)

Residuals:
    Min      1Q  Median      3Q     Max 
-79.681 -11.238   0.709  10.740 100.475 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         74.22729    7.40677  10.022  < 2e-16 ***
BoroughBrooklyn     -1.18264    2.58729  -0.457   0.6478    
BoroughManhattan     1.64475    2.75195   0.598   0.5504    
BoroughQueens       -7.35978    3.35454  -2.194   0.0288 *  
BoroughStatenIsland 10.60954    6.26756   1.693   0.0913 .  
RS                   0.37495    0.07368   5.089 5.52e-07 ***
AP                   1.69935    0.07922  21.450  < 2e-16 ***
WS                   0.44114    0.07247   6.087 2.68e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.29 on 404 degrees of freedom
Multiple R-squared:  0.9135,    Adjusted R-squared:  0.9121 
F-statistic: 609.9 on 7 and 404 DF,  p-value: < 2.2e-16

As we can see, adding Borough did not improve the model and most of the individual boroughs have large p-values (>0.05). I would not recommend its inclusion.

Adding a categorical variable is a little more complicated in Python because they must be converted to dummy variables first.

One of the dummy variables must be dropped because its value can always be predicted from the remaining dummy variables.

> # must drop one
+ borough = pd.get_dummies(SAT2['Borough'], drop_first=True)
+ borough.head()
   Brooklyn  Manhattan  Queens  StatenIsland
0         0          1       0             0
1         0          1       0             0
2         0          1       0             0
3         0          1       0             0
4         0          1       0             0
> #drop old Borough variable
+ SAT2.drop(['Borough'], axis=1, inplace=True)
+ #combine SAT with dummy variables
+ SAT2 = pd.concat([SAT2,borough],
+                  axis=1)
> SAT3 <- py$SAT2
> SAT3 %>% glimpse()
Rows: 412
Columns: 15
$ DBN          <chr> "01M292", "01M448", "01M450", "01M458", "01M509", "01M...
$ NT           <dbl> 29, 91, 70, 7, 44, 112, 159, 18, 130, 16, 62, 53, 58, ...
$ RS           <dbl> 355, 383, 377, 414, 390, 332, 522, 417, 624, 395, 409,...
$ MS           <dbl> 404, 423, 402, 401, 433, 557, 574, 418, 604, 400, 393,...
$ WS           <dbl> 363, 366, 370, 359, 384, 316, 525, 411, 628, 387, 392,...
$ AP           <dbl> 14.0, 29.2, 9.7, 2.2, 9.3, 84.7, 27.8, 0.5, 15.1, 1.7,...
$ BP           <dbl> 29.1, 22.6, 23.9, 34.4, 31.6, 5.2, 11.7, 45.4, 15.1, 3...
$ HP           <dbl> 53.8, 45.9, 55.4, 59.4, 56.9, 8.9, 14.2, 49.5, 18.2, 5...
$ WP           <dbl> 1.7, 2.3, 10.4, 3.6, 1.6, 0.9, 44.9, 4.1, 49.8, 6.3, 4...
$ MP           <dbl> 61.4, 57.4, 54.7, 43.3, 46.3, 53.7, 49.2, 39.9, 31.3, ...
$ FP           <dbl> 38.6, 42.6, 45.3, 56.7, 53.7, 46.3, 50.8, 60.1, 68.7, ...
$ Brooklyn     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Manhattan    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Queens       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ StatenIsland <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
> X = SAT2.drop(columns=['DBN','NT','MS','BP','HP',
+                        'WP','MP','FP'])
+ y = SAT2.iloc[:,3]
> X2 = sm.add_constant(X)
+ est = sm.OLS(y, X2)
+ est2 = est.fit()
+ print(est2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                     MS   R-squared:                       0.914
Model:                            OLS   Adj. R-squared:                  0.912
Method:                 Least Squares   F-statistic:                     609.9
Date:                Fri, 09 Oct 2020   Prob (F-statistic):          2.40e-210
Time:                        12:47:46   Log-Likelihood:                -1799.9
No. Observations:                 412   AIC:                             3616.
Df Residuals:                     404   BIC:                             3648.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const           74.2273      7.407     10.022      0.000      59.667      88.788
RS               0.3749      0.074      5.089      0.000       0.230       0.520
WS               0.4411      0.072      6.087      0.000       0.299       0.584
AP               1.6994      0.079     21.450      0.000       1.544       1.855
Brooklyn        -1.1826      2.587     -0.457      0.648      -6.269       3.904
Manhattan        1.6448      2.752      0.598      0.550      -3.765       7.055
Queens          -7.3598      3.355     -2.194      0.029     -13.954      -0.765
StatenIsland    10.6095      6.268      1.693      0.091      -1.712      22.931
==============================================================================
Omnibus:                       37.358   Durbin-Watson:                   2.010
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              134.757
Skew:                           0.298   Prob(JB):                     5.47e-30
Kurtosis:                       5.738   Cond. No.                     4.51e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.51e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Once Borough is converted to a dummy variable we are able to produce the same results from Python.