This is an update from a previous report:
https://rpubs.com/pjozefek/574750
Chapters 6 and 7 were added, formatting was changed, and some errors were corrected.
In this report I will be evaluating the influence of race, gender, and other variables on NYC’s SAT math scores from 2012. The SAT plays an important role in determining the academic and career path for many students, and mathematical achievement seems particularly important in the current environment. Technology, engineering, finance, and science have all become the most lucrative sectors in which to work, and a strong background in math is necessary to pursue these fields.
I gathered the SAT and school data from NYC Open Data, which is a repository for public data generated by various New York City agencies . I was able to link two datasets by a common identifier, DBN (District Borough Number), which is unique to each school.
The file 2012 SAT Results can be found at:
https://data.cityofnewyork.us/Education/2012-SAT-Results/f9bf-2cp4
The file 2006-2012 School Demographics and Accountability Snapshot can be found at:
https://data.cityofnewyork.us/Education/2006-2012-School-Demographics-and-Accountability-S/ihfw-zy9j
I began by importing the 2012 SAT results.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 6 columns):
DBN 478 non-null object
SCHOOL.NAME 478 non-null object
Num.of.SAT.Test.Takers 478 non-null object
SAT.Critical.Reading.Avg..Score 478 non-null object
SAT.Math.Avg..Score 478 non-null object
SAT.Writing.Avg..Score 478 non-null object
dtypes: object(6)
memory usage: 22.5+ KB
There are 478 rows and 6 columns.
> library(knitr)
> library(kableExtra)
> kable(py$SATdesc)%>%kable_styling(bootstrap_options = c("striped","condensed"),
+ full_width = F, font_size = 10,position="left")%>%
+ row_spec(0,background="lightblue")| DBN | SCHOOL.NAME | Num.of.SAT.Test.Takers | SAT.Critical.Reading.Avg..Score | SAT.Math.Avg..Score | SAT.Writing.Avg..Score | |
|---|---|---|---|---|---|---|
| count | 478 | 478 | 478 | 478 | 478 | 478 |
| unique | 478 | 478 | 175 | 164 | 173 | 163 |
| top | 15K464 | MOTT HAVEN VILLAGE PREPARATORY HIGH SCHOOL | s | s | s | s |
| freq | 1 | 1 | 57 | 57 | 57 | 57 |
We can see from the table that there are 57 instances of “s”, which represent missing values. I will eliminate them before moving forward.
> library(tidyverse)
> SATtib <- as_tibble(SATraw)
> SATtib <- SATtib %>% filter(Num.of.SAT.Test.Takers != "s")
> SATtib %>% glimpse()Rows: 421
Columns: 6
$ DBN <chr> "01M292", "01M448", "01M450", "01M4...
$ SCHOOL.NAME <chr> "HENRY STREET SCHOOL FOR INTERNATIO...
$ Num.of.SAT.Test.Takers <chr> "29", "91", "70", "7", "44", "112",...
$ SAT.Critical.Reading.Avg..Score <chr> "355", "383", "377", "414", "390", ...
$ SAT.Math.Avg..Score <chr> "404", "423", "402", "401", "433", ...
$ SAT.Writing.Avg..Score <chr> "363", "366", "370", "359", "384", ...
Unfortunately the “s” caused the numeric values to be parsed as characters. I converted them to integers and simplified the their variable names. I also eliminated the school name, which is already identified by DBN.
> SATtib <- SATtib %>% mutate(
+ NumTestTakers=as.integer(Num.of.SAT.Test.Takers),
+ ReadingAvgScore=as.integer(SAT.Critical.Reading.Avg..Score),
+ MathAvgScore=as.integer(SAT.Math.Avg..Score),
+ WritingAvgScore=as.integer(SAT.Writing.Avg..Score))
>
> SATtib <- SATtib %>% select(-(2:6))
> SATtib %>% glimpse()Rows: 421
Columns: 5
$ DBN <chr> "01M292", "01M448", "01M450", "01M458", "01M509", "...
$ NumTestTakers <int> 29, 91, 70, 7, 44, 112, 159, 18, 130, 16, 62, 53, 5...
$ ReadingAvgScore <int> 355, 383, 377, 414, 390, 332, 522, 417, 624, 395, 4...
$ MathAvgScore <int> 404, 423, 402, 401, 433, 557, 574, 418, 604, 400, 3...
$ WritingAvgScore <int> 363, 366, 370, 359, 384, 316, 525, 411, 628, 387, 3...
The DBN value contains a letter in the third position. This letter represents the borough. I extracted it and created a new column.
> SATtib <- SATtib %>% mutate(Borough=str_extract(DBN,"[MXKQR]"))
>
> SATtib[SATtib$Borough == "M","Borough"] <- "Manhattan"
> SATtib[SATtib$Borough == "X","Borough"] <- "Bronx"
> SATtib[SATtib$Borough == "K","Borough"] <- "Brooklyn"
> SATtib[SATtib$Borough == "Q","Borough"] <- "Queens"
> SATtib[SATtib$Borough == "R","Borough"] <- "StatenIsland"
>
> SATtib %>% group_by(Borough) %>%
+ summarize(count=n()) %>% arrange(desc(count)) # A tibble: 5 x 2
Borough count
<chr> <int>
1 Brooklyn 123
2 Bronx 110
3 Manhattan 106
4 Queens 70
5 StatenIsland 12
Next, I imported the 2006-2012 School Demographics and Accountability Snapshot.
> Demographics<- read.csv("2006_-_2012_School_Demographics_and_Accountability_Snapshot.csv")
> Demotib <- as_tibble(Demographics)
> Demotib %>% glimpseRows: 10,075
Columns: 38
$ DBN <chr> "01M015", "01M015", "01M015", "01M015", "01M015",...
$ Name <chr> "P.S. 015 ROBERTO CLEMENTE", "P.S. 015 ROBERTO CL...
$ schoolyear <int> 20052006, 20062007, 20072008, 20082009, 20092010,...
$ fl_percent <chr> "89.4", "89.4", "89.4", "89.4", " ", " ", "",...
$ frl_percent <dbl> NA, NA, NA, NA, 96.5, 96.5, 89.4, NA, NA, NA, NA,...
$ total_enrollment <int> 281, 243, 261, 252, 208, 203, 189, 402, 312, 338,...
$ prek <int> 15, 15, 18, 17, 16, 13, 13, 15, 13, 28, 33, 35, 3...
$ k <int> 36, 29, 43, 37, 40, 37, 31, 43, 37, 48, 44, 48, 5...
$ grade1 <int> 40, 39, 39, 44, 28, 35, 35, 55, 45, 46, 56, 49, 5...
$ grade2 <int> 33, 38, 36, 32, 32, 33, 28, 53, 52, 47, 44, 56, 4...
$ grade3 <int> 38, 34, 38, 34, 30, 30, 25, 68, 47, 53, 53, 39, 4...
$ grade4 <int> 52, 42, 47, 39, 24, 30, 28, 59, 61, 48, 48, 50, 4...
$ grade5 <int> 29, 46, 40, 49, 38, 25, 29, 64, 57, 68, 47, 44, 4...
$ grade6 <int> 38, NA, NA, NA, NA, NA, NA, 45, NA, NA, NA, NA, N...
$ grade7 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade8 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade9 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade10 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade11 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ grade12 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
$ ell_num <int> 36, 38, 52, 48, 40, 30, 20, 37, 30, 40, 27, 35, 3...
$ ell_percent <dbl> 12.8, 15.6, 19.9, 19.0, 19.2, 14.8, 10.6, 9.2, 9....
$ sped_num <int> 57, 55, 60, 62, 46, 46, 40, 93, 72, 75, 70, 71, 6...
$ sped_percent <dbl> 20.3, 22.6, 23.0, 24.6, 22.1, 22.7, 21.2, 23.1, 2...
$ ctt_num <int> 25, 19, 20, 21, 14, 21, 23, 7, 13, 12, 15, 18, 17...
$ selfcontained_num <int> 9, 15, 14, 17, 14, 9, 7, 37, 22, 19, 19, 22, 17, ...
$ asian_num <int> 10, 18, 16, 16, 16, 13, 12, 40, 30, 42, 39, 43, 4...
$ asian_per <dbl> 3.6, 7.4, 6.1, 6.3, 7.7, 6.4, 6.3, 10.0, 9.6, 12....
$ black_num <int> 74, 68, 77, 75, 67, 75, 63, 103, 70, 72, 83, 87, ...
$ black_per <dbl> 26.3, 28.0, 29.5, 29.8, 32.2, 36.9, 33.3, 25.6, 2...
$ hispanic_num <int> 189, 153, 157, 149, 118, 110, 109, 207, 172, 186,...
$ hispanic_per <dbl> 67.3, 63.0, 60.2, 59.1, 56.7, 54.2, 57.7, 51.5, 5...
$ white_num <int> 5, 4, 7, 7, 6, 4, 4, 39, 19, 22, 32, 35, 31, 28, ...
$ white_per <dbl> 1.8, 1.6, 2.7, 2.8, 2.9, 2.0, 2.1, 9.7, 6.1, 6.5,...
$ male_num <int> 158, 140, 143, 149, 124, 113, 97, 214, 157, 162, ...
$ male_per <dbl> 56.2, 57.6, 54.8, 59.1, 59.6, 55.7, 51.3, 53.2, 5...
$ female_num <int> 123, 103, 118, 103, 84, 90, 92, 188, 155, 176, 16...
$ female_per <dbl> 43.8, 42.4, 45.2, 40.9, 40.4, 44.3, 48.7, 46.8, 4...
We can see that there are 10075 rows. Each school has 7 entries (schoolyear 20052006 to 20112012) and it contains all grade levels. We are only interested in schoolyear 20112012 and in the demographics percentages.
> Demotib <- Demotib %>% filter(schoolyear == 20112012) %>%
+ select(DBN,
+ AsianPct = asian_per,
+ BlackPct = black_per,
+ HispanicPct = hispanic_per,
+ WhitePct = white_per,
+ MalePct = male_per,
+ FemalePct = female_per)
>
> Demotib %>% glimpseRows: 1,509
Columns: 7
$ DBN <chr> "01M015", "01M019", "01M020", "01M034", "01M063", "01M0...
$ AsianPct <dbl> 6.3, 15.5, 30.4, 5.5, 5.1, 8.0, 12.5, 28.2, 18.3, 4.2, ...
$ BlackPct <dbl> 33.3, 24.7, 8.8, 22.4, 23.3, 23.5, 13.2, 20.1, 24.8, 16...
$ HispanicPct <dbl> 57.7, 48.2, 57.0, 68.6, 62.5, 59.6, 44.1, 49.1, 53.5, 7...
$ WhitePct <dbl> 2.1, 8.5, 2.6, 2.0, 8.5, 7.4, 28.2, 2.2, 3.0, 1.7, 2.8,...
$ MalePct <dbl> 51.3, 44.8, 52.7, 50.9, 55.1, 56.8, 49.8, 48.0, 48.3, 5...
$ FemalePct <dbl> 48.7, 55.2, 47.3, 49.1, 44.9, 43.2, 50.2, 52.0, 51.7, 4...
I then joined the two data sets by DBN value and removed 9 NA values.
> SAT <- SATtib %>% left_join(Demotib, by="DBN") %>%
+ relocate(Borough, .after = DBN) %>% na.omit()
>
> SAT %>% glimpse()Rows: 412
Columns: 12
$ DBN <chr> "01M292", "01M448", "01M450", "01M458", "01M509", "...
$ Borough <chr> "Manhattan", "Manhattan", "Manhattan", "Manhattan",...
$ NumTestTakers <int> 29, 91, 70, 7, 44, 112, 159, 18, 130, 16, 62, 53, 5...
$ ReadingAvgScore <int> 355, 383, 377, 414, 390, 332, 522, 417, 624, 395, 4...
$ MathAvgScore <int> 404, 423, 402, 401, 433, 557, 574, 418, 604, 400, 3...
$ WritingAvgScore <int> 363, 366, 370, 359, 384, 316, 525, 411, 628, 387, 3...
$ AsianPct <dbl> 14.0, 29.2, 9.7, 2.2, 9.3, 84.7, 27.8, 0.5, 15.1, 1...
$ BlackPct <dbl> 29.1, 22.6, 23.9, 34.4, 31.6, 5.2, 11.7, 45.4, 15.1...
$ HispanicPct <dbl> 53.8, 45.9, 55.4, 59.4, 56.9, 8.9, 14.2, 49.5, 18.2...
$ WhitePct <dbl> 1.7, 2.3, 10.4, 3.6, 1.6, 0.9, 44.9, 4.1, 49.8, 6.3...
$ MalePct <dbl> 61.4, 57.4, 54.7, 43.3, 46.3, 53.7, 49.2, 39.9, 31....
$ FemalePct <dbl> 38.6, 42.6, 45.3, 56.7, 53.7, 46.3, 50.8, 60.1, 68....
The data set contains 412 NYC schools with the following variables:
| Variable | Definition (all data is for the year 2012) |
|---|---|
| DBN | District Borough Number - school identifier |
| Borough | New York City borough name |
| NumTestTakers | Number of SAT test takers |
| ReadingAvgScore | Average SAT reading score |
| MathAvgScore | Average SAT math score |
| WritingAvgScore | Average SAT writing score |
| AsianPct | Percentage of school students who were Asian |
| BlackPct | Percentage of school students who were Black |
| HispanicPct | Percentage of school students who were Hispanic |
| WhitePct | Percentage of school students who were White |
| MalePct | Percentage of school students who were Male |
| FemalePct | Percentage of school students who were Female |
Below is a small subset of the data:
> kable(SAT[1:10,c(1,3:12)])%>%kable_styling(bootstrap_options = c("striped","condensed"),
+ full_width = F, font_size = 12,position="left")%>%
+ row_spec(0,background="lightblue")| DBN | NumTestTakers | ReadingAvgScore | MathAvgScore | WritingAvgScore | AsianPct | BlackPct | HispanicPct | WhitePct | MalePct | FemalePct |
|---|---|---|---|---|---|---|---|---|---|---|
| 01M292 | 29 | 355 | 404 | 363 | 14.0 | 29.1 | 53.8 | 1.7 | 61.4 | 38.6 |
| 01M448 | 91 | 383 | 423 | 366 | 29.2 | 22.6 | 45.9 | 2.3 | 57.4 | 42.6 |
| 01M450 | 70 | 377 | 402 | 370 | 9.7 | 23.9 | 55.4 | 10.4 | 54.7 | 45.3 |
| 01M458 | 7 | 414 | 401 | 359 | 2.2 | 34.4 | 59.4 | 3.6 | 43.3 | 56.7 |
| 01M509 | 44 | 390 | 433 | 384 | 9.3 | 31.6 | 56.9 | 1.6 | 46.3 | 53.7 |
| 01M515 | 112 | 332 | 557 | 316 | 84.7 | 5.2 | 8.9 | 0.9 | 53.7 | 46.3 |
| 01M539 | 159 | 522 | 574 | 525 | 27.8 | 11.7 | 14.2 | 44.9 | 49.2 | 50.8 |
| 01M650 | 18 | 417 | 418 | 411 | 0.5 | 45.4 | 49.5 | 4.1 | 39.9 | 60.1 |
| 01M696 | 130 | 624 | 604 | 628 | 15.1 | 15.1 | 18.2 | 49.8 | 31.3 | 68.7 |
| 02M047 | 16 | 395 | 400 | 387 | 1.7 | 32.2 | 59.2 | 6.3 | 42.5 | 57.5 |
NumTestTakers ReadingAvgScore MathAvgScore WritingAvgScore
count 412.000000 412.000000 412.000000 412.000000
mean 112.466019 400.968447 413.842233 394.521845
std 156.535939 57.036135 65.046956 58.829309
min 6.000000 279.000000 312.000000 286.000000
25% 43.000000 368.000000 371.750000 360.750000
50% 63.000000 391.000000 395.000000 382.000000
75% 98.250000 416.250000 437.000000 411.000000
max 1277.000000 679.000000 735.000000 682.000000
AsianPct BlackPct HispanicPct WhitePct MalePct FemalePct
count 412.000000 412.000000 412.000000 412.00000 412.000000 412.000000
mean 9.462621 39.139320 42.876456 7.71335 49.786165 50.212864
std 14.812181 26.205454 24.525417 13.38467 13.224766 13.224115
min 0.000000 0.000000 2.400000 0.00000 0.000000 0.000000
25% 1.000000 20.050000 18.825000 0.90000 44.075000 45.100000
50% 2.950000 33.150000 43.650000 1.80000 49.800000 50.200000
75% 9.225000 56.725000 62.500000 7.42500 54.900000 55.925000
max 89.500000 95.700000 100.000000 82.10000 100.000000 100.000000
The test takers are almost evenly split between male and female.
The average scores are all around 400.
The number of test takers has the greatest range and standard deviation.
The race categories also vary widely and there is a surprisingly low white and asian percentage.
> SAT %>% mutate(RacePct = rowSums(.[,7:10])) %>%
+ ggplot(mapping=aes(x=RacePct)) + geom_area(aes(y=..count..,),stat="bin", fill="cyan3")+
+ geom_vline(aes(xintercept=mean(RacePct)),linetype="dashed", size=1)+
+ scale_x_continuous(breaks=seq(from=88, to=101, by=1))+
+ scale_y_continuous(breaks=seq(from=0, to=160, by=20))+
+ labs(x="Total Race Percentage")We can see that the total race percentage does not always equal 100%. Values range from a little over 88% to a little over 100%, with a mean just over 99%. That could be the result of recording errors or from the difficulty in gathering the data, especially when many individuals belong to more than one category.
Math and Scores
> #Python
+ import matplotlib.pyplot as plt
+ import seaborn as sns
+ sns.pairplot(SATpy.iloc[:,[1,4,3,5]],
+ hue='Borough', palette='hot');
+ plt.show()We can see a very strong linear relationship between the average reading, writing, and math SAT scores, as expected. The score distributions are all right skewed.
Math and Races
There appears to be a positive linear relationship between the percentage of Asian and White students and the average math SAT score, although there’s a modest level of heteroskedasticity given the concentration of datapoints on the left.
We can also visualize a negative relationship between the percentage of Black and Hispanic students and the average math SAT score, but there is a greater level of heteroskedasticity present and the trend is not as clear.
Math and Gender
The relationship between gender and math scores does not appear to be linear.
I chose to take a closer look at how the percentage of Asian students relates to the average SAT math score, which we can visualize with a scatterplot.
> library(tidyverse)
> library(ggthemes)
> ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=MathAvgScore),
+ color="blue", size=2)+
+ scale_x_continuous(breaks=seq(from=0, to=100, by=10))+
+ labs(x="Percentage of Asian Students", y="Average SAT Math Score")+theme_light()Heteroscedasticity - We can see that there is a moderate level of heteroscedasticity present. There is more variability in the average SAT math score as we move away from the cluster of values on the left.
Curvature - Fortunately there is no evidence of curvature in the scatterplot.
Leverage Points - The points on the right are all far from the clustered values on the left \((\bar{x})\) and would be considered leverage points.
Outliers - A few points represent either a large avg. SAT math score (near 700) or a small avg. SAT math score (near 300), but I don’t think they would be considered outliers. They are not extreme relative to other datapoints.
Influential Points - Although there are leverage points they would not be considered influential. Their removal would not dramatically alter the slope of the \(\hat{y}\) line.
Although sometimes transformations can help with heteroscedasticity, I did not find an adequate solution in this case. Some examples are presented below.
LOG Transformations
> library(gridExtra)
>
> ggplot(data=SAT)+geom_point(mapping=aes(x=log(AsianPct), y=log(MathAvgScore)), color="purple", size=2)+
+ labs(x="LOG Percentage of Asian Students", y="LOG Average SAT Math Score")+theme_light() ->p1
>
> ggplot(data=SAT)+geom_point(mapping=aes(x=log(AsianPct), y=MathAvgScore), color="purple", size=2)+
+ labs(x="Log Percentage of Asian Students", y="Average SAT Math Score")+theme_light() ->p2
>
> grid.arrange(p1, p2, ncol = 2)> ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=log(MathAvgScore)), color="purple", size=2)+
+ labs(x="Percentage of Asian Students", y="LOG Average SAT Math Score")+theme_light()
We can see that LOG transformations of the x values, the y values, or both did not solve the problem of heteroscedasticity, and in some cases made it worse.
Square Root Transformations
> ggplot(data=SAT)+geom_point(mapping=aes(x=sqrt(AsianPct), y=sqrt(MathAvgScore)), color="green4", size=2)+
+ labs(x="SQRT Percentage of Asian Students", y="SQRT Average SAT Math Score")+theme_light() ->p1
>
> ggplot(data=SAT)+geom_point(mapping=aes(x=sqrt(AsianPct), y=MathAvgScore), color="green4", size=2)+
+ labs(x="SQRT Percentage of Asian Students", y="Average SAT Math Score")+theme_light() ->p2
>
> grid.arrange(p1, p2, ncol = 2)> ggplot(data=SAT)+geom_point(mapping=aes(x=AsianPct, y=sqrt(MathAvgScore)), color="green4", size=2)+
+ labs(x="Percentage of Asian Students", y="SQRT Average SAT Math Score")+theme_light()
Similarly, square root transformations were not helpful.
\[Y_i = \beta_0+\beta_1x_i+\epsilon\]
The model does not view \(Y_i\) as a datapoint. The model views \(Y_i\) as a sub-population of values that share the same x value. For this to be possible the model assumes that we could get many fresh samples. In this regression \(Y_i\) represents a sub-population of average SAT math scores for a given percentage of Asian students, \(x_i\) .
Like any sub-population, \(Y_i\) has a population average and a population variance. The first component, \(\beta_0+\beta_1x_1\), supplies the sub-population average. \(\beta_0\) represents the model intercept and \(\beta_1\) represents the model slope.
The second component, \(\epsilon\), supplies the variability. We can think of \(\epsilon\) as the name of a bell curved population of \(Y\) values that happens to be centered at zero. The model assumes that this bell curve is added to each sub-population average and that every bell curve has the same standard deviation.
Call:
lm(formula = MathAvgScore ~ AsianPct, data = SAT)
Residuals:
Min 1Q Median 3Q Max
-164.163 -25.101 -6.255 19.754 212.155
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 385.0746 2.7485 140.10 <2e-16 ***
AsianPct 3.0401 0.1565 19.43 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 47 on 410 degrees of freedom
Multiple R-squared: 0.4793, Adjusted R-squared: 0.478
F-statistic: 377.3 on 1 and 410 DF, p-value: < 2.2e-16
> temp_var <- predict(lm.fit, interval = "prediction")
> new_df <-cbind(SAT, temp_var)
> ggplot(new_df, aes(AsianPct,MathAvgScore))+geom_point()+
+ geom_line(aes(y=lwr),color="red",linetype="dashed", size=1)+
+ geom_line(aes(y=upr),color="red",linetype="dashed", size=1)+
+ geom_smooth(method = lm, se=TRUE)+
+ ggtitle("Fit Plot \n Blue Band = 95% Confidence Interval \n Red Lines = 95% Prediction Interval")+
+ theme(plot.title = element_text(hjust = 0.5))If we examine the regression plot above we can see that there are several linear lines presented. The \(\hat{y}\) line in the center provides us with a mean value of y given x. Surrounding the \(\hat{y}\) line we can see a shaded band, which represents the 95% confidence interval. The upper and lower values give us a range in which we are 95% confident that the population mean will fall for the same x value. The plot’s outer dotted lines represent the prediction interval, which is useful if we wanted to evaluate another sample with the same x value. It allows us to say that we are 95% confident that the value of y (not the mean value of y) will fall within that wider range.
The t-tests
In the regression model we can use a t-test to evaluate the predictive value of \(\beta_1\).
\(H_0:\beta_1=0\)
\(H_0:\beta_1\neq0\)
To evaluate the null hypothesis we calculate the t-statistic and compare it to the critical points from a t-density function with n-2 degrees of freedom.
\[t=\frac{\hat\beta_1-0}{SE(\hat\beta_1)}=\frac{3.0401}{0.1565}=19.43\] The t-critical value with 410 degrees of freedom and \(\alpha=0.05\) is 1.966
We reject the null hypothesis if the |t-stat| is greater than the t-critical value. Therefore, since the t-stat of 19.43 is greater than 1.966, we can reject the null hypothesis.
We are evaluating the value of \(\beta_1\), but we only have a value for \(b_1\). However, \(b_1\) is an unbiased estimator of \(\beta_1\).
Imagine we were evaluating the average SAT math score for the population of all schools in the US (let’s assume that there are 100,000 schools). If we were to take a sample of 100 schools we could calculate a single sample slope. Now imagine we took all possible samples from the population. Without replacement that would result in 100,000 C100 sample points in our sample space, which is an extremely large number. We could then calculate the slope for each sample in the sample space, resulting in a daughter population of sample slopes. The grand average of that daughter population would equal \(\beta_1\), so we can say that is an unbiased estimator. It is unbiased because the methodology is unbiased, not because ever b??? value would equal \(\beta_1\). The method is “on target” because the daughter population average is centered at \(\beta_1\).
The y-hat equation
The equation for the fitted model is as follows:
\[\hat{y_i} = 385.075 + 3.04(x_i)\]
Similar to how \(b_1\) is an unbiased estimator of \(\beta_1\), \(\hat{y_i}\) is an unbiased estimator of \(Y_i\). If we evaluated all possible samples (of the same size) from the population, we could calculate a daughter population of sample slopes and \(\hat{y_i}\) values. The grand average of the daughter population of \(\hat{y_i}\) values would equal \(Y_i\). We therefore say that it is an unbiased estimator.
We can use Matrix algebra to derive the regression equation, and some of the calculations are included below. The calculations are based on the following subset of data:
> SATMat <- SAT[c(42,174,405,375,218,96,347,180),c(4,6,5)]
> SATMat <- as.data.frame(SATMat)
> kable(SATMat)%>%kable_styling(position="left",full_width = F)| ReadingAvgScore | WritingAvgScore | MathAvgScore |
|---|---|---|
| 679 | 682 | 735 |
| 632 | 649 | 688 |
| 635 | 636 | 682 |
| 612 | 596 | 660 |
| 587 | 587 | 659 |
| 605 | 588 | 654 |
| 621 | 638 | 651 |
| 636 | 636 | 648 |
The Average SAT Reading score will be used as the X matrix:
> Xmat=cbind(rep(1,8),SATMat[,1])
> Xmatd=as.data.frame(Xmat)
> kable(Xmatd, col.names = NULL)%>%kable_styling(full_width = F, position="left")| 1 | 679 |
| 1 | 632 |
| 1 | 635 |
| 1 | 612 |
| 1 | 587 |
| 1 | 605 |
| 1 | 621 |
| 1 | 636 |
The Average SAT Math Score represents the Y vector:
> Y=SATMat[,3]
> Y=as.matrix(Y)
> kable(SATMat[,3], col.names = NULL)%>%kable_styling(full_width = F, position="left")| 735 |
| 688 |
| 682 |
| 660 |
| 659 |
| 654 |
| 651 |
| 648 |
Matrix operations:
\(X^{'}X\)
> ab = as.data.frame(t(Xmat)%*%Xmat)
> kable(ab, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+ column_spec(1:2,background="yellow")| 8 | 5007 |
| 5007 | 3138965 |
\((X^{'}X)^{-1}\)
> ac = as.data.frame(solve(t(Xmat)%*%Xmat))
> kable(ac, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+ column_spec(1:2,background="orange")| 75.3273260 | -0.1201555 |
| -0.1201555 | 0.0001920 |
\(X^{'}\)
> ad=as.data.frame(t(Xmat))
> kable(ad, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+ column_spec(1:8,background="orange")| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 679 | 632 | 635 | 612 | 587 | 605 | 621 | 636 |
\(X(X^{'}X)^{-1}X^{'}\)
> ae=as.data.frame(Xmat%*%solve(t(Xmat)%*%Xmat)%*%t(Xmat))
> kable(ae, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+ column_spec(1:8,background="orchid")| 0.6668187 | 0.1874685 | 0.2180653 | -0.0165103 | -0.2714838 | -0.0879029 | 0.0752802 | 0.2282643 |
| 0.1874685 | 0.1322023 | 0.1357299 | 0.1086847 | 0.0792878 | 0.1004536 | 0.1192676 | 0.1369058 |
| 0.2180653 | 0.1357299 | 0.1409853 | 0.1006935 | 0.0568981 | 0.0884308 | 0.1164599 | 0.1427372 |
| -0.0165103 | 0.1086847 | 0.1006935 | 0.1619592 | 0.2285522 | 0.1806052 | 0.1379856 | 0.0980298 |
| -0.2714838 | 0.0792878 | 0.0568981 | 0.2285522 | 0.4151328 | 0.2807948 | 0.1613832 | 0.0494349 |
| -0.0879029 | 0.1004536 | 0.0884308 | 0.1806052 | 0.2807948 | 0.2086583 | 0.1445370 | 0.0844232 |
| 0.0752802 | 0.1192676 | 0.1164599 | 0.1379856 | 0.1613832 | 0.1445370 | 0.1295625 | 0.1155240 |
| 0.2282643 | 0.1369058 | 0.1427372 | 0.0980298 | 0.0494349 | 0.0844232 | 0.1155240 | 0.1446810 |
\(\hat{y} = X(X^{'}X)^{-1}X^{'}y\)
> af=as.data.frame((Xmat%*%solve(t(Xmat)%*%Xmat)%*%t(Xmat))%*%Y)
> kable(af, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+ column_spec(1,background="lightblue")| 717.4402 |
| 677.3496 |
| 679.9085 |
| 660.2897 |
| 638.9650 |
| 654.3188 |
| 667.9667 |
| 680.7615 |
b-vectors \(= (X^{'}X)^{-1}X^{'}y\)
> ag=as.data.frame((solve(t(Xmat)%*%Xmat)%*%t(Xmat))%*%Y)
> kable(ag, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+ column_spec(1,background="yellow")| 138.2590771 |
| 0.8529913 |
The matrix calculations match the regression output
Call:
lm(formula = MathAvgScore ~ ReadingAvgScore, data = SATMat)
Coefficients:
(Intercept) ReadingAvgScore
138.259 0.853
We can compute the value of \(h_{2,3}\) in the hat matrix by the following formula:
| x-value | \((x-\bar{x})^2\) |
|---|---|
| 679.00 | 2,822.27 |
| 632.00 | 37.52 |
| 635.00 | 83.27 |
| 612.00 | 192.52 |
| 587.00 | 1,511.27 |
| 605.00 | 435.77 |
| 621.00 | 23.77 |
| 636.00 | 102.52 |
\(\bar{x}=625.88\)
\(ss_x=5208.88\)
\(h_{2,3}=\left(\frac{1}{n}\right)+(x_2-\bar{x})(x_3-\bar{x})/ss_x\)
\(h_{2,3}=\left(\frac{1}{8}\right)+(632-625.88)(635-625.88)/5208.88 = 0.1357\)
The calculation matches the table.
Using two regressors:
> Xmat=cbind(rep(1,8),SATMat[,1],SATMat[,2])
> Xmate=as.data.frame(cbind(rep(1,8),SATMat[,1],SATMat[,2]))
> kable(Xmate, col.names = NULL)%>%kable_styling(full_width = F, position="left")| 1 | 679 | 682 |
| 1 | 632 | 649 |
| 1 | 635 | 636 |
| 1 | 612 | 596 |
| 1 | 587 | 587 |
| 1 | 605 | 588 |
| 1 | 621 | 638 |
| 1 | 636 | 636 |
Using R Matrix operations to fit the model:
> Rhat = as.data.frame(solve(t(Xmat)%*%Xmat)%*%t(Xmat)%*%Y)
> kable(Rhat, col.names = NULL)%>%kable_styling(full_width = F, position="left")%>%
+ column_spec(1,background="yellow")| 134.4315052 |
| 0.9009849 |
| -0.0418363 |
Call:
lm(formula = MathAvgScore ~ ReadingAvgScore + WritingAvgScore,
data = SATMat)
Coefficients:
(Intercept) ReadingAvgScore WritingAvgScore
134.43151 0.90098 -0.04184
The regression output matches the matrix operations.
Before performing model selection I have renamed the variables for ease of use.
> SAT2 <- SAT %>% rename(NT = NumTestTakers,
+ RS = ReadingAvgScore,
+ MS = MathAvgScore,
+ WS = WritingAvgScore,
+ AP = AsianPct,
+ BP = BlackPct,
+ HP = HispanicPct,
+ WP = WhitePct,
+ MP = MalePct,
+ FP = FemalePct)| Old Variable | New Variable | Description |
|---|---|---|
| MathAvgScore | MS | 2012 Average SAT math score for NYC schools |
| ReadingAvgScore | RS | 2012 Average SAT reading score for NYC schools |
| WritingAvgScore | WS | 2012 Average SAT writing score for NYC schools |
| NumTestTakers | NT | Number of test takers |
| AsianPct | AP | Percentage of Asian students in the school |
| BlackPct | BP | Percentage of Black students in the school |
| HispanicPct | HP | Percentage of Hispanic students in the school |
| WhitePct | WP | Percentage of White students in the school |
| MalePct | MP | Percentage of Male students in the school |
| FemalePct | FP | Percentage of Female students in the school |
Next, I used stepwise regression (iteratively adding and removing predictors until the best model is found).
> library(leaps)
> nullmod = lm(MS~1, data=SAT2) # Intercept only model
> fullmod = lm(MS~NT+RS+WS+AP+BP+HP+WP+MP+FP, data=SAT2) # all parameters
> stepwisereg = step(nullmod, scope=list(lower=nullmod, upper=fullmod), direction='both')Start: AIC=3441.29
MS ~ 1
Df Sum of Sq RSS AIC
+ WS 1 1394931 344054 2775.7
+ RS 1 1353411 385574 2822.7
+ AP 1 833423 905562 3174.5
+ WP 1 672844 1066140 3241.7
+ NT 1 486543 1252442 3308.1
+ BP 1 261348 1477637 3376.2
+ HP 1 213384 1525601 3389.4
<none> 1738985 3441.3
+ FP 1 1116 1737869 3443.0
+ MP 1 1111 1737874 3443.0
Step: AIC=2775.74
MS ~ WS
Df Sum of Sq RSS AIC
+ AP 1 179901 164152 2472.9
+ NT 1 37466 306588 2730.2
+ BP 1 35839 308215 2732.4
+ MP 1 25604 318450 2745.9
+ FP 1 25600 318454 2745.9
+ WP 1 4395 339658 2772.4
+ HP 1 3888 340166 2773.1
+ RS 1 3768 340286 2773.2
<none> 344054 2775.7
- WS 1 1394931 1738985 3441.3
Step: AIC=2472.86
MS ~ WS + AP
Df Sum of Sq RSS AIC
+ MP 1 11190 152963 2445.8
+ FP 1 11189 152964 2445.8
+ RS 1 9315 154837 2450.8
+ WP 1 1836 162316 2470.2
+ BP 1 1799 162353 2470.3
+ NT 1 1429 162723 2471.3
<none> 164152 2472.9
+ HP 1 647 163505 2473.2
- AP 1 179901 344054 2775.7
- WS 1 741410 905562 3174.5
Step: AIC=2445.77
MS ~ WS + AP + MP
Df Sum of Sq RSS AIC
+ RS 1 6431 146531 2430.1
+ BP 1 1854 151108 2442.7
+ WP 1 1476 151487 2443.8
+ NT 1 976 151987 2445.1
+ HP 1 791 152172 2445.6
<none> 152963 2445.8
+ FP 1 33 152930 2447.7
- MP 1 11190 164152 2472.9
- AP 1 165487 318450 2745.9
- WS 1 747578 900540 3174.2
Step: AIC=2430.08
MS ~ WS + AP + MP + RS
Df Sum of Sq RSS AIC
+ BP 1 3517 143014 2422.1
+ HP 1 1869 144663 2426.8
+ WP 1 1789 144743 2427.0
+ NT 1 1035 145496 2429.2
<none> 146531 2430.1
+ FP 1 69 146463 2431.9
- RS 1 6431 152963 2445.8
- MP 1 8306 154837 2450.8
- WS 1 20049 166581 2480.9
- AP 1 170512 317044 2746.1
Step: AIC=2422.07
MS ~ WS + AP + MP + RS + BP
Df Sum of Sq RSS AIC
+ NT 1 787 142228 2421.8
<none> 143014 2422.1
+ WP 1 602 142412 2422.3
+ HP 1 523 142491 2422.6
+ FP 1 88 142926 2423.8
- BP 1 3517 146531 2430.1
- MP 1 8027 151041 2442.6
- RS 1 8094 151108 2442.8
- WS 1 15913 158927 2463.5
- AP 1 135819 278834 2695.2
Step: AIC=2421.79
MS ~ WS + AP + MP + RS + BP + NT
Df Sum of Sq RSS AIC
<none> 142228 2421.8
- NT 1 787 143014 2422.1
+ WP 1 432 141796 2422.5
+ HP 1 382 141845 2422.7
+ FP 1 92 142136 2423.5
- BP 1 3269 145496 2429.2
- MP 1 7688 149915 2441.5
- RS 1 8084 150312 2442.6
- WS 1 15337 157565 2462.0
- AP 1 115280 257508 2664.4
Call:
lm(formula = MS ~ WS + AP + MP + RS + BP + NT, data = SAT2)
Residuals:
Min 1Q Median 3Q Max
-65.209 -9.906 -0.890 10.668 106.919
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 57.161880 8.876950 6.439 3.40e-10 ***
WS 0.480869 0.072765 6.609 1.23e-10 ***
AP 1.444475 0.079726 18.118 < 2e-16 ***
MP 0.339625 0.072588 4.679 3.94e-06 ***
RS 0.349090 0.072760 4.798 2.26e-06 ***
BP -0.122270 0.040078 -3.051 0.00243 **
NT 0.010680 0.007136 1.497 0.13527
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 18.74 on 405 degrees of freedom
Multiple R-squared: 0.9182, Adjusted R-squared: 0.917
F-statistic: 757.8 on 6 and 405 DF, p-value: < 2.2e-16
The resulting best model had an AIC value of 2421.79 and and adjusted R-squared value of 0.917
We can see that there are some outliers and high leverage points, but the number is not excessive.
There is some heteroscedasticity of the residuals, mostly because the data is clustered around the averages at the left of the plots.
> #plot residuals by index
> res = residuals(stepwisereg)
> res2 = as.data.frame(res)
> resid = as.numeric(rownames(res2))
> ggplot(res2)+geom_point(aes(x=resid,y=res), color="black",
+ fill="gold", pch=21, size=2)+
+ labs(y="Residuals", x="Index")> #plot residual histogram
> ggplot(res2)+geom_histogram(aes(x=res), binwidth = 5,
+ fill="gold", color="red")+
+ scale_x_continuous(breaks = seq(from=-70, to=110, by=10))+
+ theme(axis.text.x = element_text(angle = 90))+
+ labs(x="residuals")> #display the corresponding rows where the
> #absolute value of residuals is >50
> kable(SAT2[abs(res)>50,3:12])%>%kable_styling(
+ bootstrap_options=c("striped","condensed", full_width=F))| NT | RS | MS | WS | AP | BP | HP | WP | MP | FP |
|---|---|---|---|---|---|---|---|---|---|
| 112 | 332 | 557 | 316 | 84.7 | 5.2 | 8.9 | 0.9 | 53.7 | 46.3 |
| 79 | 319 | 512 | 357 | 62.4 | 7.2 | 26.4 | 4.0 | 51.2 | 48.8 |
| 121 | 345 | 517 | 343 | 35.3 | 31.0 | 28.3 | 5.0 | 56.1 | 43.9 |
| 12 | 416 | 403 | 381 | 48.7 | 33.2 | 16.6 | 1.5 | 47.7 | 52.3 |
| 9 | 439 | 374 | 418 | 2.7 | 88.9 | 7.1 | 0.4 | 99.6 | 0.4 |
| 20 | 441 | 499 | 413 | 7.5 | 58.2 | 26.3 | 4.5 | 79.4 | 20.6 |
| 143 | 323 | 475 | 329 | 48.3 | 5.6 | 43.5 | 2.6 | 53.2 | 46.8 |
When we examine the most extreme residuals it is usally possible to pinpoint the cause. For example the Asian Percentage of 84.7% is unusually high.
The Variance Inflation Factor is a measurement that is used to detect collinearity among the regressors. It is calculated by regressing each variable against the other variables in the model, computing the \(R^2\), subtracting the resulting \(R^2\) from 1, and taking the inverse. \(VIF_K=(1-R^2_K)^{-1}\)
A high level of collinearity could make the model unreliable. It can increase the standard error of the estimated coefficients making the t-statistics faulty. It can also cause the individual regressor coefficients to change erratically.
WS AP MP RS BP NT
21.445989 1.632100 1.078509 20.155587 1.290958 1.460455
There are no extreme results.
RS and WS have the highest values, likely because the test scores have a strong linear relationship.
Cook’s distance measures the influence of an observation on the regression model. An observation with a high Cook’s distance would meaningfully change the slope of the regression line if included. It can be used to detect leverage points.
It can be computed by:
\[D_i=\frac{ \sum_{j=1}^{n} (\hat{y}_j - \hat{y}_{j(i)})^2 }{p(MSE)}\]
\(D_i\) is Cook’s distance for the i^th observation
\(\hat{y}_j\) is the full regression model’s estimate for the j^th observation
\(\hat{y}_{j(i)}\) is the regression model’s (fitted without the i^th observation) estimate for the j^th observation
\(p\) is the number of regressors
\(MSE\) is the mean squared error of the full regression model
> #plot Cook's D
> cooks <- cooks.distance(stepwisereg)
> cooks <- as.data.frame(cooks)
> cooksid <- as.numeric(rownames(cooks))
> ggplot(cooks)+geom_point(aes(x=cooksid,y=cooks), color="black",
+ fill="gold", pch=21, size=2)+
+ labs(y="Cook's D", x="Index")> #display the corresponding rows where the value of cook's D is >0.05
> kable(SAT2[cooks>.05,3:12])%>%kable_styling(bootstrap_options=c("striped","condensed", full_width=F))| NT | RS | MS | WS | AP | BP | HP | WP | MP | FP |
|---|---|---|---|---|---|---|---|---|---|
| 112 | 332 | 557 | 316 | 84.7 | 5.2 | 8.9 | 0.9 | 53.7 | 46.3 |
| 79 | 319 | 512 | 357 | 62.4 | 7.2 | 26.4 | 4.0 | 51.2 | 48.8 |
| 121 | 345 | 517 | 343 | 35.3 | 31.0 | 28.3 | 5.0 | 56.1 | 43.9 |
| 12 | 416 | 403 | 381 | 48.7 | 33.2 | 16.6 | 1.5 | 47.7 | 52.3 |
| 9 | 439 | 374 | 418 | 2.7 | 88.9 | 7.1 | 0.4 | 99.6 | 0.4 |
| 143 | 323 | 475 | 329 | 48.3 | 5.6 | 43.5 | 2.6 | 53.2 | 46.8 |
Most large Cook’s D values appear to have low math scores relative to a high percentage of Asian students.
The regression model provides us with a \(\hat{y}\) equation that is designed to be optimal for our dataset, but the true value lies in how well it will work with new data. Since fresh data is often unavailable we can simply split our existing dataset into two pieces. One piece will be used to calculate the regression equation, and the other piece will provide us with our out-of-sample values.
We start by randomizing the data and selecting our “training sample,” which is used to calculate the \(\hat{y}\) equation. We then use the remaining rows as our “validation” sample. We can evaluate the RMSE value to get a sense of how well the model might work with out-of-sample data. When comparing two models, the one that produces the lowest test sample RMSE is the preferred model.
> library(caret)
> set.seed(1234)
> # Split the data into training and test set
> training.samples <- SAT2$MS %>% createDataPartition(p = 0.7, list = FALSE)
> train.data <- SAT2[training.samples, ]
> test.data <- SAT2[-training.samples, ]
> # Build the model (using selected model)
> model <- lm(MS~WS+AP+MP+RS+BP+NT, data = train.data)
> # Make predictions and compute the R2, RMSE and MAE
> predictions <- model %>% predict(test.data)
> data.frame( R2 = R2(predictions, test.data$MS),
+ RMSE = RMSE(predictions, test.data$MS),
+ MAE = MAE(predictions, test.data$MS)) R2 RMSE MAE
1 0.8967801 20.36794 13.84854
> # Define training control
> train.control <- trainControl(method = "LOOCV")
> # Train the model
> model <- train(MS~WS+AP+MP+RS+BP+NT, data = SAT2,
+ method = "lm",
+ trControl = train.control)
> # Summarize the results
> print(model)Linear Regression
412 samples
6 predictor
No pre-processing
Resampling: Leave-One-Out Cross-Validation
Summary of sample sizes: 411, 411, 411, 411, 411, 411, ...
Resampling results:
RMSE Rsquared MAE
19.15126 0.9131149 13.83611
Tuning parameter 'intercept' was held constant at a value of TRUE
This is similar to leave one out, except the data is split into k-subsets. 1 subset is used for testing and the remainder are used for training. The process is repeated until all the subsets are tested.
> # Define training control
> set.seed(1234)
> train.control <- trainControl(method = "cv", number = 10)
> # Train the model
> model <- train(MS~WS+AP+MP+RS+BP+NT, data = SAT2,
+ method = "lm",
+ trControl = train.control)
> # Summarize the results
> print(model)Linear Regression
412 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 371, 372, 369, 370, 371, 371, ...
Resampling results:
RMSE Rsquared MAE
18.88628 0.9102345 13.82922
Tuning parameter 'intercept' was held constant at a value of TRUE
We can also repeat k-fold cross validation multiple times
> # Define training control
> set.seed(1234)
> train.control <- trainControl(method = "repeatedcv",
+ number = 10, repeats = 3)
> # Train the model
> model <- train(MS~WS+AP+MP+RS+BP+NT, data = SAT2,
+ method = "lm",
+ trControl = train.control)
> # Summarize the results
> print(model)Linear Regression
412 samples
6 predictor
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 371, 372, 369, 370, 371, 371, ...
Resampling results:
RMSE Rsquared MAE
18.82951 0.9110351 13.84007
Tuning parameter 'intercept' was held constant at a value of TRUE
> from sklearn.linear_model import LinearRegression
+ from sklearn.feature_selection import RFE
+ import statsmodels.api as smRecursive Feature Elimination:
Recursive Feature Elimination (RFE) selects features by considering a smaller and smaller set of regressors. The starting point is the full set of regressors and variables are recursively pruned from the initial set. The procedure is repeated until a desired set of features remain. That number can either be a priori specified, or can be found using cross validation.
> SAT2 = r.SAT2
+ X = SAT2.drop(columns=['DBN','Borough','MS'])
+ y = SAT2.iloc[:,4]
+ names=pd.DataFrame(X.columns)> #This is to select 6 variables:
+ #can be changed and checked in model for accuracy
+ rfe_mod = RFE(lin_reg, 6, step=1)
+ myvalues=rfe_mod.fit(X,y) #to fit
+ myvalues.support_ #The mask of selected features.array([False, False, True, True, False, True, True, True, True])
> myvalues.ranking_
+ #The feature ranking, such that ranking_[i] corresponds to the
+ #ranking position of the i-th feature. Selected
+ #(i.e., estimated best) features are assigned rank 1.array([4, 3, 1, 1, 2, 1, 1, 1, 1])
> #Concat and name columns
+ ranked=pd.concat([names,rankings], axis=1)
+ ranked.columns = ["Feature", "Rank"]
+ ranked Feature Rank
0 NT 4
1 RS 3
2 WS 1
3 AP 1
4 BP 2
5 HP 1
6 WP 1
7 MP 1
8 FP 1
> #Select most important (Only 1's)
+ most_important = ranked.loc[ranked['Rank'] ==1]
+ print(most_important) Feature Rank
2 WS 1
3 AP 1
5 HP 1
6 WP 1
7 MP 1
8 FP 1
6
The six selected variables are different from the one’s selected by stepwise regression using R.
| Method | Language | Variables |
|---|---|---|
| Stepwise Regression | R | WS, AP, MP, RS, BP, NT |
| Recursive Feature Elimination | Python | WS, AP, MP, HP, WP, FP |
Model performance was similar, with an adjusted \(R^2\) of 0.912 (Python) versus 0.917 (R).
Lasso (L1) Based Feature Selection:
Several models are designed to reduce the number of features. One of the shrinkage methods - Lasso - for example reduces several coefficients to zero leaving only features that are truly important.
Sci-Kit offers SelectFromModel as a tool to run embedded models for feature selection. The module makes use of a threshold parameter, which can be either user specified or heuristically set based on median or mean. Below, the code uses Lasso (L1 penalty) to find features for inclusion. I set the threshold to 0.25, which results in three features being selected.
> # Set a minimum threshold of 0.25
+ sfm = SelectFromModel(estimator, threshold=0.25,
+ prefit=False, norm_order=1, max_features=None)SelectFromModel(estimator=LassoCV(alphas=None, copy_X=True, cv=5, eps=0.001,
fit_intercept=True, max_iter=1000,
n_alphas=100, n_jobs=None, normalize=True,
positive=False, precompute='auto',
random_state=None, selection='cyclic',
tol=0.0001, verbose=False),
max_features=None, norm_order=1, prefit=False, threshold=0.25)
Index(['RS', 'WS', 'AP'], dtype='object')
The ideal model included just 3 variables, RS, WS, and AP.
I chose to proceed with the Lasso model for cross validation and I used the validation set approach (train/test).
> #split data into test/train
+ from sklearn.model_selection import train_test_split
+
+ X_train, X_test, y_train, y_test = train_test_split(
+ X, y, test_size=0.3, random_state=101)
+
+ lm = LinearRegression()
+ lm.fit(X_train,y_train)
+ #build model on training dataLinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
> #fit model to test data
+ #display output
+ import warnings
+ warnings.filterwarnings('ignore')
+ X = X_test
+ X2 = sm.add_constant(X)
+ est = sm.OLS(y_test, X2)
+ est2 = est.fit()
+ print(est2.summary()) OLS Regression Results
==============================================================================
Dep. Variable: MS R-squared: 0.915
Model: OLS Adj. R-squared: 0.913
Method: Least Squares F-statistic: 428.7
Date: Fri, 09 Oct 2020 Prob (F-statistic): 6.27e-64
Time: 12:47:45 Log-Likelihood: -545.09
No. Observations: 124 AIC: 1098.
Df Residuals: 120 BIC: 1109.
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 101.6956 14.253 7.135 0.000 73.476 129.916
RS 0.5018 0.147 3.408 0.001 0.210 0.793
WS 0.2342 0.150 1.566 0.120 -0.062 0.530
AP 2.0807 0.141 14.764 0.000 1.802 2.360
==============================================================================
Omnibus: 16.843 Durbin-Watson: 2.091
Prob(Omnibus): 0.000 Jarque-Bera (JB): 44.340
Skew: 0.427 Prob(JB): 2.35e-10
Kurtosis: 5.802 Cond. No. 4.45e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.45e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
This model, with only 3 variables, performed similar to both of the models with 6 variables. Generally, simpler is better when model performance is close, and so I would recommend the Lasso approach for this data.
We can plot the actual values versus the predictions. In this case the result is nearly linear.
> # Actual vs Predictions
+ plt.figure(figsize=(6,4))
+ sns.set_style('darkgrid')
+ sns.scatterplot(x=y_test, y=predictions, color='red');
+ plt.xlabel('Actual Math Scores');
+ plt.ylabel('Predicted Math Scores');
+ plt.show()The residuals closely resemble a normal distribution with the addition of some outliers.
> #residual histogram
+ plt.figure(figsize=(6,4))
+ sns.distplot((y_test-predictions),bins=50, color='red');
+ plt.xlabel('Residuals');
+ plt.show()Adding a categorical variable, such as Borough, is easy with R. The regression function automatically converts factors and characters into dummy variables.
Call:
lm(formula = MS ~ Borough + RS + AP + WS, data = SAT2)
Residuals:
Min 1Q Median 3Q Max
-79.681 -11.238 0.709 10.740 100.475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 74.22729 7.40677 10.022 < 2e-16 ***
BoroughBrooklyn -1.18264 2.58729 -0.457 0.6478
BoroughManhattan 1.64475 2.75195 0.598 0.5504
BoroughQueens -7.35978 3.35454 -2.194 0.0288 *
BoroughStatenIsland 10.60954 6.26756 1.693 0.0913 .
RS 0.37495 0.07368 5.089 5.52e-07 ***
AP 1.69935 0.07922 21.450 < 2e-16 ***
WS 0.44114 0.07247 6.087 2.68e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 19.29 on 404 degrees of freedom
Multiple R-squared: 0.9135, Adjusted R-squared: 0.9121
F-statistic: 609.9 on 7 and 404 DF, p-value: < 2.2e-16
As we can see, adding Borough did not improve the model and most of the individual boroughs have large p-values (>0.05). I would not recommend its inclusion.
Adding a categorical variable is a little more complicated in Python because they must be converted to dummy variables first.
One of the dummy variables must be dropped because its value can always be predicted from the remaining dummy variables.
Brooklyn Manhattan Queens StatenIsland
0 0 1 0 0
1 0 1 0 0
2 0 1 0 0
3 0 1 0 0
4 0 1 0 0
> #drop old Borough variable
+ SAT2.drop(['Borough'], axis=1, inplace=True)
+ #combine SAT with dummy variables
+ SAT2 = pd.concat([SAT2,borough],
+ axis=1)Rows: 412
Columns: 15
$ DBN <chr> "01M292", "01M448", "01M450", "01M458", "01M509", "01M...
$ NT <dbl> 29, 91, 70, 7, 44, 112, 159, 18, 130, 16, 62, 53, 58, ...
$ RS <dbl> 355, 383, 377, 414, 390, 332, 522, 417, 624, 395, 409,...
$ MS <dbl> 404, 423, 402, 401, 433, 557, 574, 418, 604, 400, 393,...
$ WS <dbl> 363, 366, 370, 359, 384, 316, 525, 411, 628, 387, 392,...
$ AP <dbl> 14.0, 29.2, 9.7, 2.2, 9.3, 84.7, 27.8, 0.5, 15.1, 1.7,...
$ BP <dbl> 29.1, 22.6, 23.9, 34.4, 31.6, 5.2, 11.7, 45.4, 15.1, 3...
$ HP <dbl> 53.8, 45.9, 55.4, 59.4, 56.9, 8.9, 14.2, 49.5, 18.2, 5...
$ WP <dbl> 1.7, 2.3, 10.4, 3.6, 1.6, 0.9, 44.9, 4.1, 49.8, 6.3, 4...
$ MP <dbl> 61.4, 57.4, 54.7, 43.3, 46.3, 53.7, 49.2, 39.9, 31.3, ...
$ FP <dbl> 38.6, 42.6, 45.3, 56.7, 53.7, 46.3, 50.8, 60.1, 68.7, ...
$ Brooklyn <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ Manhattan <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
$ Queens <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
$ StatenIsland <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
OLS Regression Results
==============================================================================
Dep. Variable: MS R-squared: 0.914
Model: OLS Adj. R-squared: 0.912
Method: Least Squares F-statistic: 609.9
Date: Fri, 09 Oct 2020 Prob (F-statistic): 2.40e-210
Time: 12:47:46 Log-Likelihood: -1799.9
No. Observations: 412 AIC: 3616.
Df Residuals: 404 BIC: 3648.
Df Model: 7
Covariance Type: nonrobust
================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------
const 74.2273 7.407 10.022 0.000 59.667 88.788
RS 0.3749 0.074 5.089 0.000 0.230 0.520
WS 0.4411 0.072 6.087 0.000 0.299 0.584
AP 1.6994 0.079 21.450 0.000 1.544 1.855
Brooklyn -1.1826 2.587 -0.457 0.648 -6.269 3.904
Manhattan 1.6448 2.752 0.598 0.550 -3.765 7.055
Queens -7.3598 3.355 -2.194 0.029 -13.954 -0.765
StatenIsland 10.6095 6.268 1.693 0.091 -1.712 22.931
==============================================================================
Omnibus: 37.358 Durbin-Watson: 2.010
Prob(Omnibus): 0.000 Jarque-Bera (JB): 134.757
Skew: 0.298 Prob(JB): 5.47e-30
Kurtosis: 5.738 Cond. No. 4.51e+03
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.51e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Once Borough is converted to a dummy variable we are able to produce the same results from Python.