Academic rigor is seen as a prized tenet in the culture of American meritocracy. Hard work, preparation, and grit are seen as genuine characteristics for a thriving and prosperous economy. With this belief, every year over 25,000 eighth grade students in New York City’s public school system take the arduous Specialized High School Admissions Test (SHSAT) to gain admittance to eight prestigious specialized high schools: Bronx High School of Science, Brooklyn Latin School, Brooklyn Technical High School, the High School for Mathematics, Science and Engineering at City College of New York, High School of American Studies at Lehman College, Queens High School for the Sciences at York College, Staten Island Technical High School, and Stuyvesant High School. Yet with a population of over 1 million students, the largest system in the United States, is there equity in the system to allow admittance? It is the plan of this research to look at the data for New York City’s school system for the years of 2018 to 2021 with regards to middle schools and the SHSAT. The system is divided into 32 school districts. This study will examine middle school data to see if there is a correlation between the number of offers given for specialized high school based on the SHSAT and an individual school’s academic data, programs, and district assignment. It is the hope of this study to identify trends for improving success for schools and to, if any, to determine ways to improve schools that have few offers.
Regression, Standardized Testing, Feature Selection, Public Education
The SHSAT and the lack of diversity in NYC Specialized High Schools have been a source of incredible controversy for the past number of years. Different politicians and activists have advocated for various approaches to the lack of diversity in NYC’s Specialized High Schools. Here is a quote from the Brookings Institution’s article “Elite of elitist? Lessons for colleges from selective high schools”:
The question underlying these debates is: what constitutes fair admissions criteria? Supporters of a narrowly meritocratic approach argue it is unfair to deny, say, an Asian-American student a seat in one of these prestigious schools, in favor of, say, a Black student who scored lower on the test. Those who support a more inclusive policy argue that it is fairer to take into account socioeconomic and/or racial background in making admissions decisions.1
While the SHS in some cities (Milwaukee, Philadelphia) are more representative of the demographics of their overall districts, disparities exist throughout each of the SHS systems. Some cities—particularly Boston and New York—have the least representative SHS student populations in terms of race, while others such as Baltimore, Louisville, and DC face greater challenges in terms of economic representation.
Attempts have been made to provide greater access for underrepresented groups, here is a quote from the same article:
There are more modest efforts underway to improve representation in New York’s SHS, however. The Discovery program offers a summer enrichment program for eligible students who take the SHSAT and score just below the cutoff. To be eligible for Discovery, students must be from a low-income household, have scored within a certain range below the cutoff score, and attend a high-poverty high school. Students who participate and complete the program requirements are then admitted to one of the specialized high schools. By the summer of 2020, 20 percent of seats at each specialized high school will be reserved for Discovery program participants. However, the impact on racial diversity is likely to be muted. More than half of the school places offered to Discovery participants in 2019 went to Asian American students.
Here are some of the proposals: * Require all students to take the entrance exams for selective high schools. Currently, there are many gaps in the racial and socioeconomic makeup of the students who choose to take the admissions exams in the first place. These gaps may be due to a lack of sufficient information within households, or other barriers, such as limited transportation to the testing location. Instituting a school day in which the entrance exam is administered to all middle school students may increase the number of students with qualifying scores from diverse backgrounds.
Replace the unique tests schools use for admission with scores on state or national tests. Many selective high school entrance exams, including the ISEE in Boston and the SHSAT in New York, test students on curriculum that has not yet been taught in school. As a result, the students who perform best on the exam often are those who had access to prep courses or tutoring. In Boston, many Black, Hispanic, and low-income students perform well on the fifth-grade standardized test, but just a year later do not score well on the ISEE exam. This suggests that using a standardized test that is more aligned with the curriculum already taught in schools may decrease a portion of the racial and socioeconomic gaps present in current entrance exam scores.
Provide parents and students with better and more accessible information about schools and the admissions processes. Increasing school resources to provide on-demand preparation and coaching by school administrators could help increase diversity by incorporating students into the process who otherwise may not have applied to selective programs at all. Relying on parents and students to navigate a complex application process often shuts out disadvantaged students.
Increase access to advanced academic offerings. In cities like Baltimore, many students who attend middle schools in low-income areas do not have access to advanced or honors courses and are therefore disadvantaged in comparison to other students who can get a GPA boost from these higher-weighted courses. Expanding access to these programs throughout all middle schools could help level the GPA distribution so that not only students who attend middle schools in affluent areas can achieve highly-weighted GPAs. (Obviously this has advantages well beyond SHS admissions).
Provide extra learning opportunities for less advantaged students. Boston, for example, has instituted the Exam School Initiative, a summer and fall ISEE-prep program for children from underserved areas and has expanded enrollment from 450 to 750.
With these proposals, the demand remains for admitting students to these prestigious high schools. Here is how applicants are accepted:
Admission to the specialized schools is based strictly on the SHSAT, which students can choose to take in the fall of 8th grade. On exam day, applicants submit a ranking of their preferred specialized high schools, up to a total of eight. SHSAT scores are sorted from highest to lowest and students are assigned, in order, to the highest-ranked school on their list with seats available (Abdulkadiroğlu et al., 2014; Dobbie & Fryer, 2014; NYC DOE, 2014). Accordingly, cut scores for admission vary by school and year depending on the distribution of scores in that year, student preferences, and the number of seats. Cut scores are not made public, but there is a well-known hierarchy of selectivity, with Stuyvesant requiring the highest SHSAT score, followed by Bronx Science and Brooklyn Tech (Abdulkadiroğlu et al., 2014; Feinman, 2008).2
Regardless of how public policies are implemented, the lack of diversity in NYC’s Specialized High Schools will require a holistic approach that addresses years of district resource disparity and the city’s Gifted & Talented program.
For the exploratory data analysis, I decided early to create maps
that would help me grapple with the location of middle schools and their
SHSAT performance. The following were coded in Python and R using
Plotly and Shiny libraries. The purpose of
these apps were to provide a visualization of the data. When viewing the
apps, you will need an active internet connection. As you interact,
observe the concentration of offers in specific districts.
***
***
Data Dictionary
The following is a brief description for the variables used in this research paper.
| Variable | Description |
|---|---|
| accessibility | Accessibility of the site where the school is located: Functionally Accessible or Not Functionally Accessible |
| address | Primary Street Address |
| coursepassrate | Percent of students who passed core courses |
| district | District is a code which reflects where the school is located. |
| dual_program | If the school has a dual language program |
| elaprof | Percent of students proficient in ELA |
| mathprof | Percent of students proficient in math |
| name | School’s name |
| offers_count | The number of students from the middle school who received an offer to a Specialized High School |
| schooldbn | Unique identifier for each NYC public school. |
| shsat_prep | If the school listed a SHSAT prep program for |
| student_count | Number of 8th grade students from the middle school who applied to high school |
| surveysafety | Percent of students that felt safe in the hallways, bathrooms, locker rooms, and cafeteria |
| testers_count | The number of students from the middle school who took the SHSAT |
The following code sets up necessary libraries and a function for plotting graphs.
library(tidyverse)
library(dplyr)
library(sigr)
library(broom)
library(ggplot2)
library(WVPlots)
library(corrplot)
library(tidymodels)
library(readr)
library(rpart.plot)
library(GGally)
library(summarytools)
library(MASS)
library(AER)
library(Metrics)
# Used for zero-inflation model
library(pscl)
# Used to test for zero inflation
library(performance)
#
library(MuMIn)
# From http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols: Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
library(grid)
# Make a list from the ... arguments and plotlist
plots <- c(list(...), plotlist)
numPlots = length(plots)
# If layout is NULL, then use 'cols' to determine layout
if (is.null(layout)) {
# Make the panel
# ncol: Number of columns of plots
# nrow: Number of rows needed, calculated from # of cols
layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
ncol = cols, nrow = ceiling(numPlots/cols))
}
if (numPlots==1) {
print(plots[[1]])
} else {
# Set up the page
grid.newpage()
pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
# Make each plot, in the correct location
for (i in 1:numPlots) {
# Get the i,j matrix positions of the regions that contain this subplot
matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
layout.pos.col = matchidx$col))
}
}
}
The following code imports the csv files from GitHub. The csv files were cleaned beforehand. Each section is split by year into different tabs. Two features have been added:
dual_program identifies if the school as having a dual
language program with a value of 0 or 1shsat_prep identifies if the school has a program to
prep for the SHSAT with a value of 0 or 1# School information: Note: only 2018-2021 are available
school_info <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data698/data/2018-2021_school_information.csv')
# Remove NA values for total_students, course pass rate, elaprof, and mathprof
school_info <- school_info %>% drop_na(totalstudents)
school_info <- school_info %>% drop_na(coursepassrate)
school_info <- school_info %>% drop_na(elaprof)
school_info <- school_info %>% drop_na(mathprof)
# Convert totalstudents to numberical
school_info$totalstudents <- as.numeric(as.character(school_info$totalstudents))
# Potential features
## Dual Program feature: If the school has a dual language program
school_info$dual_program <- ifelse(str_detect(school_info$ellprograms, "Dual") & (is.na(school_info$ellprograms) == FALSE), 1, 0)
## Specialized Test Prep feature: If the school has a specialized high school prep class
school_info$shsat_prep <- ifelse(str_detect(school_info$electiveclasses, "Specialized High School Test") & (is.na(school_info$electiveclasses) == FALSE), 1, 0)
# School offers have the offers for middle schools: Note 2016-2021 are available but for the previous data set only 2018-2021 are available
# Remove rows with NA values
school_offers <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data698/data/school_offers.csv')
school_offers$Postcode <- as.character(school_offers$Postcode)
school_offers <- subset(school_offers, select = -c(`name`, `telephone`, `address`, `2016_student_count`, `2016_testers_count`,`2016_offers_count`,`2017_student_count`,`2017_testers_count`,`2017_offers_count`))
school_offers <- na.omit(school_offers)
temp1 <- subset(school_info, select = c(`district`, `schooldbn`, `name`, `year`, `coursepassrate`, `accessibility`, `elaprof`, `mathprof`, `totalstudents`, `surveysafety`,`dual_program`, `shsat_prep`))
temp2018 <- merge(temp1[temp1$year == 2018, ], subset(school_offers, select = c(`dbn`, `2018_student_count`, `2018_testers_count`, `2018_offers_count`)), by.x = 'schooldbn', by.y='dbn')
names(temp2018)[names(temp2018) == '2018_student_count'] <- 'student_count'
names(temp2018)[names(temp2018) == '2018_testers_count'] <- 'testers_count'
names(temp2018)[names(temp2018) == '2018_offers_count'] <- 'offers_count'
temp2019 <- merge(temp1[temp1$year == 2019, ], subset(school_offers, select = c(`dbn`, `2019_student_count`, `2019_testers_count`, `2019_offers_count`)), by.x = 'schooldbn', by.y='dbn')
names(temp2019)[names(temp2019) == '2019_student_count'] <- 'student_count'
names(temp2019)[names(temp2019) == '2019_testers_count'] <- 'testers_count'
names(temp2019)[names(temp2019) == '2019_offers_count'] <- 'offers_count'
temp2020 <- merge(temp1[temp1$year == 2020, ], subset(school_offers, select = c(`dbn`, `2020_student_count`, `2020_testers_count`, `2020_offers_count`)), by.x = 'schooldbn', by.y='dbn')
names(temp2020)[names(temp2020) == '2020_student_count'] <- 'student_count'
names(temp2020)[names(temp2020) == '2020_testers_count'] <- 'testers_count'
names(temp2020)[names(temp2020) == '2020_offers_count'] <- 'offers_count'
temp2021 <- merge(temp1[temp1$year == 2021, ], subset(school_offers, select = c(`dbn`, `2021_student_count`, `2021_testers_count`, `2021_offers_count`)), by.x = 'schooldbn', by.y='dbn')
names(temp2021)[names(temp2021) == '2021_student_count'] <- 'student_count'
names(temp2021)[names(temp2021) == '2021_testers_count'] <- 'testers_count'
names(temp2021)[names(temp2021) == '2021_offers_count'] <- 'offers_count'
# Create model data and concatenate the temp data frames
model_data <- rbind(temp2018, temp2019)
model_data <- rbind(model_data, temp2020)
model_data <- rbind(model_data, temp2021)
# Create rate_offers: number of offers / number of testers
model_data$rate_offers <- model_data$offers_count / model_data$testers_count
model_data$rate_offers[is.na(model_data$rate_offers)] <- 0
# Temp variables for schools
schools2018 <- model_data[model_data$year == 2018, ]
schools2019 <- model_data[model_data$year == 2019, ]
schools2020 <- model_data[model_data$year == 2020, ]
schools2021 <- model_data[model_data$year == 2021, ]
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | schooldbn [character] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | district [numeric] |
|
16 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | name [character] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | year [numeric] | 1 distinct value |
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | coursepassrate [numeric] |
|
33 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | accessibility [character] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | elaprof [numeric] |
|
73 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | mathprof [numeric] |
|
79 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | totalstudents [numeric] |
|
172 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | surveysafety [numeric] |
|
40 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | dual_program [numeric] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | shsat_prep [numeric] |
|
|
207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | student_count [numeric] |
|
130 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | testers_count [numeric] |
|
92 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | offers_count [numeric] |
|
39 distinct values | 207 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | rate_offers [numeric] |
|
62 distinct values | 207 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-06
ggplot( head(arrange(schools2018, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) +
geom_bar(stat = "identity", fill="#0033a1") +
coord_flip() +
theme_bw()
p1 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p2 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p3 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p4 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p5 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")
multiplot(p1, p2, p3, p4, p5, cols=3)
# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | schooldbn [character] |
|
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | district [numeric] |
|
32 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | name [character] |
|
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | year [numeric] | 1 distinct value |
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | coursepassrate [numeric] |
|
41 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | accessibility [character] |
|
|
454 (99.8%) | 1 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | elaprof [numeric] |
|
88 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | mathprof [numeric] |
|
96 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | totalstudents [numeric] |
|
362 distinct values | 454 (99.8%) | 1 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | surveysafety [numeric] |
|
44 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | dual_program [numeric] |
|
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | shsat_prep [numeric] |
|
|
455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | student_count [numeric] |
|
217 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | testers_count [numeric] |
|
131 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | offers_count [numeric] |
|
51 distinct values | 455 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | rate_offers [numeric] |
|
111 distinct values | 455 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-06
ggplot( head(arrange(schools2019, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) +
geom_bar(stat = "identity", fill="#0033a1") +
coord_flip() +
theme_bw()
p1 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p2 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p3 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p4 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p5 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")
multiplot(p1, p2, p3, p4, p5, cols=3)
# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | schooldbn [character] |
|
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | district [numeric] |
|
32 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | name [character] |
|
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | year [numeric] | 1 distinct value |
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | coursepassrate [numeric] |
|
41 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | accessibility [character] |
|
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | elaprof [numeric] |
|
84 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | mathprof [numeric] |
|
91 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | totalstudents [numeric] |
|
366 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | surveysafety [numeric] |
|
42 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | dual_program [numeric] |
|
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | shsat_prep [numeric] | 1 distinct value |
|
456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | student_count [numeric] |
|
210 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | testers_count [numeric] |
|
133 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | offers_count [numeric] |
|
48 distinct values | 456 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | rate_offers [numeric] |
|
104 distinct values | 456 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-06
ggplot( head(arrange(schools2020, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) +
geom_bar(stat = "identity", fill="#0033a1") +
coord_flip() +
theme_bw()
p1 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p2 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p3 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p4 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p5 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")
multiplot(p1, p2, p3, p4, p5, cols=3)
# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)
| No | Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | Missing | |||||||||||||||||||||||||||||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | schooldbn [character] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2 | district [numeric] |
|
32 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 3 | name [character] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 4 | year [numeric] | 1 distinct value |
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5 | coursepassrate [numeric] |
|
41 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 6 | accessibility [character] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 7 | elaprof [numeric] |
|
87 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 8 | mathprof [numeric] |
|
93 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 9 | totalstudents [numeric] |
|
380 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 10 | surveysafety [numeric] |
|
41 distinct values | 457 (99.8%) | 1 (0.2%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 11 | dual_program [numeric] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 12 | shsat_prep [numeric] |
|
|
458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 13 | student_count [numeric] |
|
209 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 14 | testers_count [numeric] |
|
121 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 15 | offers_count [numeric] |
|
49 distinct values | 458 (100.0%) | 0 (0.0%) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 16 | rate_offers [numeric] |
|
98 distinct values | 458 (100.0%) | 0 (0.0%) |
Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-06
ggplot( head(arrange(schools2021, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) +
geom_bar(stat = "identity", fill="#0033a1") +
coord_flip() +
theme_bw()
p1 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p2 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p3 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p4 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter()
p5 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")
multiplot(p1, p2, p3, p4, p5, cols=3)
## Warning: Removed 1 rows containing missing values (geom_point).
## Removed 1 rows containing missing values (geom_point).
# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)
After EDA, I will split the data into test and training partitions,
this section will go over specific models that were made to predict
count_offers (the number of offers given to student).
set.seed(505)
data_split <- initial_split(model_data, prop = 0.8, strata = district)
schools_train <- training(data_split)
schools_test <- testing(data_split)
The following creates training and testing partitions for the data
and constructs a decision tree. From the correlation plot, I decided to
use
offers_count ~ dual_program + shsat_prep + testers_count + student_count + elaprof + mathprof + coursepassrate.
The following code creates the decision tree.
model_spec <- decision_tree() %>%
set_mode("regression") %>%
set_engine("rpart")
model_spec
## Decision Tree Model Specification (regression)
##
## Computational engine: rpart
# Train the model
model <- model_spec %>%
fit(formula = offers_count ~ dual_program + shsat_prep + testers_count + student_count + elaprof + mathprof + coursepassrate,
data = schools_train)
# Information about the model
model
## parsnip model object
##
## n= 1260
##
## node), split, n, deviance, yval
## * denotes terminal node
##
## 1) root 1260 741467.700 8.4349210
## 2) testers_count< 198 1195 178775.700 4.2702930
## 4) mathprof< 89.5 1161 75592.280 2.9750220
## 8) testers_count< 64.5 982 6799.541 0.6425662 *
## 9) testers_count>=64.5 179 34141.610 15.7709500
## 18) mathprof< 53.5 102 6729.578 9.6960780 *
## 19) mathprof>=53.5 77 18661.450 23.8181800 *
## 5) mathprof>=89.5 34 34722.500 48.5000000
## 10) testers_count< 104.5 24 8484.500 32.2500000 *
## 11) testers_count>=104.5 10 4690.500 87.5000000 *
## 3) testers_count>=198 65 160922.000 85.0000000
## 6) elaprof< 82.5 50 39340.020 63.8600000
## 12) mathprof< 62 18 5906.000 36.0000000 *
## 13) mathprof>=62 32 11603.970 79.5312500 *
## 7) elaprof>=82.5 15 24753.730 155.4667000 *
# A plot of the model
model$fit %>% rpart.plot(type = 4, roundint=FALSE)
After building the decision tree model based upon the variables, we will take the mean square error.
# Declare Mean Absolute Error function
MAE <- function(actual, predicted){
return(mean(abs(actual - predicted)))
}
# Generate the predictions using the test data
schools_test$dt_model <- (predict(model, new_data = schools_test))$.pred
# Calculate the mean absolute error
mae_decisiontree_model <- MAE(schools_test$offers_count, schools_test$dt_model)
# Calculate the RMSE
decisiontree_rmse <- schools_test %>%
mutate(residual = dt_model - offers_count) %>%
summarize(rmse = sqrt(mean(residual^2)))
The mean absolute error of the decision tree is
3.7813729..
The two features of dual_program and
shsat_prep were removed because they had a negligible
correlation measure.
This section will build a traditional linear regression model. The training data will be used and then predictions will be run using the testing data set.
first_model <- lm(offers_count ~ testers_count + student_count + elaprof + mathprof + coursepassrate, data=schools_train)
summary(first_model)
##
## Call:
## lm(formula = offers_count ~ testers_count + student_count + elaprof +
## mathprof + coursepassrate, data = schools_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -70.220 -4.714 0.404 4.452 112.338
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.220274 3.895428 3.137 0.00175 **
## testers_count 0.456845 0.012550 36.401 < 2e-16 ***
## student_count -0.101659 0.005554 -18.304 < 2e-16 ***
## elaprof -0.061336 0.051486 -1.191 0.23376
## mathprof 0.144781 0.047123 3.072 0.00217 **
## coursepassrate -0.141826 0.045641 -3.107 0.00193 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.47 on 1254 degrees of freedom
## Multiple R-squared: 0.7372, Adjusted R-squared: 0.7361
## F-statistic: 703.4 on 5 and 1254 DF, p-value: < 2.2e-16
A plot is shown using the model’s prediction and the actual
offers_count. A RMSE score is also generated.
## Use the model on the testing data
## Create the prediction column based on the model
schools_test$p1 <- predict(first_model, newdata=schools_test)
## Plot to compare the predictions to actual prediction on the x axis
ggplot(schools_test, aes(x=p1, y=offers_count)) + geom_point() + geom_abline(color="blue")
##
GainCurvePlot(
schools_test,
"p1",
"offers_count",
"First Model")
first_mae <- MAE(schools_test$offers_count, schools_test$p1)
# Calculate the RMSE
first_rmse <- schools_test %>%
mutate(residual = p1 - offers_count) %>%
summarize(rmse = sqrt(mean(residual^2)))
The mean absolute error of the first model is
6.8365326.This second model is constructed using a Quasi-Poisson regression.
These specific models are used when a count variable is overly
dispersed. One quick test is to see if the variance
(588.9338074) is greater than the mean
(8.4349206) for offers_count.
# Quasipoisson variance is greater than the mean for offers_count
glm_second_model <- glm(offers_count ~ testers_count + student_count + elaprof + mathprof + coursepassrate, data=schools_train, family=quasipoisson)
summary(glm_second_model)
##
## Call:
## glm(formula = offers_count ~ testers_count + student_count +
## elaprof + mathprof + coursepassrate, family = quasipoisson,
## data = schools_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -17.9525 -1.5339 -0.9779 -0.6373 9.6923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.8042023 0.6488940 -5.863 5.82e-09 ***
## testers_count 0.0046146 0.0004839 9.537 < 2e-16 ***
## student_count 0.0024473 0.0002893 8.460 < 2e-16 ***
## elaprof -0.0128340 0.0041142 -3.119 0.00185 **
## mathprof 0.0608657 0.0039056 15.584 < 2e-16 ***
## coursepassrate 0.0226981 0.0071408 3.179 0.00152 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 4.730181)
##
## Null deviance: 39950.3 on 1259 degrees of freedom
## Residual deviance: 6278.5 on 1254 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
## Create the prediction column based on the model
schools_test$p2 <- predict(glm_second_model, newdata=schools_test)
# Calculate the MAE
second_mae <- MAE(schools_test$offers_count, schools_test$p2)
# Calculate the RMSE
second_rmse <- schools_test %>%
mutate(residual = p2 - offers_count) %>%
summarize(rmse = sqrt(mean(residual^2)))
# Plot predictions vs cnt (pred on x-axis)
ggplot(schools_test, aes(x = p2, y = offers_count)) +
geom_point() +
geom_abline(color = "darkblue")
##
GainCurvePlot(
schools_test,
"p2",
"offers_count",
"Second Model")
The third model attempts to compensate for an excessive number of zeroes. A zero-inflation test is performed on the second model to see if this is the case. This model will employ zero-inflation regression.
# Zero Inflation test for glm_second_model
check_zeroinflation(glm_second_model)
## # Check for zero-inflation
##
## Observed zeros: 953
## Predicted zeros: 452
## Ratio: 0.47
## Model is underfitting zeros (probable zero-inflation).
zip_model <- zeroinfl(offers_count ~ testers_count + student_count + elaprof + mathprof + coursepassrate, data = schools_train)
summary(zip_model)
##
## Call:
## zeroinfl(formula = offers_count ~ testers_count + student_count + elaprof +
## mathprof + coursepassrate, data = schools_train)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -9.15246 -0.18978 -0.05793 -0.02208 7.88307
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.0516479 0.2225555 4.725 2.30e-06 ***
## testers_count 0.0075520 0.0002356 32.048 < 2e-16 ***
## student_count -0.0008180 0.0001434 -5.705 1.16e-08 ***
## elaprof -0.0088870 0.0018667 -4.761 1.93e-06 ***
## mathprof 0.0325666 0.0017960 18.133 < 2e-16 ***
## coursepassrate -0.0007899 0.0023219 -0.340 0.734
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 13.3749090 2.2631489 5.910 3.42e-09 ***
## testers_count -0.0595986 0.0071053 -8.388 < 2e-16 ***
## student_count -0.0005538 0.0019960 -0.277 0.781452
## elaprof -0.0223664 0.0199567 -1.121 0.262397
## mathprof -0.0610413 0.0179027 -3.410 0.000651 ***
## coursepassrate -0.0538042 0.0232936 -2.310 0.020898 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 16
## Log-likelihood: -1748 on 12 Df
## Create the prediction column based on the model
schools_test$p3 <- predict(zip_model, newdata=schools_test)
# Calculate the MAE
third_mae <- MAE(schools_test$offers_count, schools_test$p3)
# Calculate the RMSE
third_rmse <- schools_test %>%
mutate(residual = p3 - offers_count) %>%
summarize(rmse = sqrt(mean(residual^2)))
# Plot predictions vs cnt (pred on x-axis)
ggplot(schools_test, aes(x = p3, y = offers_count)) +
geom_point() +
geom_abline(color = "darkblue")
##
GainCurvePlot(
schools_test,
"p3",
"offers_count",
"Third Model")
The following is a comparison of the three models that were created in terms of AIC. Note for Quasi-Poisson regressions there is no AIC reported.
AIC(first_model, glm_second_model, zip_model)
For the models, the Zero-Inflated Poisson regression model provided
the best results. A refinement of the second model, the third model has
the lowest mean absolute error (MAE) and root mean square error (RMSE).
From the summary of this model, besides the number of testers for the
SHSAT, the variables elaprof and mathprof were
significant. The percentage of students proficient in ELA and math for
middle schools appears to have a correlation with the number of offers
for Specialized High Schools in terms of the SHSAT.
Proficiency in ELA and math proved to be significant factors for middle school student populations in receiving offers to Specialized High Schools. Because the SHSAT is a standardized exam, it is possible that this test merely reinforces the need to improve proficiency for NY State exams. The presence of scarce SHSAT preparatory or dual language programs at middle schools did not appear to have a strong correlation with the number of offers.
Overall, from this study, I would recommend for additional aid for improving NYS proficiency in ELA and math for all students. This study did not examine the racial demographic of students. However, in future studies, I would encourage that such demographics be ignored and that researchers look at NYS proficiency rates. In addition, as these rates are examined, I would recommend researchers look at the impact of the city’s Gifted & Talented programs. Because lack of diversity is a problem in Specialized High Schools, additional research needs to examine diversity in the early childhood education, specifically the Gifted & Talented programs. Further study should also examine SHSAT awareness in minority communities: perhaps, lack of participation is another inhibiting factor for lack of diversity.
An additional recommendation would be to consider providing additional resources for SHSAT preparation. Perhaps lack of quality test prep programs would provide students the opportunity to catch up to NYS proficiency standards.
The code used for this project can be found at these locations:
Almarode, Subotnik, R., & Lee, G. M. (2016). What Works: Lessons From a Research Study of Specialized Science High School Graduates. Gifted Child Today Magazine, 39(4), 185–190. https://doi.org/10.1177/1076217516662096
Chang, W. (2013, January 1). Cookbook for R. Retrieved April 5, 2022, from http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
Corcoran, S., & Baker-Smith, C. (2015). Pathways to an Elite Education: Exploring Strategies to Diversity NYC’s Specialized High Schools. Institute for Education and Social Policy, and Research Alliance for New York City Schools.
Fadilla, & Usmeldi. (2020). Preliminary study for development of teacher’s Books oriented research-based learning on science lesson in Junior High School. Journal of Physics. Conference Series, 1481(1). https://doi.org/10.1088/1742-6596/1481/1/012068
Nwanganga, Fred; Chapple, Mike. Practical Machine Learning in R (Kindle Locations 3605-3681). Wiley. Kindle Edition.
Reeves, Richard V., and Ashley Schobert. “Elite or Elitist? Lessons for Colleges from Selective High Schools.” Brookings, Brookings, 9 Mar. 2022, https://www.brookings.edu/research/elite-or-elitist-lessons-for-colleges-from-selective-high-schools/.
Taylor, Jonathan James, “Policy Implications of a Predictive Validity Study of the Specialized High School Admissions Test at Three Elite New York City High Schools” (2015). CUNY Academic Works. https://academicworks.cuny.edu/gc_etds/1154
Waran, R. (2021, June 17). Deploying dash apps on azure with windows. Resonance Analytics. Retrieved March, 2022, from https://resonance-analytics.com/blog/deploying-dash-apps-on-azure
Wickham, Hadley; Grolemund, Garrett. R for Data Science . O’Reilly Media. Kindle Edition.