Abstract

Academic rigor is seen as a prized tenet in the culture of American meritocracy. Hard work, preparation, and grit are seen as genuine characteristics for a thriving and prosperous economy. With this belief, every year over 25,000 eighth grade students in New York City’s public school system take the arduous Specialized High School Admissions Test (SHSAT) to gain admittance to eight prestigious specialized high schools: Bronx High School of Science, Brooklyn Latin School, Brooklyn Technical High School, the High School for Mathematics, Science and Engineering at City College of New York, High School of American Studies at Lehman College, Queens High School for the Sciences at York College, Staten Island Technical High School, and Stuyvesant High School. Yet with a population of over 1 million students, the largest system in the United States, is there equity in the system to allow admittance? It is the plan of this research to look at the data for New York City’s school system for the years of 2018 to 2021 with regards to middle schools and the SHSAT. The system is divided into 32 school districts. This study will examine middle school data to see if there is a correlation between the number of offers given for specialized high school based on the SHSAT and an individual school’s academic data, programs, and district assignment. It is the hope of this study to identify trends for improving success for schools and to, if any, to determine ways to improve schools that have few offers.

Keywords

Regression, Standardized Testing, Feature Selection, Public Education

Literature Review

The SHSAT and the lack of diversity in NYC Specialized High Schools have been a source of incredible controversy for the past number of years. Different politicians and activists have advocated for various approaches to the lack of diversity in NYC’s Specialized High Schools. Here is a quote from the Brookings Institution’s article “Elite of elitist? Lessons for colleges from selective high schools”:

The question underlying these debates is: what constitutes fair admissions criteria? Supporters of a narrowly meritocratic approach argue it is unfair to deny, say, an Asian-American student a seat in one of these prestigious schools, in favor of, say, a Black student who scored lower on the test. Those who support a more inclusive policy argue that it is fairer to take into account socioeconomic and/or racial background in making admissions decisions.¹

While the SHS in some cities (Milwaukee, Philadelphia) are more representative of the demographics of their overall districts, disparities exist throughout each of the SHS systems. Some cities—particularly Boston and New York—have the least representative SHS student populations in terms of race, while others such as Baltimore, Louisville, and DC face greater challenges in terms of economic representation.

Attempts have been made to provide greater access for underrepresented groups, here is a quote from the same article:

There are more modest efforts underway to improve representation in New York’s SHS, however. The Discovery program offers a summer enrichment program for eligible students who take the SHSAT and score just below the cutoff. To be eligible for Discovery, students must be from a low-income household, have scored within a certain range below the cutoff score, and attend a high-poverty high school. Students who participate and complete the program requirements are then admitted to one of the specialized high schools. By the summer of 2020, 20 percent of seats at each specialized high school will be reserved for Discovery program participants. However, the impact on racial diversity is likely to be muted. More than half of the school places offered to Discovery participants in 2019 went to Asian American students.

Here are some of the proposals: * Require all students to take the entrance exams for selective high schools. Currently, there are many gaps in the racial and socioeconomic makeup of the students who choose to take the admissions exams in the first place. These gaps may be due to a lack of sufficient information within households, or other barriers, such as limited transportation to the testing location. Instituting a school day in which the entrance exam is administered to all middle school students may increase the number of students with qualifying scores from diverse backgrounds.

Replace the unique tests schools use for admission with scores on state or national tests. Many selective high school entrance exams, including the ISEE in Boston and the SHSAT in New York, test students on curriculum that has not yet been taught in school. As a result, the students who perform best on the exam often are those who had access to prep courses or tutoring. In Boston, many Black, Hispanic, and low-income students perform well on the fifth-grade standardized test, but just a year later do not score well on the ISEE exam. This suggests that using a standardized test that is more aligned with the curriculum already taught in schools may decrease a portion of the racial and socioeconomic gaps present in current entrance exam scores.
Provide parents and students with better and more accessible information about schools and the admissions processes. Increasing school resources to provide on-demand preparation and coaching by school administrators could help increase diversity by incorporating students into the process who otherwise may not have applied to selective programs at all. Relying on parents and students to navigate a complex application process often shuts out disadvantaged students.
Increase access to advanced academic offerings. In cities like Baltimore, many students who attend middle schools in low-income areas do not have access to advanced or honors courses and are therefore disadvantaged in comparison to other students who can get a GPA boost from these higher-weighted courses. Expanding access to these programs throughout all middle schools could help level the GPA distribution so that not only students who attend middle schools in affluent areas can achieve highly-weighted GPAs. (Obviously this has advantages well beyond SHS admissions).
Provide extra learning opportunities for less advantaged students. Boston, for example, has instituted the Exam School Initiative, a summer and fall ISEE-prep program for children from underserved areas and has expanded enrollment from 450 to 750.

With these proposals, the demand remains for admitting students to these prestigious high schools. Here is how applicants are accepted:

Admission to the specialized schools is based strictly on the SHSAT, which students can choose to take in the fall of 8th grade. On exam day, applicants submit a ranking of their preferred specialized high schools, up to a total of eight. SHSAT scores are sorted from highest to lowest and students are assigned, in order, to the highest-ranked school on their list with seats available (Abdulkadiroğlu et al., 2014; Dobbie & Fryer, 2014; NYC DOE, 2014). Accordingly, cut scores for admission vary by school and year depending on the distribution of scores in that year, student preferences, and the number of seats. Cut scores are not made public, but there is a well-known hierarchy of selectivity, with Stuyvesant requiring the highest SHSAT score, followed by Bronx Science and Brooklyn Tech (Abdulkadiroğlu et al., 2014; Feinman, 2008).²

Regardless of how public policies are implemented, the lack of diversity in NYC’s Specialized High Schools will require a holistic approach that addresses years of district resource disparity and the city’s Gifted & Talented program.

Methodology

Exploratory Data Analysis

For the exploratory data analysis, I decided early to create maps that would help me grapple with the location of middle schools and their SHSAT performance. The following were coded in Python and R using Plotly and Shiny libraries. The purpose of these apps were to provide a visualization of the data. When viewing the apps, you will need an active internet connection. As you interact, observe the concentration of offers in specific districts.

***

Data Dictionary

The following is a brief description for the variables used in this research paper.

Variable	Description
accessibility	Accessibility of the site where the school is located: Functionally Accessible or Not Functionally Accessible
address	Primary Street Address
coursepassrate	Percent of students who passed core courses
district	District is a code which reflects where the school is located.
dual_program	If the school has a dual language program
elaprof	Percent of students proficient in ELA
mathprof	Percent of students proficient in math
name	School’s name
offers_count	The number of students from the middle school who received an offer to a Specialized High School
schooldbn	Unique identifier for each NYC public school.
shsat_prep	If the school listed a SHSAT prep program for
student_count	Number of 8th grade students from the middle school who applied to high school
surveysafety	Percent of students that felt safe in the hallways, bathrooms, locker rooms, and cafeteria
testers_count	The number of students from the middle school who took the SHSAT

The following code sets up necessary libraries and a function for plotting graphs.

library(tidyverse)
library(dplyr)
library(sigr)
library(broom)
library(ggplot2)
library(WVPlots)
library(corrplot)
library(tidymodels)
library(readr)
library(rpart.plot)
library(GGally)
library(summarytools)
library(MASS)
library(AER)
library(Metrics)

# Used for zero-inflation model
library(pscl)

# Used to test for zero inflation
library(performance)

# 
library(MuMIn)


# From http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/
# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  library(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Years 2018 to 2021

The following code imports the csv files from GitHub. The csv files were cleaned beforehand. Each section is split by year into different tabs. Two features have been added:

dual_program identifies if the school as having a dual language program with a value of 0 or 1
shsat_prep identifies if the school has a program to prep for the SHSAT with a value of 0 or 1

# School information: Note: only 2018-2021 are available
school_info <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data698/data/2018-2021_school_information.csv')

# Remove NA values for total_students, course pass rate, elaprof, and mathprof
school_info <- school_info %>% drop_na(totalstudents)
school_info <- school_info %>% drop_na(coursepassrate)
school_info <- school_info %>% drop_na(elaprof)
school_info <- school_info %>% drop_na(mathprof)

# Convert totalstudents to numberical
school_info$totalstudents <- as.numeric(as.character(school_info$totalstudents))


# Potential features
## Dual Program feature: If the school has a dual language program
school_info$dual_program <- ifelse(str_detect(school_info$ellprograms, "Dual") & (is.na(school_info$ellprograms) == FALSE), 1, 0)

## Specialized Test Prep feature: If the school has a specialized high school prep class 
school_info$shsat_prep <- ifelse(str_detect(school_info$electiveclasses, "Specialized High School Test") & (is.na(school_info$electiveclasses) == FALSE), 1, 0)

# School offers have the offers for middle schools: Note 2016-2021 are available but for the previous data set only 2018-2021 are available
# Remove rows with NA values
school_offers <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data698/data/school_offers.csv')
school_offers$Postcode <- as.character(school_offers$Postcode)
school_offers <- subset(school_offers, select = -c(`name`, `telephone`, `address`, `2016_student_count`,                                  `2016_testers_count`,`2016_offers_count`,`2017_student_count`,`2017_testers_count`,`2017_offers_count`))
school_offers <- na.omit(school_offers) 

temp1 <- subset(school_info, select = c(`district`, `schooldbn`, `name`, `year`, `coursepassrate`, `accessibility`, `elaprof`, `mathprof`, `totalstudents`, `surveysafety`,`dual_program`, `shsat_prep`))


temp2018 <- merge(temp1[temp1$year == 2018, ], subset(school_offers, select = c(`dbn`, `2018_student_count`, `2018_testers_count`, `2018_offers_count`)), by.x = 'schooldbn', by.y='dbn')

names(temp2018)[names(temp2018) == '2018_student_count'] <- 'student_count'
names(temp2018)[names(temp2018) == '2018_testers_count'] <- 'testers_count'
names(temp2018)[names(temp2018) == '2018_offers_count'] <- 'offers_count'


temp2019 <- merge(temp1[temp1$year == 2019, ], subset(school_offers, select = c(`dbn`, `2019_student_count`, `2019_testers_count`, `2019_offers_count`)), by.x = 'schooldbn', by.y='dbn')

names(temp2019)[names(temp2019) == '2019_student_count'] <- 'student_count'
names(temp2019)[names(temp2019) == '2019_testers_count'] <- 'testers_count'
names(temp2019)[names(temp2019) == '2019_offers_count'] <- 'offers_count'

temp2020 <- merge(temp1[temp1$year == 2020, ], subset(school_offers, select = c(`dbn`, `2020_student_count`, `2020_testers_count`, `2020_offers_count`)), by.x = 'schooldbn', by.y='dbn')

names(temp2020)[names(temp2020) == '2020_student_count'] <- 'student_count'
names(temp2020)[names(temp2020) == '2020_testers_count'] <- 'testers_count'
names(temp2020)[names(temp2020) == '2020_offers_count'] <- 'offers_count'

temp2021 <- merge(temp1[temp1$year == 2021, ], subset(school_offers, select = c(`dbn`, `2021_student_count`, `2021_testers_count`, `2021_offers_count`)), by.x = 'schooldbn', by.y='dbn')

names(temp2021)[names(temp2021) == '2021_student_count'] <- 'student_count'
names(temp2021)[names(temp2021) == '2021_testers_count'] <- 'testers_count'
names(temp2021)[names(temp2021) == '2021_offers_count'] <- 'offers_count'


# Create model data and concatenate the temp data frames
model_data <- rbind(temp2018, temp2019)
model_data <- rbind(model_data, temp2020)
model_data <- rbind(model_data, temp2021)


# Create rate_offers: number of offers / number of testers
model_data$rate_offers <- model_data$offers_count / model_data$testers_count
model_data$rate_offers[is.na(model_data$rate_offers)] <- 0

# Temp variables for schools
schools2018 <- model_data[model_data$year == 2018, ]
schools2019 <- model_data[model_data$year == 2019, ]
schools2020 <- model_data[model_data$year == 2020, ]
schools2021 <- model_data[model_data$year == 2021, ]

2018

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

schooldbn [character]

1. 01M034

2. 01M140

3. 01M184

4. 01M188

5. 01M332

6. 01M378

7. 01M450

8. 01M539

9. 01M839

10. 05M123

[ 197 others ]

1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
197	(	95.2%	)

207 (100.0%)

0 (0.0%)

2

district [numeric]

Mean (sd) : 17.5 (8.6)

min ≤ med ≤ max:

1 ≤ 20 ≤ 31

IQR (CV) : 13 (0.5)

16 distinct values

207 (100.0%)

0 (0.0%)

3

name [character]

1. Academy for College Prepa

2. Academy of Applied Mathem

3. Academy of Arts and Lette

4. Academy of Medical Techno

5. ACCION Academy

6. Adrien Block (I.S. 25)

7. Alfred De B. Mason (P.S./

8. America's School of Heroe

9. Amistad Dual Language Sch

10. Anning S. Prall (I.S. 27)

[ 197 others ]

1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
197	(	95.2%	)

207 (100.0%)

0 (0.0%)

4

year [numeric]

1 distinct value

2018

:

207

(

100.0%

)

207 (100.0%)

0 (0.0%)

5

coursepassrate [numeric]

Mean (sd) : 91.3 (10.6)

min ≤ med ≤ max:

28 ≤ 95 ≤ 100

IQR (CV) : 9 (0.1)

33 distinct values

207 (100.0%)

0 (0.0%)

6

accessibility [character]

1. Accessible

2. Not Accessible

120	(	58.0%	)
87	(	42.0%	)

207 (100.0%)

0 (0.0%)

7

elaprof [numeric]

Mean (sd) : 35.2 (20)

min ≤ med ≤ max:

4 ≤ 30 ≤ 95

IQR (CV) : 28.5 (0.6)

73 distinct values

207 (100.0%)

0 (0.0%)

8

mathprof [numeric]

Mean (sd) : 32.3 (23.4)

min ≤ med ≤ max:

2 ≤ 26 ≤ 97

IQR (CV) : 36 (0.7)

79 distinct values

207 (100.0%)

0 (0.0%)

9

totalstudents [numeric]

Mean (sd) : 477.7 (451.1)

min ≤ med ≤ max:

38 ≤ 278 ≤ 2175

IQR (CV) : 291.5 (0.9)

172 distinct values

207 (100.0%)

0 (0.0%)

10

surveysafety [numeric]

Mean (sd) : 84.1 (8.5)

min ≤ med ≤ max:

59 ≤ 85 ≤ 100

IQR (CV) : 10 (0.1)

40 distinct values

207 (100.0%)

0 (0.0%)

11

dual_program [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	186	(	89.9%	)
1	:	21	(	10.1%	)

207 (100.0%)

0 (0.0%)

12

shsat_prep [numeric]

Min : 0

Mean : 0

Max : 1

0	:	204	(	98.6%	)
1	:	3	(	1.4%	)

207 (100.0%)

0 (0.0%)

13

student_count [numeric]

Mean (sd) : 160.7 (152.9)

min ≤ med ≤ max:

15 ≤ 93 ≤ 769

IQR (CV) : 86.5 (1)

130 distinct values

207 (100.0%)

0 (0.0%)

14

testers_count [numeric]

Mean (sd) : 55.6 (77.5)

min ≤ med ≤ max:

0 ≤ 22 ≤ 394

IQR (CV) : 59.5 (1.4)

92 distinct values

207 (100.0%)

0 (0.0%)

15

offers_count [numeric]

Mean (sd) : 11.3 (28.5)

min ≤ med ≤ max:

0 ≤ 0 ≤ 205

IQR (CV) : 8 (2.5)

39 distinct values

207 (100.0%)

0 (0.0%)

16

rate_offers [numeric]

Mean (sd) : 0.1 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.8

IQR (CV) : 0.1 (1.9)

62 distinct values

207 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-06

ggplot( head(arrange(schools2018, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) + 
  geom_bar(stat = "identity", fill="#0033a1") +
  coord_flip() +
  theme_bw()

p1 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p2 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p3 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p4 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p5 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")

multiplot(p1, p2, p3, p4, p5, cols=3)

# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)

2019

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

schooldbn [character]

1. 01M034

2. 01M140

3. 01M184

4. 01M188

5. 01M332

6. 01M378

7. 01M450

8. 01M539

9. 01M839

10. 02M104

[ 445 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
445	(	97.8%	)

455 (100.0%)

0 (0.0%)

2

district [numeric]

Mean (sd) : 16 (9.4)

min ≤ med ≤ max:

1 ≤ 15 ≤ 32

IQR (CV) : 16.5 (0.6)

32 distinct values

455 (100.0%)

0 (0.0%)

3

name [character]

1. Daniel Carter Beard (J.H

2. J.H.S. 74 Nathaniel Hawt

3. 47 - PS 347 The American

4. Academy for College Prepa

5. Academy for New Americans

6. Academy For Personal Lead

7. Academy for Young Writers

8. Academy of Applied Mathem

9. Academy of Arts and Lette

10. Academy of Medical Techno

[ 445 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
445	(	97.8%	)

455 (100.0%)

0 (0.0%)

4

year [numeric]

1 distinct value

2019

:

455

(

100.0%

)

455 (100.0%)

0 (0.0%)

5

coursepassrate [numeric]

Mean (sd) : 90.7 (9)

min ≤ med ≤ max:

40 ≤ 93 ≤ 100

IQR (CV) : 11 (0.1)

41 distinct values

455 (100.0%)

0 (0.0%)

6

accessibility [character]

1. Accessible

2. Not Accessible

3. Accessible

4. Fully Accessible

5. Not Accessible

6. Partially Accessible

1	(	0.2%	)
1	(	0.2%	)
4	(	0.9%	)
89	(	19.6%	)
194	(	42.7%	)
165	(	36.3%	)

454 (99.8%)

1 (0.2%)

7

elaprof [numeric]

Mean (sd) : 37.7 (20.9)

min ≤ med ≤ max:

1 ≤ 32 ≤ 100

IQR (CV) : 28 (0.6)

88 distinct values

455 (100.0%)

0 (0.0%)

8

mathprof [numeric]

Mean (sd) : 31 (23.6)

min ≤ med ≤ max:

0 ≤ 24 ≤ 100

IQR (CV) : 33 (0.8)

96 distinct values

455 (100.0%)

0 (0.0%)

9

totalstudents [numeric]

Mean (sd) : 627.7 (397)

min ≤ med ≤ max:

70 ≤ 525.5 ≤ 2238

IQR (CV) : 454 (0.6)

362 distinct values

454 (99.8%)

1 (0.2%)

10

surveysafety [numeric]

Mean (sd) : 83.4 (9.2)

min ≤ med ≤ max:

19 ≤ 85 ≤ 100

IQR (CV) : 12 (0.1)

44 distinct values

455 (100.0%)

0 (0.0%)

11

dual_program [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	398	(	87.5%	)
1	:	57	(	12.5%	)

455 (100.0%)

0 (0.0%)

12

shsat_prep [numeric]

Min : 0

Mean : 0

Max : 1

0	:	448	(	98.5%	)
1	:	7	(	1.5%	)

455 (100.0%)

0 (0.0%)

13

student_count [numeric]

Mean (sd) : 148.3 (130.6)

min ≤ med ≤ max:

11 ≤ 100 ≤ 773

IQR (CV) : 97.5 (0.9)

217 distinct values

455 (100.0%)

0 (0.0%)

14

testers_count [numeric]

Mean (sd) : 47.8 (64.1)

min ≤ med ≤ max:

0 ≤ 25 ≤ 352

IQR (CV) : 41.5 (1.3)

131 distinct values

455 (100.0%)

0 (0.0%)

15

offers_count [numeric]

Mean (sd) : 8.2 (23.3)

min ≤ med ≤ max:

0 ≤ 0 ≤ 195

IQR (CV) : 6 (2.8)

51 distinct values

455 (100.0%)

0 (0.0%)

16

rate_offers [numeric]

Mean (sd) : 0.1 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.8

IQR (CV) : 0.1 (2.2)

111 distinct values

455 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-06

ggplot( head(arrange(schools2019, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) + 
  geom_bar(stat = "identity", fill="#0033a1") +
  coord_flip() +
  theme_bw()

p1 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p2 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p3 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p4 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p5 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")

multiplot(p1, p2, p3, p4, p5, cols=3)

# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)

2020

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

schooldbn [character]

1. 01M034

2. 01M140

3. 01M184

4. 01M188

5. 01M332

6. 01M378

7. 01M450

8. 01M539

9. 01M839

10. 02M104

[ 446 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
446	(	97.8%	)

456 (100.0%)

0 (0.0%)

2

district [numeric]

Mean (sd) : 15.9 (9.4)

min ≤ med ≤ max:

1 ≤ 15 ≤ 32

IQR (CV) : 16 (0.6)

32 distinct values

456 (100.0%)

0 (0.0%)

3

name [character]

1. Academy for College Prepa

2. Academy For Personal Lead

3. Academy for Young Writers

4. Academy of Applied Mathem

5. Academy of Arts and Lette

6. Academy of Medical Techno

7. Academy of Public Relatio

8. Accion Academy (12X341)

9. Albert Shanker School for

10. All City Leadership Secon

[ 446 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
446	(	97.8%	)

456 (100.0%)

0 (0.0%)

4

year [numeric]

1 distinct value

2020

:

456

(

100.0%

)

456 (100.0%)

0 (0.0%)

5

coursepassrate [numeric]

Mean (sd) : 91.4 (9)

min ≤ med ≤ max:

27 ≤ 94 ≤ 100

IQR (CV) : 10 (0.1)

41 distinct values

456 (100.0%)

0 (0.0%)

6

accessibility [character]

1. Fully Accessible

2. Not Accessible

3. Partially Accessible

115	(	25.2%	)
185	(	40.6%	)
156	(	34.2%	)

456 (100.0%)

0 (0.0%)

7

elaprof [numeric]

Mean (sd) : 43.7 (21.1)

min ≤ med ≤ max:

7 ≤ 39 ≤ 100

IQR (CV) : 30 (0.5)

84 distinct values

456 (100.0%)

0 (0.0%)

8

mathprof [numeric]

Mean (sd) : 35.4 (23.7)

min ≤ med ≤ max:

2 ≤ 28 ≤ 100

IQR (CV) : 35 (0.7)

91 distinct values

456 (100.0%)

0 (0.0%)

9

totalstudents [numeric]

Mean (sd) : 634.4 (396.9)

min ≤ med ≤ max:

94 ≤ 534.5 ≤ 2251

IQR (CV) : 440.2 (0.6)

366 distinct values

456 (100.0%)

0 (0.0%)

10

surveysafety [numeric]

Mean (sd) : 83.2 (8.2)

min ≤ med ≤ max:

58 ≤ 84 ≤ 100

IQR (CV) : 10.2 (0.1)

42 distinct values

456 (100.0%)

0 (0.0%)

11

dual_program [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	394	(	86.4%	)
1	:	62	(	13.6%	)

456 (100.0%)

0 (0.0%)

12

shsat_prep [numeric]

1 distinct value

0

:

456

(

100.0%

)

456 (100.0%)

0 (0.0%)

13

student_count [numeric]

Mean (sd) : 145.1 (130.9)

min ≤ med ≤ max:

11 ≤ 96 ≤ 772

IQR (CV) : 89.2 (0.9)

210 distinct values

456 (100.0%)

0 (0.0%)

14

testers_count [numeric]

Mean (sd) : 47.4 (65.2)

min ≤ med ≤ max:

0 ≤ 23.5 ≤ 369

IQR (CV) : 45 (1.4)

133 distinct values

456 (100.0%)

0 (0.0%)

15

offers_count [numeric]

Mean (sd) : 7.2 (21.2)

min ≤ med ≤ max:

0 ≤ 0 ≤ 214

IQR (CV) : 0 (2.9)

48 distinct values

456 (100.0%)

0 (0.0%)

16

rate_offers [numeric]

Mean (sd) : 0.1 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.8

IQR (CV) : 0 (2.3)

104 distinct values

456 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-06

ggplot( head(arrange(schools2020, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) + 
  geom_bar(stat = "identity", fill="#0033a1") +
  coord_flip() +
  theme_bw()

p1 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p2 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p3 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p4 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p5 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")

multiplot(p1, p2, p3, p4, p5, cols=3)

# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)

2021

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

schooldbn [character]

1. 01M034

2. 01M140

3. 01M184

4. 01M188

5. 01M332

6. 01M378

7. 01M450

8. 01M539

9. 01M839

10. 02M104

[ 448 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
448	(	97.8%	)

458 (100.0%)

0 (0.0%)

2

district [numeric]

Mean (sd) : 16 (9.4)

min ≤ med ≤ max:

1 ≤ 15 ≤ 32

IQR (CV) : 16 (0.6)

32 distinct values

458 (100.0%)

0 (0.0%)

3

name [character]

1. Academy for College Prepa

2. Academy for New Americans

3. Academy For Personal Lead

4. Academy for Young Writers

5. Academy of Applied Mathem

6. Academy of Arts and Lette

7. Academy of Medical Techno

8. Accion Academy

9. Albert Shanker School for

10. All City Leadership Secon

[ 448 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
448	(	97.8%	)

458 (100.0%)

0 (0.0%)

4

year [numeric]

1 distinct value

2021

:

458

(

100.0%

)

458 (100.0%)

0 (0.0%)

5

coursepassrate [numeric]

Mean (sd) : 91.7 (8.8)

min ≤ med ≤ max:

39 ≤ 94 ≤ 100

IQR (CV) : 9.8 (0.1)

41 distinct values

458 (100.0%)

0 (0.0%)

6

accessibility [character]

1. Fully Accessible

2. Not Accessible

3. Partially Accessible

124	(	27.1%	)
184	(	40.2%	)
150	(	32.8%	)

458 (100.0%)

0 (0.0%)

7

elaprof [numeric]

Mean (sd) : 43.4 (21)

min ≤ med ≤ max:

2 ≤ 38 ≤ 100

IQR (CV) : 31.5 (0.5)

87 distinct values

458 (100.0%)

0 (0.0%)

8

mathprof [numeric]

Mean (sd) : 38.6 (23.4)

min ≤ med ≤ max:

1 ≤ 33.5 ≤ 100

IQR (CV) : 36 (0.6)

93 distinct values

458 (100.0%)

0 (0.0%)

9

totalstudents [numeric]

Mean (sd) : 625.5 (397.7)

min ≤ med ≤ max:

85 ≤ 524.5 ≤ 2265

IQR (CV) : 449 (0.6)

380 distinct values

458 (100.0%)

0 (0.0%)

10

surveysafety [numeric]

Mean (sd) : 82.9 (8.2)

min ≤ med ≤ max:

57 ≤ 84 ≤ 100

IQR (CV) : 11 (0.1)

41 distinct values

457 (99.8%)

1 (0.2%)

11

dual_program [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	399	(	87.1%	)
1	:	59	(	12.9%	)

458 (100.0%)

0 (0.0%)

12

shsat_prep [numeric]

Min : 0

Mean : 0.3

Max : 1

0	:	324	(	70.7%	)
1	:	134	(	29.3%	)

458 (100.0%)

0 (0.0%)

13

student_count [numeric]

Mean (sd) : 139.4 (127.7)

min ≤ med ≤ max:

12 ≤ 95 ≤ 764

IQR (CV) : 82.8 (0.9)

209 distinct values

458 (100.0%)

0 (0.0%)

14

testers_count [numeric]

Mean (sd) : 41.4 (59)

min ≤ med ≤ max:

0 ≤ 20 ≤ 374

IQR (CV) : 36.8 (1.4)

121 distinct values

458 (100.0%)

0 (0.0%)

15

offers_count [numeric]

Mean (sd) : 7.3 (21.6)

min ≤ med ≤ max:

0 ≤ 0 ≤ 221

IQR (CV) : 0 (3)

49 distinct values

458 (100.0%)

0 (0.0%)

16

rate_offers [numeric]

Mean (sd) : 0.1 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.8

IQR (CV) : 0 (2.3)

98 distinct values

458 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-06

ggplot( head(arrange(schools2021, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) + 
  geom_bar(stat = "identity", fill="#0033a1") +
  coord_flip() +
  theme_bw()

p1 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p2 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p3 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p4 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p5 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")

multiplot(p1, p2, p3, p4, p5, cols=3)

## Warning: Removed 1 rows containing missing values (geom_point).
## Removed 1 rows containing missing values (geom_point).

# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)

Experimentation & Results

After EDA, I will split the data into test and training partitions, this section will go over specific models that were made to predict count_offers (the number of offers given to student).

set.seed(505)
data_split <- initial_split(model_data, prop = 0.8, strata = district)
schools_train <- training(data_split)
schools_test <- testing(data_split)

Decision Tree

The following creates training and testing partitions for the data and constructs a decision tree. From the correlation plot, I decided to use offers_count ~ dual_program + shsat_prep + testers_count + student_count + elaprof + mathprof + coursepassrate.

The following code creates the decision tree.

model_spec <- decision_tree() %>%
  set_mode("regression") %>%
  set_engine("rpart")

model_spec

## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart

# Train the model
model <- model_spec %>%
  fit(formula = offers_count ~ dual_program + shsat_prep + testers_count + student_count + elaprof + mathprof + coursepassrate,
      data = schools_train)

# Information about the model
model

## parsnip model object
## 
## n= 1260 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1260 741467.700   8.4349210  
##    2) testers_count< 198 1195 178775.700   4.2702930  
##      4) mathprof< 89.5 1161  75592.280   2.9750220  
##        8) testers_count< 64.5 982   6799.541   0.6425662 *
##        9) testers_count>=64.5 179  34141.610  15.7709500  
##         18) mathprof< 53.5 102   6729.578   9.6960780 *
##         19) mathprof>=53.5 77  18661.450  23.8181800 *
##      5) mathprof>=89.5 34  34722.500  48.5000000  
##       10) testers_count< 104.5 24   8484.500  32.2500000 *
##       11) testers_count>=104.5 10   4690.500  87.5000000 *
##    3) testers_count>=198 65 160922.000  85.0000000  
##      6) elaprof< 82.5 50  39340.020  63.8600000  
##       12) mathprof< 62 18   5906.000  36.0000000 *
##       13) mathprof>=62 32  11603.970  79.5312500 *
##      7) elaprof>=82.5 15  24753.730 155.4667000 *

# A plot of the model
model$fit  %>% rpart.plot(type = 4, roundint=FALSE)

Performance of the Tree

After building the decision tree model based upon the variables, we will take the mean square error.

# Declare Mean Absolute Error function
MAE <- function(actual, predicted){
  return(mean(abs(actual - predicted)))
}

# Generate the predictions using the test data
schools_test$dt_model <- (predict(model, new_data = schools_test))$.pred

# Calculate the mean absolute error
mae_decisiontree_model <- MAE(schools_test$offers_count, schools_test$dt_model)


# Calculate the RMSE
decisiontree_rmse <- schools_test %>% 
  mutate(residual = dt_model - offers_count) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

The mean absolute error of the decision tree is 3.7813729.
The RMSE of the decision tree is

.

The two features of dual_program and shsat_prep were removed because they had a negligible correlation measure.

First Model

This section will build a traditional linear regression model. The training data will be used and then predictions will be run using the testing data set.

first_model <- lm(offers_count ~ testers_count + student_count + elaprof + mathprof + coursepassrate, data=schools_train)
summary(first_model)

## 
## Call:
## lm(formula = offers_count ~ testers_count + student_count + elaprof + 
##     mathprof + coursepassrate, data = schools_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -70.220  -4.714   0.404   4.452 112.338 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    12.220274   3.895428   3.137  0.00175 ** 
## testers_count   0.456845   0.012550  36.401  < 2e-16 ***
## student_count  -0.101659   0.005554 -18.304  < 2e-16 ***
## elaprof        -0.061336   0.051486  -1.191  0.23376    
## mathprof        0.144781   0.047123   3.072  0.00217 ** 
## coursepassrate -0.141826   0.045641  -3.107  0.00193 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.47 on 1254 degrees of freedom
## Multiple R-squared:  0.7372, Adjusted R-squared:  0.7361 
## F-statistic: 703.4 on 5 and 1254 DF,  p-value: < 2.2e-16

Performance of First Model

A plot is shown using the model’s prediction and the actual offers_count. A RMSE score is also generated.

## Use the model on the testing data

## Create the prediction column based on the model
schools_test$p1 <- predict(first_model, newdata=schools_test)

## Plot to compare the predictions to actual prediction on the x axis
ggplot(schools_test, aes(x=p1, y=offers_count)) + geom_point() + geom_abline(color="blue")

## 
GainCurvePlot(
  schools_test,
  "p1",
  "offers_count",
  "First Model")

first_mae <- MAE(schools_test$offers_count, schools_test$p1)


# Calculate the RMSE
first_rmse <- schools_test %>% 
  mutate(residual = p1 - offers_count) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

The mean absolute error of the first model is 6.8365326.
The first model’s RMSE is

Second Model

This second model is constructed using a Quasi-Poisson regression. These specific models are used when a count variable is overly dispersed. One quick test is to see if the variance (588.9338074) is greater than the mean (8.4349206) for offers_count.

# Quasipoisson variance is greater than the mean for offers_count
glm_second_model <- glm(offers_count  ~ testers_count + student_count + elaprof + mathprof + coursepassrate, data=schools_train, family=quasipoisson)

summary(glm_second_model)

## 
## Call:
## glm(formula = offers_count ~ testers_count + student_count + 
##     elaprof + mathprof + coursepassrate, family = quasipoisson, 
##     data = schools_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -17.9525   -1.5339   -0.9779   -0.6373    9.6923  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -3.8042023  0.6488940  -5.863 5.82e-09 ***
## testers_count   0.0046146  0.0004839   9.537  < 2e-16 ***
## student_count   0.0024473  0.0002893   8.460  < 2e-16 ***
## elaprof        -0.0128340  0.0041142  -3.119  0.00185 ** 
## mathprof        0.0608657  0.0039056  15.584  < 2e-16 ***
## coursepassrate  0.0226981  0.0071408   3.179  0.00152 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 4.730181)
## 
##     Null deviance: 39950.3  on 1259  degrees of freedom
## Residual deviance:  6278.5  on 1254  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

## Create the prediction column based on the model
schools_test$p2 <- predict(glm_second_model, newdata=schools_test)


# Calculate the MAE
second_mae <- MAE(schools_test$offers_count, schools_test$p2)


# Calculate the RMSE
second_rmse <- schools_test %>% 
  mutate(residual = p2 - offers_count) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

# Plot predictions vs cnt (pred on x-axis)
ggplot(schools_test, aes(x = p2, y = offers_count)) +
  geom_point() + 
  geom_abline(color = "darkblue")

## 
GainCurvePlot(
  schools_test,
  "p2",
  "offers_count",
  "Second Model")

The mean absolute error of the second model is 6.673356.
The second model’s RMSE is

Third Model

The third model attempts to compensate for an excessive number of zeroes. A zero-inflation test is performed on the second model to see if this is the case. This model will employ zero-inflation regression.

# Zero Inflation test for glm_second_model
check_zeroinflation(glm_second_model)

## # Check for zero-inflation
## 
##    Observed zeros: 953
##   Predicted zeros: 452
##             Ratio: 0.47

## Model is underfitting zeros (probable zero-inflation).

zip_model <- zeroinfl(offers_count  ~ testers_count + student_count + elaprof + mathprof + coursepassrate, data = schools_train)

summary(zip_model)

## 
## Call:
## zeroinfl(formula = offers_count ~ testers_count + student_count + elaprof + 
##     mathprof + coursepassrate, data = schools_train)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -9.15246 -0.18978 -0.05793 -0.02208  7.88307 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     1.0516479  0.2225555   4.725 2.30e-06 ***
## testers_count   0.0075520  0.0002356  32.048  < 2e-16 ***
## student_count  -0.0008180  0.0001434  -5.705 1.16e-08 ***
## elaprof        -0.0088870  0.0018667  -4.761 1.93e-06 ***
## mathprof        0.0325666  0.0017960  18.133  < 2e-16 ***
## coursepassrate -0.0007899  0.0023219  -0.340    0.734    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    13.3749090  2.2631489   5.910 3.42e-09 ***
## testers_count  -0.0595986  0.0071053  -8.388  < 2e-16 ***
## student_count  -0.0005538  0.0019960  -0.277 0.781452    
## elaprof        -0.0223664  0.0199567  -1.121 0.262397    
## mathprof       -0.0610413  0.0179027  -3.410 0.000651 ***
## coursepassrate -0.0538042  0.0232936  -2.310 0.020898 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 16 
## Log-likelihood: -1748 on 12 Df

## Create the prediction column based on the model
schools_test$p3 <- predict(zip_model, newdata=schools_test)


# Calculate the MAE
third_mae <- MAE(schools_test$offers_count, schools_test$p3)


# Calculate the RMSE
third_rmse <- schools_test %>% 
  mutate(residual = p3 - offers_count) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

# Plot predictions vs cnt (pred on x-axis)
ggplot(schools_test, aes(x = p3, y = offers_count)) +
  geom_point() + 
  geom_abline(color = "darkblue")

## 
GainCurvePlot(
  schools_test,
  "p3",
  "offers_count",
  "Third Model")

The mean absolute error of the second model is 2.8787104.
The second model’s RMSE is

Comparison of AIC Measures

The following is a comparison of the three models that were created in terms of AIC. Note for Quasi-Poisson regressions there is no AIC reported.

AIC(first_model, glm_second_model, zip_model)

Results

For the models, the Zero-Inflated Poisson regression model provided the best results. A refinement of the second model, the third model has the lowest mean absolute error (MAE) and root mean square error (RMSE). From the summary of this model, besides the number of testers for the SHSAT, the variables elaprof and mathprof were significant. The percentage of students proficient in ELA and math for middle schools appears to have a correlation with the number of offers for Specialized High Schools in terms of the SHSAT.

Summary

Proficiency in ELA and math proved to be significant factors for middle school student populations in receiving offers to Specialized High Schools. Because the SHSAT is a standardized exam, it is possible that this test merely reinforces the need to improve proficiency for NY State exams. The presence of scarce SHSAT preparatory or dual language programs at middle schools did not appear to have a strong correlation with the number of offers.

Overall, from this study, I would recommend for additional aid for improving NYS proficiency in ELA and math for all students. This study did not examine the racial demographic of students. However, in future studies, I would encourage that such demographics be ignored and that researchers look at NYS proficiency rates. In addition, as these rates are examined, I would recommend researchers look at the impact of the city’s Gifted & Talented programs. Because lack of diversity is a problem in Specialized High Schools, additional research needs to examine diversity in the early childhood education, specifically the Gifted & Talented programs. Further study should also examine SHSAT awareness in minority communities: perhaps, lack of participation is another inhibiting factor for lack of diversity.

An additional recommendation would be to consider providing additional resources for SHSAT preparation. Perhaps lack of quality test prep programs would provide students the opportunity to catch up to NYS proficiency standards.

Appendix

Code

The code used for this project can be found at these locations:

GitHub for Project: https://github.com/logicalschema/spring2022/tree/main/data698
GitHub for Shiny App: https://github.com/logicalschema/spring2022/tree/main/data698/shinyapp/NYC_DOE_SHSAT
GitHub for Python App: https://github.com/logicalschema/specialized-hs

Links

Bibliography

Almarode, Subotnik, R., & Lee, G. M. (2016). What Works: Lessons From a Research Study of Specialized Science High School Graduates. Gifted Child Today Magazine, 39(4), 185–190. https://doi.org/10.1177/1076217516662096

Chang, W. (2013, January 1). Cookbook for R. Retrieved April 5, 2022, from http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

Corcoran, S., & Baker-Smith, C. (2015). Pathways to an Elite Education: Exploring Strategies to Diversity NYC’s Specialized High Schools. Institute for Education and Social Policy, and Research Alliance for New York City Schools.

Fadilla, & Usmeldi. (2020). Preliminary study for development of teacher’s Books oriented research-based learning on science lesson in Junior High School. Journal of Physics. Conference Series, 1481(1). https://doi.org/10.1088/1742-6596/1481/1/012068

Nwanganga, Fred; Chapple, Mike. Practical Machine Learning in R (Kindle Locations 3605-3681). Wiley. Kindle Edition.

Reeves, Richard V., and Ashley Schobert. “Elite or Elitist? Lessons for Colleges from Selective High Schools.” Brookings, Brookings, 9 Mar. 2022, https://www.brookings.edu/research/elite-or-elitist-lessons-for-colleges-from-selective-high-schools/.

Taylor, Jonathan James, “Policy Implications of a Predictive Validity Study of the Specialized High School Admissions Test at Three Elite New York City High Schools” (2015). CUNY Academic Works. https://academicworks.cuny.edu/gc_etds/1154

Waran, R. (2021, June 17). Deploying dash apps on azure with windows. Resonance Analytics. Retrieved March, 2022, from https://resonance-analytics.com/blog/deploying-dash-apps-on-azure

Wickham, Hadley; Grolemund, Garrett. R for Data Science . O’Reilly Media. Kindle Edition.

Data 698 Project

Sung Lee

2022-04-02

Abstract

Keywords

Literature Review

Methodology

Exploratory Data Analysis

Years 2018 to 2021

2018

2019

2020

2021

Experimentation & Results

Decision Tree

Performance of the Tree

First Model

Performance of First Model

Second Model

Third Model

Comparison of AIC Measures

Results

Summary

Appendix

Code

Links

Bibliography