Assignment

You get to decide which dataset you want to work on. THe data set must be different You can work on a problem from your work, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portal. Select one of the methodologies studied in weeks 1-10, and one methodology from weeks 11-15 to apply in the new dataset selected. To complete this task:. - describe the problem you are trying to solve. - describe your datasets and what you did to prepare the data for analysis. - methodologies you used for analyzing the data - why you did what you did - make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution. Your final presentation could be the traditional R file and essay, or it could be an oral presentation with the execution and explanation of your code, recorded on any platform of your choice (Youtube, Free Cam). If you select the presentation, it should be a 5 to 8 minutes recording.

Problem

The topic of selecting a middle school was a selfish one for me because my child is set to attend middle school in New York City. NYC’s public school system is difficult to navigate and as a parent I sought to select a middle school with an increased likelihood of students attending a specialized high school. The main problem I wish to solve is to select a middle school for my child to attend with an increased likelihood of students attending a specialized high school.

Every year over 25,000 eighth grade students in New York City’s public school system take the arduous Specialized High School Admissions Test (SHSAT) to gain admittance to eight prestigious specialized high schools: Bronx High School of Science, Brooklyn Latin School, Brooklyn Technical High School, the High School for Mathematics, Science and Engineering at City College of New York, High School of American Studies at Lehman College, Queens High School for the Sciences at York College, Staten Island Technical High School, and Stuyvesant High School. Yet with a population of over 1 million students, the largest system in the United States, is there equity in the system to allow admittance? It is the plan of this research to look at the data for New York City’s school system for the years of 2018 to 2021 with regards to middle schools and the SHSAT. The system is divided into 32 school districts. This study will examine middle school data to see if there is a correlation between the number of offers given for specialized high school based on the SHSAT and an individual school’s academic data, programs, and district assignment. It is the hope of this study to identify trends for improving success for schools and to, if any, to determine ways to improve schools that have few offers.

Datasets

The following data sets were used for this homework.

Reasoning

Looking at the NYC Open Data, the data for middle schools proved to only be viable for 2018 to 2021. In addition, the data sources would report a range of 0-5 for offers. Hence, if a middle school received 0 offers for students, the row would have a value of “0 to 5”. I decided to list such values as 0 in addition to removing middle schools with “NA” values for student counts: the reason is that in some years the middle school may not have reported the data or did not exist.
Later in the exploration of models, I found that issues with overdispersion and would need to employ zero-inflation modelling methods.

Methodology

In visualizing and exploring the data, I wanted to be able to identify high performing middle schools by examining them on a map and plotting their SHSAT offers (offers made by specialized high schools to students). By also creating some feature variables, I would examine a correlation matrix for the variables and see to identify what makes a good middle school for producing students with a high probability of being given an offer by a specialized high school. Next, I would use different models, a decision tree, multinomial linear regression, and knn clustering to see if there are possible policies that can be implemented to increase the likelihood of students receiving offers for specialized high schools.

Exploratory Data Analysis

For the exploratory data analysis, I decided early to create maps that would help me grapple with the location of middle schools and their SHSAT performance. The following were coded in Python and R using Plotly and Shiny libraries.

***

https://sslee-specializedhs.azurewebsites.net/

The following code imports that data sets that were created after cleaning.

# School information: Note: only 2018-2021 are available
school_info <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data622/hw4/data/2018-2021_school_information.csv')

## Rows: 1906 Columns: 17
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (11): schooldbn, name, neighborhood, address, totalstudents, ellprograms...
## dbl  (6): district, coursepassrate, elaprof, mathprof, surveysafety, year
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Remove NA values for total_students, course pass rate, elaprof, and mathprof
school_info <- school_info %>% drop_na(totalstudents)
school_info <- school_info %>% drop_na(coursepassrate)
school_info <- school_info %>% drop_na(elaprof)
school_info <- school_info %>% drop_na(mathprof)

# Convert totalstudents to numberical
school_info$totalstudents <- as.numeric(as.character(school_info$totalstudents))

## Warning: NAs introduced by coercion

# Potential features
## Dual Program feature: If the school has a dual language program
school_info$dual_program <- ifelse(str_detect(school_info$ellprograms, "Dual") & (is.na(school_info$ellprograms) == FALSE), 1, 0)

## Specialized Test Prep feature: If the school has a specialized high school prep class 
school_info$shsat_prep <- ifelse(str_detect(school_info$electiveclasses, "Specialized High School Test") & (is.na(school_info$electiveclasses) == FALSE), 1, 0)

# School offers have the offers for middle schools: Note 2016-2021 are available but for the previous data set only 2018-2021 are available
# Remove rows with NA values
school_offers <- read_csv('https://raw.githubusercontent.com/logicalschema/spring2022/main/data622/hw4/data/school_offers.csv')

## Rows: 470 Columns: 28
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr  (6): dbn, name, telephone, address, Borough, url
## dbl (22): district, Postcode, Latitude, Longitude, 2016_student_count, 2016_...
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.

school_offers$Postcode <- as.character(school_offers$Postcode)
school_offers <- subset(school_offers, select = -c(`name`, `telephone`, `address`, `2016_student_count`,                                  `2016_testers_count`,`2016_offers_count`,`2017_student_count`,`2017_testers_count`,`2017_offers_count`))
school_offers <- na.omit(school_offers) 

temp1 <- subset(school_info, select = c(`district`, `schooldbn`, `name`, `year`, `coursepassrate`, `accessibility`, `elaprof`, `mathprof`, `totalstudents`, `surveysafety`,`dual_program`, `shsat_prep`))


temp2018 <- merge(temp1[temp1$year == 2018, ], subset(school_offers, select = c(`dbn`, `2018_student_count`, `2018_testers_count`, `2018_offers_count`)), by.x = 'schooldbn', by.y='dbn')

names(temp2018)[names(temp2018) == '2018_student_count'] <- 'student_count'
names(temp2018)[names(temp2018) == '2018_testers_count'] <- 'testers_count'
names(temp2018)[names(temp2018) == '2018_offers_count'] <- 'offers_count'


temp2019 <- merge(temp1[temp1$year == 2019, ], subset(school_offers, select = c(`dbn`, `2019_student_count`, `2019_testers_count`, `2019_offers_count`)), by.x = 'schooldbn', by.y='dbn')

names(temp2019)[names(temp2019) == '2019_student_count'] <- 'student_count'
names(temp2019)[names(temp2019) == '2019_testers_count'] <- 'testers_count'
names(temp2019)[names(temp2019) == '2019_offers_count'] <- 'offers_count'

temp2020 <- merge(temp1[temp1$year == 2020, ], subset(school_offers, select = c(`dbn`, `2020_student_count`, `2020_testers_count`, `2020_offers_count`)), by.x = 'schooldbn', by.y='dbn')

names(temp2020)[names(temp2020) == '2020_student_count'] <- 'student_count'
names(temp2020)[names(temp2020) == '2020_testers_count'] <- 'testers_count'
names(temp2020)[names(temp2020) == '2020_offers_count'] <- 'offers_count'

temp2021 <- merge(temp1[temp1$year == 2021, ], subset(school_offers, select = c(`dbn`, `2021_student_count`, `2021_testers_count`, `2021_offers_count`)), by.x = 'schooldbn', by.y='dbn')

names(temp2021)[names(temp2021) == '2021_student_count'] <- 'student_count'
names(temp2021)[names(temp2021) == '2021_testers_count'] <- 'testers_count'
names(temp2021)[names(temp2021) == '2021_offers_count'] <- 'offers_count'


# Create model data and concatenate the temp data frames
model_data <- rbind(temp2018, temp2019)
model_data <- rbind(model_data, temp2020)
model_data <- rbind(model_data, temp2021)


# Create rate_offers: number of offers / number of testers
model_data$rate_offers <- model_data$offers_count / model_data$testers_count
model_data$rate_offers[is.na(model_data$rate_offers)] <- 0

# Temp variables for schools
schools2018 <- model_data[model_data$year == 2018, ]
schools2019 <- model_data[model_data$year == 2019, ]
schools2020 <- model_data[model_data$year == 2020, ]
schools2021 <- model_data[model_data$year == 2021, ]

Years 2018 to 2021

The following tabs represent a visual analysis of the data for the years 2018 to 2021. The top 25 middle schools with the highest rate of offers to specialized high schools for each year is also listed. Notable schools such as the Christa McAuliffe School would consistently be at the top for high performing middle schools.

2018

print(dfSummary(schools2018, graph.magnif=0.75), method="render")

Data Frame Summary

schools2018

Dimensions: 207 x 16
Duplicates: 0

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

schooldbn [character]

1. 01M034

2. 01M140

3. 01M184

4. 01M188

5. 01M332

6. 01M378

7. 01M450

8. 01M539

9. 01M839

10. 05M123

[ 197 others ]

1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
197	(	95.2%	)

207 (100.0%)

0 (0.0%)

2

district [numeric]

Mean (sd) : 17.5 (8.6)

min ≤ med ≤ max:

1 ≤ 20 ≤ 31

IQR (CV) : 13 (0.5)

16 distinct values

207 (100.0%)

0 (0.0%)

3

name [character]

1. Academy for College Prepa

2. Academy of Applied Mathem

3. Academy of Arts and Lette

4. Academy of Medical Techno

5. ACCION Academy

6. Adrien Block (I.S. 25)

7. Alfred De B. Mason (P.S./

8. America's School of Heroe

9. Amistad Dual Language Sch

10. Anning S. Prall (I.S. 27)

[ 197 others ]

1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
1	(	0.5%	)
197	(	95.2%	)

207 (100.0%)

0 (0.0%)

4

year [numeric]

1 distinct value

2018

:

207

(

100.0%

)

207 (100.0%)

0 (0.0%)

5

coursepassrate [numeric]

Mean (sd) : 91.3 (10.6)

min ≤ med ≤ max:

28 ≤ 95 ≤ 100

IQR (CV) : 9 (0.1)

33 distinct values

207 (100.0%)

0 (0.0%)

6

accessibility [character]

1. Accessible

2. Not Accessible

120	(	58.0%	)
87	(	42.0%	)

207 (100.0%)

0 (0.0%)

7

elaprof [numeric]

Mean (sd) : 35.2 (20)

min ≤ med ≤ max:

4 ≤ 30 ≤ 95

IQR (CV) : 28.5 (0.6)

73 distinct values

207 (100.0%)

0 (0.0%)

8

mathprof [numeric]

Mean (sd) : 32.3 (23.4)

min ≤ med ≤ max:

2 ≤ 26 ≤ 97

IQR (CV) : 36 (0.7)

79 distinct values

207 (100.0%)

0 (0.0%)

9

totalstudents [numeric]

Mean (sd) : 477.7 (451.1)

min ≤ med ≤ max:

38 ≤ 278 ≤ 2175

IQR (CV) : 291.5 (0.9)

172 distinct values

207 (100.0%)

0 (0.0%)

10

surveysafety [numeric]

Mean (sd) : 84.1 (8.5)

min ≤ med ≤ max:

59 ≤ 85 ≤ 100

IQR (CV) : 10 (0.1)

40 distinct values

207 (100.0%)

0 (0.0%)

11

dual_program [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	186	(	89.9%	)
1	:	21	(	10.1%	)

207 (100.0%)

0 (0.0%)

12

shsat_prep [numeric]

Min : 0

Mean : 0

Max : 1

0	:	204	(	98.6%	)
1	:	3	(	1.4%	)

207 (100.0%)

0 (0.0%)

13

student_count [numeric]

Mean (sd) : 160.7 (152.9)

min ≤ med ≤ max:

15 ≤ 93 ≤ 769

IQR (CV) : 86.5 (1)

130 distinct values

207 (100.0%)

0 (0.0%)

14

testers_count [numeric]

Mean (sd) : 55.6 (77.5)

min ≤ med ≤ max:

0 ≤ 22 ≤ 394

IQR (CV) : 59.5 (1.4)

92 distinct values

207 (100.0%)

0 (0.0%)

15

offers_count [numeric]

Mean (sd) : 11.3 (28.5)

min ≤ med ≤ max:

0 ≤ 0 ≤ 205

IQR (CV) : 8 (2.5)

39 distinct values

207 (100.0%)

0 (0.0%)

16

rate_offers [numeric]

Mean (sd) : 0.1 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.8

IQR (CV) : 0.1 (1.9)

62 distinct values

207 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-04

ggplot( head(arrange(schools2018, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) + 
  geom_bar(stat = "identity", fill="#0033a1") +
  coord_flip() +
  theme_bw()

p1 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p2 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p3 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p4 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p5 <- ggplot(data=model_data[model_data$year == 2018, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")

multiplot(p1, p2, p3, p4, p5, cols=3)

# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)

2019

print(dfSummary(schools2019, graph.magnif=0.75), method="render")

Data Frame Summary

schools2019

Dimensions: 455 x 16
Duplicates: 0

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

schooldbn [character]

1. 01M034

2. 01M140

3. 01M184

4. 01M188

5. 01M332

6. 01M378

7. 01M450

8. 01M539

9. 01M839

10. 02M104

[ 445 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
445	(	97.8%	)

455 (100.0%)

0 (0.0%)

2

district [numeric]

Mean (sd) : 16 (9.4)

min ≤ med ≤ max:

1 ≤ 15 ≤ 32

IQR (CV) : 16.5 (0.6)

32 distinct values

455 (100.0%)

0 (0.0%)

3

name [character]

1. Daniel Carter Beard (J.H

2. J.H.S. 74 Nathaniel Hawt

3. 47 - PS 347 The American

4. Academy for College Prepa

5. Academy for New Americans

6. Academy For Personal Lead

7. Academy for Young Writers

8. Academy of Applied Mathem

9. Academy of Arts and Lette

10. Academy of Medical Techno

[ 445 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
445	(	97.8%	)

455 (100.0%)

0 (0.0%)

4

year [numeric]

1 distinct value

2019

:

455

(

100.0%

)

455 (100.0%)

0 (0.0%)

5

coursepassrate [numeric]

Mean (sd) : 90.7 (9)

min ≤ med ≤ max:

40 ≤ 93 ≤ 100

IQR (CV) : 11 (0.1)

41 distinct values

455 (100.0%)

0 (0.0%)

6

accessibility [character]

1. Accessible

2. Not Accessible

3. Accessible

4. Fully Accessible

5. Not Accessible

6. Partially Accessible

1	(	0.2%	)
1	(	0.2%	)
4	(	0.9%	)
89	(	19.6%	)
194	(	42.7%	)
165	(	36.3%	)

454 (99.8%)

1 (0.2%)

7

elaprof [numeric]

Mean (sd) : 37.7 (20.9)

min ≤ med ≤ max:

1 ≤ 32 ≤ 100

IQR (CV) : 28 (0.6)

88 distinct values

455 (100.0%)

0 (0.0%)

8

mathprof [numeric]

Mean (sd) : 31 (23.6)

min ≤ med ≤ max:

0 ≤ 24 ≤ 100

IQR (CV) : 33 (0.8)

96 distinct values

455 (100.0%)

0 (0.0%)

9

totalstudents [numeric]

Mean (sd) : 627.7 (397)

min ≤ med ≤ max:

70 ≤ 525.5 ≤ 2238

IQR (CV) : 454 (0.6)

362 distinct values

454 (99.8%)

1 (0.2%)

10

surveysafety [numeric]

Mean (sd) : 83.4 (9.2)

min ≤ med ≤ max:

19 ≤ 85 ≤ 100

IQR (CV) : 12 (0.1)

44 distinct values

455 (100.0%)

0 (0.0%)

11

dual_program [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	398	(	87.5%	)
1	:	57	(	12.5%	)

455 (100.0%)

0 (0.0%)

12

shsat_prep [numeric]

Min : 0

Mean : 0

Max : 1

0	:	448	(	98.5%	)
1	:	7	(	1.5%	)

455 (100.0%)

0 (0.0%)

13

student_count [numeric]

Mean (sd) : 148.3 (130.6)

min ≤ med ≤ max:

11 ≤ 100 ≤ 773

IQR (CV) : 97.5 (0.9)

217 distinct values

455 (100.0%)

0 (0.0%)

14

testers_count [numeric]

Mean (sd) : 47.8 (64.1)

min ≤ med ≤ max:

0 ≤ 25 ≤ 352

IQR (CV) : 41.5 (1.3)

131 distinct values

455 (100.0%)

0 (0.0%)

15

offers_count [numeric]

Mean (sd) : 8.2 (23.3)

min ≤ med ≤ max:

0 ≤ 0 ≤ 195

IQR (CV) : 6 (2.8)

51 distinct values

455 (100.0%)

0 (0.0%)

16

rate_offers [numeric]

Mean (sd) : 0.1 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.8

IQR (CV) : 0.1 (2.2)

111 distinct values

455 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-04

ggplot( head(arrange(schools2019, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) + 
  geom_bar(stat = "identity", fill="#0033a1") +
  coord_flip() +
  theme_bw()

p1 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p2 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p3 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p4 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p5 <- ggplot(data=model_data[model_data$year == 2019, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")

multiplot(p1, p2, p3, p4, p5, cols=3)

# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)

2020

print(dfSummary(schools2020, graph.magnif=0.75), method="render")

Data Frame Summary

schools2020

Dimensions: 456 x 16
Duplicates: 0

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

schooldbn [character]

1. 01M034

2. 01M140

3. 01M184

4. 01M188

5. 01M332

6. 01M378

7. 01M450

8. 01M539

9. 01M839

10. 02M104

[ 446 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
446	(	97.8%	)

456 (100.0%)

0 (0.0%)

2

district [numeric]

Mean (sd) : 15.9 (9.4)

min ≤ med ≤ max:

1 ≤ 15 ≤ 32

IQR (CV) : 16 (0.6)

32 distinct values

456 (100.0%)

0 (0.0%)

3

name [character]

1. Academy for College Prepa

2. Academy For Personal Lead

3. Academy for Young Writers

4. Academy of Applied Mathem

5. Academy of Arts and Lette

6. Academy of Medical Techno

7. Academy of Public Relatio

8. Accion Academy (12X341)

9. Albert Shanker School for

10. All City Leadership Secon

[ 446 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
446	(	97.8%	)

456 (100.0%)

0 (0.0%)

4

year [numeric]

1 distinct value

2020

:

456

(

100.0%

)

456 (100.0%)

0 (0.0%)

5

coursepassrate [numeric]

Mean (sd) : 91.4 (9)

min ≤ med ≤ max:

27 ≤ 94 ≤ 100

IQR (CV) : 10 (0.1)

41 distinct values

456 (100.0%)

0 (0.0%)

6

accessibility [character]

1. Fully Accessible

2. Not Accessible

3. Partially Accessible

115	(	25.2%	)
185	(	40.6%	)
156	(	34.2%	)

456 (100.0%)

0 (0.0%)

7

elaprof [numeric]

Mean (sd) : 43.7 (21.1)

min ≤ med ≤ max:

7 ≤ 39 ≤ 100

IQR (CV) : 30 (0.5)

84 distinct values

456 (100.0%)

0 (0.0%)

8

mathprof [numeric]

Mean (sd) : 35.4 (23.7)

min ≤ med ≤ max:

2 ≤ 28 ≤ 100

IQR (CV) : 35 (0.7)

91 distinct values

456 (100.0%)

0 (0.0%)

9

totalstudents [numeric]

Mean (sd) : 634.4 (396.9)

min ≤ med ≤ max:

94 ≤ 534.5 ≤ 2251

IQR (CV) : 440.2 (0.6)

366 distinct values

456 (100.0%)

0 (0.0%)

10

surveysafety [numeric]

Mean (sd) : 83.2 (8.2)

min ≤ med ≤ max:

58 ≤ 84 ≤ 100

IQR (CV) : 10.2 (0.1)

42 distinct values

456 (100.0%)

0 (0.0%)

11

dual_program [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	394	(	86.4%	)
1	:	62	(	13.6%	)

456 (100.0%)

0 (0.0%)

12

shsat_prep [numeric]

1 distinct value

0

:

456

(

100.0%

)

456 (100.0%)

0 (0.0%)

13

student_count [numeric]

Mean (sd) : 145.1 (130.9)

min ≤ med ≤ max:

11 ≤ 96 ≤ 772

IQR (CV) : 89.2 (0.9)

210 distinct values

456 (100.0%)

0 (0.0%)

14

testers_count [numeric]

Mean (sd) : 47.4 (65.2)

min ≤ med ≤ max:

0 ≤ 23.5 ≤ 369

IQR (CV) : 45 (1.4)

133 distinct values

456 (100.0%)

0 (0.0%)

15

offers_count [numeric]

Mean (sd) : 7.2 (21.2)

min ≤ med ≤ max:

0 ≤ 0 ≤ 214

IQR (CV) : 0 (2.9)

48 distinct values

456 (100.0%)

0 (0.0%)

16

rate_offers [numeric]

Mean (sd) : 0.1 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.8

IQR (CV) : 0 (2.3)

104 distinct values

456 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-04

ggplot( head(arrange(schools2020, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) + 
  geom_bar(stat = "identity", fill="#0033a1") +
  coord_flip() +
  theme_bw()

p1 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p2 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p3 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p4 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p5 <- ggplot(data=model_data[model_data$year == 2020, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")

multiplot(p1, p2, p3, p4, p5, cols=3)

# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)

2021

print(dfSummary(schools2021, graph.magnif=0.75), method="render")

Data Frame Summary

schools2021

Dimensions: 458 x 16
Duplicates: 0

No

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Missing

1

schooldbn [character]

1. 01M034

2. 01M140

3. 01M184

4. 01M188

5. 01M332

6. 01M378

7. 01M450

8. 01M539

9. 01M839

10. 02M104

[ 448 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
448	(	97.8%	)

458 (100.0%)

0 (0.0%)

2

district [numeric]

Mean (sd) : 16 (9.4)

min ≤ med ≤ max:

1 ≤ 15 ≤ 32

IQR (CV) : 16 (0.6)

32 distinct values

458 (100.0%)

0 (0.0%)

3

name [character]

1. Academy for College Prepa

2. Academy for New Americans

3. Academy For Personal Lead

4. Academy for Young Writers

5. Academy of Applied Mathem

6. Academy of Arts and Lette

7. Academy of Medical Techno

8. Accion Academy

9. Albert Shanker School for

10. All City Leadership Secon

[ 448 others ]

1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
1	(	0.2%	)
448	(	97.8%	)

458 (100.0%)

0 (0.0%)

4

year [numeric]

1 distinct value

2021

:

458

(

100.0%

)

458 (100.0%)

0 (0.0%)

5

coursepassrate [numeric]

Mean (sd) : 91.7 (8.8)

min ≤ med ≤ max:

39 ≤ 94 ≤ 100

IQR (CV) : 9.8 (0.1)

41 distinct values

458 (100.0%)

0 (0.0%)

6

accessibility [character]

1. Fully Accessible

2. Not Accessible

3. Partially Accessible

124	(	27.1%	)
184	(	40.2%	)
150	(	32.8%	)

458 (100.0%)

0 (0.0%)

7

elaprof [numeric]

Mean (sd) : 43.4 (21)

min ≤ med ≤ max:

2 ≤ 38 ≤ 100

IQR (CV) : 31.5 (0.5)

87 distinct values

458 (100.0%)

0 (0.0%)

8

mathprof [numeric]

Mean (sd) : 38.6 (23.4)

min ≤ med ≤ max:

1 ≤ 33.5 ≤ 100

IQR (CV) : 36 (0.6)

93 distinct values

458 (100.0%)

0 (0.0%)

9

totalstudents [numeric]

Mean (sd) : 625.5 (397.7)

min ≤ med ≤ max:

85 ≤ 524.5 ≤ 2265

IQR (CV) : 449 (0.6)

380 distinct values

458 (100.0%)

0 (0.0%)

10

surveysafety [numeric]

Mean (sd) : 82.9 (8.2)

min ≤ med ≤ max:

57 ≤ 84 ≤ 100

IQR (CV) : 11 (0.1)

41 distinct values

457 (99.8%)

1 (0.2%)

11

dual_program [numeric]

Min : 0

Mean : 0.1

Max : 1

0	:	399	(	87.1%	)
1	:	59	(	12.9%	)

458 (100.0%)

0 (0.0%)

12

shsat_prep [numeric]

Min : 0

Mean : 0.3

Max : 1

0	:	324	(	70.7%	)
1	:	134	(	29.3%	)

458 (100.0%)

0 (0.0%)

13

student_count [numeric]

Mean (sd) : 139.4 (127.7)

min ≤ med ≤ max:

12 ≤ 95 ≤ 764

IQR (CV) : 82.8 (0.9)

209 distinct values

458 (100.0%)

0 (0.0%)

14

testers_count [numeric]

Mean (sd) : 41.4 (59)

min ≤ med ≤ max:

0 ≤ 20 ≤ 374

IQR (CV) : 36.8 (1.4)

121 distinct values

458 (100.0%)

0 (0.0%)

15

offers_count [numeric]

Mean (sd) : 7.3 (21.6)

min ≤ med ≤ max:

0 ≤ 0 ≤ 221

IQR (CV) : 0 (3)

49 distinct values

458 (100.0%)

0 (0.0%)

16

rate_offers [numeric]

Mean (sd) : 0.1 (0.1)

min ≤ med ≤ max:

0 ≤ 0 ≤ 0.8

IQR (CV) : 0 (2.3)

98 distinct values

458 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.0 (R version 4.1.3)
2022-05-04

ggplot( head(arrange(schools2021, desc(rate_offers)), n = 25), aes(x= reorder(name, rate_offers), y=rate_offers) ) + 
  geom_bar(stat = "identity", fill="#0033a1") +
  coord_flip() +
  theme_bw()

p1 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=coursepassrate, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p2 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=elaprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p3 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=mathprof, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p4 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=surveysafety, y=rate_offers)) + geom_point(alpha=0.5) + geom_jitter() 
p5 <- ggplot(data=model_data[model_data$year == 2021, ], aes(x=accessibility, y=rate_offers)) + geom_boxplot() + geom_boxplot(outlier.colour = "red")

multiplot(p1, p2, p3, p4, p5, cols=3)

## Warning: Removed 1 rows containing missing values (geom_point).
## Removed 1 rows containing missing values (geom_point).

# Cor plots
ggcorr(schools2021[, c(5,7,8,9,10,11,12,13,14,15,16)], label = T, hjust= 0.9)

Experimentation & Results

Decision Tree

The following creates training and testing partitions for the data and constructs a decision tree. From the correlation plot, I decided to use offers_count ~ testers_count + student_count + elaprof + mathprof + coursepassrate.

Building the Tree

set.seed(422)
data_split <- initial_split(model_data, prop = 0.8, strata = district)
schools_train <- training(data_split)
schools_test <- testing(data_split)

model_spec <- decision_tree() %>%
  set_mode("regression") %>%
  set_engine("rpart")

model_spec

## Decision Tree Model Specification (regression)
## 
## Computational engine: rpart

# Train the model
model <- model_spec %>%
  fit(offers_count ~ testers_count + student_count + elaprof + mathprof + coursepassrate,
      data = schools_train)

# Information about the model
model

## parsnip model object
## 
## n= 1260 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 1260 637081.600   8.1722220  
##    2) testers_count< 198.5 1195 180419.700   4.4702930  
##      4) testers_count< 100.5 1089  42853.570   2.1698810  
##        8) mathprof< 64.5 990  11025.070   0.9050505 *
##        9) mathprof>=64.5 99  14406.730  14.8181800 *
##      5) testers_count>=100.5 106  72597.860  28.1037700  
##       10) mathprof< 88.5 94  25154.000  21.0000000  
##         20) mathprof< 53.5 56   3919.982  12.5178600 *
##         21) mathprof>=53.5 38  11267.500  33.5000000 *
##       11) mathprof>=88.5 12   5542.250  83.7500000 *
##    3) testers_count>=198.5 65 139207.500  76.2307700  
##      6) elaprof< 82.5 53  44108.750  60.2830200  
##       12) mathprof< 62.5 20   7415.200  33.2000000 *
##       13) mathprof>=62.5 33  13132.970  76.6969700 *
##      7) elaprof>=82.5 12  22084.670 146.6667000 *

# A plot of the model
model$fit  %>% rpart.plot(type = 4, roundint=FALSE)

Performance of the Tree

After building the decision tree model based upon the variables, we will take the mean square error.

# Declare Mean Absolute Error function
MAE <- function(actual, predicted){
  mean(abs(actual - predicted))
}


# Generate the predictions using the test data
predictions <- predict(model, new_data = schools_test) 

# Truncate the preductions
predictions$.pred <- trunc(predictions$.pred)

# Calculate the mean absolute error
mae_decisiontree_model <- MAE(schools_test$offers_count, predictions$.pred)

The mean absolute error of the decision tree is 3.2025316.

First Model

This first model looks to build a traditional regression using the same variables as the decision tree.

first_model <- lm(offers_count ~ coursepassrate + elaprof + mathprof + surveysafety, data=schools_train)
summary(first_model)

## 
## Call:
## lm(formula = offers_count ~ coursepassrate + elaprof + mathprof + 
##     surveysafety, data = schools_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.759  -8.576  -0.876   4.711 175.620 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    13.31048    7.22798   1.842   0.0658 .  
## coursepassrate -0.01141    0.06435  -0.177   0.8593    
## elaprof        -0.15438    0.07322  -2.108   0.0352 *  
## mathprof        0.70883    0.06593  10.751  < 2e-16 ***
## surveysafety   -0.27439    0.06365  -4.311 1.75e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.33 on 1254 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3386, Adjusted R-squared:  0.3365 
## F-statistic: 160.5 on 4 and 1254 DF,  p-value: < 2.2e-16

# To see the performance metrics in an orderly data frame
glance(first_model)

# To see the R-squared
wrapFTest(first_model)

## [1] "F Test summary: (R2=0.3386, F(4,1254)=160.5, p<1e-05)."

## Evaluation R-2
firstmodelr_2 <- summary(first_model)$r.squared


## Use the model on the testing data

## Create the prediction column based on the model
schools_test$p1 <- predict(first_model, newdata=schools_test)

## Plot to compare the predictions to actual prediction on the x axis
ggplot(schools_test, aes(x=p1, y=offers_count)) + geom_point() + geom_abline(color="blue")

# Calculate the RMSE
first_rmse <- schools_test %>% 
  mutate(residual = p1 - offers_count) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

# Get R-squared from glance and print it
(rsq_glance <- glance(first_model)$r.squared)

## [1] 0.3385638

The first model’s RMSE is

Second Model

This second model uses quasipoisson for the regression because the variance is greater than the mean for offers_count. Because this is a quasipoisson, an AIC score cannot be generated. However, a RMSE score is created.

# Quasipoisson variance is greater than the mean for offers_count
glm_second_model <- glm(offers_count  ~ testers_count + coursepassrate + elaprof + mathprof + surveysafety, data=schools_train, family=quasipoisson)

summary(glm_second_model)

## 
## Call:
## glm(formula = offers_count ~ testers_count + coursepassrate + 
##     elaprof + mathprof + surveysafety, family = quasipoisson, 
##     data = schools_train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -14.6335   -1.6865   -1.0574   -0.5426   11.0786  
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -5.0429990  1.1490305  -4.389 1.23e-05 ***
## testers_count   0.0076113  0.0002519  30.220  < 2e-16 ***
## coursepassrate  0.0622752  0.0122191   5.097 3.99e-07 ***
## elaprof        -0.0178286  0.0048805  -3.653  0.00027 ***
## mathprof        0.0573938  0.0046112  12.447  < 2e-16 ***
## surveysafety   -0.0212445  0.0040421  -5.256 1.73e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 7.289936)
## 
##     Null deviance: 37110.5  on 1258  degrees of freedom
## Residual deviance:  6561.6  on 1253  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

schools_test$p2 <- predict(glm_second_model, newdata=schools_test)


# Calculate the RMSE
second_rmse <- schools_test %>% 
  mutate(residual = p2 - offers_count) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

# Plot predictions vs cnt (pred on x-axis)
ggplot(schools_test, aes(x = p2, y = offers_count)) +
  geom_point() + 
  geom_abline(color = "darkblue")

The second model’s RMSE is

Third Model

For this third model, I will use a zero-inflated model. Firstly, a test is done to see if one is needed. Secondly, if one is needed, then a third model is created.

# Zero Inflation test for glm_second_model
check_zeroinflation(glm_second_model)

## # Check for zero-inflation
## 
##    Observed zeros: 934
##   Predicted zeros: 408
##             Ratio: 0.44

## Model is underfitting zeros (probable zero-inflation).

zip_model <- zeroinfl(offers_count  ~ testers_count + coursepassrate + elaprof + mathprof + surveysafety, data = schools_train)


schools_test$p3 <- predict(zip_model, newdata=schools_test)


# Calculate the RMSE
third_rmse <- schools_test %>% 
  mutate(residual = p2 - offers_count) %>%
  summarize(rmse  = sqrt(mean(residual^2)))

# Plot predictions vs cnt (pred on x-axis)
ggplot(schools_test, aes(x = p2, y = offers_count)) +
  geom_point() + 
  geom_abline(color = "darkblue")

summary(zip_model)

## 
## Call:
## zeroinfl(formula = offers_count ~ testers_count + coursepassrate + elaprof + 
##     mathprof + surveysafety, data = schools_train)
## 
## Pearson residuals:
##      Min       1Q   Median       3Q      Max 
## -8.39681 -0.22161 -0.07272 -0.03212  8.63169 
## 
## Count model coefficients (poisson with log link):
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     9.720e-01  3.499e-01   2.778 0.005474 ** 
## testers_count   6.055e-03  9.574e-05  63.240  < 2e-16 ***
## coursepassrate  6.610e-04  3.603e-03   0.183 0.854430    
## elaprof        -7.009e-03  1.871e-03  -3.747 0.000179 ***
## mathprof        3.404e-02  1.749e-03  19.461  < 2e-16 ***
## surveysafety   -3.903e-03  1.600e-03  -2.440 0.014705 *  
## 
## Zero-inflation model coefficients (binomial with logit link):
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     8.128631   2.398116   3.390   0.0007 ***
## testers_count  -0.052219   0.004471 -11.678  < 2e-16 ***
## coursepassrate -0.030935   0.023298  -1.328   0.1843    
## elaprof        -0.009621   0.017457  -0.551   0.5816    
## mathprof       -0.075576   0.016377  -4.615 3.94e-06 ***
## surveysafety    0.029381   0.017472   1.682   0.0926 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 17 
## Log-likelihood: -1897 on 12 Df

The third model’s RMSE is

Results

Though the first model has the lowest RMSE, the zero-inflated model has a better AIC measure.

AIC(first_model, glm_second_model, zip_model)

With this and the importance from the summary for the model, the variables elaprof and mathprof. According to the data:

elaprof: Percent of students proficient in ELA
mathprof: Percent of students proficient in math

Summary

Because of the strong correlations between proficiency in ELA and math, I would recommend schools work to improve these proficiencies to have a better success rate for obtaining offers from Specialized High Schools (including raising awareness of the examinations and encouraging students to take them). Overall, I selected a middle school with these high variables close to my neighborhood.

Additional research can be done to look at demographics and economic data, but overall, to improve proficiencies in ELA and mathematics seem to be a viable way to go for preparing students and middle schools

Data 622 Homework 4

Sung Lee

2022-05-04

Assignment

Problem

Datasets

Reasoning

Methodology

Exploratory Data Analysis

Years 2018 to 2021

2018

Data Frame Summary

schools2018

2019

Data Frame Summary

schools2019

2020

Data Frame Summary

schools2020

2021

Data Frame Summary

schools2021

Experimentation & Results

Decision Tree

Building the Tree

Performance of the Tree

First Model

Second Model

Third Model

Results

Summary