PASSNYC: Data Science for Good

The overall goal of this notebook is to increase the diversity in Specialized High Schools.
There are three ways to increase the number of students of historically underrepresented demographics in Specialized High Schools.
1. Increase in SHSAT registrations.
2. Increase in actual SHSAT participation. (registering for the test and actually taking it are two different things)
3. Increase in SHS admission offers.

To achieve each of the aforementioned goals, PASSNYC and its partners have to provide outreach services that will improve the chances of students taking the SHSAT and receiving placements in these specialized high schools. Since, non-profits like PASSNYC have limited resources, a data driven solution to accurately identify the affected schools is necessary. Therefore, the schools have to be assigned some sort of ranking based on the likelihood of converting maximum number of students into test takers. This will ensure optimum utilisation of resources.

Now, the question is, how to identify such schools?

Proxies that have been good indicators of these types of schools include data on -
* English Language Learners
* Students with Disabilities
* Students on Free/Reduced Lunch
* Students with Temporary Housing

Based on the kind of problems the school faces, PASSNYC will commence services like after school programs, test preparation, mentoring, or resources for parents.

We’re going to use 3 additional datasets.
* Demographic Snapshot of Schools, 2013-2018
* School Safety Report, 2010-2016
* Admission Offers, 2017-2018

The analysis process will be as follows:
1. Explore the School demographics and geography.
2. Find schools with less number of registrations.
3. Find schools with less number of test takers.
4. Identify the schools with less number of admissions.
5. Suggest schools who need the most help in each of the above 3 scenarios.

Import Libraries

library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(reshape2)
library(RColorBrewer)

library(ggmap)
library(leaflet)
library(htmltools)
library(httr)
library(plotly)
library(GGally)
library(factoextra)
library(NbClust)

# library(naniar)
# library(mice)
# library(rgdal)
# library(geojsonio)

Import data

school_explorer <- read.csv("data/2016 School Explorer.csv", sep=",", header=TRUE)
school_explorer <- tbl_df(school_explorer)
school_explorer_og <- tbl_df(school_explorer)


nyc_shsat_results <- read.csv("data/2017-2018_SHSAT_Admissions_Test_Offers_By_Sending_School.csv", sep=",", header = TRUE)
nyc_shsat_results <- tbl_df(nyc_shsat_results)

d5_shsat <- read.csv("data/D5 SHSAT Registrations and Testers.csv", sep=",", header = TRUE)
d5_shsat <- tbl_df(d5_shsat)

school_dem <- read.csv("data/2013_-_2018_Demographic_Snapshot_School.csv", sep=",", header = TRUE)
school_dem <- tbl_df(school_dem)

school_safety <- read.csv("data/2010-2016-school-safety-report.csv", sep=",", header = TRUE)
school_safety <- tbl_df(school_safety)

Pre-Processing

Pre-Process the school_explorer data frame.

# Subset the columns
school_explorer <- subset(school_explorer, select = -c(Adjusted.Grade, New., Other.Location.Code.in.LCGMS, Address..Full., SED.Code, Grades, Percent.Black...Hispanic))
school_explorer <- select(school_explorer, -contains("Grade"))
school_explorer <- select(school_explorer, -contains("Rating"))


# Convert currency to numbers.
toCurrency <- function(curr) {
    curr <- as.character(curr)
    curr <- gsub("[$,]", "", curr)
    result <- as.numeric(curr)
    return(result)
}

school_explorer$School.Income.Estimate <- toCurrency(school_explorer$School.Income.Estimate)


# Convert percentages to a number divided by 100.
percent_columns <- c("Percent.ELL", "Percent.Asian", "Percent.Black", "Percent.Hispanic", "Percent.White", "Student.Attendance.Rate", "Percent.of.Students.Chronically.Absent", "Rigorous.Instruction..", "Collaborative.Teachers..", "Supportive.Environment..", "Effective.School.Leadership..", "Strong.Family.Community.Ties..", "Trust..")

school_explorer[percent_columns] <- apply(school_explorer[percent_columns], 2, function(y) as.numeric(gsub("%", "", y))/100 )


# Convert to numeric type
school_explorer$Economic.Need.Index = as.numeric(as.character(school_explorer$Economic.Need.Index))
school_explorer$Average.ELA.Proficiency = as.numeric(as.character(school_explorer$Average.ELA.Proficiency))
school_explorer$Average.Math.Proficiency = as.numeric(as.character(school_explorer$Average.Math.Proficiency))
school_explorer$Location.Code = as.character(school_explorer$Location.Code)

head(school_explorer)

Pre-Process the school_dem data frame.

p_cols <- c("X..Female.1", "X..Male.1", "X..Students.with.Disabilities.1", "X..Poverty.1")

school_dem_mfdb <- school_dem %>%
                    filter(Year == "2015-16") %>%
                    select(DBN, X..Female.1, X..Male.1, X..Students.with.Disabilities.1, X..Poverty.1)

school_dem_mfdb[p_cols] <- apply(school_dem_mfdb[p_cols], 2, function(y) as.numeric(gsub("%", "", y))/100 )

Join the school_exp and school_dem data frames.

school_exp_dem <- inner_join(school_explorer, school_dem_mfdb, by = c("Location.Code" = "DBN"))
school_exp_dem <- rename(school_exp_dem, Percent.Female = X..Female.1, Percent.Male = X..Male.1, Percent.Disability = X..Students.with.Disabilities.1 , Percent.Poverty = X..Poverty.1)

head(school_exp_dem)

Pre-Process the school_safety data.

# Select Crime stats attributes from the df
ss <- school_safety %>%
        filter(School.Year == "2015-16") %>%
        select(DBN, AvgOfMajor.N, AvgOfOth.N, AvgOfNoCrim.N, AvgOfProp.N, AvgOfVio.N)

ss$AvgOfMajor.N[ss$AvgOfMajor.N == "N/A"] <- NA
ss$AvgOfOth.N[ss$AvgOfOth.N == "N/A"] <- NA
ss$AvgOfNoCrim.N[ss$AvgOfNoCrim.N == "N/A"] <- NA
ss$AvgOfProp.N[ss$AvgOfProp.N == "N/A"] <- NA
ss$AvgOfVio.N[ss$AvgOfVio.N == "N/A"] <- NA

dbn <- ss$DBN
ss <- apply(ss[,2:6], 2, function(x) as.numeric(x))
ss <- as.data.frame(ss)
ss$DBN <- dbn

# Fill NA values for Crime Stats
# NA values in the crime stats will imputed by a random number generated between the `(mean-standard deviation) and (mean + standard deviation)`.



mean_AvgOfMajor.N <- as.numeric(mean(ss[, "AvgOfMajor.N"], na.rm=TRUE))
sd_AvgOfMajor.N <- as.numeric(sd(ss[, "AvgOfMajor.N"], na.rm=TRUE))
count_nan_AvgOfMajor.N <- sum(is.na(ss[, "AvgOfMajor.N"]))
random_AvgOfMajor.N <- sample((mean_AvgOfMajor.N - sd_AvgOfMajor.N):(mean_AvgOfMajor.N + sd_AvgOfMajor.N), count_nan_AvgOfMajor.N, replace = TRUE)

mean_AvgOfOth.N <- as.numeric(mean(ss[,"AvgOfOth.N"], na.rm=TRUE))
sd_AvgOfOth.N <- as.numeric(sd(ss[,"AvgOfOth.N"], na.rm=TRUE))
count_nan_AvgOfOth.N <- sum(is.na(ss[,"AvgOfOth.N"]))
random_AvgOfOth.N <- sample((mean_AvgOfOth.N - sd_AvgOfOth.N):(mean_AvgOfOth.N + sd_AvgOfOth.N), count_nan_AvgOfOth.N, replace = TRUE)

mean_AvgOfNoCrim.N <- as.numeric(mean(ss[,"AvgOfNoCrim.N"], na.rm=TRUE))
sd_AvgOfNoCrim.N <- as.numeric(sd(ss[,"AvgOfNoCrim.N"], na.rm=TRUE))
count_nan_AvgOfNoCrim.N <- sum(is.na(ss[, "AvgOfNoCrim.N"]))
random_AvgOfNoCrim.N <- sample((mean_AvgOfNoCrim.N - sd_AvgOfNoCrim.N):(mean_AvgOfNoCrim.N + sd_AvgOfNoCrim.N), count_nan_AvgOfNoCrim.N, replace = TRUE)

mean_AvgOfProp.N <- as.numeric(mean(ss[,"AvgOfProp.N"], na.rm=TRUE))
sd_AvgOfProp.N <- as.numeric(sd(ss[,"AvgOfProp.N"], na.rm=TRUE))
count_nan_AvgOfProp.N <- sum(is.na(ss[,"AvgOfProp.N"]))
random_AvgOfProp.N <- sample((mean_AvgOfProp.N - sd_AvgOfProp.N):(mean_AvgOfProp.N + sd_AvgOfProp.N), count_nan_AvgOfProp.N, replace = TRUE)

mean_AvgOfVio.N <- as.numeric(mean(ss[,"AvgOfVio.N"], na.rm=TRUE))
sd_AvgOfVio.N <- as.numeric(sd(ss[,"AvgOfVio.N"], na.rm=TRUE))
count_nan_AvgOfVio.N <- sum(is.na(ss[,"AvgOfVio.N"]))
random_AvgOfVio.N <- sample((mean_AvgOfVio.N - sd_AvgOfVio.N):(mean_AvgOfVio.N + sd_AvgOfVio.N), count_nan_AvgOfVio.N, replace = TRUE)


# Replace NA with the random integers generated above
ss <- mutate(ss, 
               AvgOfMajor.N=ifelse(is.na(AvgOfMajor.N), random_AvgOfMajor.N, AvgOfMajor.N), 
               AvgOfNoCrim.N=ifelse(is.na(AvgOfNoCrim.N), random_AvgOfNoCrim.N, AvgOfNoCrim.N), 
               AvgOfOth.N=ifelse(is.na(AvgOfOth.N), random_AvgOfOth.N, AvgOfOth.N), 
               AvgOfProp.N=ifelse(is.na(AvgOfProp.N), random_AvgOfProp.N, AvgOfProp.N), 
               AvgOfVio.N=ifelse(is.na(AvgOfVio.N), random_AvgOfVio.N, AvgOfVio.N))

Join the school_exp_dem and ss data frames.

school_exp_dem_safety <- left_join(school_exp_dem, ss, by = c("Location.Code" = "DBN"))

head(school_exp_dem_safety)

Pre-Process the nyc_shsat_results data.

# Replace "0-5" values with NA.
nyc_shsat_results <- nyc_shsat_results %>%
                        mutate(Count.of.Testers = as.character(Count.of.Testers), Count.of.Offers = as.character(Count.of.Offers), Feeder.School.DBN = as.character(Feeder.School.DBN))


nyc_shsat_results$Count.of.Testers[nyc_shsat_results$Count.of.Testers == "0-5"] <- NA
nyc_shsat_results$Count.of.Offers[nyc_shsat_results$Count.of.Offers == "0-5"] <- NA

# We will replace the "0-5" factor in the Count.Of.Testers and Count.Of.Offers column by a random number betweem 0 and 5. 
count_nan_testers <- sum(is.na(nyc_shsat_results$Count.of.Testers))
count_nan_offers <- sum(is.na(nyc_shsat_results$Count.of.Offers))

random_testers <- sample(0:5, count_nan_testers, replace = TRUE)
random_offers <- sample(0:5, count_nan_offers, replace = TRUE)

# Replace NA with the random integers generated above
nyc_shsat_results <- mutate(nyc_shsat_results, 
                            Count.of.Testers=ifelse(is.na(Count.of.Testers), random_testers, Count.of.Testers),
                            Count.of.Offers=ifelse(is.na(Count.of.Offers), random_offers, Count.of.Offers))

Join nyc_shsat_results and school_exp_dem data frames.

school_exp_dem_shsat <- left_join(nyc_shsat_results, school_exp_dem, by=c("Feeder.School.DBN" = "Location.Code"))

school_exp_dem_shsat <- school_exp_dem_shsat %>%
                            mutate(Count.of.Testers = as.numeric(as.character(Count.of.Testers)), Count.of.Offers = as.numeric(as.character(Count.of.Offers)), Count.of.Students.in.HS.Admissions = as.numeric(as.character(Count.of.Students.in.HS.Admissions)))

Join nyc_exp_dem_shsat and ss data frames.

school_exp_dem_shsat_safety <- left_join(school_exp_dem_shsat, ss, by=c("Feeder.School.DBN" = "DBN"))

school_exp_dem_shsat_safety <- school_exp_dem_shsat_safety %>%
                                    select(-c(Feeder.School.Name)) %>%
                                    mutate(Percent.Testers = Count.of.Testers/Count.of.Students.in.HS.Admissions, Percent.Offers = Count.of.Offers/Count.of.Testers)


# Replace the 0/0 = inf values by 0.
school_exp_dem_shsat_safety$Percent.Offers[is.infinite(school_exp_dem_shsat_safety$Percent.Offers)] <- 0  

# Create a new column with the majority race.
school_exp_dem_shsat_safety$Majority.Race <- colnames(school_exp_dem_shsat_safety[c("Percent.Asian", "Percent.Black", "Percent.Hispanic", "Percent.White")])[max.col(school_exp_dem_shsat_safety[c("Percent.Asian", "Percent.Black", "Percent.Hispanic", "Percent.White")], ties.method="first")]


school_exp_dem_shsat_safety$Majority.Race = gsub("Percent.Hispanic", "Hispanic", school_exp_dem_shsat_safety$Majority.Race)
school_exp_dem_shsat_safety$Majority.Race = gsub("Percent.Asian", "Asian", school_exp_dem_shsat_safety$Majority.Race)
school_exp_dem_shsat_safety$Majority.Race = gsub("Percent.Black", "Black", school_exp_dem_shsat_safety$Majority.Race)
school_exp_dem_shsat_safety$Majority.Race = gsub("Percent.White", "White", school_exp_dem_shsat_safety$Majority.Race)

head(school_exp_dem_shsat_safety)

Available data frames:

school_exp_dem
school_exp_dem_safety
school_exp_dem_shsat_safety

Geospatial Exploration of The Schools in New York.

Schools in New York

leaflet(data = school_exp_dem) %>% 
    addProviderTiles(providers$Esri.WorldTopoMap) %>%
    addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(),  label = ~htmlEscape(School.Name))

Distribution of schools based on Ethnicity

school_dem %>%
    select(X..Asian, X..Black, X..Hispanic, X..White) %>%
    melt() %>%
    group_by(variable) %>%
    summarise(total = sum(value)) %>%
    ggplot(aes(x = variable, y = total)) + geom_bar(aes(fill = total), stat="identity")

Hispanics make up the largest school attending population in New York.

Distribution of schools with Black Students

pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Black)

leaflet(data = school_exp_dem) %>% 
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~Percent.Black, color = ~pal1(Percent.Black), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Black")%>% 
 addLegend("bottomright", pal = pal1, values = ~Percent.Black, title = "Percent.Black", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

Distribution of schools with Hispanic Students

pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Hispanic)

leaflet(data = school_exp_dem) %>% 
    na.omit() %>%
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~Percent.Hispanic, color = ~pal1(Percent.Hispanic), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Hispanic")%>% 
 addLegend("bottomright", pal = pal1, values = ~Percent.Hispanic, title = "Percent.Hispanic", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

Distribution of schools with Asian Students

pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Asian)

leaflet(data = school_exp_dem) %>% 
    na.omit() %>%
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~Percent.Asian, color = ~pal1(Percent.Asian), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Asian")%>% 
 addLegend("bottomright", pal = pal1, values = ~Percent.Asian, title = "Percent.Asian", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

Distribution of schools with White Students

pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.White)

leaflet(data = school_exp_dem) %>% 
    na.omit() %>%
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~Percent.White, color = ~pal1(Percent.White), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.White")%>% 
 addLegend("bottomright", pal = pal1, values = ~Percent.White, title = "Percent.White", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

Distribution of schools based on Gender

pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Female)

leaflet(data = school_exp_dem) %>% 
    na.omit() %>%
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~Percent.Female, color = ~pal1(Percent.Female), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Female")%>% 
 addLegend("bottomright", pal = pal1, values = ~Percent.Female, title = "Percent.Female", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

Percent Poverty in Schools of NY

pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Poverty)

leaflet(data = school_exp_dem) %>% 
    na.omit() %>%
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~Percent.Poverty, color = ~pal1(Percent.Poverty), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Poverty")%>% 
 addLegend("bottomright", pal = pal1, values = ~Percent.Poverty, title = "Percent.Poverty", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

Economic Need Index Distribution of Schools in NY

pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Economic.Need.Index)

leaflet(data = school_exp_dem) %>% 
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~Economic.Need.Index, color = ~pal1(Economic.Need.Index), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Economic.Need.Index")%>% 
 addLegend("bottomright", pal = pal1, values = ~Economic.Need.Index, title = "Economic.Need.Index", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

Clusters of Community and Non-Community Schools

# Create a palette that maps factor levels to colors
pal <- colorFactor(c("blue", "red"), domain = c("Yes", "No"))

leaflet(school_exp_dem) %>% 
    addTiles() %>%
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircleMarkers(~Longitude,
                     ~Latitude, 
                     color = ~pal(Community.School.),
                     radius = ~ifelse(Community.School. == "Yes", 2, 1)) %>%
    addLegend(position = "bottomright", pal = pal, values = c("Yes", "No"))

Crime statistics in the City

The distribution of crime stats is pretty much the same as that of the Major Crimes.

Major Crimes

pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem_safety$AvgOfMajor.N)

leaflet(data = school_exp_dem_safety) %>% 
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~AvgOfMajor.N *20, color = ~pal1(AvgOfMajor.N), stroke = TRUE, fillOpacity = 0.5, popup = paste("City:", school_exp_dem_safety$City, "<br>","School:", school_exp_dem_safety$School.Name, "<br>"), group = "AvgOfMajor.N")%>% 
 addLegend("bottomright", pal = pal1, values = ~AvgOfMajor.N, title = "AvgOfMajor.N", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

The crime rate is high regions where Black and Hispanic dominated schools exist.

Correlation Between Attributes

Correlation Heatmap

hm.palette <- colorRampPalette(rev(brewer.pal(5, 'RdYlBu')), space='Lab')

school_exp_dem_shsat_safety %>%
    select(-c(School.Name, Feeder.School.DBN, 
              Count.of.Students.in.HS.Admissions, 
              Count.of.Testers, Count.of.Offers, District, 
              Latitude, Longitude, City, Zip, Community.School., 
              School.Income.Estimate, Majority.Race)) %>%
    na.omit() %>%
    cor() %>%
    melt() %>%
    ggplot(aes(Var1, Var2)) + geom_tile(aes(fill = value)) + scale_fill_gradientn(colours = hm.palette(100)) + coord_equal() + theme(axis.text.x=element_text(angle=90, hjust=0)) + ggtitle("Correlation Between Attributes") + geom_text(aes(label = round(value, 1)))

# + geom_text(aes(label = round(value, 1)))

The above heatmap can be interpreted in the following way:

Value 1: Ideal Positive linear correlation
Value 0: No linear correlation
Value -1: Ideal Negative linear correlation

We can aggregate the following variables into a single variable to reduce dimenstionality as they have high a correlation amongst them:

School Characterstics
- Rigorous Instruction
- Collaborative Teachers
- Supportive Environment
- School Leadership
- Strong Family Ties
- Trust
Crime Stats
- Major
- Other
- Non Criminal
- Property
- Violent
Average Academic Performance
- Average Math Proficiency
- Average Math Proficiency

We will take the mean of the above 3 batches of highly correlated values into a single value to reduce dimensionality.

Note that disparity between the White/Asian vs Black/Hispanic students in terms of the above factors. White students have better support from schools than anyone else.

Create 3 new colunms called School.Rating, Crime.Stats, and Avg.Academics.

school_rating <- school_exp_dem_shsat_safety %>%
                    select(c(21:26)) %>%
                    mutate(School.Rating = rowMeans(.[,1:6])) %>%
                    select(School.Rating)

crime_stats <- school_exp_dem_shsat_safety %>%
                    select(c(33:37)) %>%
                    rowwise() %>%
                    mutate(Crime.Stats = mean(c(AvgOfMajor.N, AvgOfOth.N, AvgOfNoCrim.N, AvgOfProp.N,AvgOfVio.N))) %>%
                    select(Crime.Stats)

avg_acad_perf <- school_exp_dem_shsat_safety %>%
                    select(Average.ELA.Proficiency, Average.Math.Proficiency) %>%
                    rowwise() %>%
                    mutate(Avg.Academics = mean(c(Average.ELA.Proficiency, Average.Math.Proficiency))) %>%
                    select(Avg.Academics)

school_exp_dem_shsat_safety <- cbind(school_exp_dem_shsat_safety, data.frame(School.Rating = school_rating))
school_exp_dem_shsat_safety <- cbind(school_exp_dem_shsat_safety, data.frame(Crime.Stats = crime_stats))
school_exp_dem_shsat_safety <- cbind(school_exp_dem_shsat_safety, data.frame(Avg.Academics = avg_acad_perf))


school_exp_dem_shsat_safety <-school_exp_dem_shsat_safety %>% select(-c(21:26), -c(33:37), -c(Average.ELA.Proficiency, Average.Math.Proficiency))

Correlation Between Attributes After Dimension Reduction

hm.palette <- colorRampPalette(rev(brewer.pal(5, 'RdYlBu')), space='Lab')

school_exp_dem_shsat_safety %>%
    select(-c(School.Name, Feeder.School.DBN, 
              Count.of.Students.in.HS.Admissions, 
              Count.of.Testers, Count.of.Offers, District, 
              Latitude, Longitude, City, Zip, Community.School., 
              School.Income.Estimate, Majority.Race, Percent.Poverty)) %>%
    na.omit() %>%
    cor() %>%
    melt() %>%
    ggplot(aes(Var1, Var2)) + geom_tile(aes(fill = value)) + scale_fill_gradientn(colours = hm.palette(100)) + coord_equal() + theme(axis.text.x=element_text(angle=90, hjust=0)) + ggtitle("Correlation Between Attributes After Dimension Reduction") + geom_text(aes(label = round(value, 1)))

# + geom_text(aes(label = round(value, 1)))

Percent testers is:

Positively Correlated with:
- Attendance Rate
- Average Academics (Math + ELA)
Negatively Correlated with:
- Economic Need Index
- Chronic Absence
- Percent Disability

Interesting correlations to examine further based on the pearson cofficient > 0.5 or pearson cofficient < -0.5:
Race vs the following:

Economic Need Index
Chronic Absence
Percent ELL
Academics(Avg ELA/Math)
Disability
Percent Testers/Offers

Ethinicity vs Economic Need Index

eni_race <- select(school_exp_dem, Economic.Need.Index, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)

eni_asian <- eni_race %>%
    ggplot(aes(x = Percent.Asian, y = Economic.Need.Index)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Economic Need Index : Asian")

eni_black <- eni_race %>%
    ggplot(aes(x = Percent.Black, y = Economic.Need.Index)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Economic Need Index : Black")

eni_hispanic <- eni_race %>%
    ggplot(aes(x = Percent.Hispanic, y = Economic.Need.Index)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Economic Need Index : Hispanic")

eni_white <- eni_race %>%
    ggplot(aes(x = Percent.White, y = Economic.Need.Index)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Economic Need Index : White")


grid.arrange(eni_asian, eni_white, eni_black, eni_hispanic, ncol=2)

Hispanic and Black population have a greater economic need while the White and Asian students are well off.

Ethnicity vs Chronic Absence

absence_race <- select(school_exp_dem, Percent.of.Students.Chronically.Absent, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)

absence_asian <- absence_race %>%
    ggplot(aes(x = Percent.Asian, y = Percent.of.Students.Chronically.Absent)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Chronic Absence : Asian")

absence_black <- absence_race %>%
    ggplot(aes(x = Percent.Black, y = Percent.of.Students.Chronically.Absent)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Chronic Absence : Black")

absence_hispanic <- absence_race %>%
    ggplot(aes(x = Percent.Hispanic, y = Percent.of.Students.Chronically.Absent)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Chronic Absence : Hispanic")

absence_white <- absence_race %>%
    ggplot(aes(x = Percent.White, y = Percent.of.Students.Chronically.Absent)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Chronic Absence : White")


grid.arrange(absence_asian, absence_white, absence_black, absence_hispanic, ncol=2)

Chronic Absence is more amongst black and hispanic dominated schools as compared to the Hispanic and White dominated schools.

Ethnicity vs Percent English Language Learners

ell_race <- select(school_exp_dem_shsat_safety, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)

ell_asian <- ell_race %>%
    ggplot(aes(x = Percent.Asian, y = Percent.ELL)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("English Language Learners : Asian")

ell_black <- ell_race %>%
    ggplot(aes(x = Percent.Black, y = Percent.ELL)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("English Language Learners : Black")

ell_hispanic <- ell_race %>%
    ggplot(aes(x = Percent.Hispanic, y = Percent.ELL)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("English Language Learners : Hispanic")

ell_white <- ell_race %>%
    ggplot(aes(x = Percent.White, y = Percent.ELL)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("English Language Learners : White")


grid.arrange(ell_asian, ell_white, ell_hispanic, ell_black, ncol=2)

White/Asian dominated schools have a more supportive environment as compared to Black/Hispanic dominated schools.

Ethnicity vs Average Academics (ELA + Math Proficiency)

ela_race <- select(school_exp_dem, Average.ELA.Proficiency, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
math_race <- select(school_exp_dem, Average.Math.Proficiency, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)

#  ELA 
ela_asian <- ela_race %>%
    ggplot(aes(x = Percent.Asian, y = Average.ELA.Proficiency)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average ELA Proficiency : Asian")

ela_black <- ela_race %>%
    ggplot(aes(x = Percent.Black, y = Average.ELA.Proficiency)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average ELA Proficiency : Black")

ela_hispanic <- ela_race %>%
    ggplot(aes(x = Percent.Hispanic, y = Average.ELA.Proficiency)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average ELA Proficiency : Hispanic")

ela_white <- ela_race %>%
    ggplot(aes(x = Percent.White, y = Average.ELA.Proficiency)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average ELA Proficiency : White")


# Math
math_asian <- math_race %>%
    ggplot(aes(x = Percent.Asian, y = Average.Math.Proficiency)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average Math Proficiency : Asian")

math_black <- math_race %>%
    ggplot(aes(x = Percent.Black, y = Average.Math.Proficiency)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average Math Proficiency : Black")

math_hispanic <- math_race %>%
    ggplot(aes(x = Percent.Hispanic, y = Average.Math.Proficiency)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average Math Proficiency : Hispanic")

math_white <- math_race %>%
    ggplot(aes(x = Percent.White, y = Average.Math.Proficiency)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average Math Proficiency : White")




grid.arrange(ela_asian, ela_white, ela_hispanic, ela_black, math_asian, math_white, math_hispanic, math_black, ncol=4)

Asian/White dominated schools have higher ELA and Math proficiency as compared to Hispanic and Black dominated schools.
From the above graphs, there seems to exist some correlation between Average ELA proficiency and Average Math Proficiency. Let’s see if they are actually correlated.

Ethnicity vs Disability

disability_race <- select(school_exp_dem, Percent.Disability, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)

disab_asian <- disability_race %>%
    ggplot(aes(x = Percent.Asian, y = Percent.Disability)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Disability : Asian")

disab_black <- disability_race %>%
    ggplot(aes(x = Percent.Black, y = Percent.Disability)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Disability : Black")

disab_hispanic <- disability_race %>%
    ggplot(aes(x = Percent.Hispanic, y = Percent.Disability)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Disability : Hispanic")

disab_white <- disability_race %>%
    ggplot(aes(x = Percent.White, y = Percent.Disability)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Disability : White")


grid.arrange(disab_asian, disab_white, disab_hispanic, disab_black, ncol=2)

Black/Hispanic dominated schools tend to have more students with disabilities as compared to White/ Asian dominated schools.

ENI vs Chronic Absence, Disability, Percent Testers, ELL, Academics

eni_absence <- school_exp_dem_shsat_safety %>%
                    select(Economic.Need.Index, Percent.of.Students.Chronically.Absent) %>%
                    ggplot(aes(x = Economic.Need.Index, y = Percent.of.Students.Chronically.Absent)) + geom_point(color = "orange") + ggtitle("ENI vs Chronic Absence") + geom_smooth(method = lm)

eni_disability <- school_exp_dem_shsat_safety %>%
                    select(Economic.Need.Index, Percent.Disability) %>%
                    ggplot(aes(x = Economic.Need.Index, y = Percent.Disability)) + geom_point(color = "pink")+ ggtitle("ENI vs Percent Disability") + geom_smooth(method = lm)

eni_testers <- school_exp_dem_shsat_safety %>%
                    select(Economic.Need.Index, Percent.Testers) %>%
                    ggplot(aes(x = Economic.Need.Index, y = Percent.Testers)) + geom_point(color = "violet") +  ggtitle("ENI vs Percent Testers") + geom_smooth(method = lm)

eni_offers <- school_exp_dem_shsat_safety %>%
                    select(Economic.Need.Index, Percent.Offers) %>%
                    filter(Percent.Offers <= 1) %>%
                    ggplot(aes(x = Economic.Need.Index, y = Percent.Offers)) + geom_point(color = "yellow") +  ggtitle("ENI vs Percent Offers") + geom_smooth(method = lm)


eni_ELL <- school_exp_dem_shsat_safety %>%
                    select(Economic.Need.Index, Percent.ELL) %>%
                    ggplot(aes(x = Economic.Need.Index, y = Percent.ELL)) + geom_point(color = "lightgreen") +  ggtitle("ENI vs Percent ELL") + geom_smooth(method = lm)

eni_academics <- school_exp_dem_shsat_safety %>%
                    select(Economic.Need.Index, Avg.Academics) %>%
                    ggplot(aes(x = Economic.Need.Index, y = Avg.Academics)) + geom_point(color = "lightblue") + ggtitle("ENI vs Academics") + geom_smooth(method = lm)


grid.arrange(eni_absence, eni_disability, eni_testers, eni_offers, eni_ELL, eni_academics, ncol=3)

SHSAT Registrations

Demographic Overview

school_exp_dem_shsat_safety %>%
    na.omit() %>%
    group_by(Majority.Race) %>%
    summarise(Number.Of.Reg = sum(Count.of.Students.in.HS.Admissions), Number.Of.Testers = sum(Count.of.Testers), Number.Of.Offers = sum(Count.of.Offers)) %>%
    melt(id.vars = c("Majority.Race")) %>%
    ggplot(aes(x=Majority.Race, y=value, fill=variable, color=variable, alpha=variable)) + geom_bar(stat="identity",position ="identity") + scale_colour_manual(values=c("darkblue","red", "orange")) + scale_fill_manual(values=c("lightblue","pink", "#FFFF66")) + scale_alpha_manual(values=c(.3, .5, .8))

Hispanics dominated schools have the highest registration.
White dominated schools have significantly lower registration and yet more number of actual test takers and the highest number of offers.
Asians have the least registrations and test-takers and yet they get more offers than black and hispanic domiated schools.

Mean and Spread of Attributes for all Schools

school_exp_dem_shsat_safety_mean_lol <- school_exp_dem_shsat_safety %>%
                                        select(Economic.Need.Index, Percent.ELL, 
                                               Percent.Asian, Percent.Black, Percent.Hispanic, 
                                               Percent.White, Student.Attendance.Rate, 
                                               Percent.of.Students.Chronically.Absent, Percent.Testers,
                                               Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
                                        filter(Percent.Offers <= 1) %>%
                                        na.omit()
   

school_exp_dem_shsat_safety_mean <- colMeans(school_exp_dem_shsat_safety_mean_lol) 
df_school_exp_dem_shsat_safety_mean <- data.frame(lapply(school_exp_dem_shsat_safety_mean, function(x) t(data.frame(x))))

df_school_exp_dem_shsat_safety_mean %>%
    melt() %>%
    ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for all schools")

school_exp_dem_shsat_safety %>%
    select(Economic.Need.Index, Percent.ELL, 
           Percent.Asian, Percent.Black, Percent.Hispanic, 
           Percent.White, Student.Attendance.Rate, 
           Percent.of.Students.Chronically.Absent, Percent.Testers,
           Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
    filter(Percent.Offers <= 1) %>%
    na.omit() %>%
    melt() %>%
    ggplot(aes(x = variable, y = value)) + geom_boxplot(fill = "purple") + theme(axis.text.x = element_text(angle = 90, hjust = 0))

Ethnicity vs Testers and Offers

tester_race <- select(school_exp_dem_shsat_safety, Percent.Testers, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
offer_race <- select(school_exp_dem_shsat_safety, Percent.Offers, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)

#  Percent Testers 
tester_asian <- tester_race %>%
    ggplot(aes(x = Percent.Asian, y = Percent.Testers)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Testers: Asian")

tester_black <- tester_race %>%
    ggplot(aes(x = Percent.Black, y = Percent.Testers)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Testers: Black")

tester_hispanic <- tester_race %>%
    ggplot(aes(x = Percent.Hispanic, y = Percent.Testers)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Testers : Hispanic")

tester_white <- tester_race %>%
    ggplot(aes(x = Percent.White, y = Percent.Testers)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Testers : White")


# offer
offer_asian <- offer_race %>%
    ggplot(aes(x = Percent.Asian, y = Percent.Offers)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Offers : Asian")

offer_black <- offer_race %>%
    ggplot(aes(x = Percent.Black, y = Percent.Offers)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Offers : Black")

offer_hispanic <- offer_race %>%
    ggplot(aes(x = Percent.Hispanic, y = Percent.Offers)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Offers : Hispanic")

offer_white <- offer_race %>%
    ggplot(aes(x = Percent.White, y = Percent.Offers)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Offers : White")




grid.arrange(tester_asian, tester_white, tester_hispanic, tester_black, offer_asian, offer_white, offer_hispanic, offer_black, ncol=4)

There are 26 schools where the percent offers > 100%. That is, the number of offers > number of testers.
Those schools are:

school_exp_dem_shsat_safety[(as.numeric(as.character(school_exp_dem_shsat_safety$Count.of.Offers)) > as.numeric(as.character(school_exp_dem_shsat_safety$Count.of.Testers))), ]

Almost all of these schools are Black/ Hispanic dominated schools. How is this possible? Corruption? Some special program?

As mentioned earlier, there are three ways to increase the number of students of historically underrepresented demographics in Specialized High Schools.

Increase in SHSAT registrations.
Increase in actual SHSAT participation. (registering for the test and actually taking it are two different things)
Increase in SHS admission offers.

Now, our aim is to identify what attributes are correlated with the number of SHSAT registrations, test takers, and admission offers.

Increase The Number Of SHSAT Registrations

Registration amongst the minorities like Black and Hispanic population is pretty high. We need to focus on increasing the number of poeple who actually take the test and the number of offers.

Increase The Number of SHSAT Test Takers

The graph below tells us that for majority of the schools, the percentage of students who take the test (\(testTakers/Registrations * 100\)) lies around 25%.
So, majority of the schools have 25% students who appear for the test after registering for the same.

school_exp_dem_shsat_safety %>%
    select(Percent.Testers) %>%
    ggplot(aes(Percent.Testers)) + geom_density(fill = "pink", na.rm= T, alpha = 0.8)

Attributes of Schools With High Percentage of Test Takers (Top 50)

I’ve plotted a heat map of the values scaled between (0, 1) in a sorted data frame to compare the different attributes between schools with low test takers and schools with high test takers.

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

hm.palette <- colorRampPalette(rev(brewer.pal(9, 'RdYlBu')), space='Lab')


school_exp_dem_shsat_safety %>%
    select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Students.in.HS.Admissions, Percent.Testers, School.Rating, Avg.Academics, Percent.Disability, Crime.Stats) %>%
    arrange(desc(Count.of.Testers), Count.of.Students.in.HS.Admissions) %>%
    head(50) %>%
    mutate(Avg.Academics = normalit(Avg.Academics), Crime.Stats = normalit(Crime.Stats)) %>%
    select(-c(Percent.Testers, Count.of.Testers, Count.of.Students.in.HS.Admissions)) %>%
    melt() %>%
    ggplot(aes(x = School.Name, y = variable)) + geom_tile(aes(fill = value)) +  scale_fill_gradientn(colours = hm.palette(100)) + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust=0)) + ggtitle("Characterstics of Schools With High Percentage of Testers(Top 50)") + geom_text(aes(label = round(value, 1)))

From the above graph, we observe that for high test takers,

Attendance is very high.
School rating is very high.
Crime stats are moderately high.
Moderately high Economic Need Index.

Mean of the above attributes.

top_50_tester_school <- school_exp_dem_shsat_safety %>%
                                select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Students.in.HS.Admissions, Percent.Testers, School.Rating, Avg.Academics, Percent.Disability) %>%
                                arrange(desc(Count.of.Testers), Count.of.Students.in.HS.Admissions) %>%
                                select(-c(Count.of.Testers, Count.of.Students.in.HS.Admissions)) %>%
                                head(50) 

top_50_tester_school_mean <- colMeans(top_50_tester_school[, 2:13]) 
df_top_50_tester_school_mean <- data.frame(lapply(top_50_tester_school_mean, function(x) t(data.frame(x))))
df_top_50_tester_school_mean

Attributes of School’s With Low Percentage of Test Takers (Bottom 50)

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

hm.palette <- colorRampPalette(rev(brewer.pal(9, 'RdYlBu')), space='Lab')


school_exp_dem_shsat_safety %>%
    select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Students.in.HS.Admissions, Percent.Testers, School.Rating, Avg.Academics, Percent.Disability, Crime.Stats) %>%
    arrange(desc(Count.of.Testers), Count.of.Students.in.HS.Admissions) %>%
    tail(50) %>%
    mutate(Avg.Academics = normalit(Avg.Academics), Crime.Stats = normalit(Crime.Stats)) %>%
    select(-c(Percent.Testers, Count.of.Testers, Count.of.Students.in.HS.Admissions)) %>%
    melt() %>%
    na.omit() %>%
    ggplot(aes(x = School.Name, y = variable)) + geom_tile(aes(fill = value)) +  scale_fill_gradientn(colours = hm.palette(100)) + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust=0)) + ggtitle("Characterstics of Schools With Low Percentage of Testers (Bottom 50)") + geom_text(aes(label = round(value, 1)))

From the above graph, we observe that for low test takers,

Attendance is high.
School rating is moderately high.
Economic Need Index is very high.

Mean of the above attributes.

bottom_50_tester_school <- school_exp_dem_shsat_safety %>%
                                select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Students.in.HS.Admissions, Percent.Testers, School.Rating, Avg.Academics, Percent.Disability) %>%
                                arrange(desc(Count.of.Testers), Count.of.Students.in.HS.Admissions) %>%
                                tail(50) %>%
                                select(-c(Count.of.Testers, Count.of.Students.in.HS.Admissions)) %>%
                                na.omit()

# bottom_50_tester_school
bottom_50_tester_school_mean <- colMeans(bottom_50_tester_school[, 2:13])

df_bottom_50_tester_school_mean <- data.frame(lapply(bottom_50_tester_school_mean, function(x) t(data.frame(x))))

df_bottom_50_tester_school_mean

Compare the Mean Values for Top-50 and Bottom-50 Test Takers

top_50_testers_plot <- df_top_50_tester_school_mean %>%
                            melt() %>%
                            ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for top-50 test-taking schools")

bottom_50_testers_plot <- df_bottom_50_tester_school_mean %>%
                            melt() %>%
                            ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for bottom-50 test-taking schools")


grid.arrange(top_50_testers_plot, bottom_50_testers_plot, ncol=2)

The average percent testers for the top-50 schools is 53% while that of the bottom-50 schools is 6%.

Comparing the two,

Schools with lesser test takers have:
- Higher Economic Need Index.
- Higher Attendance Rate.
- Lower number of English Language Learners.
- Higher Chronic Absence.
- Higher percent disability.
- Lower Academic performance in Math and ELA.

Based on the correlation heatmap we plotted earlier,

Percent testers is:

Positively Correlated with:
+ Attendance Rate
+ Average Academics (Math + ELA)
Negatively Correlated with:
- Economic Need Index
- Chronic Absence
- Percent Disability

Increase The Number of SHSAT Offers

The graph below tells us that for majority of the schools, the percentage of students who take the test (\(testTakers/Registration * 100\)) lies around 25%.
So, majority of the schools have 25% students who appear for the tests who after registering for the same.

school_exp_dem_shsat_safety %>%
    select(Percent.Offers) %>%
    filter(Percent.Offers <= 1) %>%
    ggplot(aes(Percent.Offers)) + geom_density(fill = "orange", na.rm= T, alpha = 0.6)

In Majority of the schools, only around 12% students secure admissions in Specialised High Schools.

Attributes of School’s With High Percentage of SHSAT offers (Top 50)

I’ve plotted a heat map of the values scaled between (0, 1) in a sorted data frame to compare the different attributes between schools with low test takers and schools with high test takers.

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

hm.palette <- colorRampPalette(rev(brewer.pal(9, 'RdYlBu')), space='Lab')


school_exp_dem_shsat_safety %>%
    select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Offers, Percent.Offers, School.Rating, Avg.Academics, Percent.Disability, Crime.Stats) %>%
    filter(Percent.Offers <= 1) %>%
    arrange(desc(Count.of.Offers), Count.of.Testers) %>%
    head(50) %>%
    mutate(Avg.Academics = normalit(Avg.Academics), Crime.Stats = normalit(Crime.Stats)) %>%
    select(-c(Percent.Offers, Count.of.Testers, Count.of.Offers)) %>%
    melt() %>%
    ggplot(aes(x = School.Name, y = variable)) + geom_tile(aes(fill = value)) +  scale_fill_gradientn(colours = hm.palette(100)) + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust=0)) + ggtitle("Characterstics of Schools With High Percentage of Offers(Top 50)") + geom_text(aes(label = round(value, 1)))

From the above graph, we observe that for high SHSAT offers,

Attendance is very high.
School rating is very high.
Crime stats are moderately high.
Very less number of English Language Learners.
Moderately low Economic Need Index.

Mean of the above attributes.

top_50_offer_school <- school_exp_dem_shsat_safety %>%
                                select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Offers, Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
                                arrange(desc(Count.of.Offers), Count.of.Testers) %>%
                                select(-c(Count.of.Offers, Count.of.Testers)) %>%
                                head(50) 

top_50_offer_school_mean <- colMeans(top_50_offer_school[, 2:13]) 
df_top_50_offer_school_mean <- data.frame(lapply(top_50_offer_school_mean, function(x) t(data.frame(x))))
df_top_50_offer_school_mean

Attributes of School’s With Low Percentage of SHSAT offers (Bottom 50)

 normalit<-function(m){
   (m - min(m))/(max(m)-min(m))
 }

hm.palette <- colorRampPalette(rev(brewer.pal(9, 'RdYlBu')), space='Lab')


school_exp_dem_shsat_safety %>%
    select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Offers, Percent.Offers, School.Rating, Avg.Academics, Percent.Disability, Crime.Stats) %>%
    arrange(desc(Count.of.Offers), Count.of.Testers) %>%
    tail(50) %>%
    mutate(Avg.Academics = normalit(Avg.Academics), Crime.Stats = normalit(Crime.Stats)) %>%
    select(-c(Percent.Offers, Count.of.Offers, Count.of.Testers)) %>%
    melt() %>%
    na.omit() %>%
    ggplot(aes(x = School.Name, y = variable)) + geom_tile(aes(fill = value)) +  scale_fill_gradientn(colours = hm.palette(100)) + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust=0)) + ggtitle("Characterstics of Schools With Low Percentage of Offers (Bottom 50)") + geom_text(aes(label = round(value, 1)))

From the above graph, we observe that for low SHSAT offers,

Attendance is high.
School rating is moderately high.
Crime stats are moderately high.
Very less number of English Language Learners.
Very high Economic Need Index.

Mean of the above attributes.

bottom_50_offer_school <- school_exp_dem_shsat_safety %>%
                                select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Offers, Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
                                arrange(desc(Count.of.Offers), Count.of.Testers) %>%
                                tail(50) %>%
                                select(-c(Count.of.Offers, Count.of.Testers)) %>%
                                na.omit()

# bottom_50_tester_school
bottom_50_offer_school_mean <- colMeans(bottom_50_offer_school[, 2:13])

df_bottom_50_offer_school_mean <- data.frame(lapply(bottom_50_offer_school_mean, function(x) t(data.frame(x))))

df_bottom_50_offer_school_mean

Compare the Means for Top-50 and Bottom-50 SHSAT Offers

top_50_offers_plot <- df_top_50_offer_school_mean %>%
                            melt() %>%
                            ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for top-50 SHSAT offers schools")

bottom_50_offers_plot <- df_bottom_50_offer_school_mean %>%
                            melt() %>%
                            ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for bottom-50 SHSAT offers schools")


grid.arrange(top_50_offers_plot, bottom_50_offers_plot, ncol=2)

The average percent offers for the top-50 schools is 36% while that of the bottom-50 schools is 0%.

Comparing the two,

Schools with lesser offers have:
- Drastically Higher Economic Need Index.
- Lower Attendance Rates.
- Higher number of English Language Learners.
- Higher Chronic Absence.
- Higher percent disability.
- Lower Academic performance in Math and ELA.

Conclusion

Naive Approach

# Mean Calculations

mean_precent_testers_all <- school_exp_dem_shsat_safety %>%
                            select(Economic.Need.Index, Percent.ELL, 
                                   Percent.Asian, Percent.Black, Percent.Hispanic, 
                                   Percent.White, Student.Attendance.Rate, 
                                   Percent.of.Students.Chronically.Absent, Percent.Testers,
                                   Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
                            filter(Percent.Offers <= 1) %>%
                            na.omit()


mean_precent_testers <- colMeans(mean_precent_testers_all) 
df_mean_precent_testers <- data.frame(lapply(mean_precent_testers, function(x) t(data.frame(x))))

tester_mean <- df_mean_precent_testers$Percent.Testers
offer_mean <- df_mean_precent_testers$Percent.Offers
eni_mean <- df_mean_precent_testers$Economic.Need.Index
academic_mean <- df_mean_precent_testers$Avg.Academics
attendance_mean <- df_mean_precent_testers$Student.Attendance.Rate
school_rating_mean <- df_mean_precent_testers$School.Rating
chronic_absence_mean <- df_mean_precent_testers$Percent.of.Students.Chronically.Absent

All Schools That Need Help

These are the schools with low than average number of offers.

help_schools_gen <- school_exp_dem_shsat_safety %>%
                        filter(Percent.Offers < offer_mean, Percent.Testers < tester_mean)

help_schools_gen <- help_schools_gen %>%
    select(Feeder.School.DBN,School.Name, Majority.Race, Count.of.Students.in.HS.Admissions,Count.of.Testers,Count.of.Offers, District, Latitude,Longitude, City,Zip,Community.School.,Economic.Need.Index,School.Income.Estimate,Percent.ELL,Percent.Asian,Percent.Black,Percent.Hispanic, Percent.White, Student.Attendance.Rate,Percent.of.Students.Chronically.Absent, Percent.Female, Percent.Male,Percent.Disability,Percent.Poverty, Percent.Testers,Percent.Offers,School.Rating, Crime.Stats, Avg.Academics)
    
help_schools_gen

pal1 <- colorNumeric(palette = "RdYlBu", domain = help_schools_gen$Count.of.Offers)

leaflet(data = help_schools_gen) %>% 
    na.omit() %>%
    addTiles() %>%  
    addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
    addCircles(~Longitude, ~Latitude, radius=~Count.of.Offers*20, color = ~pal1(Count.of.Offers), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", help_schools_gen$City, "<br>","School:", help_schools_gen$School.Name, "<br>"), group = "Count.of.Offers")%>% 
 addLegend("bottomright", pal = pal1, values = ~Count.of.Offers, title = "Count.of.Offers", opacity = 1) %>%
    setView(-73.935242, 40.730610, zoom = 10)

Schools That Need The Most Help

These schools need both financial and Qualitative (Administrative/Teaching) help.

We will select schools whose Percent.Testers, Percent.Offers, Avg.Academic, and Attendance are below average and Economic Index, and Chronic.Absence is above average.
The schools are arranged in ascending order by the number of Percent.Offers, Percent.Testers, and SHSAT Registrations.
There are 597 schools in New York who teach students till atleast Grade 8. The following subset is from these schools.

# school_exp_dem_shsat_safety[school_exp_dem_shsat_safety$Percent.Testers < mean_tester, ]

help_schools <- school_exp_dem_shsat_safety %>%
                        filter(Percent.Testers < tester_mean, Percent.Offers < offer_mean, Economic.Need.Index > eni_mean, Avg.Academics < academic_mean, Student.Attendance.Rate < attendance_mean, School.Rating < school_rating_mean, Percent.of.Students.Chronically.Absent > chronic_absence_mean) %>%
                        arrange(Percent.Offers, Percent.Testers, Count.of.Students.in.HS.Admissions)
    

help_schools <- help_schools %>%
    select(Feeder.School.DBN,School.Name, Majority.Race, Count.of.Students.in.HS.Admissions,Count.of.Testers,Count.of.Offers, District, Latitude,Longitude, City,Zip,Community.School.,Economic.Need.Index,School.Income.Estimate,Percent.ELL,Percent.Asian,Percent.Black,Percent.Hispanic, Percent.White, Student.Attendance.Rate,Percent.of.Students.Chronically.Absent, Percent.Female, Percent.Male,Percent.Disability,Percent.Poverty, Percent.Testers,Percent.Offers,School.Rating, Crime.Stats, Avg.Academics)
    
help_schools

These schools are perfectly represent the demographic that PASSNYC aims to help. All the schools majority Black and/or Hispanic.

leaflet(data = help_schools) %>% 
    addProviderTiles(providers$Esri.WorldTopoMap) %>%
    addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(),  label = ~htmlEscape(School.Name))

Schools That Need Qualitative Help

Schools that have a more than average percentage of test-taking students but less than average percentage of offers, are the schools that are not in dire need of economic/financial aid, but require help with school administration and subjects like Math and ELA. These will benefit the most from after-school programs, teacher-training, and tutoring.

PS: We know that Economic Need Index and Test-Takers are negatively correlated from the correlation plot we made.

help_schools_acad <- school_exp_dem_shsat_safety %>%
                        filter(Percent.Testers > tester_mean, Percent.Offers < offer_mean, Economic.Need.Index < eni_mean, Avg.Academics < academic_mean, School.Rating < school_rating_mean) %>%
                        arrange(Percent.Offers)
    

help_schools_acad <- help_schools_acad %>%
    select(Feeder.School.DBN,School.Name, Majority.Race, Count.of.Students.in.HS.Admissions,Count.of.Testers,Count.of.Offers, District, Latitude,Longitude, City,Zip,Community.School.,Economic.Need.Index,School.Income.Estimate,Percent.ELL,Percent.Asian,Percent.Black,Percent.Hispanic, Percent.White, Student.Attendance.Rate,Percent.of.Students.Chronically.Absent, Percent.Female, Percent.Male,Percent.Disability,Percent.Poverty, Percent.Testers,Percent.Offers,School.Rating, Crime.Stats, Avg.Academics)
    
help_schools_acad

leaflet(data = help_schools_acad) %>% 
    addProviderTiles(providers$Esri.WorldTopoMap) %>%
    addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(),  label = ~htmlEscape(School.Name))

A Granular Approach (Unsupervised)

We will perform k-means clustering to identify similar schools.
Based on the attributes of the clusters, we will identify what kind of help the schools need ie. Monetary or Qualitative.

cluster_df <- school_exp_dem_shsat_safety %>%
                select(Feeder.School.DBN, School.Name, Majority.Race, Economic.Need.Index, Percent.ELL, 
                       Percent.Asian, Percent.Black, Percent.Hispanic, Latitude, Longitude,
                       Percent.White, Student.Attendance.Rate, 
                       Percent.of.Students.Chronically.Absent, Percent.Testers,
                       Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
                filter(Percent.Offers <= 1) %>%
                na.omit()
    
    
scaled_df <- school_exp_dem_shsat_safety %>%
                select(Economic.Need.Index, Percent.ELL, 
                       Percent.Asian, Percent.Black, Percent.Hispanic, 
                       Percent.White, Student.Attendance.Rate, 
                       Percent.of.Students.Chronically.Absent, Percent.Testers,
                       Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
                filter(Percent.Offers <= 1) %>%
                na.omit() %>%
                scale()

# Manual method of finding the number of clusters

# # Identiy the optimum number of clusters
# kmean_withinss <- function(k) {
#     cluster <- kmeans(scaled_df, k)
#     return (cluster$tot.withinss)
# }
# 
# # Set maximum cluster 
# max_k <-10
# 
# # Run algorithm over a range of k 
# wss <- sapply(2:max_k, kmean_withinss)
# 
# # Create a data frame to plot the graph
# elbow <-data.frame(2:max_k, wss)
# 
# # Plot the graph with gglplot
# ggplot(elbow, aes(x = X2.max_k, y = wss)) +
#     geom_point() +
#     geom_line() +
#     scale_x_continuous(breaks = seq(1, 20, by = 1))
# 
# 
# pc_cluster <-kmeans(scaled_df, 3)
# cluster_df$cluster <- pc_cluster$cluster
# 
# cluster_df %>%
#     group_by(cluster) %>%
#     summarise_all(funs(mean)) 
# 
# 
# cluster_df %>%
#     mutate(cluster = as.factor(cluster)) %>%
#     ggplot(aes(x = Economic.Need.Index, y = Avg.Academics)) + geom_point(size = 3, alpha = 0.6, aes(color = cluster)) +
#   scale_color_manual(values = c("red", "green", "blue"), labels = c("Cluster A", "Cluster B", "Cluster C"))

Find Ideal Number of Clusters.

# Elbow method
num_clust_a <- fviz_nbclust(scaled_df, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2) + labs(subtitle = "Elbow method")

# Silhouette method
num_clust_b <- fviz_nbclust(scaled_df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method")

# Gap statistic
num_clust_c <- fviz_nbclust(scaled_df, kmeans, nstart = 25,  method = "gap_stat", nboot = 500, verbose = FALSE) + labs(subtitle = "Gap statistic method")

grid.arrange(num_clust_a, num_clust_b, num_clust_c, ncol=3)

nb <- NbClust(scaled_df, distance = "euclidean", min.nc = 2, max.nc = 10, method = "kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 6 proposed 6 as the best number of clusters 
## * 1 proposed 7 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  2 
##  
##  
## *******************************************************************

fviz_nbclust(nb)

## Among all indices: 
## ===================
## * 2 proposed  0 as the best number of clusters
## * 1 proposed  1 as the best number of clusters
## * 7 proposed  2 as the best number of clusters
## * 5 proposed  3 as the best number of clusters
## * 1 proposed  5 as the best number of clusters
## * 6 proposed  6 as the best number of clusters
## * 1 proposed  7 as the best number of clusters
## * 1 proposed  8 as the best number of clusters
## * 2 proposed  10 as the best number of clusters
## 
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is  2 .

We’ll take the number of clusters to be 3. The above graph shows 3 as the best number of clusters when I run it on R-Studio. It somehow changes to 4 when I run knit it.
Anywho, we’ll stick with 3 clusters as it helps seperate the clusters better.

# Acual Kmeans Clustering
pc_cluster <-kmeans(scaled_df, 3)
cluster_df$cluster <- as.factor(pc_cluster$cluster)

hm.palette <-colorRampPalette(rev(brewer.pal(10, 'RdYlGn')),space='Lab')

cluster_df %>%
    select(-c(Feeder.School.DBN, School.Name, Majority.Race, Latitude, Longitude, Avg.Academics)) %>%
    melt() %>%
    ggplot(aes(x = variable, y = cluster, fill = value)) + geom_tile() + coord_equal() + scale_fill_gradientn(colours = hm.palette(100)) + theme(axis.text.x = element_text(angle = 90, hjust=0))

cluster_df %>%
    select(-c(Latitude, Longitude, School.Name, Feeder.School.DBN)) %>%
    melt() %>%
    group_by(cluster) %>%
    ggplot(aes(x = variable, y = value)) + geom_boxplot(aes(fill=cluster), position = position_dodge()) +  theme(axis.text.x = element_text(angle=90, hjust=0))

Cluster 1:
- Moderate Economic Need
- Moderate number of english language learners
- Black Majority
- Moderately high attendance
- Moderate chronic absence
- Moderate number of test-takers
- Low number of offers
- Moderate school rating
- Moderate Percent disability
- Moderate Academics (Math, ELA)
Cluster 2:
- Low Economic Need
- Less number of english language learners
- Majority White > Asian > Hispanics
- Highest attendance
- Very low chronic absence
- Highest percent test-takers
- Highest percent offers
- Highest school rating
- Best Acdemics (ELA, Math)
- Least percent disability
Cluster 3:
- Highest Economic Need
- Highest number of english language learners
- Majority Hispanics > Blacks
- Least attendance
- Highest chronic absence
- Lowest percent test-takers
- Moderate percent offers
- Lowest school rating
- Worst Academics (ELA, Math)
- Highest percent disability

Surprisingly, there isnt’t a whole lot of difference in the Offers received between the 2nd and 3rd cluster.

Based on the above characterstics,

Cluster 1:
- Underperforming
- Need extra english classes as high number of Enligh language learners
- Needs more qualitative help than financial
- Will benefit the most with a little help from PASSNYC
Cluster 2:
- Best performing
- Don’t need help from PASSNYC at the moment
Cluster 3:
- Poor performing
- Need immediate financial assistance, Administrative help, Extra english classes, Create awareness about SHSAT tests
- Need the most help from PASSNYC
- Will benefit from a major long-term effort from PASSNYC (resource intensive)

cluster_df %>%
    select(Percent.Offers, cluster) %>%
    ggplot(aes(Percent.Offers)) + geom_density(aes(fill = cluster), alpha = 0.5) + ggtitle("Density plot of Percent.Offers of the 2 clusters")

Majority schools in cluster 1 has around 10% of offers.
Majority schools in cluster 2 has around 12% of offers.
Majority schools in cluster 3 has around 16% of offers.

Clusters Plotted On The Map Of New York

cluster_df %>%
    ggplot(aes(x = Longitude, colour = cluster)) + geom_point(aes(y = Latitude), size=3, alpha=0.5) + scale_color_manual(values = c("red", "darkblue", "green"))

palfact <- colorFactor(c("red", "blue", "green"),cluster_df$cluster)

leaflet() %>% 
    addTiles() %>%
    addProviderTiles(providers$Esri.WorldTopoMap) %>%
    addCircles(data=cluster_df, lat = ~Latitude, lng = ~Longitude,
    color =~palfact(cluster),
    stroke = TRUE,
    radius = 10,
    fillOpacity = 0.75)

Effort Intensive Help (Long Term) from PASNYC

Schools in cluster 3 require the most help. The schools are arranged in the decreasing order of help required. These schools need long term monetary as well as administrative help.

most_help_schools <- cluster_df %>%
                        filter(cluster == 3) %>%
                        arrange(Percent.Offers, Percent.Testers, Percent.Testers, Economic.Need.Index, Avg.Academics)

most_help_schools

leaflet(data = most_help_schools) %>% 
    addProviderTiles(providers$Esri.WorldTopoMap) %>%
    addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(),  label = ~htmlEscape(School.Name))

Non-Intensive Effort (Short Term) from PASNYC

Schools in cluster 1 require the short term help. The schools are arranged in the decreasing order of help required. These schools majorly need administrative help.

medium_help_schools <- cluster_df %>%
                        filter(cluster == 1) %>%
                        arrange(Percent.Offers, Percent.Testers, Percent.Testers, Economic.Need.Index, Avg.Academics)

medium_help_schools

leaflet(data = medium_help_schools) %>% 
    addProviderTiles(providers$Esri.WorldTopoMap) %>%
    addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(),  label = ~htmlEscape(School.Name))

Thank you for reading. Hope you liked it. Constructive criticism is welcome. :)
You can find me on LinkedIn.
View the kernel on kaggle. Read my data science blogs here.

PASSNYC: Data Science for Good

Akshaj Verma

Pre-Processing

Geospatial Exploration of The Schools in New York.

Schools in New York

Distribution of schools based on Ethnicity

Distribution of schools with Black Students

Distribution of schools with Hispanic Students

Distribution of schools with Asian Students

Distribution of schools with White Students

Distribution of schools based on Gender

Percent Poverty in Schools of NY

Economic Need Index Distribution of Schools in NY

Clusters of Community and Non-Community Schools

Crime statistics in the City

Correlation Between Attributes

Correlation Heatmap

Correlation Between Attributes After Dimension Reduction

Ethinicity vs Economic Need Index

Ethnicity vs Chronic Absence

Ethnicity vs Percent English Language Learners

Ethnicity vs Average Academics (ELA + Math Proficiency)

Ethnicity vs Disability

ENI vs Chronic Absence, Disability, Percent Testers, ELL, Academics

SHSAT Registrations

Demographic Overview

Mean and Spread of Attributes for all Schools

Ethnicity vs Testers and Offers

Increase The Number Of SHSAT Registrations

Increase The Number of SHSAT Test Takers

Attributes of Schools With High Percentage of Test Takers (Top 50)

Mean of the above attributes.

Attributes of School’s With Low Percentage of Test Takers (Bottom 50)

Mean of the above attributes.

Compare the Mean Values for Top-50 and Bottom-50 Test Takers

Increase The Number of SHSAT Offers

Attributes of School’s With High Percentage of SHSAT offers (Top 50)

Mean of the above attributes.

Attributes of School’s With Low Percentage of SHSAT offers (Bottom 50)

Mean of the above attributes.

Compare the Means for Top-50 and Bottom-50 SHSAT Offers

Conclusion

Naive Approach

All Schools That Need Help

Schools That Need The Most Help

Schools That Need Qualitative Help

A Granular Approach (Unsupervised)

Find Ideal Number of Clusters.

Clusters Plotted On The Map Of New York

Effort Intensive Help (Long Term) from PASNYC

Non-Intensive Effort (Short Term) from PASNYC