The overall goal of this notebook is to increase the diversity in Specialized High Schools.
There are three ways to increase the number of students of historically underrepresented demographics in Specialized High Schools.
1. Increase in SHSAT registrations.
2. Increase in actual SHSAT participation. (registering for the test and actually taking it are two different things)
3. Increase in SHS admission offers.
To achieve each of the aforementioned goals, PASSNYC and its partners have to provide outreach services that will improve the chances of students taking the SHSAT and receiving placements in these specialized high schools. Since, non-profits like PASSNYC have limited resources, a data driven solution to accurately identify the affected schools is necessary. Therefore, the schools have to be assigned some sort of ranking based on the likelihood of converting maximum number of students into test takers. This will ensure optimum utilisation of resources.
Now, the question is, how to identify such schools?
Proxies that have been good indicators of these types of schools include data on -
* English Language Learners
* Students with Disabilities
* Students on Free/Reduced Lunch
* Students with Temporary Housing
Based on the kind of problems the school faces, PASSNYC will commence services like after school programs, test preparation, mentoring, or resources for parents.
We’re going to use 3 additional datasets.
* Demographic Snapshot of Schools, 2013-2018
* School Safety Report, 2010-2016
* Admission Offers, 2017-2018
The analysis process will be as follows:
1. Explore the School demographics and geography.
2. Find schools with less number of registrations.
3. Find schools with less number of test takers.
4. Identify the schools with less number of admissions.
5. Suggest schools who need the most help in each of the above 3 scenarios.
Import Libraries
library(dplyr)
library(tidyr)
library(ggplot2)
library(gridExtra)
library(reshape2)
library(RColorBrewer)
library(ggmap)
library(leaflet)
library(htmltools)
library(httr)
library(plotly)
library(GGally)
library(factoextra)
library(NbClust)
# library(naniar)
# library(mice)
# library(rgdal)
# library(geojsonio)
Import data
school_explorer <- read.csv("data/2016 School Explorer.csv", sep=",", header=TRUE)
school_explorer <- tbl_df(school_explorer)
school_explorer_og <- tbl_df(school_explorer)
nyc_shsat_results <- read.csv("data/2017-2018_SHSAT_Admissions_Test_Offers_By_Sending_School.csv", sep=",", header = TRUE)
nyc_shsat_results <- tbl_df(nyc_shsat_results)
d5_shsat <- read.csv("data/D5 SHSAT Registrations and Testers.csv", sep=",", header = TRUE)
d5_shsat <- tbl_df(d5_shsat)
school_dem <- read.csv("data/2013_-_2018_Demographic_Snapshot_School.csv", sep=",", header = TRUE)
school_dem <- tbl_df(school_dem)
school_safety <- read.csv("data/2010-2016-school-safety-report.csv", sep=",", header = TRUE)
school_safety <- tbl_df(school_safety)
Pre-Process the school_explorer data frame.
# Subset the columns
school_explorer <- subset(school_explorer, select = -c(Adjusted.Grade, New., Other.Location.Code.in.LCGMS, Address..Full., SED.Code, Grades, Percent.Black...Hispanic))
school_explorer <- select(school_explorer, -contains("Grade"))
school_explorer <- select(school_explorer, -contains("Rating"))
# Convert currency to numbers.
toCurrency <- function(curr) {
curr <- as.character(curr)
curr <- gsub("[$,]", "", curr)
result <- as.numeric(curr)
return(result)
}
school_explorer$School.Income.Estimate <- toCurrency(school_explorer$School.Income.Estimate)
# Convert percentages to a number divided by 100.
percent_columns <- c("Percent.ELL", "Percent.Asian", "Percent.Black", "Percent.Hispanic", "Percent.White", "Student.Attendance.Rate", "Percent.of.Students.Chronically.Absent", "Rigorous.Instruction..", "Collaborative.Teachers..", "Supportive.Environment..", "Effective.School.Leadership..", "Strong.Family.Community.Ties..", "Trust..")
school_explorer[percent_columns] <- apply(school_explorer[percent_columns], 2, function(y) as.numeric(gsub("%", "", y))/100 )
# Convert to numeric type
school_explorer$Economic.Need.Index = as.numeric(as.character(school_explorer$Economic.Need.Index))
school_explorer$Average.ELA.Proficiency = as.numeric(as.character(school_explorer$Average.ELA.Proficiency))
school_explorer$Average.Math.Proficiency = as.numeric(as.character(school_explorer$Average.Math.Proficiency))
school_explorer$Location.Code = as.character(school_explorer$Location.Code)
head(school_explorer)
Pre-Process the school_dem data frame.
p_cols <- c("X..Female.1", "X..Male.1", "X..Students.with.Disabilities.1", "X..Poverty.1")
school_dem_mfdb <- school_dem %>%
filter(Year == "2015-16") %>%
select(DBN, X..Female.1, X..Male.1, X..Students.with.Disabilities.1, X..Poverty.1)
school_dem_mfdb[p_cols] <- apply(school_dem_mfdb[p_cols], 2, function(y) as.numeric(gsub("%", "", y))/100 )
Join the school_exp and school_dem data frames.
school_exp_dem <- inner_join(school_explorer, school_dem_mfdb, by = c("Location.Code" = "DBN"))
school_exp_dem <- rename(school_exp_dem, Percent.Female = X..Female.1, Percent.Male = X..Male.1, Percent.Disability = X..Students.with.Disabilities.1 , Percent.Poverty = X..Poverty.1)
head(school_exp_dem)
Pre-Process the school_safety data.
# Select Crime stats attributes from the df
ss <- school_safety %>%
filter(School.Year == "2015-16") %>%
select(DBN, AvgOfMajor.N, AvgOfOth.N, AvgOfNoCrim.N, AvgOfProp.N, AvgOfVio.N)
ss$AvgOfMajor.N[ss$AvgOfMajor.N == "N/A"] <- NA
ss$AvgOfOth.N[ss$AvgOfOth.N == "N/A"] <- NA
ss$AvgOfNoCrim.N[ss$AvgOfNoCrim.N == "N/A"] <- NA
ss$AvgOfProp.N[ss$AvgOfProp.N == "N/A"] <- NA
ss$AvgOfVio.N[ss$AvgOfVio.N == "N/A"] <- NA
dbn <- ss$DBN
ss <- apply(ss[,2:6], 2, function(x) as.numeric(x))
ss <- as.data.frame(ss)
ss$DBN <- dbn
# Fill NA values for Crime Stats
# NA values in the crime stats will imputed by a random number generated between the `(mean-standard deviation) and (mean + standard deviation)`.
mean_AvgOfMajor.N <- as.numeric(mean(ss[, "AvgOfMajor.N"], na.rm=TRUE))
sd_AvgOfMajor.N <- as.numeric(sd(ss[, "AvgOfMajor.N"], na.rm=TRUE))
count_nan_AvgOfMajor.N <- sum(is.na(ss[, "AvgOfMajor.N"]))
random_AvgOfMajor.N <- sample((mean_AvgOfMajor.N - sd_AvgOfMajor.N):(mean_AvgOfMajor.N + sd_AvgOfMajor.N), count_nan_AvgOfMajor.N, replace = TRUE)
mean_AvgOfOth.N <- as.numeric(mean(ss[,"AvgOfOth.N"], na.rm=TRUE))
sd_AvgOfOth.N <- as.numeric(sd(ss[,"AvgOfOth.N"], na.rm=TRUE))
count_nan_AvgOfOth.N <- sum(is.na(ss[,"AvgOfOth.N"]))
random_AvgOfOth.N <- sample((mean_AvgOfOth.N - sd_AvgOfOth.N):(mean_AvgOfOth.N + sd_AvgOfOth.N), count_nan_AvgOfOth.N, replace = TRUE)
mean_AvgOfNoCrim.N <- as.numeric(mean(ss[,"AvgOfNoCrim.N"], na.rm=TRUE))
sd_AvgOfNoCrim.N <- as.numeric(sd(ss[,"AvgOfNoCrim.N"], na.rm=TRUE))
count_nan_AvgOfNoCrim.N <- sum(is.na(ss[, "AvgOfNoCrim.N"]))
random_AvgOfNoCrim.N <- sample((mean_AvgOfNoCrim.N - sd_AvgOfNoCrim.N):(mean_AvgOfNoCrim.N + sd_AvgOfNoCrim.N), count_nan_AvgOfNoCrim.N, replace = TRUE)
mean_AvgOfProp.N <- as.numeric(mean(ss[,"AvgOfProp.N"], na.rm=TRUE))
sd_AvgOfProp.N <- as.numeric(sd(ss[,"AvgOfProp.N"], na.rm=TRUE))
count_nan_AvgOfProp.N <- sum(is.na(ss[,"AvgOfProp.N"]))
random_AvgOfProp.N <- sample((mean_AvgOfProp.N - sd_AvgOfProp.N):(mean_AvgOfProp.N + sd_AvgOfProp.N), count_nan_AvgOfProp.N, replace = TRUE)
mean_AvgOfVio.N <- as.numeric(mean(ss[,"AvgOfVio.N"], na.rm=TRUE))
sd_AvgOfVio.N <- as.numeric(sd(ss[,"AvgOfVio.N"], na.rm=TRUE))
count_nan_AvgOfVio.N <- sum(is.na(ss[,"AvgOfVio.N"]))
random_AvgOfVio.N <- sample((mean_AvgOfVio.N - sd_AvgOfVio.N):(mean_AvgOfVio.N + sd_AvgOfVio.N), count_nan_AvgOfVio.N, replace = TRUE)
# Replace NA with the random integers generated above
ss <- mutate(ss,
AvgOfMajor.N=ifelse(is.na(AvgOfMajor.N), random_AvgOfMajor.N, AvgOfMajor.N),
AvgOfNoCrim.N=ifelse(is.na(AvgOfNoCrim.N), random_AvgOfNoCrim.N, AvgOfNoCrim.N),
AvgOfOth.N=ifelse(is.na(AvgOfOth.N), random_AvgOfOth.N, AvgOfOth.N),
AvgOfProp.N=ifelse(is.na(AvgOfProp.N), random_AvgOfProp.N, AvgOfProp.N),
AvgOfVio.N=ifelse(is.na(AvgOfVio.N), random_AvgOfVio.N, AvgOfVio.N))
Join the school_exp_dem and ss data frames.
school_exp_dem_safety <- left_join(school_exp_dem, ss, by = c("Location.Code" = "DBN"))
head(school_exp_dem_safety)
Pre-Process the nyc_shsat_results data.
# Replace "0-5" values with NA.
nyc_shsat_results <- nyc_shsat_results %>%
mutate(Count.of.Testers = as.character(Count.of.Testers), Count.of.Offers = as.character(Count.of.Offers), Feeder.School.DBN = as.character(Feeder.School.DBN))
nyc_shsat_results$Count.of.Testers[nyc_shsat_results$Count.of.Testers == "0-5"] <- NA
nyc_shsat_results$Count.of.Offers[nyc_shsat_results$Count.of.Offers == "0-5"] <- NA
# We will replace the "0-5" factor in the Count.Of.Testers and Count.Of.Offers column by a random number betweem 0 and 5.
count_nan_testers <- sum(is.na(nyc_shsat_results$Count.of.Testers))
count_nan_offers <- sum(is.na(nyc_shsat_results$Count.of.Offers))
random_testers <- sample(0:5, count_nan_testers, replace = TRUE)
random_offers <- sample(0:5, count_nan_offers, replace = TRUE)
# Replace NA with the random integers generated above
nyc_shsat_results <- mutate(nyc_shsat_results,
Count.of.Testers=ifelse(is.na(Count.of.Testers), random_testers, Count.of.Testers),
Count.of.Offers=ifelse(is.na(Count.of.Offers), random_offers, Count.of.Offers))
Join nyc_shsat_results and school_exp_dem data frames.
school_exp_dem_shsat <- left_join(nyc_shsat_results, school_exp_dem, by=c("Feeder.School.DBN" = "Location.Code"))
school_exp_dem_shsat <- school_exp_dem_shsat %>%
mutate(Count.of.Testers = as.numeric(as.character(Count.of.Testers)), Count.of.Offers = as.numeric(as.character(Count.of.Offers)), Count.of.Students.in.HS.Admissions = as.numeric(as.character(Count.of.Students.in.HS.Admissions)))
Join nyc_exp_dem_shsat and ss data frames.
school_exp_dem_shsat_safety <- left_join(school_exp_dem_shsat, ss, by=c("Feeder.School.DBN" = "DBN"))
school_exp_dem_shsat_safety <- school_exp_dem_shsat_safety %>%
select(-c(Feeder.School.Name)) %>%
mutate(Percent.Testers = Count.of.Testers/Count.of.Students.in.HS.Admissions, Percent.Offers = Count.of.Offers/Count.of.Testers)
# Replace the 0/0 = inf values by 0.
school_exp_dem_shsat_safety$Percent.Offers[is.infinite(school_exp_dem_shsat_safety$Percent.Offers)] <- 0
# Create a new column with the majority race.
school_exp_dem_shsat_safety$Majority.Race <- colnames(school_exp_dem_shsat_safety[c("Percent.Asian", "Percent.Black", "Percent.Hispanic", "Percent.White")])[max.col(school_exp_dem_shsat_safety[c("Percent.Asian", "Percent.Black", "Percent.Hispanic", "Percent.White")], ties.method="first")]
school_exp_dem_shsat_safety$Majority.Race = gsub("Percent.Hispanic", "Hispanic", school_exp_dem_shsat_safety$Majority.Race)
school_exp_dem_shsat_safety$Majority.Race = gsub("Percent.Asian", "Asian", school_exp_dem_shsat_safety$Majority.Race)
school_exp_dem_shsat_safety$Majority.Race = gsub("Percent.Black", "Black", school_exp_dem_shsat_safety$Majority.Race)
school_exp_dem_shsat_safety$Majority.Race = gsub("Percent.White", "White", school_exp_dem_shsat_safety$Majority.Race)
head(school_exp_dem_shsat_safety)
Available data frames:
leaflet(data = school_exp_dem) %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(), label = ~htmlEscape(School.Name))
school_dem %>%
select(X..Asian, X..Black, X..Hispanic, X..White) %>%
melt() %>%
group_by(variable) %>%
summarise(total = sum(value)) %>%
ggplot(aes(x = variable, y = total)) + geom_bar(aes(fill = total), stat="identity")
Hispanics make up the largest school attending population in New York.
pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Black)
leaflet(data = school_exp_dem) %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~Percent.Black, color = ~pal1(Percent.Black), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Black")%>%
addLegend("bottomright", pal = pal1, values = ~Percent.Black, title = "Percent.Black", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Hispanic)
leaflet(data = school_exp_dem) %>%
na.omit() %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~Percent.Hispanic, color = ~pal1(Percent.Hispanic), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Hispanic")%>%
addLegend("bottomright", pal = pal1, values = ~Percent.Hispanic, title = "Percent.Hispanic", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Asian)
leaflet(data = school_exp_dem) %>%
na.omit() %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~Percent.Asian, color = ~pal1(Percent.Asian), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Asian")%>%
addLegend("bottomright", pal = pal1, values = ~Percent.Asian, title = "Percent.Asian", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.White)
leaflet(data = school_exp_dem) %>%
na.omit() %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~Percent.White, color = ~pal1(Percent.White), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.White")%>%
addLegend("bottomright", pal = pal1, values = ~Percent.White, title = "Percent.White", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Female)
leaflet(data = school_exp_dem) %>%
na.omit() %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~Percent.Female, color = ~pal1(Percent.Female), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Female")%>%
addLegend("bottomright", pal = pal1, values = ~Percent.Female, title = "Percent.Female", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Percent.Poverty)
leaflet(data = school_exp_dem) %>%
na.omit() %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~Percent.Poverty, color = ~pal1(Percent.Poverty), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Percent.Poverty")%>%
addLegend("bottomright", pal = pal1, values = ~Percent.Poverty, title = "Percent.Poverty", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem$Economic.Need.Index)
leaflet(data = school_exp_dem) %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~Economic.Need.Index, color = ~pal1(Economic.Need.Index), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", school_exp_dem$City, "<br>","School:", school_exp_dem$School.Name, "<br>"), group = "Economic.Need.Index")%>%
addLegend("bottomright", pal = pal1, values = ~Economic.Need.Index, title = "Economic.Need.Index", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
# Create a palette that maps factor levels to colors
pal <- colorFactor(c("blue", "red"), domain = c("Yes", "No"))
leaflet(school_exp_dem) %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircleMarkers(~Longitude,
~Latitude,
color = ~pal(Community.School.),
radius = ~ifelse(Community.School. == "Yes", 2, 1)) %>%
addLegend(position = "bottomright", pal = pal, values = c("Yes", "No"))
The distribution of crime stats is pretty much the same as that of the Major Crimes.
Major Crimes
pal1 <- colorNumeric(palette = "YlOrRd", domain = school_exp_dem_safety$AvgOfMajor.N)
leaflet(data = school_exp_dem_safety) %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~AvgOfMajor.N *20, color = ~pal1(AvgOfMajor.N), stroke = TRUE, fillOpacity = 0.5, popup = paste("City:", school_exp_dem_safety$City, "<br>","School:", school_exp_dem_safety$School.Name, "<br>"), group = "AvgOfMajor.N")%>%
addLegend("bottomright", pal = pal1, values = ~AvgOfMajor.N, title = "AvgOfMajor.N", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
The crime rate is high regions where Black and Hispanic dominated schools exist.
hm.palette <- colorRampPalette(rev(brewer.pal(5, 'RdYlBu')), space='Lab')
school_exp_dem_shsat_safety %>%
select(-c(School.Name, Feeder.School.DBN,
Count.of.Students.in.HS.Admissions,
Count.of.Testers, Count.of.Offers, District,
Latitude, Longitude, City, Zip, Community.School.,
School.Income.Estimate, Majority.Race)) %>%
na.omit() %>%
cor() %>%
melt() %>%
ggplot(aes(Var1, Var2)) + geom_tile(aes(fill = value)) + scale_fill_gradientn(colours = hm.palette(100)) + coord_equal() + theme(axis.text.x=element_text(angle=90, hjust=0)) + ggtitle("Correlation Between Attributes") + geom_text(aes(label = round(value, 1)))
# + geom_text(aes(label = round(value, 1)))
The above heatmap can be interpreted in the following way:
We can aggregate the following variables into a single variable to reduce dimenstionality as they have high a correlation amongst them:
We will take the mean of the above 3 batches of highly correlated values into a single value to reduce dimensionality.
Note that disparity between the White/Asian vs Black/Hispanic students in terms of the above factors. White students have better support from schools than anyone else.
Create 3 new colunms called School.Rating, Crime.Stats, and Avg.Academics.
school_rating <- school_exp_dem_shsat_safety %>%
select(c(21:26)) %>%
mutate(School.Rating = rowMeans(.[,1:6])) %>%
select(School.Rating)
crime_stats <- school_exp_dem_shsat_safety %>%
select(c(33:37)) %>%
rowwise() %>%
mutate(Crime.Stats = mean(c(AvgOfMajor.N, AvgOfOth.N, AvgOfNoCrim.N, AvgOfProp.N,AvgOfVio.N))) %>%
select(Crime.Stats)
avg_acad_perf <- school_exp_dem_shsat_safety %>%
select(Average.ELA.Proficiency, Average.Math.Proficiency) %>%
rowwise() %>%
mutate(Avg.Academics = mean(c(Average.ELA.Proficiency, Average.Math.Proficiency))) %>%
select(Avg.Academics)
school_exp_dem_shsat_safety <- cbind(school_exp_dem_shsat_safety, data.frame(School.Rating = school_rating))
school_exp_dem_shsat_safety <- cbind(school_exp_dem_shsat_safety, data.frame(Crime.Stats = crime_stats))
school_exp_dem_shsat_safety <- cbind(school_exp_dem_shsat_safety, data.frame(Avg.Academics = avg_acad_perf))
school_exp_dem_shsat_safety <-school_exp_dem_shsat_safety %>% select(-c(21:26), -c(33:37), -c(Average.ELA.Proficiency, Average.Math.Proficiency))
hm.palette <- colorRampPalette(rev(brewer.pal(5, 'RdYlBu')), space='Lab')
school_exp_dem_shsat_safety %>%
select(-c(School.Name, Feeder.School.DBN,
Count.of.Students.in.HS.Admissions,
Count.of.Testers, Count.of.Offers, District,
Latitude, Longitude, City, Zip, Community.School.,
School.Income.Estimate, Majority.Race, Percent.Poverty)) %>%
na.omit() %>%
cor() %>%
melt() %>%
ggplot(aes(Var1, Var2)) + geom_tile(aes(fill = value)) + scale_fill_gradientn(colours = hm.palette(100)) + coord_equal() + theme(axis.text.x=element_text(angle=90, hjust=0)) + ggtitle("Correlation Between Attributes After Dimension Reduction") + geom_text(aes(label = round(value, 1)))
# + geom_text(aes(label = round(value, 1)))
Percent testers is:
Interesting correlations to examine further based on the pearson cofficient > 0.5 or pearson cofficient < -0.5:
Race vs the following:
eni_race <- select(school_exp_dem, Economic.Need.Index, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
eni_asian <- eni_race %>%
ggplot(aes(x = Percent.Asian, y = Economic.Need.Index)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Economic Need Index : Asian")
eni_black <- eni_race %>%
ggplot(aes(x = Percent.Black, y = Economic.Need.Index)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Economic Need Index : Black")
eni_hispanic <- eni_race %>%
ggplot(aes(x = Percent.Hispanic, y = Economic.Need.Index)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Economic Need Index : Hispanic")
eni_white <- eni_race %>%
ggplot(aes(x = Percent.White, y = Economic.Need.Index)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Economic Need Index : White")
grid.arrange(eni_asian, eni_white, eni_black, eni_hispanic, ncol=2)
Hispanic and Black population have a greater economic need while the White and Asian students are well off.
absence_race <- select(school_exp_dem, Percent.of.Students.Chronically.Absent, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
absence_asian <- absence_race %>%
ggplot(aes(x = Percent.Asian, y = Percent.of.Students.Chronically.Absent)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Chronic Absence : Asian")
absence_black <- absence_race %>%
ggplot(aes(x = Percent.Black, y = Percent.of.Students.Chronically.Absent)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Chronic Absence : Black")
absence_hispanic <- absence_race %>%
ggplot(aes(x = Percent.Hispanic, y = Percent.of.Students.Chronically.Absent)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Chronic Absence : Hispanic")
absence_white <- absence_race %>%
ggplot(aes(x = Percent.White, y = Percent.of.Students.Chronically.Absent)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Chronic Absence : White")
grid.arrange(absence_asian, absence_white, absence_black, absence_hispanic, ncol=2)
Chronic Absence is more amongst black and hispanic dominated schools as compared to the Hispanic and White dominated schools.
ell_race <- select(school_exp_dem_shsat_safety, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
ell_asian <- ell_race %>%
ggplot(aes(x = Percent.Asian, y = Percent.ELL)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("English Language Learners : Asian")
ell_black <- ell_race %>%
ggplot(aes(x = Percent.Black, y = Percent.ELL)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("English Language Learners : Black")
ell_hispanic <- ell_race %>%
ggplot(aes(x = Percent.Hispanic, y = Percent.ELL)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("English Language Learners : Hispanic")
ell_white <- ell_race %>%
ggplot(aes(x = Percent.White, y = Percent.ELL)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("English Language Learners : White")
grid.arrange(ell_asian, ell_white, ell_hispanic, ell_black, ncol=2)
White/Asian dominated schools have a more supportive environment as compared to Black/Hispanic dominated schools.
ela_race <- select(school_exp_dem, Average.ELA.Proficiency, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
math_race <- select(school_exp_dem, Average.Math.Proficiency, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
# ELA
ela_asian <- ela_race %>%
ggplot(aes(x = Percent.Asian, y = Average.ELA.Proficiency)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average ELA Proficiency : Asian")
ela_black <- ela_race %>%
ggplot(aes(x = Percent.Black, y = Average.ELA.Proficiency)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average ELA Proficiency : Black")
ela_hispanic <- ela_race %>%
ggplot(aes(x = Percent.Hispanic, y = Average.ELA.Proficiency)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average ELA Proficiency : Hispanic")
ela_white <- ela_race %>%
ggplot(aes(x = Percent.White, y = Average.ELA.Proficiency)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average ELA Proficiency : White")
# Math
math_asian <- math_race %>%
ggplot(aes(x = Percent.Asian, y = Average.Math.Proficiency)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average Math Proficiency : Asian")
math_black <- math_race %>%
ggplot(aes(x = Percent.Black, y = Average.Math.Proficiency)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average Math Proficiency : Black")
math_hispanic <- math_race %>%
ggplot(aes(x = Percent.Hispanic, y = Average.Math.Proficiency)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average Math Proficiency : Hispanic")
math_white <- math_race %>%
ggplot(aes(x = Percent.White, y = Average.Math.Proficiency)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Average Math Proficiency : White")
grid.arrange(ela_asian, ela_white, ela_hispanic, ela_black, math_asian, math_white, math_hispanic, math_black, ncol=4)
Asian/White dominated schools have higher ELA and Math proficiency as compared to Hispanic and Black dominated schools.
From the above graphs, there seems to exist some correlation between Average ELA proficiency and Average Math Proficiency. Let’s see if they are actually correlated.
disability_race <- select(school_exp_dem, Percent.Disability, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
disab_asian <- disability_race %>%
ggplot(aes(x = Percent.Asian, y = Percent.Disability)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Disability : Asian")
disab_black <- disability_race %>%
ggplot(aes(x = Percent.Black, y = Percent.Disability)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Disability : Black")
disab_hispanic <- disability_race %>%
ggplot(aes(x = Percent.Hispanic, y = Percent.Disability)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Disability : Hispanic")
disab_white <- disability_race %>%
ggplot(aes(x = Percent.White, y = Percent.Disability)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Disability : White")
grid.arrange(disab_asian, disab_white, disab_hispanic, disab_black, ncol=2)
Black/Hispanic dominated schools tend to have more students with disabilities as compared to White/ Asian dominated schools.
eni_absence <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.of.Students.Chronically.Absent) %>%
ggplot(aes(x = Economic.Need.Index, y = Percent.of.Students.Chronically.Absent)) + geom_point(color = "orange") + ggtitle("ENI vs Chronic Absence") + geom_smooth(method = lm)
eni_disability <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.Disability) %>%
ggplot(aes(x = Economic.Need.Index, y = Percent.Disability)) + geom_point(color = "pink")+ ggtitle("ENI vs Percent Disability") + geom_smooth(method = lm)
eni_testers <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.Testers) %>%
ggplot(aes(x = Economic.Need.Index, y = Percent.Testers)) + geom_point(color = "violet") + ggtitle("ENI vs Percent Testers") + geom_smooth(method = lm)
eni_offers <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.Offers) %>%
filter(Percent.Offers <= 1) %>%
ggplot(aes(x = Economic.Need.Index, y = Percent.Offers)) + geom_point(color = "yellow") + ggtitle("ENI vs Percent Offers") + geom_smooth(method = lm)
eni_ELL <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.ELL) %>%
ggplot(aes(x = Economic.Need.Index, y = Percent.ELL)) + geom_point(color = "lightgreen") + ggtitle("ENI vs Percent ELL") + geom_smooth(method = lm)
eni_academics <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Avg.Academics) %>%
ggplot(aes(x = Economic.Need.Index, y = Avg.Academics)) + geom_point(color = "lightblue") + ggtitle("ENI vs Academics") + geom_smooth(method = lm)
grid.arrange(eni_absence, eni_disability, eni_testers, eni_offers, eni_ELL, eni_academics, ncol=3)
school_exp_dem_shsat_safety %>%
na.omit() %>%
group_by(Majority.Race) %>%
summarise(Number.Of.Reg = sum(Count.of.Students.in.HS.Admissions), Number.Of.Testers = sum(Count.of.Testers), Number.Of.Offers = sum(Count.of.Offers)) %>%
melt(id.vars = c("Majority.Race")) %>%
ggplot(aes(x=Majority.Race, y=value, fill=variable, color=variable, alpha=variable)) + geom_bar(stat="identity",position ="identity") + scale_colour_manual(values=c("darkblue","red", "orange")) + scale_fill_manual(values=c("lightblue","pink", "#FFFF66")) + scale_alpha_manual(values=c(.3, .5, .8))
school_exp_dem_shsat_safety_mean_lol <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.ELL,
Percent.Asian, Percent.Black, Percent.Hispanic,
Percent.White, Student.Attendance.Rate,
Percent.of.Students.Chronically.Absent, Percent.Testers,
Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
filter(Percent.Offers <= 1) %>%
na.omit()
school_exp_dem_shsat_safety_mean <- colMeans(school_exp_dem_shsat_safety_mean_lol)
df_school_exp_dem_shsat_safety_mean <- data.frame(lapply(school_exp_dem_shsat_safety_mean, function(x) t(data.frame(x))))
df_school_exp_dem_shsat_safety_mean %>%
melt() %>%
ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for all schools")
school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.ELL,
Percent.Asian, Percent.Black, Percent.Hispanic,
Percent.White, Student.Attendance.Rate,
Percent.of.Students.Chronically.Absent, Percent.Testers,
Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
filter(Percent.Offers <= 1) %>%
na.omit() %>%
melt() %>%
ggplot(aes(x = variable, y = value)) + geom_boxplot(fill = "purple") + theme(axis.text.x = element_text(angle = 90, hjust = 0))
tester_race <- select(school_exp_dem_shsat_safety, Percent.Testers, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
offer_race <- select(school_exp_dem_shsat_safety, Percent.Offers, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White)
# Percent Testers
tester_asian <- tester_race %>%
ggplot(aes(x = Percent.Asian, y = Percent.Testers)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Testers: Asian")
tester_black <- tester_race %>%
ggplot(aes(x = Percent.Black, y = Percent.Testers)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Testers: Black")
tester_hispanic <- tester_race %>%
ggplot(aes(x = Percent.Hispanic, y = Percent.Testers)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Testers : Hispanic")
tester_white <- tester_race %>%
ggplot(aes(x = Percent.White, y = Percent.Testers)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Testers : White")
# offer
offer_asian <- offer_race %>%
ggplot(aes(x = Percent.Asian, y = Percent.Offers)) + geom_point(color="lightgreen", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Offers : Asian")
offer_black <- offer_race %>%
ggplot(aes(x = Percent.Black, y = Percent.Offers)) + geom_point(color="lightblue", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Offers : Black")
offer_hispanic <- offer_race %>%
ggplot(aes(x = Percent.Hispanic, y = Percent.Offers)) + geom_point(color="pink", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Offers : Hispanic")
offer_white <- offer_race %>%
ggplot(aes(x = Percent.White, y = Percent.Offers)) + geom_point(color="orange", na.rm = TRUE) + geom_smooth(method = lm) + ggtitle("Percent.Offers : White")
grid.arrange(tester_asian, tester_white, tester_hispanic, tester_black, offer_asian, offer_white, offer_hispanic, offer_black, ncol=4)
There are 26 schools where the percent offers > 100%. That is, the number of offers > number of testers.
Those schools are:
school_exp_dem_shsat_safety[(as.numeric(as.character(school_exp_dem_shsat_safety$Count.of.Offers)) > as.numeric(as.character(school_exp_dem_shsat_safety$Count.of.Testers))), ]
Almost all of these schools are Black/ Hispanic dominated schools. How is this possible? Corruption? Some special program?
As mentioned earlier, there are three ways to increase the number of students of historically underrepresented demographics in Specialized High Schools.
Now, our aim is to identify what attributes are correlated with the number of SHSAT registrations, test takers, and admission offers.
Registration amongst the minorities like Black and Hispanic population is pretty high. We need to focus on increasing the number of poeple who actually take the test and the number of offers.
The graph below tells us that for majority of the schools, the percentage of students who take the test (\(testTakers/Registrations * 100\)) lies around 25%.
So, majority of the schools have 25% students who appear for the test after registering for the same.
school_exp_dem_shsat_safety %>%
select(Percent.Testers) %>%
ggplot(aes(Percent.Testers)) + geom_density(fill = "pink", na.rm= T, alpha = 0.8)
I’ve plotted a heat map of the values scaled between (0, 1) in a sorted data frame to compare the different attributes between schools with low test takers and schools with high test takers.
normalit<-function(m){
(m - min(m))/(max(m)-min(m))
}
hm.palette <- colorRampPalette(rev(brewer.pal(9, 'RdYlBu')), space='Lab')
school_exp_dem_shsat_safety %>%
select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Students.in.HS.Admissions, Percent.Testers, School.Rating, Avg.Academics, Percent.Disability, Crime.Stats) %>%
arrange(desc(Count.of.Testers), Count.of.Students.in.HS.Admissions) %>%
head(50) %>%
mutate(Avg.Academics = normalit(Avg.Academics), Crime.Stats = normalit(Crime.Stats)) %>%
select(-c(Percent.Testers, Count.of.Testers, Count.of.Students.in.HS.Admissions)) %>%
melt() %>%
ggplot(aes(x = School.Name, y = variable)) + geom_tile(aes(fill = value)) + scale_fill_gradientn(colours = hm.palette(100)) + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust=0)) + ggtitle("Characterstics of Schools With High Percentage of Testers(Top 50)") + geom_text(aes(label = round(value, 1)))
From the above graph, we observe that for high test takers,
top_50_tester_school <- school_exp_dem_shsat_safety %>%
select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Students.in.HS.Admissions, Percent.Testers, School.Rating, Avg.Academics, Percent.Disability) %>%
arrange(desc(Count.of.Testers), Count.of.Students.in.HS.Admissions) %>%
select(-c(Count.of.Testers, Count.of.Students.in.HS.Admissions)) %>%
head(50)
top_50_tester_school_mean <- colMeans(top_50_tester_school[, 2:13])
df_top_50_tester_school_mean <- data.frame(lapply(top_50_tester_school_mean, function(x) t(data.frame(x))))
df_top_50_tester_school_mean
normalit<-function(m){
(m - min(m))/(max(m)-min(m))
}
hm.palette <- colorRampPalette(rev(brewer.pal(9, 'RdYlBu')), space='Lab')
school_exp_dem_shsat_safety %>%
select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Students.in.HS.Admissions, Percent.Testers, School.Rating, Avg.Academics, Percent.Disability, Crime.Stats) %>%
arrange(desc(Count.of.Testers), Count.of.Students.in.HS.Admissions) %>%
tail(50) %>%
mutate(Avg.Academics = normalit(Avg.Academics), Crime.Stats = normalit(Crime.Stats)) %>%
select(-c(Percent.Testers, Count.of.Testers, Count.of.Students.in.HS.Admissions)) %>%
melt() %>%
na.omit() %>%
ggplot(aes(x = School.Name, y = variable)) + geom_tile(aes(fill = value)) + scale_fill_gradientn(colours = hm.palette(100)) + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust=0)) + ggtitle("Characterstics of Schools With Low Percentage of Testers (Bottom 50)") + geom_text(aes(label = round(value, 1)))
From the above graph, we observe that for low test takers,
bottom_50_tester_school <- school_exp_dem_shsat_safety %>%
select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Students.in.HS.Admissions, Percent.Testers, School.Rating, Avg.Academics, Percent.Disability) %>%
arrange(desc(Count.of.Testers), Count.of.Students.in.HS.Admissions) %>%
tail(50) %>%
select(-c(Count.of.Testers, Count.of.Students.in.HS.Admissions)) %>%
na.omit()
# bottom_50_tester_school
bottom_50_tester_school_mean <- colMeans(bottom_50_tester_school[, 2:13])
df_bottom_50_tester_school_mean <- data.frame(lapply(bottom_50_tester_school_mean, function(x) t(data.frame(x))))
df_bottom_50_tester_school_mean
top_50_testers_plot <- df_top_50_tester_school_mean %>%
melt() %>%
ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for top-50 test-taking schools")
bottom_50_testers_plot <- df_bottom_50_tester_school_mean %>%
melt() %>%
ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for bottom-50 test-taking schools")
grid.arrange(top_50_testers_plot, bottom_50_testers_plot, ncol=2)
The average percent testers for the top-50 schools is 53% while that of the bottom-50 schools is 6%.
Comparing the two,
Based on the correlation heatmap we plotted earlier,
Percent testers is:
Positively Correlated with:
+ Attendance Rate
+ Average Academics (Math + ELA)
The graph below tells us that for majority of the schools, the percentage of students who take the test (\(testTakers/Registration * 100\)) lies around 25%.
So, majority of the schools have 25% students who appear for the tests who after registering for the same.
school_exp_dem_shsat_safety %>%
select(Percent.Offers) %>%
filter(Percent.Offers <= 1) %>%
ggplot(aes(Percent.Offers)) + geom_density(fill = "orange", na.rm= T, alpha = 0.6)
In Majority of the schools, only around 12% students secure admissions in Specialised High Schools.
I’ve plotted a heat map of the values scaled between (0, 1) in a sorted data frame to compare the different attributes between schools with low test takers and schools with high test takers.
normalit<-function(m){
(m - min(m))/(max(m)-min(m))
}
hm.palette <- colorRampPalette(rev(brewer.pal(9, 'RdYlBu')), space='Lab')
school_exp_dem_shsat_safety %>%
select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Offers, Percent.Offers, School.Rating, Avg.Academics, Percent.Disability, Crime.Stats) %>%
filter(Percent.Offers <= 1) %>%
arrange(desc(Count.of.Offers), Count.of.Testers) %>%
head(50) %>%
mutate(Avg.Academics = normalit(Avg.Academics), Crime.Stats = normalit(Crime.Stats)) %>%
select(-c(Percent.Offers, Count.of.Testers, Count.of.Offers)) %>%
melt() %>%
ggplot(aes(x = School.Name, y = variable)) + geom_tile(aes(fill = value)) + scale_fill_gradientn(colours = hm.palette(100)) + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust=0)) + ggtitle("Characterstics of Schools With High Percentage of Offers(Top 50)") + geom_text(aes(label = round(value, 1)))
From the above graph, we observe that for high SHSAT offers,
top_50_offer_school <- school_exp_dem_shsat_safety %>%
select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Offers, Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
arrange(desc(Count.of.Offers), Count.of.Testers) %>%
select(-c(Count.of.Offers, Count.of.Testers)) %>%
head(50)
top_50_offer_school_mean <- colMeans(top_50_offer_school[, 2:13])
df_top_50_offer_school_mean <- data.frame(lapply(top_50_offer_school_mean, function(x) t(data.frame(x))))
df_top_50_offer_school_mean
normalit<-function(m){
(m - min(m))/(max(m)-min(m))
}
hm.palette <- colorRampPalette(rev(brewer.pal(9, 'RdYlBu')), space='Lab')
school_exp_dem_shsat_safety %>%
select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Offers, Percent.Offers, School.Rating, Avg.Academics, Percent.Disability, Crime.Stats) %>%
arrange(desc(Count.of.Offers), Count.of.Testers) %>%
tail(50) %>%
mutate(Avg.Academics = normalit(Avg.Academics), Crime.Stats = normalit(Crime.Stats)) %>%
select(-c(Percent.Offers, Count.of.Offers, Count.of.Testers)) %>%
melt() %>%
na.omit() %>%
ggplot(aes(x = School.Name, y = variable)) + geom_tile(aes(fill = value)) + scale_fill_gradientn(colours = hm.palette(100)) + coord_flip() + theme(axis.text.x = element_text(angle = 90, hjust=0)) + ggtitle("Characterstics of Schools With Low Percentage of Offers (Bottom 50)") + geom_text(aes(label = round(value, 1)))
From the above graph, we observe that for low SHSAT offers,
bottom_50_offer_school <- school_exp_dem_shsat_safety %>%
select(School.Name, Economic.Need.Index, Percent.ELL, Percent.Asian, Percent.Black, Percent.Hispanic, Percent.White, Student.Attendance.Rate, Percent.of.Students.Chronically.Absent, Count.of.Testers, Count.of.Offers, Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
arrange(desc(Count.of.Offers), Count.of.Testers) %>%
tail(50) %>%
select(-c(Count.of.Offers, Count.of.Testers)) %>%
na.omit()
# bottom_50_tester_school
bottom_50_offer_school_mean <- colMeans(bottom_50_offer_school[, 2:13])
df_bottom_50_offer_school_mean <- data.frame(lapply(bottom_50_offer_school_mean, function(x) t(data.frame(x))))
df_bottom_50_offer_school_mean
top_50_offers_plot <- df_top_50_offer_school_mean %>%
melt() %>%
ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for top-50 SHSAT offers schools")
bottom_50_offers_plot <- df_bottom_50_offer_school_mean %>%
melt() %>%
ggplot(aes(x = variable, y = value)) + geom_bar(stat="identity", aes(fill = value)) + geom_label(aes(label = round(value, 2))) + theme(axis.text.x = element_text(angle = 90, hjust=0)) + coord_cartesian(ylim = c(0, 4)) + ggtitle("Mean Attributes for bottom-50 SHSAT offers schools")
grid.arrange(top_50_offers_plot, bottom_50_offers_plot, ncol=2)
The average percent offers for the top-50 schools is 36% while that of the bottom-50 schools is 0%.
Comparing the two,
# Mean Calculations
mean_precent_testers_all <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.ELL,
Percent.Asian, Percent.Black, Percent.Hispanic,
Percent.White, Student.Attendance.Rate,
Percent.of.Students.Chronically.Absent, Percent.Testers,
Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
filter(Percent.Offers <= 1) %>%
na.omit()
mean_precent_testers <- colMeans(mean_precent_testers_all)
df_mean_precent_testers <- data.frame(lapply(mean_precent_testers, function(x) t(data.frame(x))))
tester_mean <- df_mean_precent_testers$Percent.Testers
offer_mean <- df_mean_precent_testers$Percent.Offers
eni_mean <- df_mean_precent_testers$Economic.Need.Index
academic_mean <- df_mean_precent_testers$Avg.Academics
attendance_mean <- df_mean_precent_testers$Student.Attendance.Rate
school_rating_mean <- df_mean_precent_testers$School.Rating
chronic_absence_mean <- df_mean_precent_testers$Percent.of.Students.Chronically.Absent
These are the schools with low than average number of offers.
help_schools_gen <- school_exp_dem_shsat_safety %>%
filter(Percent.Offers < offer_mean, Percent.Testers < tester_mean)
help_schools_gen <- help_schools_gen %>%
select(Feeder.School.DBN,School.Name, Majority.Race, Count.of.Students.in.HS.Admissions,Count.of.Testers,Count.of.Offers, District, Latitude,Longitude, City,Zip,Community.School.,Economic.Need.Index,School.Income.Estimate,Percent.ELL,Percent.Asian,Percent.Black,Percent.Hispanic, Percent.White, Student.Attendance.Rate,Percent.of.Students.Chronically.Absent, Percent.Female, Percent.Male,Percent.Disability,Percent.Poverty, Percent.Testers,Percent.Offers,School.Rating, Crime.Stats, Avg.Academics)
help_schools_gen
pal1 <- colorNumeric(palette = "RdYlBu", domain = help_schools_gen$Count.of.Offers)
leaflet(data = help_schools_gen) %>%
na.omit() %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap, group = "City") %>%
addCircles(~Longitude, ~Latitude, radius=~Count.of.Offers*20, color = ~pal1(Count.of.Offers), stroke = TRUE, fillOpacity = 1.0, popup = paste("City:", help_schools_gen$City, "<br>","School:", help_schools_gen$School.Name, "<br>"), group = "Count.of.Offers")%>%
addLegend("bottomright", pal = pal1, values = ~Count.of.Offers, title = "Count.of.Offers", opacity = 1) %>%
setView(-73.935242, 40.730610, zoom = 10)
These schools need both financial and Qualitative (Administrative/Teaching) help.
We will select schools whose Percent.Testers, Percent.Offers, Avg.Academic, and Attendance are below average and Economic Index, and Chronic.Absence is above average.
The schools are arranged in ascending order by the number of Percent.Offers, Percent.Testers, and SHSAT Registrations.
There are 597 schools in New York who teach students till atleast Grade 8. The following subset is from these schools.
# school_exp_dem_shsat_safety[school_exp_dem_shsat_safety$Percent.Testers < mean_tester, ]
help_schools <- school_exp_dem_shsat_safety %>%
filter(Percent.Testers < tester_mean, Percent.Offers < offer_mean, Economic.Need.Index > eni_mean, Avg.Academics < academic_mean, Student.Attendance.Rate < attendance_mean, School.Rating < school_rating_mean, Percent.of.Students.Chronically.Absent > chronic_absence_mean) %>%
arrange(Percent.Offers, Percent.Testers, Count.of.Students.in.HS.Admissions)
help_schools <- help_schools %>%
select(Feeder.School.DBN,School.Name, Majority.Race, Count.of.Students.in.HS.Admissions,Count.of.Testers,Count.of.Offers, District, Latitude,Longitude, City,Zip,Community.School.,Economic.Need.Index,School.Income.Estimate,Percent.ELL,Percent.Asian,Percent.Black,Percent.Hispanic, Percent.White, Student.Attendance.Rate,Percent.of.Students.Chronically.Absent, Percent.Female, Percent.Male,Percent.Disability,Percent.Poverty, Percent.Testers,Percent.Offers,School.Rating, Crime.Stats, Avg.Academics)
help_schools
These schools are perfectly represent the demographic that PASSNYC aims to help. All the schools majority Black and/or Hispanic.
leaflet(data = help_schools) %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(), label = ~htmlEscape(School.Name))
Schools that have a more than average percentage of test-taking students but less than average percentage of offers, are the schools that are not in dire need of economic/financial aid, but require help with school administration and subjects like Math and ELA. These will benefit the most from after-school programs, teacher-training, and tutoring.
PS: We know that Economic Need Index and Test-Takers are negatively correlated from the correlation plot we made.
help_schools_acad <- school_exp_dem_shsat_safety %>%
filter(Percent.Testers > tester_mean, Percent.Offers < offer_mean, Economic.Need.Index < eni_mean, Avg.Academics < academic_mean, School.Rating < school_rating_mean) %>%
arrange(Percent.Offers)
help_schools_acad <- help_schools_acad %>%
select(Feeder.School.DBN,School.Name, Majority.Race, Count.of.Students.in.HS.Admissions,Count.of.Testers,Count.of.Offers, District, Latitude,Longitude, City,Zip,Community.School.,Economic.Need.Index,School.Income.Estimate,Percent.ELL,Percent.Asian,Percent.Black,Percent.Hispanic, Percent.White, Student.Attendance.Rate,Percent.of.Students.Chronically.Absent, Percent.Female, Percent.Male,Percent.Disability,Percent.Poverty, Percent.Testers,Percent.Offers,School.Rating, Crime.Stats, Avg.Academics)
help_schools_acad
leaflet(data = help_schools_acad) %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(), label = ~htmlEscape(School.Name))
We will perform k-means clustering to identify similar schools.
Based on the attributes of the clusters, we will identify what kind of help the schools need ie. Monetary or Qualitative.
cluster_df <- school_exp_dem_shsat_safety %>%
select(Feeder.School.DBN, School.Name, Majority.Race, Economic.Need.Index, Percent.ELL,
Percent.Asian, Percent.Black, Percent.Hispanic, Latitude, Longitude,
Percent.White, Student.Attendance.Rate,
Percent.of.Students.Chronically.Absent, Percent.Testers,
Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
filter(Percent.Offers <= 1) %>%
na.omit()
scaled_df <- school_exp_dem_shsat_safety %>%
select(Economic.Need.Index, Percent.ELL,
Percent.Asian, Percent.Black, Percent.Hispanic,
Percent.White, Student.Attendance.Rate,
Percent.of.Students.Chronically.Absent, Percent.Testers,
Percent.Offers, School.Rating, Avg.Academics, Percent.Disability) %>%
filter(Percent.Offers <= 1) %>%
na.omit() %>%
scale()
# Manual method of finding the number of clusters
# # Identiy the optimum number of clusters
# kmean_withinss <- function(k) {
# cluster <- kmeans(scaled_df, k)
# return (cluster$tot.withinss)
# }
#
# # Set maximum cluster
# max_k <-10
#
# # Run algorithm over a range of k
# wss <- sapply(2:max_k, kmean_withinss)
#
# # Create a data frame to plot the graph
# elbow <-data.frame(2:max_k, wss)
#
# # Plot the graph with gglplot
# ggplot(elbow, aes(x = X2.max_k, y = wss)) +
# geom_point() +
# geom_line() +
# scale_x_continuous(breaks = seq(1, 20, by = 1))
#
#
# pc_cluster <-kmeans(scaled_df, 3)
# cluster_df$cluster <- pc_cluster$cluster
#
# cluster_df %>%
# group_by(cluster) %>%
# summarise_all(funs(mean))
#
#
# cluster_df %>%
# mutate(cluster = as.factor(cluster)) %>%
# ggplot(aes(x = Economic.Need.Index, y = Avg.Academics)) + geom_point(size = 3, alpha = 0.6, aes(color = cluster)) +
# scale_color_manual(values = c("red", "green", "blue"), labels = c("Cluster A", "Cluster B", "Cluster C"))
# Elbow method
num_clust_a <- fviz_nbclust(scaled_df, kmeans, method = "wss") + geom_vline(xintercept = 4, linetype = 2) + labs(subtitle = "Elbow method")
# Silhouette method
num_clust_b <- fviz_nbclust(scaled_df, kmeans, method = "silhouette") + labs(subtitle = "Silhouette method")
# Gap statistic
num_clust_c <- fviz_nbclust(scaled_df, kmeans, nstart = 25, method = "gap_stat", nboot = 500, verbose = FALSE) + labs(subtitle = "Gap statistic method")
grid.arrange(num_clust_a, num_clust_b, num_clust_c, ncol=3)
nb <- NbClust(scaled_df, distance = "euclidean", min.nc = 2, max.nc = 10, method = "kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 6 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 2
##
##
## *******************************************************************
fviz_nbclust(nb)
## Among all indices:
## ===================
## * 2 proposed 0 as the best number of clusters
## * 1 proposed 1 as the best number of clusters
## * 7 proposed 2 as the best number of clusters
## * 5 proposed 3 as the best number of clusters
## * 1 proposed 5 as the best number of clusters
## * 6 proposed 6 as the best number of clusters
## * 1 proposed 7 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
## * 2 proposed 10 as the best number of clusters
##
## Conclusion
## =========================
## * According to the majority rule, the best number of clusters is 2 .
We’ll take the number of clusters to be 3. The above graph shows 3 as the best number of clusters when I run it on R-Studio. It somehow changes to 4 when I run knit it.
Anywho, we’ll stick with 3 clusters as it helps seperate the clusters better.
# Acual Kmeans Clustering
pc_cluster <-kmeans(scaled_df, 3)
cluster_df$cluster <- as.factor(pc_cluster$cluster)
hm.palette <-colorRampPalette(rev(brewer.pal(10, 'RdYlGn')),space='Lab')
cluster_df %>%
select(-c(Feeder.School.DBN, School.Name, Majority.Race, Latitude, Longitude, Avg.Academics)) %>%
melt() %>%
ggplot(aes(x = variable, y = cluster, fill = value)) + geom_tile() + coord_equal() + scale_fill_gradientn(colours = hm.palette(100)) + theme(axis.text.x = element_text(angle = 90, hjust=0))
cluster_df %>%
select(-c(Latitude, Longitude, School.Name, Feeder.School.DBN)) %>%
melt() %>%
group_by(cluster) %>%
ggplot(aes(x = variable, y = value)) + geom_boxplot(aes(fill=cluster), position = position_dodge()) + theme(axis.text.x = element_text(angle=90, hjust=0))
Surprisingly, there isnt’t a whole lot of difference in the Offers received between the 2nd and 3rd cluster.
Based on the above characterstics,
cluster_df %>%
select(Percent.Offers, cluster) %>%
ggplot(aes(Percent.Offers)) + geom_density(aes(fill = cluster), alpha = 0.5) + ggtitle("Density plot of Percent.Offers of the 2 clusters")
Majority schools in cluster 1 has around 10% of offers.
Majority schools in cluster 2 has around 12% of offers.
Majority schools in cluster 3 has around 16% of offers.
cluster_df %>%
ggplot(aes(x = Longitude, colour = cluster)) + geom_point(aes(y = Latitude), size=3, alpha=0.5) + scale_color_manual(values = c("red", "darkblue", "green"))
palfact <- colorFactor(c("red", "blue", "green"),cluster_df$cluster)
leaflet() %>%
addTiles() %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addCircles(data=cluster_df, lat = ~Latitude, lng = ~Longitude,
color =~palfact(cluster),
stroke = TRUE,
radius = 10,
fillOpacity = 0.75)
Schools in cluster 3 require the most help. The schools are arranged in the decreasing order of help required. These schools need long term monetary as well as administrative help.
most_help_schools <- cluster_df %>%
filter(cluster == 3) %>%
arrange(Percent.Offers, Percent.Testers, Percent.Testers, Economic.Need.Index, Avg.Academics)
most_help_schools
leaflet(data = most_help_schools) %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(), label = ~htmlEscape(School.Name))
Schools in cluster 1 require the short term help. The schools are arranged in the decreasing order of help required. These schools majorly need administrative help.
medium_help_schools <- cluster_df %>%
filter(cluster == 1) %>%
arrange(Percent.Offers, Percent.Testers, Percent.Testers, Economic.Need.Index, Avg.Academics)
medium_help_schools
leaflet(data = medium_help_schools) %>%
addProviderTiles(providers$Esri.WorldTopoMap) %>%
addMarkers(~Longitude, ~Latitude, clusterOptions = markerClusterOptions(), label = ~htmlEscape(School.Name))
Thank you for reading. Hope you liked it. Constructive criticism is welcome. :)
You can find me on LinkedIn.
View the kernel on kaggle. Read my data science blogs here.