New York City is home to over 1,800 schools, making it the largest public-school system in the United States. While New York City is well known for its elite private and public schools, it is also known for some of its neglected, underperforming schools as well. New York City high schools have SAT averages that vary drastically. To understand what causes these variations, I analyzed several external factors to determine if there was a correlation between them and SAT scores. The factors analyzed include total crimes in school, enrollment, percent of students with disabilities, percent of students receiving free lunch, geography, academic expectations scores and percent of Hispanic and black students. I hypothesized that all of these factors would be correlated with lower total average SAT scores.
To identify these external factors, I compared several different datasets from NYC Open Data. The datasets analyzed include 2012 NYC General Education School Survey dataset, 2012 SAT Results dataset, 2010 – 2016 School Safety Report, and 2011-2012 High School Progress Report. The year 2012 was selected because of the availability and quality of multiple datasets.
I hypothesized that external factors such as race, free lunch, school crime, enrollment, and disabilities would all have a negative correlation with SAT scores. Meaning that as one of those factors increased, average total SAT scores would decrease. I also hypothesized that geography would not have much impact on SAT scores due to the prevalance of magnet schools in New York City. I also believed that academic expectation scores from the annual NYC School Survey would have a positive correlation with SAT scores.
My first step getting started was to load the following libraries in RStudio:
library(sf)
library(tidyverse)
library(mapview)
library(st)
library(sp)
The first dataset I loaded was the 2012 SAT Results dataset as a CSV file through the read.csv function. I added stringAsFactors = FALSE to make it easier to run my data as numeric instead of characters. I changed the names of the columns I was most interested in, reading, writing, and math scores, to have simplified names. Finally, I ran as.numeric() functions for each of these columns so that I would be able to calculate the total SAT score for each school, instead of analyzing them separately. I did this by using the rowSums function to generate a new column called “total.”
sat <- read.csv("2012_SAT_Results.csv", stringsAsFactors = FALSE)
colnames(sat)[4] <- "Reading"
colnames(sat)[5] <- "Math"
colnames(sat)[6] <- "Writing"
as.numeric(sat$Math) -> sat$Math
as.numeric(sat$Reading) -> sat$Reading
as.numeric(sat$Writing) -> sat$Writing
sat$total <- rowSums(sat[,4:6])
The 2012 SAT dataset is very useful, but it does not have any sort of geometry for the school locations, so it is not mappable. I merged the data with the NYC School Point Locations to solve this problem. This dataset includes all schools in New York City, so I filtered it to only show high schools because only high schoolers are taking the SAT. All schools in New York City have a DBN which stands for District Borough Number, a unique identifier that appears in all education datasets. This makes merging education datasets very simple because the DBN’s do not vary in the way that school names might. So once the school points dataset had been filtered, I merged it with my SAT dataset to create HSsat1, a mappable SAT dataset. The SAT dataset had “s” listed as some of the scores, so prior to merging the datasets I filtered them out.
colnames(sat)[2] <- "SCHOOLNAME"
str(nycMerged$geometry)
coordinates(nycMerged) <- nycMerged$geometry
nySchools %>%
inner_join(sat) -> nycMerged
nySchools %>%
filter(SCH_TYPE %in% "High school") -> nycHighSchools
HSsat %>%
filter(!(Reading %in% "s")) -> HSsat1
nycHighSchools %>%
inner_join(sat)-> HSsat1
The next dataset that I added in was the 2010 – 2016 School Safety Report as a CSV file. Despite the name, the dataset only has reports from 2013 – 2016, so I filtered the data to the reports from the 2013 – 2014 school year, because it was the closest to my SAT results and there were no better datasets available. I changed the name of one of the DBN columns in my originally dataset, because when trying to merge the SAT and Safety datasets, it did not like that there were two DBN columns. After doing this I was able to merge the two datasets by the DBN columns.
schoolSafety <- read.csv("2010_-_2016_School_Safety_Report.csv", stringsAsFactors = FALSE)
schoolSafety %>%
filter(School.Year %in% "2013-14") -> schoolSafety1
colnames(HSsat1)[1] <- "DBN1"
HSsat1 %>%
inner_join(schoolSafety1) -> safetyScoresMerged
Next, I uploaded the 2012 NYC General Education School Survey dataset CSV file and merged it with the safety scores and SAT merged dataset, to make it easy to plot and compare the data.
survey2012 <- read.csv("https://data.cityofnewyork.us/resource/xiyj-m4sj.csv", stringsAsFactors = FALSE)
colnames(survey2012)[1] <- "DBN"
survey2012 %>%
inner_join(safetyScoresMerged1) -> surveyScoresMerge
I found a useful dataset called 2011 – 2012 High School Progress Report. This dataset has information from a report on student progress at NYC public schools. Some of the useful data includes the population, percent of black and Hispanic students, percent of students with disabilities, and percent of students receiving free lunch. I added this dataset into R as a CSV. I joined the new dataset with the SAT dataset and the Safety dataset to be able to test the correlation between a few different variables. I simplified a few of the column names before testing them to make them easy to compare.
ProgressReport <- read.csv("ProgressReport.csv", stringsAsFactors = FALSE)
view(ProgressReport)
ProgressReport %>%
inner_join(survey2012) -> ProgressSurvey
colnames(ProgressSurvey)[14] <- "PercentDisabilities"
ProgressReport %>%
inner_join(safetyScoresMerged1) -> ProgressSafety
ProgressReport %>%
inner_join(HSsat1) -> ProgressSAT
colnames(ProgressSurvey)[14] <- "PercentDisabilities"
colnames(ProgressSAT)[17] <- "freelunch"
colnames(ProgressSurvey)[17] <- "freelunch"
colnames(ProgressSurvey)[18] <- "blackorhispanic"
colnames(ProgressSAT)[18] <- "blackorhispanic"
colnames(ProgressSAT)[15] <- "selfcontained"
colnames(ProgressSurvey)[15] <- "selfcontained"
My first visualization shows the average total SAT of New York City public high school displayed on a map. Each point corresponds with the school’s location and the color reflects the range of scores shown on the legend. The lowest scores are a dark purple color and the highest scores are a yellow or lime green color. In 2012, the SAT consisted of three sections, math, reading, and writing, with each section scored on a scale from 0 – 800. The total SAT score in 2012 was out of 2400. The highest score in the 2012 NYC SAT dataset is from 887 and the highest is a 2100. To view the school name and a breakdown of the average scores for reading, math and writing click on any of the map points. Looking at the map, there doesn’t appear to be any cluster of the standout scores (yellow or lime green). The extremely high scores are spread out across the city. However, there do seem to be clusters of very low scores (purple and dark blue) in one section of Brooklyn and in the Bronx.
mapview(HSsat, zcol = c( "SCHOOLNAME", "Math","Reading", "Writing","total"),legend = list(FALSE,FALSE,FALSE,FALSE,TRUE)) -> satMAP
As shown on the map, geography does not play as much of a role in academics in New York City as it does in most other parts of the United States. This is because of the prevalence of magnet high schools in New York City. A magnet school is a public school that high achieving students can test into. So, if an academically gifted student lives in a poor neighborhood in New York City, they are not doomed to the local public school, they can commute to one of the city’s elite magnet schools. For this reason, I was unable to find much of a correlation between SAT scores and geography.
One of the strongest correlations that I found amongst the four datasets examined was between the percentage of black and Hispanic students within a school and the total SAT score average of each school.
cor.test(ProgressSAT$blackorhispanic, ProgressSAT$total)
ggplot(ProgressSAT, aes(blackorhispanic, total)) +
geom_point() -> scatterplot10
scatterplot10 +
xlab("Percent of Black and Hispanic Students") +
ylab("Total SAT Score (2400)") +
ggtitle("") +
theme_minimal()
The correlation test found a value of -0.7787901 with a 95% confidence interval of -0.8250323 to -0.7221870, meaning that there is a significant negative correlation between the percent of black and Hispanic students and total SAT score. As the total SAT score decreases, the percentage of black and Hispanic students increase. This correlation relates very closely to the map because New York is heavily segregated by race. The maps in this article show areas where white people make up less than 10% of the population and where black people make up less than 10% of the population. The areas where white people make up less than 10% of the population correspond to the pockets of dark purple markers on the map above that indicate low SAT scores parts of Brooklyn and the Bronx. Race gaps in the SATs have been documented and observed for many years. Part of this gap has been attributed to the linguistic differences between different racial groups and the way that test is written favors Caucasian and Asian students.
cor.test(ProgressSAT$freelunch, ProgressSAT$total)
ggplot(ProgressSAT, aes(freelunch, total)) +
geom_point() -> scatterplotlunch
scatterplotlunch +
xlab("Percent of Students Receiving Free Lunch") +
ylab("Total SAT Scores") +
ggtitle("Total SAT Scores and Percent of Students Receiving Free Lunch") +
theme_minimal()
I ran a correlation test to determine the relationship between the percent of students receiving free lunch and the total SAT score average for each high school. The result of the correlation test was -0.6508731 with a confidence interval -0.7196408 and -0.5694609. This means that there is a moderate, negative correlation between percent of students receiving free lunch and total SAT scores. This correlation makes sense, because free lunch is an indicator of income and the SAT is known to favor students from rich and highly educated families.
After changing the total number of crimes in school from a character to a number, I ran a correlation test to determine if there was any significant relationship between total SAT scores and the total number of crimes.
as.numeric(safetyScoresMerged1$safetyTotal) -> safetyScoresMerged1$safetyTotal
cor.test(safetyScoresMerged1$safetyTotal, safetyScoresMerged1$total)
ggplot(safetyScoresMerged1, aes(safetyTotal, total)) +
geom_point() -> scatterplot2
scatterplot2 +
xlab("Total Number of Crimes") +
ylab("Average Total SAT Score") +
ggtitle("Relationship Between Crime and SAT Scores in NYC High Schools") +
theme_minimal()
The correlation test found a result of -0.1495631 with a 95% confidence interval of -0.36556855 and 0.08172871. A correlation of 0 or close to 0 means that there is little to no correlation between the two variables. Such a low correlation means that there is no significant correlation between average SAT scores and school crime. An interesting thing to note about the graph is that some of the schools with the lowest SAT scores have similar amounts of crime as the schools with the highest SAT scores. The schools with higher amounts of crime have fairly average or slightly below average scores. This goes against my hypothesis because I expected there to be at least a mild correlation between SAT scores and school crime.
as.numeric(sub(",", "", safetyScoresMerged1$Register, fixed = TRUE)) -> safetyScoresMerged1$Register
cor.test(safetyScoresMerged1$total, safetyScoresMerged1$Register)
I ran a correlation test to determine if there was a relationship between average total SAT score and the number of students enrolled at each high school. The correlation test had a result of 0.3533583 and a 95% confidence interval of 0.1358265 and 0.5383867. This shows that there is a very weak, positive correlation between the two variables. This means that higher enrollment is slightly correlated with higher SAT scores. However, .35 is not a statistically significant correlation so there is not any significant relationship between school enrollment and SAT scores. This result is surprising and went against my hypothesis as well. I have always associated larger public schools with average or less rigorous academics.
as.numeric(ProgressSAT$Closing.the.Achievement.Gap.Points) -> ProgressSAT$Closing.the.Achievement.Gap.Points
I ran a correlation test to determine if there was a relationship between the percent of students with disabilities and the schools SAT average score. The correlation produced a result of -0.4433127 with a 95% confidence interval of -0.5416118 and -0.3330536. This means that there is a weak, negative correlation between SAT scores and students with disabilities. While this technically confirms my hypothesis, a value of -0.44 is not close enough to 1 to prove anything statistically significant.
Every year, the NYC Department of Education (NYC DOE) distributes the “School Survey” to parents, teachers and students in grades 6 – 12. The survey is one of the largest of its kind ever conducted nationally.
An overview of the survey’s purpose and significance from the 2011 survey dataset:
“Survey results provide insight into a school’s learning environment and contribute a measure of diversification that goes beyond test scores on the Progress Report. NYC School Survey results contribute 10% - 15% of a school’s Progress Report grade (the exact contribution to the Progress Report is dependent on school type). Survey questions assess the community’s opinions on academic expectations, communication, engagement, and safety and respect. School leaders can use survey results to better understand their own school’s strengths and target areas for improvement. The NYC School Survey helps school leaders understand what key members of the school community say about the learning environment at each school. The information captured by the survey is designed to support a dialogue among all members of the school community about how to make the school a better place to learn. The NYC School Survey has been taken annually since its inception in 2007.”
```
as.numeric(surveyScoresMerge\(total_academic_expectations) -> surveyScoresMerge\)total_academic_expectations cor.test(surveyScoresMerge\(total, surveyScoresMerge\)total_academic_expectations)
ggplot(surveyScoresMerge, aes(total_academic_expectations, total)) + geom_point() -> scatterplot8 scatterplot8 + xlab(“Academic Expectations Score”) + ylab(“Total SAT Scores”) + ggtitle(“Total SAT Scores and Academic Expectations Score”) + theme_minimal() ``` To compare total SAT scores and academic expectation scores I began by switching the academic expectations score from a character to a number. Next, I ran a correlation test that found a result of 0.09627417 with a 95% confidence interval of -0.2139165 and 0.3888281. This result means that there is absolutely no correlation between a school’s academic expectation score and total SAT average score. This proves my hypothesis false, because there is no correlation between the two variables. This most likely has to do with how qualitative survey scores are. I was not able to find the survey questions online, so it’s hard to know how accurate a reflection these scores are of academic expectations.
A few of my hypotheses were supported, such as the negative correlation between black and Hispanic students and SAT scores and free lunch and SAT scores. My hypothesis about geography not being very relevant to the SAT scores was supported by the map. A few results were too weak to show any correlation or lack of correlation, such as the number of students enrolled and percent of students with disabilities. Two of my hypotheses were not supported because there was no correlation found between the number of crimes in school and SAT scores or with the academic expectations score and SAT scores.
The external factors that were most strongly related to SAT scores were race and what percent of students receive free lunch. Other factors that seem like they should be linked to SAT scores were not. These results indicate that the SAT has some level of bias towards race and income, and may not necessarily be great indicator of intelligence. This could have to do with the linguistic differences that occur amongst different races and the expensive test prep services available to those who can afford them.