In this project we shall explore the efficacy and fairness of standardized tests. Specifically, we will look into the Scholastic Aptitude Test (SAT) and how its scores correlate to demographic factors of students from New York City who have taken the test. The SAT is taken by high school students who wish to apply to college.
After cleaning and combining various datasets, exploratory analysis brought about interesting relationships between students' performance on the SAT and factors such as gender, race, perceived safety of schools, percentage of AP test takers and total enrollment.
Note: This project is part of Data Quest's Data Scientist in Python track.
For a holistic exploration of the drivers behind students' performance in the SAT, we need to look at a variety of datasets to consolidate information on a high school's demographics, SAT scores, graduation rates, class size and other factors. We will thus download the following datasets and merge them.
ap_2010.csv: Data on AP test results which can be found here. AP class_size.csv: Data on the class sizes of each high school, which can be found heredemographics.csv: Data on the demographic characteristics of each high school's student body, which can be found heregraduation.csv: Data on graduation outcomes of each class, such as graduates as a percentage of total cohort. This dataset can be accessed herehs_directory.csv: NYC High school directory including data on each high school such as school name, school type and number of students. It is available heresat_results.csv: The SAT results at the school level for graduating seniors. This dataset can be found heresurvey_all.txt: Survey data from student, parent and teacher respondents on schools, which can be accessed heresurvey_d75.txt: Survey data from student, parent and teacher respondents on schools from the 75th district%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re
data_files = ["ap_2010.csv", "class_size.csv", "demographics.csv", "graduation.csv",
"hs_directory.csv", "sat_results.csv"]
data = {}
for f in data_files:
d = pd.read_csv("schools/{0}".format(f))
data[f.replace(".csv", "")] = d
Below we take an initial look at the sat_results dataset, which includes the school name and the number of SAT test takers. It also includes the average SAT score on each of the test's different sections: reading, math and writing. A student can score up to 800 points per section, for a perfect total score of 2400. Additionally, it includes the school's DBN, a unique identifier which will be useful when merging the datasets.
print('SAT Scores dataset')
data['sat_results'].head()
Looping through the rest of the datasets, we observe that most of them also include the DBN. For those who don't, we notice that a combination of other variables (CSD, BOROUGH and SCHOOL CODE) appears to make up the identifier. We also notice that some datasets have multiple rows for the same high school. For example, in the graduation dataset, there appears to be a row for each high school's cohort year.
for dataset in data.keys():
print(data[dataset].head())
At this moment we will read in our last two datasets, which are surveys . We also concatenate both surveys into a single dataset, survey.
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0, sort=True)
print('Survey dimensions: {}'.format(survey.shape))
survey.head()
As a next step, let us select our variables of interest given that this dataset has 2,773 columns. Specifically, we shall want to keep the DBN in order to identify schools, and variables such as the number of student and teacher respondents, the school's safety and respect score based on teacher responses, the communication score, the engagement score, academic expectations score, and other metrics evaluating the performance of the school as per student, teacher and parent respondents.
survey["DBN"] = survey["dbn"]
survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11",
"eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11",
"saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11",
"eng_tot_11", "aca_tot_11"]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
survey.head()
We shall now begin cleaning our data in order to be able to merge the most important columns later on. Let us start by the unique identifier of each school, which will allow us to match elements efficiently in one dataset to elements in another.
For the columns that do not have the DBN identifier, we shall have to concatenate as strings the numbers identifying the school's CSD (Central School District), borough and school code. For this purpose, we write a function which pads the CSD numbers, given that the DBN includes this number in a 2-digit form. We then apply this function to the class_size database and compose its DBN with the padded CSD and SCHOOL CODE.
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]
def pad_csd(num):
string_representation = str(num)
if len(string_representation) > 1:
return string_representation
else:
return "0" + string_representation
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]
Next, we observe that the average score columns in the sat_results database are of type object. We will therefore change them to a numeric type in order to manipulate them. We then take an average of the three scores to simplify our analysis.
data['sat_results'].info()
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")
data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]
We also notice that in the hs_directory dataset we have a variable Location 1 which includes text describing the school's address. Moreover, within the text is included the latitude and longitude coordinates for each school. Let us extract these in order to perform a geographical analysis of score distributions. For this, we define two functions find_lat and find_lon which use regular expressions to extract this information.
data['hs_directory']['Location 1'].head()
def find_lat(loc):
coords = re.findall("\(.+, .+\)", loc)
lat = coords[0].split(",")[0].replace("(", "")
return lat
def find_lon(loc):
coords = re.findall("\(.+, .+\)", loc)
lon = coords[0].split(",")[1].replace(")", "").strip()
return lon
data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)
data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")
In order to combine our datasets, we need to aggregate or select those which have multiple rows for a same high school. The class_size dataset has various rows for the same school according to its program type and grade. We will select grade to be 09-12 given that our area of focus is high school students who take the SAT, and GEN ED as our PROGRAM TYPE. Other program types includes CTT, which is an integrated co-teaching program for students with disabilities. In the demographics and graduation datasets, we select a particular school year (2012-12) and to capture all students select All Cohort as our demographic of interest.
data['class_size'].head()
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]
class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size
data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]
data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]
Finally, we also convert to numeric values variables in the ap_2010 dataset, which were also stored as object type values. Specifically, we wish to convert the number of test takers, the number of exams taken and the number of exams with scores of 3, 4 or 5.
data['ap_2010'].info()
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']
for col in cols:
data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")
At this point, we have created unique rows with a shared unique identifier for each data set which we will use to consolidate the datasets in order to perform our analysis. We conduct this merge below, using the left merging method in order to preserve values from the sat_results database as this is our most important variable. On the other hand, we merge the class_size, demographics, survey and hs_directory using an inner join given that they have fewer null values.
for dataset in data.keys():
print(dataset)
print(data[dataset].isna().sum(axis = 0))
combined = data["sat_results"]
combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")
to_merge = ["class_size", "demographics", "survey", "hs_directory"]
for m in to_merge:
combined = combined.merge(data[m], on="DBN", how="inner")
As a next step, we take a look at the number of null values of our combined dataset. We shall impute these by using the fillna method and passing the mean of each column.
print(combined.isna().sum(axis = 0))
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
Before we being our exploratory analysis, we will map each school to its Central School District by extracting the first two digits from its DBN. This will serve as an additional variable that may present interesting patterns.
def get_first_two_chars(dbn):
return dbn[0:2]
combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)
We will begin our analysis by calculating the Pearson correlation between all the variables in our combined dataset and our main variable of interest, sat_score, to gain insight into which interact closely. We see below that, for logical reasons, the average test score in each SAT section is highly correlated to the average score. However, we also observe a relatively high positive correlation with total enrollment and other variables that have a direct relationship with this factor such as total exams taken. We also see a positive relationship with the percentage and number of females, and the percentage and number of Asian students. Additionally, there is a positive relationship with school characteristics evaluated in the surveys such as safety. On the other hand, we obsever a strong negative relationship to the percentage of English language learners at -0.72.
correlations = combined.corr()
correlations = correlations["sat_score"]
plt.figure(figsize=(10,10))
correlations.plot.barh()
plt.title('Variable correlation with SAT score')
plt.show()
Let us take a closer look at the relationship between total_enrollment and sat_score. In the figure below, we can see the slightly positive relationship between both variables. However, the relationship does not appear to be strong and there is an important cluster of points grouping the schools with the lowest enrollment numbers which seem to capture two possibilities: they can have outstanding average scores due to their personalized teaching methods, or they can also present the lowest scores in our universe of schools. Therefore, we cannot say there is a clear relationship between the two variables.
combined.plot.scatter('total_enrollment', 'sat_score', title = 'Relationship between total enrollment and SAT score')
plt.show()
Now we shall dive deeper into the correlation between sat_score and the variables stemming from the survey of each school.
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
survey_data = combined.copy()
fields = survey_fields
fields.append('sat_score')
survey_data = survey_data[fields]
correlations_survey = survey_data.corr()
correlations_survey = correlations_survey["sat_score"]
plt.figure(figsize=(10,8))
correlations_survey.plot.barh()
plt.title('Variable correlation of survey fields with SAT score')
plt.show()
From the above plot, we can once again see a positive relation to variables related to total enrollment such as N_p, the number of parent respondents, and N_s, the number of student respondents. We also see a positive relationship with aca_s_11 and saf_s_11, the safety and respect score based on student responses and the academic expectations score based on student responses. It is interesting to note that this correlation is much lower to that of saf_p_11 and aca_p_11, the safety and respect score and academic expectations score of the parents. We also observe a negative relationship with com_p_11, the communication score based on parent responses. The difference between the correlations between these student-parent variables exhibits a gap in how students and their parents perceive this high school experience.
combined.plot.scatter('saf_s_11', 'sat_score', title = 'Relationship between safety and respect score from students and high school SAT scores')
plt.show()
In the above graph it appears that for higher safety scores from students, high school's SAT performance diverges more, while performance distribution is more concentrated for lower scores. That is, schools perceived as unsafe by students appear to have a lower performance on the SATs with more certainty. To further this analysis, we will compute the average safety scores for each district and map them geographically.
district_scores = combined.copy()
district_scores = district_scores.groupby('school_dist').agg(np.mean)
district_safety_scores = district_scores[['saf_s_11', 'lon', 'lat']]
district_safety_scores.head()
import os
os.environ['PROJ_LIB'] = r'C:\ProgramData\Anaconda3\pkgs\proj4-5.2.0-ha925a31_1\Library\share'
from mpl_toolkits.basemap import Basemap
m = Basemap(
projection='merc',
llcrnrlat=40.496044,
urcrnrlat=40.915256,
llcrnrlon=-74.255735,
urcrnrlon=-73.700272,
resolution='h'
)
m.drawmapboundary(fill_color='grey')
m.drawcoastlines()
m.drawrivers(color='#6D5F47', linewidth=.4)
longitudes = district_safety_scores['lon'].tolist()
latitudes = district_safety_scores['lat'].tolist()
school_districts = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\NYC_school_district_map.jpg')
m.scatter(longitudes, latitudes, c = district_safety_scores['saf_s_11'], cmap = 'Blues', s = 20, zorder= 2, latlon = True)
plt.colorbar(label='Safety score as per student survey')
plt.title('Geographic mapping of safetey scores')
plt.show()
plt.figure(figsize = (11,11))
plt.imshow(school_districts)
plt.title('Location of NYC school districts')
plt.show()
print('Location of NYC school districts image sourced from the New York City Deparment of Education')
From the above images we can observe that the safest schools are those in the Manhattan/Bronx area, the Northern Queens/Brooklyn area and Southwestern Brooklyn area as per above. On the other hand, the least safe schools are located in the Northern Brooklyn districts, Southern Queens and the Southern Queens/Brooklyn districts.
crime_map = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\NYC_crime_map.png')
plt.figure(figsize = (15,15))
plt.imshow(crime_map)
plt.title('NYC Crime Map')
plt.show()
print('Map sourced from NYC Crime Map')
If we compare the distribution of safe schools to the previous figure, a crime map of NYC provided by the government here, we notice similar patterns indicating the overall level of safety in these districts. For instance, we note that the Northeastern and Eastern districts of Brooklyn present higher levels of crime per 1000 residents than surrounding districts, precisely in line with the map of schools. We can further examine the level of concern in terms of safety for these districts in the map below, which was sourced from Google Maps. It indicates the level of safety, with purple areas being the most unsafe.
safety_map = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\brooklyn_safety_map.png')
plt.figure(figsize = (15,15))
plt.imshow(safety_map)
plt.title('Brooklyn Safety Map')
plt.show()
print('Map sourced from Google Images')
We have four variables which indicate the percentage of each race at a particular school: white_per, asian_per, black_per and hispanic_per. Let us now take a look at their correlation to SAT scores.
race_data = combined.copy()
fields = ['sat_score', 'white_per', 'asian_per', 'black_per', 'hispanic_per']
race_data = race_data[fields]
correlations_race = race_data.corr()
correlations_race = correlations_race["sat_score"]
plt.figure(figsize=(10,8))
correlations_race.plot.barh()
plt.title('Variable correlation of race variables with SAT score')
plt.show()
In the above plot we note that there is a positive correlation between SAT scores and the percentage of white and Asian students in the school. On the other hand, there is a negative relationship between scores and percentage of black and hispanic students. It is often the case that a student's performance in school can be largely determined by the support he or she has at home and the extent to which he or she can focus on his/her studies. A student who lives in a high-crime district and needs to take on a more significant role at home would understandably see impacted performance at school. It could be the case that these different races have those favorable and disfavorable forces working in favor or against them and that is why we see these correlations.
We will now explore schools with low SAT scores and high values for the variable hispanic_per. In the plot below, we see a largely negative relationship between these two variables. However, it is interesting to see a cluster of points at the bottom left, which represent schools with a low percentage of hispanic studesnt and low SAT scores.
combined.plot.scatter('hispanic_per', 'sat_score')
plt.title('Correlation between percentage of histpanic students and SAT score')
plt.show()
To examine these schools with a high hispanic student body and low SAT scores, we will select schools with a threshold of above 95% for percentage of hispanics.
hispanic = combined.copy()
hispanic = hispanic[hispanic['hispanic_per'] > 95]
hispanic
We see that there are 8 schools with a percentage of hispanic students higher than 95%. By looking at some of the schools on this list, we note that many of them are public schools with a percentage of minority enrollment of almost 100% and percentage of students who are economically disadvantaged of above 60%. It is interesting to note that some of these schools, such as Multicultural High School, actually have a relatively low student to teacher ratio (11:1 in the case of this high school), as reported by US News (please refer to Sources).
Now, let us take a look at schools on the other end of the spectrum. Below is a table of schools with a percentage of hispanic students lower than 10% and SAT scores greater than 1800.
nonhispanic = combined.copy()
nonhispanic = nonhispanic[(nonhispanic['hispanic_per'] < 10) & (nonhispanic['sat_score'] > 1800)]
nonhispanic
From the schools above, we see most of them mostly have a technical or science focus. Moreover, these are top-rated schools. Stuyvesant High School, for example, was 25 in the National Rankings as per US News, with an overall score of 99.86. Minority enrollment was lower compared to schools in the first group, at around 80%, with economically disadvantaged students at a lower 45%. Enrollment was significantly higher compared to the first group as well, at 3,368 in the case of Stuyvesant. Moreover, the majority of students were Asian (74%) and white (18%), as per US News. In the case of the Bronx High School of Science, it was 49 in national rankings with an overall score of 99.72 and also had a student body that was majority-Asian (63%).
We now turn to investigate gender differences in SAT scores.
gender_data = combined.copy()
fields = ['sat_score', 'female_per', 'male_per']
gender_data = gender_data[fields]
correlations_gender = gender_data.corr()
correlations_gender = correlations_gender["sat_score"]
plt.figure(figsize=(10,8))
correlations_gender.plot.barh()
plt.title('Variable correlation of gender variables with SAT score')
plt.show()
We find that the percentage of female students has a positive correlation with SAT scores, while the case for percentage of males is the opposite. This is a surprising finding as the female population is usually considered to be at a disadvantage. Below we see a scatter plot of percentage of females and SAT scores, and it is interesting to find that the schools with the best performance on the test are those which have a more balanced student body, while those on the extremes (all-male, all-female) register scores that are more in line with the average of all NYC high schools.
combined.plot.scatter('female_per', 'sat_score')
plt.title('Correlation between percentage of female students and SAT score')
plt.show()
We shall perform the same exercise as with schools' hispanic population, but with female students.
females = combined.copy()
females = females.sort_values('female_per', ascending = False)
females = females[(females['sat_score'] > 1700) & (females['female_per'] > 60)]
females = females[['SCHOOL NAME', 'female_per', 'sat_score']]
females
Researching these schools, we note that Fiorello LaGuardia, for example, specializes in the visual and performing arts. The majority of students in this school (93%) feel safe inside the premises and only a small amount (15%) consider that bullying takes place, as per Inside Schools (please review Sources). Townsend Harris High School focuses on the humanities, and a large percentage of teachers (84%) indicate they would recommend this school to families (Inside Schools).
We will now examine the relationship between the percentage of students in each school who take AP exams and the school's performance on the SATs. We expect this relationship to be positive, as high school students take these Advanced Placement exams to earn college credit. We note that our dataset only includes the absolute number of students who toke these tests, so we wil have to create a new variable to reflect percentages.
combined['ap_per'] = combined['AP Test Takers '] / combined['total_enrollment']
combined.plot.scatter('ap_per', 'sat_score')
plt.title('Relationship between percentage of AP test takers and SAT scores')
plt.show()
It is interesting to see that the positive relationship one would expect is not very clear. For a group of schools we do se higher percentage of AP students and higher SAT scores, but for a second cluster of data points we do not see much influence of this variable on the performance in the SATs. This could perhaps be explained by the poor quality of AP classes in some schools; in this case one would not expect a significant positive externality on students' performance in the SATs from having to take those AP courses.
In this project we have gathered data from significant sources to look into the variables that explain high school students' performance on the SAT scores. For this purpose, we have cleaned and combined several datasets and performed exploratory analysis of our data primarily through plots. We found interesting relationships at the level of different factors: gender, race, AP test takers, total enrollment and perceived safety. While some of the relationships we found were expected, such as that between perceived safety of schools and student performance, other correlations were surprising. This was the case of the percentage of female students and total enrollment.
To further this research, one could look at a more granular level into geographic data to identify the neighborhoods that have the best schools. It could also be important to explore the relationship between SAT scores and class size.
Google. Brooklyn Safety Map. Google Maps: Link. Accessed October 31, 2019.
InsideSchools. Fiorello H. LaGuardia High School of Music & Art and Performing Arts. Link. Accessed: October 31, 2019.
InsideSchools. Townsend Harris High School. Link. Accessed: October 31, 2019.
New York City Department of Education. Map of Instructional Divisions. Link. Accessed: October 31, 2019
New York City Crime Map. Link. Accessed: October 31, 2019.
NYC Open Data. 2012 SAT Results. Link. Accessed: October 31, 2019.
NYC Open Data. 2010-2011 School Attendance and Enrollment Statistics by District. Link. Accessed: October 31, 2019.
NYC Open Data. 2010 AP (College Board) School Level Results. Link. Accessed: October 31, 2019.
NYC Open Data. 2005-2010 Graduation Outcomes - School Level. Link. Accessed: October 31, 2019.
NYC Open Data. 2006-2012 School Demographics and Accountability Snapshot. Link. Accessed: October 31, 2019.
NYC Open Data. 2011 NYC School Survey. Link. Accessed: October 31, 2019.
NYC Open Data. 2010-2011 Class Size - School-level detail. Link. Accessed: October 31, 2019.
US News. Manhattan Bridges High School. Link. Accessed: October 31, 2019.
US News. Multicultural High School. Link. Accessed: October 31, 2019.
US News. Pan American International High School. Link. Accessed: October 31, 2019.
US News. Stuyvestant High School. Link. Accessed: October 31, 2019.
US News. Bronx High School of Science. Link. Accessed: October 31, 2019.