Analyzing Data from New York City's High Schools to Evaluate SAT Performance

Andrea Vallebueno

October 31, 2019

Executive Summary

In this project we shall explore the efficacy and fairness of standardized tests. Specifically, we will look into the Scholastic Aptitude Test (SAT) and how its scores correlate to demographic factors of students from New York City who have taken the test. The SAT is taken by high school students who wish to apply to college.

After cleaning and combining various datasets, exploratory analysis brought about interesting relationships between students' performance on the SAT and factors such as gender, race, perceived safety of schools, percentage of AP test takers and total enrollment.

Note: This project is part of Data Quest's Data Scientist in Python track.

Data Overview

For a holistic exploration of the drivers behind students' performance in the SAT, we need to look at a variety of datasets to consolidate information on a high school's demographics, SAT scores, graduation rates, class size and other factors. We will thus download the following datasets and merge them.

  • ap_2010.csv: Data on AP test results which can be found here. AP
  • class_size.csv: Data on the class sizes of each high school, which can be found here
  • demographics.csv: Data on the demographic characteristics of each high school's student body, which can be found here
  • graduation.csv: Data on graduation outcomes of each class, such as graduates as a percentage of total cohort. This dataset can be accessed here
  • hs_directory.csv: NYC High school directory including data on each high school such as school name, school type and number of students. It is available here
  • sat_results.csv: The SAT results at the school level for graduating seniors. This dataset can be found here
  • survey_all.txt: Survey data from student, parent and teacher respondents on schools, which can be accessed here
  • survey_d75.txt: Survey data from student, parent and teacher respondents on schools from the 75th district

Setting the environment

In [73]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re

Read in the data

Datasets

Let us begin by reading in each dataset and storing them in a dictionary called data.

In [74]:
data_files = ["ap_2010.csv", "class_size.csv", "demographics.csv", "graduation.csv",
              "hs_directory.csv", "sat_results.csv"]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

Below we take an initial look at the sat_results dataset, which includes the school name and the number of SAT test takers. It also includes the average SAT score on each of the test's different sections: reading, math and writing. A student can score up to 800 points per section, for a perfect total score of 2400. Additionally, it includes the school's DBN, a unique identifier which will be useful when merging the datasets.

In [75]:
print('SAT Scores dataset')
data['sat_results'].head()
SAT Scores dataset
Out[75]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score
0 01M292 HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES 29 355 404 363
1 01M448 UNIVERSITY NEIGHBORHOOD HIGH SCHOOL 91 383 423 366
2 01M450 EAST SIDE COMMUNITY SCHOOL 70 377 402 370
3 01M458 FORSYTH SATELLITE ACADEMY 7 414 401 359
4 01M509 MARTA VALLE HIGH SCHOOL 44 390 433 384

Looping through the rest of the datasets, we observe that most of them also include the DBN. For those who don't, we notice that a combination of other variables (CSD, BOROUGH and SCHOOL CODE) appears to make up the identifier. We also notice that some datasets have multiple rows for the same high school. For example, in the graduation dataset, there appears to be a row for each high school's cohort year.

In [76]:
for dataset in  data.keys():
    print(data[dataset].head())
      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  
   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT   
2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED   
3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT   
4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED   

  CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY)  \
0                                    -                                   -   
1                                    -                                   -   
2                                    -                                   -   
3                                    -                                   -   
4                                    -                                   -   

  SERVICE CATEGORY(K-9* ONLY)  NUMBER OF STUDENTS / SEATS FILLED  \
0                           -                               19.0   
1                           -                               21.0   
2                           -                               17.0   
3                           -                               17.0   
4                           -                               15.0   

   NUMBER OF SECTIONS  AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  \
0                 1.0                19.0                    19.0   
1                 1.0                21.0                    21.0   
2                 1.0                17.0                    17.0   
3                 1.0                17.0                    17.0   
4                 1.0                15.0                    15.0   

   SIZE OF LARGEST CLASS DATA SOURCE  SCHOOLWIDE PUPIL-TEACHER RATIO  
0                   19.0         ATS                             NaN  
1                   21.0         ATS                             NaN  
2                   17.0         ATS                             NaN  
3                   17.0         ATS                             NaN  
4                   15.0         ATS                             NaN  
      DBN                       Name  schoolyear fl_percent  frl_percent  \
0  01M015  P.S. 015 ROBERTO CLEMENTE    20052006       89.4          NaN   
1  01M015  P.S. 015 ROBERTO CLEMENTE    20062007       89.4          NaN   
2  01M015  P.S. 015 ROBERTO CLEMENTE    20072008       89.4          NaN   
3  01M015  P.S. 015 ROBERTO CLEMENTE    20082009       89.4          NaN   
4  01M015  P.S. 015 ROBERTO CLEMENTE    20092010                    96.5   

   total_enrollment prek   k grade1 grade2  ... black_num black_per  \
0               281   15  36     40     33  ...        74      26.3   
1               243   15  29     39     38  ...        68      28.0   
2               261   18  43     39     36  ...        77      29.5   
3               252   17  37     44     32  ...        75      29.8   
4               208   16  40     28     32  ...        67      32.2   

  hispanic_num hispanic_per white_num white_per male_num male_per female_num  \
0          189         67.3         5       1.8    158.0     56.2      123.0   
1          153         63.0         4       1.6    140.0     57.6      103.0   
2          157         60.2         7       2.7    143.0     54.8      118.0   
3          149         59.1         7       2.8    149.0     59.1      103.0   
4          118         56.7         6       2.9    124.0     59.6       84.0   

  female_per  
0       43.8  
1       42.4  
2       45.2  
3       40.9  
4       40.4  

[5 rows x 38 columns]
    Demographic     DBN                            School Name    Cohort  \
0  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2003   
1  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2004   
2  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2005   
3  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2006   
4  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL  2006 Aug   

   Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
0             5               s                         s                 s   
1            55              37                     67.3%                17   
2            64              43                     67.2%                27   
3            78              43                     55.1%                36   
4            78              44                     56.4%                37   

  Total Regents - % of cohort Total Regents - % of grads  ...  \
0                           s                          s  ...   
1                       30.9%                      45.9%  ...   
2                       42.2%                      62.8%  ...   
3                       46.2%                      83.7%  ...   
4                       47.4%                      84.1%  ...   

  Regents w/o Advanced - n Regents w/o Advanced - % of cohort  \
0                        s                                  s   
1                       17                              30.9%   
2                       27                              42.2%   
3                       36                              46.2%   
4                       37                              47.4%   

  Regents w/o Advanced - % of grads Local - n Local - % of cohort  \
0                                 s         s                   s   
1                             45.9%        20               36.4%   
2                             62.8%        16                 25%   
3                             83.7%         7                  9%   
4                             84.1%         7                  9%   

    Local - % of grads Still Enrolled - n Still Enrolled - % of cohort  \
0                    s                  s                            s   
1                54.1%                 15                        27.3%   
2  37.200000000000003%                  9                        14.1%   
3                16.3%                 16                        20.5%   
4                15.9%                 15                        19.2%   

  Dropped Out - n Dropped Out - % of cohort  
0               s                         s  
1               3                      5.5%  
2               9                     14.1%  
3              11                     14.1%  
4              11                     14.1%  

[5 rows x 23 columns]
      dbn                                        school_name       boro  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262              9              12   
1          X400    718-842-0687  718-589-9849              9              12   
2          X240    718-294-8111  718-294-8109              6              12   
3          M520  718-935-3477             NaN              9              10   
4          Q695    718-969-3155  718-969-3552              6              12   

  expgrade_span_min  expgrade_span_max  ...  \
0               NaN                NaN  ...   
1               NaN                NaN  ...   
2               NaN                NaN  ...   
3                 9               14.0  ...   
4               NaN                NaN  ...   

                                          priority02  \
0                    Then to New York City residents   
1  Then to New York City residents who attend an ...   
2  Then to Bronx students or residents who attend...   
3  Then to New York City residents who attend an ...   
4  Then to Districts 28 and 29 students or residents   

                                          priority03  \
0                                                NaN   
1                Then to Bronx students or residents   
2  Then to New York City residents who attend an ...   
3          Then to Manhattan students or residents     
4               Then to Queens students or residents   

                            priority04                       priority05  \
0                                  NaN                              NaN   
1      Then to New York City residents                              NaN   
2  Then to Bronx students or residents  Then to New York City residents   
3      Then to New York City residents                              NaN   
4      Then to New York City residents                              NaN   

  priority06  priority07 priority08  priority09 priority10  \
0        NaN         NaN        NaN         NaN        NaN   
1        NaN         NaN        NaN         NaN        NaN   
2        NaN         NaN        NaN         NaN        NaN   
3        NaN         NaN        NaN         NaN        NaN   
4        NaN         NaN        NaN         NaN        NaN   

                                          Location 1  
0  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...  
1  1110 Boston Road\nBronx, NY 10456\n(40.8276026...  
2  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...  
3  411 Pearl Street\nNew York, NY 10038\n(40.7106...  
4  160-20 Goethals Avenue\nJamaica, NY 11432\n(40...  

[5 rows x 58 columns]
      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score  \
0                     29                             355                 404   
1                     91                             383                 423   
2                     70                             377                 402   
3                      7                             414                 401   
4                     44                             390                 433   

  SAT Writing Avg. Score  
0                    363  
1                    366  
2                    370  
3                    359  
4                    384  

Surveys

At this moment we will read in our last two datasets, which are surveys . We also concatenate both surveys into a single dataset, survey.

In [77]:
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0, sort=True)
print('Survey dimensions: {}'.format(survey.shape))
survey.head()
Survey dimensions: (1702, 2773)
Out[77]:
N_p N_s N_t aca_p_11 aca_s_11 aca_t_11 aca_tot_11 bn com_p_11 com_s_11 ... t_q8c_1 t_q8c_2 t_q8c_3 t_q8c_4 t_q9 t_q9_1 t_q9_2 t_q9_3 t_q9_4 t_q9_5
0 90.0 NaN 22.0 7.8 NaN 7.9 7.9 M015 7.6 NaN ... 29.0 67.0 5.0 0.0 NaN 5.0 14.0 52.0 24.0 5.0
1 161.0 NaN 34.0 7.8 NaN 9.1 8.4 M019 7.6 NaN ... 74.0 21.0 6.0 0.0 NaN 3.0 6.0 3.0 78.0 9.0
2 367.0 NaN 42.0 8.6 NaN 7.5 8.0 M020 8.3 NaN ... 33.0 35.0 20.0 13.0 NaN 3.0 5.0 16.0 70.0 5.0
3 151.0 145.0 29.0 8.5 7.4 7.8 7.9 M034 8.2 5.9 ... 21.0 45.0 28.0 7.0 NaN 0.0 18.0 32.0 39.0 11.0
4 90.0 NaN 23.0 7.9 NaN 8.1 8.0 M063 7.9 NaN ... 59.0 36.0 5.0 0.0 NaN 10.0 5.0 10.0 60.0 15.0

5 rows × 2773 columns

As a next step, let us select our variables of interest given that this dataset has 2,773 columns. Specifically, we shall want to keep the DBN in order to identify schools, and variables such as the number of student and teacher respondents, the school's safety and respect score based on teacher responses, the communication score, the engagement score, academic expectations score, and other metrics evaluating the performance of the school as per student, teacher and parent respondents.

In [78]:
survey["DBN"] = survey["dbn"]

survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", 
                 "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", 
                 "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", 
                 "eng_tot_11", "aca_tot_11"]
survey = survey.loc[:,survey_fields]
data["survey"] = survey
In [79]:
survey.head()
Out[79]:
DBN rr_s rr_t rr_p N_s N_t N_p saf_p_11 com_p_11 eng_p_11 ... eng_t_11 aca_t_11 saf_s_11 com_s_11 eng_s_11 aca_s_11 saf_tot_11 com_tot_11 eng_tot_11 aca_tot_11
0 01M015 NaN 88 60 NaN 22.0 90.0 8.5 7.6 7.5 ... 7.6 7.9 NaN NaN NaN NaN 8.0 7.7 7.5 7.9
1 01M019 NaN 100 60 NaN 34.0 161.0 8.4 7.6 7.6 ... 8.9 9.1 NaN NaN NaN NaN 8.5 8.1 8.2 8.4
2 01M020 NaN 88 73 NaN 42.0 367.0 8.9 8.3 8.3 ... 6.8 7.5 NaN NaN NaN NaN 8.2 7.3 7.5 8.0
3 01M034 89.0 73 50 145.0 29.0 151.0 8.8 8.2 8.0 ... 6.8 7.8 6.2 5.9 6.5 7.4 7.3 6.7 7.1 7.9
4 01M063 NaN 100 60 NaN 23.0 90.0 8.7 7.9 8.1 ... 7.8 8.1 NaN NaN NaN NaN 8.5 7.6 7.9 8.0

5 rows × 23 columns

Cleaning the data

We shall now begin cleaning our data in order to be able to merge the most important columns later on. Let us start by the unique identifier of each school, which will allow us to match elements efficiently in one dataset to elements in another.

Add DBN columns

For the columns that do not have the DBN identifier, we shall have to concatenate as strings the numbers identifying the school's CSD (Central School District), borough and school code. For this purpose, we write a function which pads the CSD numbers, given that the DBN includes this number in a 2-digit form. We then apply this function to the class_size database and compose its DBN with the padded CSD and SCHOOL CODE.

In [80]:
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

Convert columns to numeric

Next, we observe that the average score columns in the sat_results database are of type object. We will therefore change them to a numeric type in order to manipulate them. We then take an average of the three scores to simplify our analysis.

In [81]:
data['sat_results'].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 6 columns):
DBN                                478 non-null object
SCHOOL NAME                        478 non-null object
Num of SAT Test Takers             478 non-null object
SAT Critical Reading Avg. Score    478 non-null object
SAT Math Avg. Score                478 non-null object
SAT Writing Avg. Score             478 non-null object
dtypes: object(6)
memory usage: 22.5+ KB
In [82]:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

We also notice that in the hs_directory dataset we have a variable Location 1 which includes text describing the school's address. Moreover, within the text is included the latitude and longitude coordinates for each school. Let us extract these in order to perform a geographical analysis of score distributions. For this, we define two functions find_lat and find_lon which use regular expressions to extract this information.

In [83]:
data['hs_directory']['Location 1'].head()
Out[83]:
0    883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1    1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2    1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
3    411 Pearl Street\nNew York, NY 10038\n(40.7106...
4    160-20 Goethals Avenue\nJamaica, NY 11432\n(40...
Name: Location 1, dtype: object
In [84]:
def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

Selecting our data

In order to combine our datasets, we need to aggregate or select those which have multiple rows for a same high school. The class_size dataset has various rows for the same school according to its program type and grade. We will select grade to be 09-12 given that our area of focus is high school students who take the SAT, and GEN ED as our PROGRAM TYPE. Other program types includes CTT, which is an integrated co-teaching program for students with disabilities. In the demographics and graduation datasets, we select a particular school year (2012-12) and to capture all students select All Cohort as our demographic of interest.

In [85]:
data['class_size'].head()
Out[85]:
CSD BOROUGH SCHOOL CODE SCHOOL NAME GRADE PROGRAM TYPE CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY) SERVICE CATEGORY(K-9* ONLY) NUMBER OF STUDENTS / SEATS FILLED NUMBER OF SECTIONS AVERAGE CLASS SIZE SIZE OF SMALLEST CLASS SIZE OF LARGEST CLASS DATA SOURCE SCHOOLWIDE PUPIL-TEACHER RATIO padded_csd DBN
0 1 M M015 P.S. 015 Roberto Clemente 0K GEN ED - - - 19.0 1.0 19.0 19.0 19.0 ATS NaN 01 01M015
1 1 M M015 P.S. 015 Roberto Clemente 0K CTT - - - 21.0 1.0 21.0 21.0 21.0 ATS NaN 01 01M015
2 1 M M015 P.S. 015 Roberto Clemente 01 GEN ED - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN 01 01M015
3 1 M M015 P.S. 015 Roberto Clemente 01 CTT - - - 17.0 1.0 17.0 17.0 17.0 ATS NaN 01 01M015
4 1 M M015 P.S. 015 Roberto Clemente 02 GEN ED - - - 15.0 1.0 15.0 15.0 15.0 ATS NaN 01 01M015
In [86]:
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

Convert AP scores to numeric

Finally, we also convert to numeric values variables in the ap_2010 dataset, which were also stored as object type values. Specifically, we wish to convert the number of test takers, the number of exams taken and the number of exams with scores of 3, 4 or 5.

In [87]:
data['ap_2010'].info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258 entries, 0 to 257
Data columns (total 5 columns):
DBN                                     258 non-null object
SchoolName                              258 non-null object
AP Test Takers                          258 non-null object
Total Exams Taken                       258 non-null object
Number of Exams with scores 3 4 or 5    258 non-null object
dtypes: object(5)
memory usage: 10.2+ KB
In [88]:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

Combine the datasets

At this point, we have created unique rows with a shared unique identifier for each data set which we will use to consolidate the datasets in order to perform our analysis. We conduct this merge below, using the left merging method in order to preserve values from the sat_results database as this is our most important variable. On the other hand, we merge the class_size, demographics, survey and hs_directory using an inner join given that they have fewer null values.

In [89]:
for dataset in data.keys():
    print(dataset)
    print(data[dataset].isna().sum(axis = 0))
ap_2010
DBN                                       0
SchoolName                                0
AP Test Takers                           25
Total Exams Taken                        25
Number of Exams with scores 3 4 or 5    107
dtype: int64
class_size
DBN                                    0
CSD                                    0
NUMBER OF STUDENTS / SEATS FILLED      0
NUMBER OF SECTIONS                     0
AVERAGE CLASS SIZE                     0
SIZE OF SMALLEST CLASS                 0
SIZE OF LARGEST CLASS                  0
SCHOOLWIDE PUPIL-TEACHER RATIO       583
dtype: int64
demographics
DBN                     0
Name                    0
schoolyear              0
fl_percent           1509
frl_percent             0
total_enrollment        0
prek                    0
k                       0
grade1                  0
grade2                  0
grade3                  0
grade4                  0
grade5                  0
grade6                  0
grade7                  0
grade8                  0
grade9                  0
grade10                 0
grade11                 0
grade12                 0
ell_num                 0
ell_percent             0
sped_num                0
sped_percent            0
ctt_num                 0
selfcontained_num       0
asian_num               0
asian_per               0
black_num               0
black_per               0
hispanic_num            0
hispanic_per            0
white_num               0
white_per               0
male_num                0
male_per                0
female_num              0
female_per              0
dtype: int64
graduation
Demographic                           0
DBN                                   0
School Name                           0
Cohort                                0
Total Cohort                          0
Total Grads - n                       0
Total Grads - % of cohort             0
Total Regents - n                     0
Total Regents - % of cohort           0
Total Regents - % of grads            0
Advanced Regents - n                  0
Advanced Regents - % of cohort        0
Advanced Regents - % of grads         0
Regents w/o Advanced - n              0
Regents w/o Advanced - % of cohort    0
Regents w/o Advanced - % of grads     0
Local - n                             0
Local - % of cohort                   0
Local - % of grads                    0
Still Enrolled - n                    0
Still Enrolled - % of cohort          0
Dropped Out - n                       0
Dropped Out - % of cohort             0
dtype: int64
hs_directory
dbn                0
school_name        0
boro               0
building_code      0
phone_number       0
                ... 
priority10       434
Location 1         0
DBN                0
lat                0
lon                0
Length: 61, dtype: int64
sat_results
DBN                                 0
SCHOOL NAME                         0
Num of SAT Test Takers              0
SAT Critical Reading Avg. Score    57
SAT Math Avg. Score                57
SAT Writing Avg. Score             57
sat_score                          57
dtype: int64
survey
DBN             0
rr_s          661
rr_t            0
rr_p            0
N_s           666
N_t             2
N_p             6
saf_p_11        6
com_p_11        6
eng_p_11        6
aca_p_11        6
saf_t_11        2
com_t_11        2
eng_t_11        2
aca_t_11        2
saf_s_11      666
com_s_11      666
eng_s_11      666
aca_s_11      666
saf_tot_11      0
com_tot_11      0
eng_tot_11      0
aca_tot_11      0
dtype: int64
In [90]:
combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

As a next step, we take a look at the number of null values of our combined dataset. We shall impute these by using the fillna method and passing the mean of each column.

In [91]:
print(combined.isna().sum(axis = 0))
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)
DBN                                  0
SCHOOL NAME                          0
Num of SAT Test Takers               0
SAT Critical Reading Avg. Score     28
SAT Math Avg. Score                 28
                                  ... 
priority09                         363
priority10                         363
Location 1                           0
lat                                  0
lon                                  0
Length: 159, dtype: int64

Add a school district column for mapping

Before we being our exploratory analysis, we will map each school to its Central School District by extracting the first two digits from its DBN. This will serve as an additional variable that may present interesting patterns.

In [92]:
def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

Exploratory data analysis

Correlation analysis

We will begin our analysis by calculating the Pearson correlation between all the variables in our combined dataset and our main variable of interest, sat_score, to gain insight into which interact closely. We see below that, for logical reasons, the average test score in each SAT section is highly correlated to the average score. However, we also observe a relatively high positive correlation with total enrollment and other variables that have a direct relationship with this factor such as total exams taken. We also see a positive relationship with the percentage and number of females, and the percentage and number of Asian students. Additionally, there is a positive relationship with school characteristics evaluated in the surveys such as safety. On the other hand, we obsever a strong negative relationship to the percentage of English language learners at -0.72.

In [93]:
correlations = combined.corr()
correlations = correlations["sat_score"]
plt.figure(figsize=(10,10))
correlations.plot.barh()
plt.title('Variable correlation with SAT score')
plt.show()

Let us take a closer look at the relationship between total_enrollment and sat_score. In the figure below, we can see the slightly positive relationship between both variables. However, the relationship does not appear to be strong and there is an important cluster of points grouping the schools with the lowest enrollment numbers which seem to capture two possibilities: they can have outstanding average scores due to their personalized teaching methods, or they can also present the lowest scores in our universe of schools. Therefore, we cannot say there is a clear relationship between the two variables.

In [94]:
combined.plot.scatter('total_enrollment', 'sat_score', title = 'Relationship between total enrollment and SAT score')
plt.show()

Correlation to school survey metrics

Now we shall dive deeper into the correlation between sat_score and the variables stemming from the survey of each school.

In [95]:
# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")
In [96]:
survey_data = combined.copy()
fields = survey_fields
fields.append('sat_score')
survey_data = survey_data[fields]
correlations_survey = survey_data.corr()
correlations_survey = correlations_survey["sat_score"]

plt.figure(figsize=(10,8))
correlations_survey.plot.barh()
plt.title('Variable correlation of survey fields with SAT score')
plt.show()

From the above plot, we can once again see a positive relation to variables related to total enrollment such as N_p, the number of parent respondents, and N_s, the number of student respondents. We also see a positive relationship with aca_s_11 and saf_s_11, the safety and respect score based on student responses and the academic expectations score based on student responses. It is interesting to note that this correlation is much lower to that of saf_p_11 and aca_p_11, the safety and respect score and academic expectations score of the parents. We also observe a negative relationship with com_p_11, the communication score based on parent responses. The difference between the correlations between these student-parent variables exhibits a gap in how students and their parents perceive this high school experience.

In [97]:
combined.plot.scatter('saf_s_11', 'sat_score', title = 'Relationship between safety and respect score from students and high school SAT scores')
plt.show()

In the above graph it appears that for higher safety scores from students, high school's SAT performance diverges more, while performance distribution is more concentrated for lower scores. That is, schools perceived as unsafe by students appear to have a lower performance on the SATs with more certainty. To further this analysis, we will compute the average safety scores for each district and map them geographically.

In [119]:
district_scores = combined.copy()
district_scores = district_scores.groupby('school_dist').agg(np.mean)
district_safety_scores = district_scores[['saf_s_11', 'lon', 'lat']]
district_safety_scores.head()
Out[119]:
saf_s_11 lon lat
school_dist
01 6.768611 -73.982377 40.719022
02 6.910660 -73.991386 40.739699
03 6.716667 -73.977370 40.781574
04 6.885714 -73.943215 40.793449
05 6.314286 -73.949251 40.817077
In [128]:
import os
os.environ['PROJ_LIB'] = r'C:\ProgramData\Anaconda3\pkgs\proj4-5.2.0-ha925a31_1\Library\share'
from mpl_toolkits.basemap import Basemap
m = Basemap(
    projection='merc', 
    llcrnrlat=40.496044, 
    urcrnrlat=40.915256, 
    llcrnrlon=-74.255735, 
    urcrnrlon=-73.700272,
    resolution='h'
)

m.drawmapboundary(fill_color='grey')
m.drawcoastlines()
m.drawrivers(color='#6D5F47', linewidth=.4)

longitudes = district_safety_scores['lon'].tolist()
latitudes = district_safety_scores['lat'].tolist()

school_districts = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\NYC_school_district_map.jpg')

m.scatter(longitudes, latitudes, c = district_safety_scores['saf_s_11'], cmap = 'Blues', s = 20, zorder= 2, latlon = True)
plt.colorbar(label='Safety score as per student survey')
plt.title('Geographic mapping of safetey scores')
plt.show()

plt.figure(figsize = (11,11))
plt.imshow(school_districts)
plt.title('Location of NYC school districts')
plt.show()
print('Location of NYC school districts image sourced from the New York City Deparment of Education')
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: MatplotlibDeprecationWarning: 
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
  # Remove the CWD from sys.path while we load stuff.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: MatplotlibDeprecationWarning: 
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
  from ipykernel import kernelapp as app
Location of NYC school districts image sourced from the New York City Deparment of Education

From the above images we can observe that the safest schools are those in the Manhattan/Bronx area, the Northern Queens/Brooklyn area and Southwestern Brooklyn area as per above. On the other hand, the least safe schools are located in the Northern Brooklyn districts, Southern Queens and the Southern Queens/Brooklyn districts.

In [136]:
crime_map = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\NYC_crime_map.png')

plt.figure(figsize = (15,15))
plt.imshow(crime_map)
plt.title('NYC Crime Map')
plt.show()
print('Map sourced from NYC Crime Map')
Map sourced from NYC Crime Map

If we compare the distribution of safe schools to the previous figure, a crime map of NYC provided by the government here, we notice similar patterns indicating the overall level of safety in these districts. For instance, we note that the Northeastern and Eastern districts of Brooklyn present higher levels of crime per 1000 residents than surrounding districts, precisely in line with the map of schools. We can further examine the level of concern in terms of safety for these districts in the map below, which was sourced from Google Maps. It indicates the level of safety, with purple areas being the most unsafe.

In [135]:
safety_map = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\brooklyn_safety_map.png')

plt.figure(figsize = (15,15))
plt.imshow(safety_map)
plt.title('Brooklyn Safety Map')
plt.show()
print('Map sourced from Google Images')
Map sourced from Google Images

Correlation to race

We have four variables which indicate the percentage of each race at a particular school: white_per, asian_per, black_per and hispanic_per. Let us now take a look at their correlation to SAT scores.

In [140]:
race_data = combined.copy()
fields = ['sat_score', 'white_per', 'asian_per', 'black_per', 'hispanic_per']
race_data = race_data[fields]
correlations_race = race_data.corr()
correlations_race = correlations_race["sat_score"]

plt.figure(figsize=(10,8))
correlations_race.plot.barh()
plt.title('Variable correlation of race variables with SAT score')
plt.show()

In the above plot we note that there is a positive correlation between SAT scores and the percentage of white and Asian students in the school. On the other hand, there is a negative relationship between scores and percentage of black and hispanic students. It is often the case that a student's performance in school can be largely determined by the support he or she has at home and the extent to which he or she can focus on his/her studies. A student who lives in a high-crime district and needs to take on a more significant role at home would understandably see impacted performance at school. It could be the case that these different races have those favorable and disfavorable forces working in favor or against them and that is why we see these correlations.

We will now explore schools with low SAT scores and high values for the variable hispanic_per. In the plot below, we see a largely negative relationship between these two variables. However, it is interesting to see a cluster of points at the bottom left, which represent schools with a low percentage of hispanic studesnt and low SAT scores.

In [141]:
combined.plot.scatter('hispanic_per', 'sat_score')
plt.title('Correlation between percentage of histpanic students and SAT score')
plt.show()

To examine these schools with a high hispanic student body and low SAT scores, we will select schools with a threshold of above 95% for percentage of hispanics.

In [148]:
hispanic = combined.copy()
hispanic = hispanic[hispanic['hispanic_per'] > 95]
hispanic
Out[148]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority05 priority06 priority07 priority08 priority09 priority10 Location 1 lat lon school_dist
44 02M542 MANHATTAN BRIDGES HIGH SCHOOL 66 336.0 378.0 344.0 1058.0 Manhattan Bridges High School 67.000000 102.000000 ... 0 0 0 0 0 0 525 West 50Th Street\nNew York, NY 10019\n(40.... 40.765027 -73.992517 02
82 06M348 WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL 70 380.0 395.0 399.0 1174.0 0 129.028846 197.038462 ... Then to New York City residents 0 0 0 0 0 511 West 182Nd Street\nNew York, NY 10033\n(40... 40.848879 -73.930807 06
89 06M552 GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M... 56 339.0 349.0 326.0 1014.0 GREGORIO LUPERON HS SCI & MATH 88.000000 138.000000 ... 0 0 0 0 0 0 501 West 165Th\nNew York, NY 10032\n(40.838032... 40.838032 -73.938371 06
125 09X365 ACADEMY FOR LANGUAGE AND TECHNOLOGY 54 315.0 339.0 297.0 951.0 Academy for Language and Technology 20.000000 20.000000 ... 0 0 0 0 0 0 1700 Macombs Road\nBronx, NY 10453\n(40.849102... 40.849102 -73.916088 09
141 10X342 INTERNATIONAL SCHOOL FOR LIBERAL ARTS 49 300.0 333.0 301.0 934.0 International School for Liberal Arts 55.000000 73.000000 ... 0 0 0 0 0 0 2780 Reservoir Avenue\nBronx, NY 10468\n(40.87... 40.870377 -73.898163 10
176 12X388 PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE 30 321.0 351.0 298.0 970.0 0 129.028846 197.038462 ... 0 0 0 0 0 0 1300 Boynton Avenue\nBronx, NY 10472\n(40.8313... 40.831366 -73.878823 12
253 19K583 MULTICULTURAL HIGH SCHOOL 29 279.0 322.0 286.0 887.0 Multicultural High School 44.000000 44.000000 ... 0 0 0 0 0 0 999 Jamaica Avenue\nBrooklyn, NY 11208\n(40.69... 40.691144 -73.868426 19
286 24Q296 PAN AMERICAN INTERNATIONAL HIGH SCHOOL 55 317.0 323.0 311.0 951.0 0 129.028846 197.038462 ... 0 0 0 0 0 0 45-10 94Th Street\nElmhurst, NY 11373\n(40.743... 40.743303 -73.870575 24

8 rows × 160 columns

We see that there are 8 schools with a percentage of hispanic students higher than 95%. By looking at some of the schools on this list, we note that many of them are public schools with a percentage of minority enrollment of almost 100% and percentage of students who are economically disadvantaged of above 60%. It is interesting to note that some of these schools, such as Multicultural High School, actually have a relatively low student to teacher ratio (11:1 in the case of this high school), as reported by US News (please refer to Sources).

Now, let us take a look at schools on the other end of the spectrum. Below is a table of schools with a percentage of hispanic students lower than 10% and SAT scores greater than 1800.

In [149]:
nonhispanic = combined.copy()
nonhispanic = nonhispanic[(nonhispanic['hispanic_per'] < 10) & (nonhispanic['sat_score'] > 1800)]
nonhispanic
Out[149]:
DBN SCHOOL NAME Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score SAT Writing Avg. Score sat_score SchoolName AP Test Takers Total Exams Taken ... priority05 priority06 priority07 priority08 priority09 priority10 Location 1 lat lon school_dist
37 02M475 STUYVESANT HIGH SCHOOL 832 679.0 735.0 682.0 2096.0 STUYVESANT HS 1510.0 2819.0 ... 0 0 0 0 0 0 345 Chambers Street\nNew York, NY 10282\n(40.7... 40.717746 -74.014049 02
151 10X445 BRONX HIGH SCHOOL OF SCIENCE 731 632.0 688.0 649.0 1969.0 BRONX HS OF SCIENCE 1190.0 2435.0 ... 0 0 0 0 0 0 75 West 205 Street\nBronx, NY 10468\n(40.87705... 40.877056 -73.889780 10
187 13K430 BROOKLYN TECHNICAL HIGH SCHOOL 1277 587.0 659.0 587.0 1833.0 BROOKLYN TECHNICAL HS 2117.0 3692.0 ... 0 0 0 0 0 0 29 Ft Greene Place\nBrooklyn, NY 11217\n(40.68... 40.688107 -73.976745 13
327 28Q687 QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO... 121 612.0 660.0 596.0 1868.0 Queens HS for Science York Colllege 215.0 338.0 ... 0 0 0 0 0 0 94-50 159 Street\nJamaica, NY 11433\n(40.70099... 40.700999 -73.798154 28
356 31R605 STATEN ISLAND TECHNICAL HIGH SCHOOL 227 635.0 682.0 636.0 1953.0 STATEN ISLAND TECHNICAL HS 528.0 905.0 ... 0 0 0 0 0 0 485 Clawson Street\nStaten Island, NY 10306\n(... 40.567913 -74.115362 31

5 rows × 160 columns

From the schools above, we see most of them mostly have a technical or science focus. Moreover, these are top-rated schools. Stuyvesant High School, for example, was 25 in the National Rankings as per US News, with an overall score of 99.86. Minority enrollment was lower compared to schools in the first group, at around 80%, with economically disadvantaged students at a lower 45%. Enrollment was significantly higher compared to the first group as well, at 3,368 in the case of Stuyvesant. Moreover, the majority of students were Asian (74%) and white (18%), as per US News. In the case of the Bronx High School of Science, it was 49 in national rankings with an overall score of 99.72 and also had a student body that was majority-Asian (63%).

Correlation to gender

We now turn to investigate gender differences in SAT scores.

In [150]:
gender_data = combined.copy()
fields = ['sat_score', 'female_per', 'male_per']
gender_data = gender_data[fields]
correlations_gender = gender_data.corr()
correlations_gender = correlations_gender["sat_score"]

plt.figure(figsize=(10,8))
correlations_gender.plot.barh()
plt.title('Variable correlation of gender variables with SAT score')
plt.show()

We find that the percentage of female students has a positive correlation with SAT scores, while the case for percentage of males is the opposite. This is a surprising finding as the female population is usually considered to be at a disadvantage. Below we see a scatter plot of percentage of females and SAT scores, and it is interesting to find that the schools with the best performance on the test are those which have a more balanced student body, while those on the extremes (all-male, all-female) register scores that are more in line with the average of all NYC high schools.

In [156]:
combined.plot.scatter('female_per', 'sat_score')
plt.title('Correlation between percentage of female students and SAT score')
plt.show()

We shall perform the same exercise as with schools' hispanic population, but with female students.

In [158]:
females = combined.copy()
females = females.sort_values('female_per', ascending = False)
females = females[(females['sat_score'] > 1700) & (females['female_per'] > 60)]
females = females[['SCHOOL NAME', 'female_per', 'sat_score']]
females
Out[158]:
SCHOOL NAME female_per sat_score
61 FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A... 73.6 1707.0
302 TOWNSEND HARRIS HIGH SCHOOL 71.1 1910.0
5 BARD HIGH SCHOOL EARLY COLLEGE 68.7 1856.0
26 ELEANOR ROOSEVELT HIGH SCHOOL 67.5 1758.0
60 BEACON HIGH SCHOOL 61.0 1744.0

Researching these schools, we note that Fiorello LaGuardia, for example, specializes in the visual and performing arts. The majority of students in this school (93%) feel safe inside the premises and only a small amount (15%) consider that bullying takes place, as per Inside Schools (please review Sources). Townsend Harris High School focuses on the humanities, and a large percentage of teachers (84%) indicate they would recommend this school to families (Inside Schools).

Correlation to AP testing

We will now examine the relationship between the percentage of students in each school who take AP exams and the school's performance on the SATs. We expect this relationship to be positive, as high school students take these Advanced Placement exams to earn college credit. We note that our dataset only includes the absolute number of students who toke these tests, so we wil have to create a new variable to reflect percentages.

In [161]:
combined['ap_per'] = combined['AP Test Takers '] / combined['total_enrollment']

combined.plot.scatter('ap_per', 'sat_score')
plt.title('Relationship between percentage of AP test takers and SAT scores')
plt.show()

It is interesting to see that the positive relationship one would expect is not very clear. For a group of schools we do se higher percentage of AP students and higher SAT scores, but for a second cluster of data points we do not see much influence of this variable on the performance in the SATs. This could perhaps be explained by the poor quality of AP classes in some schools; in this case one would not expect a significant positive externality on students' performance in the SATs from having to take those AP courses.

Conclusions

In this project we have gathered data from significant sources to look into the variables that explain high school students' performance on the SAT scores. For this purpose, we have cleaned and combined several datasets and performed exploratory analysis of our data primarily through plots. We found interesting relationships at the level of different factors: gender, race, AP test takers, total enrollment and perceived safety. While some of the relationships we found were expected, such as that between perceived safety of schools and student performance, other correlations were surprising. This was the case of the percentage of female students and total enrollment.

To further this research, one could look at a more granular level into geographic data to identify the neighborhoods that have the best schools. It could also be important to explore the relationship between SAT scores and class size.

Sources

Google. Brooklyn Safety Map. Google Maps: Link. Accessed October 31, 2019.

InsideSchools. Fiorello H. LaGuardia High School of Music & Art and Performing Arts. Link. Accessed: October 31, 2019.

InsideSchools. Townsend Harris High School. Link. Accessed: October 31, 2019.

New York City Department of Education. Map of Instructional Divisions. Link. Accessed: October 31, 2019

New York City Crime Map. Link. Accessed: October 31, 2019.

NYC Open Data. 2012 SAT Results. Link. Accessed: October 31, 2019.

NYC Open Data. 2010-2011 School Attendance and Enrollment Statistics by District. Link. Accessed: October 31, 2019.

NYC Open Data. 2010 AP (College Board) School Level Results. Link. Accessed: October 31, 2019.

NYC Open Data. 2005-2010 Graduation Outcomes - School Level. Link. Accessed: October 31, 2019.

NYC Open Data. 2006-2012 School Demographics and Accountability Snapshot. Link. Accessed: October 31, 2019.

NYC Open Data. 2011 NYC School Survey. Link. Accessed: October 31, 2019.

NYC Open Data. 2010-2011 Class Size - School-level detail. Link. Accessed: October 31, 2019.

US News. Manhattan Bridges High School. Link. Accessed: October 31, 2019.

US News. Multicultural High School. Link. Accessed: October 31, 2019.

US News. Pan American International High School. Link. Accessed: October 31, 2019.

US News. Stuyvestant High School. Link. Accessed: October 31, 2019.

US News. Bronx High School of Science. Link. Accessed: October 31, 2019.