Analyzing Data from New York City's High Schools to Evaluate SAT Performance¶

Andrea Vallebueno¶

October 31, 2019¶

Executive Summary¶

In this project we shall explore the efficacy and fairness of standardized tests. Specifically, we will look into the Scholastic Aptitude Test (SAT) and how its scores correlate to demographic factors of students from New York City who have taken the test. The SAT is taken by high school students who wish to apply to college.

After cleaning and combining various datasets, exploratory analysis brought about interesting relationships between students' performance on the SAT and factors such as gender, race, perceived safety of schools, percentage of AP test takers and total enrollment.

Note: This project is part of Data Quest's Data Scientist in Python track.

Data Overview¶

For a holistic exploration of the drivers behind students' performance in the SAT, we need to look at a variety of datasets to consolidate information on a high school's demographics, SAT scores, graduation rates, class size and other factors. We will thus download the following datasets and merge them.

ap_2010.csv: Data on AP test results which can be found here. AP
class_size.csv: Data on the class sizes of each high school, which can be found here
demographics.csv: Data on the demographic characteristics of each high school's student body, which can be found here
graduation.csv: Data on graduation outcomes of each class, such as graduates as a percentage of total cohort. This dataset can be accessed here
hs_directory.csv: NYC High school directory including data on each high school such as school name, school type and number of students. It is available here
sat_results.csv: The SAT results at the school level for graduating seniors. This dataset can be found here
survey_all.txt: Survey data from student, parent and teacher respondents on schools, which can be accessed here
survey_d75.txt: Survey data from student, parent and teacher respondents on schools from the 75th district

Setting the environment¶

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import re

Read in the data¶

Datasets¶

Let us begin by reading in each dataset and storing them in a dictionary called data.

data_files = ["ap_2010.csv", "class_size.csv", "demographics.csv", "graduation.csv",
              "hs_directory.csv", "sat_results.csv"]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

Below we take an initial look at the sat_results dataset, which includes the school name and the number of SAT test takers. It also includes the average SAT score on each of the test's different sections: reading, math and writing. A student can score up to 800 points per section, for a perfect total score of 2400. Additionally, it includes the school's DBN, a unique identifier which will be useful when merging the datasets.

print('SAT Scores dataset')
data['sat_results'].head()

SAT Scores dataset

Looping through the rest of the datasets, we observe that most of them also include the DBN. For those who don't, we notice that a combination of other variables (CSD, BOROUGH and SCHOOL CODE) appears to make up the identifier. We also notice that some datasets have multiple rows for the same high school. For example, in the graduation dataset, there appears to be a row for each high school's cohort year.

for dataset in  data.keys():
    print(data[dataset].head())

      DBN                             SchoolName AP Test Takers   \
0  01M448           UNIVERSITY NEIGHBORHOOD H.S.              39   
1  01M450                 EAST SIDE COMMUNITY HS              19   
2  01M515                    LOWER EASTSIDE PREP              24   
3  01M539         NEW EXPLORATIONS SCI,TECH,MATH             255   
4  02M296  High School of Hospitality Management               s   

  Total Exams Taken Number of Exams with scores 3 4 or 5  
0                49                                   10  
1                21                                    s  
2                26                                   24  
3               377                                  191  
4                 s                                    s  
   CSD BOROUGH SCHOOL CODE                SCHOOL NAME GRADE  PROGRAM TYPE  \
0    1       M        M015  P.S. 015 Roberto Clemente     0K       GEN ED   
1    1       M        M015  P.S. 015 Roberto Clemente     0K          CTT   
2    1       M        M015  P.S. 015 Roberto Clemente     01       GEN ED   
3    1       M        M015  P.S. 015 Roberto Clemente     01          CTT   
4    1       M        M015  P.S. 015 Roberto Clemente     02       GEN ED   

  CORE SUBJECT (MS CORE and 9-12 ONLY) CORE COURSE (MS CORE and 9-12 ONLY)  \
0                                    -                                   -   
1                                    -                                   -   
2                                    -                                   -   
3                                    -                                   -   
4                                    -                                   -   

  SERVICE CATEGORY(K-9* ONLY)  NUMBER OF STUDENTS / SEATS FILLED  \
0                           -                               19.0   
1                           -                               21.0   
2                           -                               17.0   
3                           -                               17.0   
4                           -                               15.0   

   NUMBER OF SECTIONS  AVERAGE CLASS SIZE  SIZE OF SMALLEST CLASS  \
0                 1.0                19.0                    19.0   
1                 1.0                21.0                    21.0   
2                 1.0                17.0                    17.0   
3                 1.0                17.0                    17.0   
4                 1.0                15.0                    15.0   

   SIZE OF LARGEST CLASS DATA SOURCE  SCHOOLWIDE PUPIL-TEACHER RATIO  
0                   19.0         ATS                             NaN  
1                   21.0         ATS                             NaN  
2                   17.0         ATS                             NaN  
3                   17.0         ATS                             NaN  
4                   15.0         ATS                             NaN  
      DBN                       Name  schoolyear fl_percent  frl_percent  \
0  01M015  P.S. 015 ROBERTO CLEMENTE    20052006       89.4          NaN   
1  01M015  P.S. 015 ROBERTO CLEMENTE    20062007       89.4          NaN   
2  01M015  P.S. 015 ROBERTO CLEMENTE    20072008       89.4          NaN   
3  01M015  P.S. 015 ROBERTO CLEMENTE    20082009       89.4          NaN   
4  01M015  P.S. 015 ROBERTO CLEMENTE    20092010                    96.5   

   total_enrollment prek   k grade1 grade2  ... black_num black_per  \
0               281   15  36     40     33  ...        74      26.3   
1               243   15  29     39     38  ...        68      28.0   
2               261   18  43     39     36  ...        77      29.5   
3               252   17  37     44     32  ...        75      29.8   
4               208   16  40     28     32  ...        67      32.2   

  hispanic_num hispanic_per white_num white_per male_num male_per female_num  \
0          189         67.3         5       1.8    158.0     56.2      123.0   
1          153         63.0         4       1.6    140.0     57.6      103.0   
2          157         60.2         7       2.7    143.0     54.8      118.0   
3          149         59.1         7       2.8    149.0     59.1      103.0   
4          118         56.7         6       2.9    124.0     59.6       84.0   

  female_per  
0       43.8  
1       42.4  
2       45.2  
3       40.9  
4       40.4  

[5 rows x 38 columns]
    Demographic     DBN                            School Name    Cohort  \
0  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2003   
1  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2004   
2  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2005   
3  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL      2006   
4  Total Cohort  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL  2006 Aug   

   Total Cohort Total Grads - n Total Grads - % of cohort Total Regents - n  \
0             5               s                         s                 s   
1            55              37                     67.3%                17   
2            64              43                     67.2%                27   
3            78              43                     55.1%                36   
4            78              44                     56.4%                37   

  Total Regents - % of cohort Total Regents - % of grads  ...  \
0                           s                          s  ...   
1                       30.9%                      45.9%  ...   
2                       42.2%                      62.8%  ...   
3                       46.2%                      83.7%  ...   
4                       47.4%                      84.1%  ...   

  Regents w/o Advanced - n Regents w/o Advanced - % of cohort  \
0                        s                                  s   
1                       17                              30.9%   
2                       27                              42.2%   
3                       36                              46.2%   
4                       37                              47.4%   

  Regents w/o Advanced - % of grads Local - n Local - % of cohort  \
0                                 s         s                   s   
1                             45.9%        20               36.4%   
2                             62.8%        16                 25%   
3                             83.7%         7                  9%   
4                             84.1%         7                  9%   

    Local - % of grads Still Enrolled - n Still Enrolled - % of cohort  \
0                    s                  s                            s   
1                54.1%                 15                        27.3%   
2  37.200000000000003%                  9                        14.1%   
3                16.3%                 16                        20.5%   
4                15.9%                 15                        19.2%   

  Dropped Out - n Dropped Out - % of cohort  
0               s                         s  
1               3                      5.5%  
2               9                     14.1%  
3              11                     14.1%  
4              11                     14.1%  

[5 rows x 23 columns]
      dbn                                        school_name       boro  \
0  17K548                Brooklyn School for Music & Theatre   Brooklyn   
1  09X543                   High School for Violin and Dance      Bronx   
2  09X327        Comprehensive Model School Project M.S. 327      Bronx   
3  02M280     Manhattan Early College School for Advertising  Manhattan   
4  28Q680  Queens Gateway to Health Sciences Secondary Sc...     Queens   

  building_code    phone_number    fax_number grade_span_min  grade_span_max  \
0          K440    718-230-6250  718-230-6262              9              12   
1          X400    718-842-0687  718-589-9849              9              12   
2          X240    718-294-8111  718-294-8109              6              12   
3          M520  718-935-3477             NaN              9              10   
4          Q695    718-969-3155  718-969-3552              6              12   

  expgrade_span_min  expgrade_span_max  ...  \
0               NaN                NaN  ...   
1               NaN                NaN  ...   
2               NaN                NaN  ...   
3                 9               14.0  ...   
4               NaN                NaN  ...   

                                          priority02  \
0                    Then to New York City residents   
1  Then to New York City residents who attend an ...   
2  Then to Bronx students or residents who attend...   
3  Then to New York City residents who attend an ...   
4  Then to Districts 28 and 29 students or residents   

                                          priority03  \
0                                                NaN   
1                Then to Bronx students or residents   
2  Then to New York City residents who attend an ...   
3          Then to Manhattan students or residents     
4               Then to Queens students or residents   

                            priority04                       priority05  \
0                                  NaN                              NaN   
1      Then to New York City residents                              NaN   
2  Then to Bronx students or residents  Then to New York City residents   
3      Then to New York City residents                              NaN   
4      Then to New York City residents                              NaN   

  priority06  priority07 priority08  priority09 priority10  \
0        NaN         NaN        NaN         NaN        NaN   
1        NaN         NaN        NaN         NaN        NaN   
2        NaN         NaN        NaN         NaN        NaN   
3        NaN         NaN        NaN         NaN        NaN   
4        NaN         NaN        NaN         NaN        NaN   

                                          Location 1  
0  883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...  
1  1110 Boston Road\nBronx, NY 10456\n(40.8276026...  
2  1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...  
3  411 Pearl Street\nNew York, NY 10038\n(40.7106...  
4  160-20 Goethals Avenue\nJamaica, NY 11432\n(40...  

[5 rows x 58 columns]
      DBN                                    SCHOOL NAME  \
0  01M292  HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES   
1  01M448            UNIVERSITY NEIGHBORHOOD HIGH SCHOOL   
2  01M450                     EAST SIDE COMMUNITY SCHOOL   
3  01M458                      FORSYTH SATELLITE ACADEMY   
4  01M509                        MARTA VALLE HIGH SCHOOL   

  Num of SAT Test Takers SAT Critical Reading Avg. Score SAT Math Avg. Score  \
0                     29                             355                 404   
1                     91                             383                 423   
2                     70                             377                 402   
3                      7                             414                 401   
4                     44                             390                 433   

  SAT Writing Avg. Score  
0                    363  
1                    366  
2                    370  
3                    359  
4                    384

Surveys¶

At this moment we will read in our last two datasets, which are surveys . We also concatenate both surveys into a single dataset, survey.

all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0, sort=True)
print('Survey dimensions: {}'.format(survey.shape))
survey.head()

Survey dimensions: (1702, 2773)

As a next step, let us select our variables of interest given that this dataset has 2,773 columns. Specifically, we shall want to keep the DBN in order to identify schools, and variables such as the number of student and teacher respondents, the school's safety and respect score based on teacher responses, the communication score, the engagement score, academic expectations score, and other metrics evaluating the performance of the school as per student, teacher and parent respondents.

survey["DBN"] = survey["dbn"]

survey_fields = ["DBN", "rr_s", "rr_t", "rr_p", "N_s", "N_t", "N_p", "saf_p_11", "com_p_11", 
                 "eng_p_11", "aca_p_11", "saf_t_11", "com_t_11", "eng_t_11", "aca_t_11", 
                 "saf_s_11", "com_s_11", "eng_s_11", "aca_s_11", "saf_tot_11", "com_tot_11", 
                 "eng_tot_11", "aca_tot_11"]
survey = survey.loc[:,survey_fields]
data["survey"] = survey

survey.head()

Cleaning the data¶

We shall now begin cleaning our data in order to be able to merge the most important columns later on. Let us start by the unique identifier of each school, which will allow us to match elements efficiently in one dataset to elements in another.

Add DBN columns¶

For the columns that do not have the DBN identifier, we shall have to concatenate as strings the numbers identifying the school's CSD (Central School District), borough and school code. For this purpose, we write a function which pads the CSD numbers, given that the DBN includes this number in a 2-digit form. We then apply this function to the class_size database and compose its DBN with the padded CSD and SCHOOL CODE.

data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

Convert columns to numeric¶

Next, we observe that the average score columns in the sat_results database are of type object. We will therefore change them to a numeric type in order to manipulate them. We then take an average of the three scores to simplify our analysis.

data['sat_results'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 478 entries, 0 to 477
Data columns (total 6 columns):
DBN                                478 non-null object
SCHOOL NAME                        478 non-null object
Num of SAT Test Takers             478 non-null object
SAT Critical Reading Avg. Score    478 non-null object
SAT Math Avg. Score                478 non-null object
SAT Writing Avg. Score             478 non-null object
dtypes: object(6)
memory usage: 22.5+ KB

cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

We also notice that in the hs_directory dataset we have a variable Location 1 which includes text describing the school's address. Moreover, within the text is included the latitude and longitude coordinates for each school. Let us extract these in order to perform a geographical analysis of score distributions. For this, we define two functions find_lat and find_lon which use regular expressions to extract this information.

data['hs_directory']['Location 1'].head()

0    883 Classon Avenue\nBrooklyn, NY 11225\n(40.67...
1    1110 Boston Road\nBronx, NY 10456\n(40.8276026...
2    1501 Jerome Avenue\nBronx, NY 10452\n(40.84241...
3    411 Pearl Street\nNew York, NY 10038\n(40.7106...
4    160-20 Goethals Avenue\nJamaica, NY 11432\n(40...
Name: Location 1, dtype: object

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

Selecting our data¶

In order to combine our datasets, we need to aggregate or select those which have multiple rows for a same high school. The class_size dataset has various rows for the same school according to its program type and grade. We will select grade to be 09-12 given that our area of focus is high school students who take the SAT, and GEN ED as our PROGRAM TYPE. Other program types includes CTT, which is an integrated co-teaching program for students with disabilities. In the demographics and graduation datasets, we select a particular school year (2012-12) and to capture all students select All Cohort as our demographic of interest.

data['class_size'].head()

class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

Convert AP scores to numeric¶

Finally, we also convert to numeric values variables in the ap_2010 dataset, which were also stored as object type values. Specifically, we wish to convert the number of test takers, the number of exams taken and the number of exams with scores of 3, 4 or 5.

data['ap_2010'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258 entries, 0 to 257
Data columns (total 5 columns):
DBN                                     258 non-null object
SchoolName                              258 non-null object
AP Test Takers                          258 non-null object
Total Exams Taken                       258 non-null object
Number of Exams with scores 3 4 or 5    258 non-null object
dtypes: object(5)
memory usage: 10.2+ KB

cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

Combine the datasets¶

At this point, we have created unique rows with a shared unique identifier for each data set which we will use to consolidate the datasets in order to perform our analysis. We conduct this merge below, using the left merging method in order to preserve values from the sat_results database as this is our most important variable. On the other hand, we merge the class_size, demographics, survey and hs_directory using an inner join given that they have fewer null values.

for dataset in data.keys():
    print(dataset)
    print(data[dataset].isna().sum(axis = 0))

ap_2010
DBN                                       0
SchoolName                                0
AP Test Takers                           25
Total Exams Taken                        25
Number of Exams with scores 3 4 or 5    107
dtype: int64
class_size
DBN                                    0
CSD                                    0
NUMBER OF STUDENTS / SEATS FILLED      0
NUMBER OF SECTIONS                     0
AVERAGE CLASS SIZE                     0
SIZE OF SMALLEST CLASS                 0
SIZE OF LARGEST CLASS                  0
SCHOOLWIDE PUPIL-TEACHER RATIO       583
dtype: int64
demographics
DBN                     0
Name                    0
schoolyear              0
fl_percent           1509
frl_percent             0
total_enrollment        0
prek                    0
k                       0
grade1                  0
grade2                  0
grade3                  0
grade4                  0
grade5                  0
grade6                  0
grade7                  0
grade8                  0
grade9                  0
grade10                 0
grade11                 0
grade12                 0
ell_num                 0
ell_percent             0
sped_num                0
sped_percent            0
ctt_num                 0
selfcontained_num       0
asian_num               0
asian_per               0
black_num               0
black_per               0
hispanic_num            0
hispanic_per            0
white_num               0
white_per               0
male_num                0
male_per                0
female_num              0
female_per              0
dtype: int64
graduation
Demographic                           0
DBN                                   0
School Name                           0
Cohort                                0
Total Cohort                          0
Total Grads - n                       0
Total Grads - % of cohort             0
Total Regents - n                     0
Total Regents - % of cohort           0
Total Regents - % of grads            0
Advanced Regents - n                  0
Advanced Regents - % of cohort        0
Advanced Regents - % of grads         0
Regents w/o Advanced - n              0
Regents w/o Advanced - % of cohort    0
Regents w/o Advanced - % of grads     0
Local - n                             0
Local - % of cohort                   0
Local - % of grads                    0
Still Enrolled - n                    0
Still Enrolled - % of cohort          0
Dropped Out - n                       0
Dropped Out - % of cohort             0
dtype: int64
hs_directory
dbn                0
school_name        0
boro               0
building_code      0
phone_number       0
                ... 
priority10       434
Location 1         0
DBN                0
lat                0
lon                0
Length: 61, dtype: int64
sat_results
DBN                                 0
SCHOOL NAME                         0
Num of SAT Test Takers              0
SAT Critical Reading Avg. Score    57
SAT Math Avg. Score                57
SAT Writing Avg. Score             57
sat_score                          57
dtype: int64
survey
DBN             0
rr_s          661
rr_t            0
rr_p            0
N_s           666
N_t             2
N_p             6
saf_p_11        6
com_p_11        6
eng_p_11        6
aca_p_11        6
saf_t_11        2
com_t_11        2
eng_t_11        2
aca_t_11        2
saf_s_11      666
com_s_11      666
eng_s_11      666
aca_s_11      666
saf_tot_11      0
com_tot_11      0
eng_tot_11      0
aca_tot_11      0
dtype: int64

combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

As a next step, we take a look at the number of null values of our combined dataset. We shall impute these by using the fillna method and passing the mean of each column.

print(combined.isna().sum(axis = 0))
combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

DBN                                  0
SCHOOL NAME                          0
Num of SAT Test Takers               0
SAT Critical Reading Avg. Score     28
SAT Math Avg. Score                 28
                                  ... 
priority09                         363
priority10                         363
Location 1                           0
lat                                  0
lon                                  0
Length: 159, dtype: int64

Add a school district column for mapping¶

Before we being our exploratory analysis, we will map each school to its Central School District by extracting the first two digits from its DBN. This will serve as an additional variable that may present interesting patterns.

def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

Exploratory data analysis¶

Correlation analysis¶

We will begin our analysis by calculating the Pearson correlation between all the variables in our combined dataset and our main variable of interest, sat_score, to gain insight into which interact closely. We see below that, for logical reasons, the average test score in each SAT section is highly correlated to the average score. However, we also observe a relatively high positive correlation with total enrollment and other variables that have a direct relationship with this factor such as total exams taken. We also see a positive relationship with the percentage and number of females, and the percentage and number of Asian students. Additionally, there is a positive relationship with school characteristics evaluated in the surveys such as safety. On the other hand, we obsever a strong negative relationship to the percentage of English language learners at -0.72.

correlations = combined.corr()
correlations = correlations["sat_score"]
plt.figure(figsize=(10,10))
correlations.plot.barh()
plt.title('Variable correlation with SAT score')
plt.show()

Let us take a closer look at the relationship between total_enrollment and sat_score. In the figure below, we can see the slightly positive relationship between both variables. However, the relationship does not appear to be strong and there is an important cluster of points grouping the schools with the lowest enrollment numbers which seem to capture two possibilities: they can have outstanding average scores due to their personalized teaching methods, or they can also present the lowest scores in our universe of schools. Therefore, we cannot say there is a clear relationship between the two variables.

combined.plot.scatter('total_enrollment', 'sat_score', title = 'Relationship between total enrollment and SAT score')
plt.show()

Correlation to school survey metrics¶

Now we shall dive deeper into the correlation between sat_score and the variables stemming from the survey of each school.

# Remove DBN since it's a unique identifier, not a useful numerical value for correlation.
survey_fields.remove("DBN")

survey_data = combined.copy()
fields = survey_fields
fields.append('sat_score')
survey_data = survey_data[fields]
correlations_survey = survey_data.corr()
correlations_survey = correlations_survey["sat_score"]

plt.figure(figsize=(10,8))
correlations_survey.plot.barh()
plt.title('Variable correlation of survey fields with SAT score')
plt.show()

From the above plot, we can once again see a positive relation to variables related to total enrollment such as N_p, the number of parent respondents, and N_s, the number of student respondents. We also see a positive relationship with aca_s_11 and saf_s_11, the safety and respect score based on student responses and the academic expectations score based on student responses. It is interesting to note that this correlation is much lower to that of saf_p_11 and aca_p_11, the safety and respect score and academic expectations score of the parents. We also observe a negative relationship with com_p_11, the communication score based on parent responses. The difference between the correlations between these student-parent variables exhibits a gap in how students and their parents perceive this high school experience.

combined.plot.scatter('saf_s_11', 'sat_score', title = 'Relationship between safety and respect score from students and high school SAT scores')
plt.show()

In the above graph it appears that for higher safety scores from students, high school's SAT performance diverges more, while performance distribution is more concentrated for lower scores. That is, schools perceived as unsafe by students appear to have a lower performance on the SATs with more certainty. To further this analysis, we will compute the average safety scores for each district and map them geographically.

district_scores = combined.copy()
district_scores = district_scores.groupby('school_dist').agg(np.mean)
district_safety_scores = district_scores[['saf_s_11', 'lon', 'lat']]
district_safety_scores.head()

import os
os.environ['PROJ_LIB'] = r'C:\ProgramData\Anaconda3\pkgs\proj4-5.2.0-ha925a31_1\Library\share'
from mpl_toolkits.basemap import Basemap
m = Basemap(
    projection='merc', 
    llcrnrlat=40.496044, 
    urcrnrlat=40.915256, 
    llcrnrlon=-74.255735, 
    urcrnrlon=-73.700272,
    resolution='h'
)

m.drawmapboundary(fill_color='grey')
m.drawcoastlines()
m.drawrivers(color='#6D5F47', linewidth=.4)

longitudes = district_safety_scores['lon'].tolist()
latitudes = district_safety_scores['lat'].tolist()

school_districts = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\NYC_school_district_map.jpg')

m.scatter(longitudes, latitudes, c = district_safety_scores['saf_s_11'], cmap = 'Blues', s = 20, zorder= 2, latlon = True)
plt.colorbar(label='Safety score as per student survey')
plt.title('Geographic mapping of safetey scores')
plt.show()

plt.figure(figsize = (11,11))
plt.imshow(school_districts)
plt.title('Location of NYC school districts')
plt.show()
print('Location of NYC school districts image sourced from the New York City Deparment of Education')

C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:10: MatplotlibDeprecationWarning: 
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
  # Remove the CWD from sys.path while we load stuff.
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:15: MatplotlibDeprecationWarning: 
The dedent function was deprecated in Matplotlib 3.1 and will be removed in 3.3. Use inspect.cleandoc instead.
  from ipykernel import kernelapp as app

Location of NYC school districts image sourced from the New York City Deparment of Education

From the above images we can observe that the safest schools are those in the Manhattan/Bronx area, the Northern Queens/Brooklyn area and Southwestern Brooklyn area as per above. On the other hand, the least safe schools are located in the Northern Brooklyn districts, Southern Queens and the Southern Queens/Brooklyn districts.

crime_map = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\NYC_crime_map.png')

plt.figure(figsize = (15,15))
plt.imshow(crime_map)
plt.title('NYC Crime Map')
plt.show()
print('Map sourced from NYC Crime Map')

Map sourced from NYC Crime Map

If we compare the distribution of safe schools to the previous figure, a crime map of NYC provided by the government here, we notice similar patterns indicating the overall level of safety in these districts. For instance, we note that the Northeastern and Eastern districts of Brooklyn present higher levels of crime per 1000 residents than surrounding districts, precisely in line with the map of schools. We can further examine the level of concern in terms of safety for these districts in the map below, which was sourced from Google Maps. It indicates the level of safety, with purple areas being the most unsafe.

safety_map = plt.imread(r'C:\Users\andre\Documents\Data_Science_Courses\DQ_Data_Scientist_in_Python\8. Data Cleaning Walkthrough\Guided_project\images\brooklyn_safety_map.png')

plt.figure(figsize = (15,15))
plt.imshow(safety_map)
plt.title('Brooklyn Safety Map')
plt.show()
print('Map sourced from Google Images')

Map sourced from Google Images

Correlation to race¶

We have four variables which indicate the percentage of each race at a particular school: white_per, asian_per, black_per and hispanic_per. Let us now take a look at their correlation to SAT scores.

race_data = combined.copy()
fields = ['sat_score', 'white_per', 'asian_per', 'black_per', 'hispanic_per']
race_data = race_data[fields]
correlations_race = race_data.corr()
correlations_race = correlations_race["sat_score"]

plt.figure(figsize=(10,8))
correlations_race.plot.barh()
plt.title('Variable correlation of race variables with SAT score')
plt.show()

In the above plot we note that there is a positive correlation between SAT scores and the percentage of white and Asian students in the school. On the other hand, there is a negative relationship between scores and percentage of black and hispanic students. It is often the case that a student's performance in school can be largely determined by the support he or she has at home and the extent to which he or she can focus on his/her studies. A student who lives in a high-crime district and needs to take on a more significant role at home would understandably see impacted performance at school. It could be the case that these different races have those favorable and disfavorable forces working in favor or against them and that is why we see these correlations.

We will now explore schools with low SAT scores and high values for the variable hispanic_per. In the plot below, we see a largely negative relationship between these two variables. However, it is interesting to see a cluster of points at the bottom left, which represent schools with a low percentage of hispanic studesnt and low SAT scores.

combined.plot.scatter('hispanic_per', 'sat_score')
plt.title('Correlation between percentage of histpanic students and SAT score')
plt.show()

To examine these schools with a high hispanic student body and low SAT scores, we will select schools with a threshold of above 95% for percentage of hispanics.

hispanic = combined.copy()
hispanic = hispanic[hispanic['hispanic_per'] > 95]
hispanic

We see that there are 8 schools with a percentage of hispanic students higher than 95%. By looking at some of the schools on this list, we note that many of them are public schools with a percentage of minority enrollment of almost 100% and percentage of students who are economically disadvantaged of above 60%. It is interesting to note that some of these schools, such as Multicultural High School, actually have a relatively low student to teacher ratio (11:1 in the case of this high school), as reported by US News (please refer to Sources).

Now, let us take a look at schools on the other end of the spectrum. Below is a table of schools with a percentage of hispanic students lower than 10% and SAT scores greater than 1800.

nonhispanic = combined.copy()
nonhispanic = nonhispanic[(nonhispanic['hispanic_per'] < 10) & (nonhispanic['sat_score'] > 1800)]
nonhispanic

From the schools above, we see most of them mostly have a technical or science focus. Moreover, these are top-rated schools. Stuyvesant High School, for example, was 25 in the National Rankings as per US News, with an overall score of 99.86. Minority enrollment was lower compared to schools in the first group, at around 80%, with economically disadvantaged students at a lower 45%. Enrollment was significantly higher compared to the first group as well, at 3,368 in the case of Stuyvesant. Moreover, the majority of students were Asian (74%) and white (18%), as per US News. In the case of the Bronx High School of Science, it was 49 in national rankings with an overall score of 99.72 and also had a student body that was majority-Asian (63%).

Correlation to gender¶

We now turn to investigate gender differences in SAT scores.

gender_data = combined.copy()
fields = ['sat_score', 'female_per', 'male_per']
gender_data = gender_data[fields]
correlations_gender = gender_data.corr()
correlations_gender = correlations_gender["sat_score"]

plt.figure(figsize=(10,8))
correlations_gender.plot.barh()
plt.title('Variable correlation of gender variables with SAT score')
plt.show()

We find that the percentage of female students has a positive correlation with SAT scores, while the case for percentage of males is the opposite. This is a surprising finding as the female population is usually considered to be at a disadvantage. Below we see a scatter plot of percentage of females and SAT scores, and it is interesting to find that the schools with the best performance on the test are those which have a more balanced student body, while those on the extremes (all-male, all-female) register scores that are more in line with the average of all NYC high schools.

combined.plot.scatter('female_per', 'sat_score')
plt.title('Correlation between percentage of female students and SAT score')
plt.show()

We shall perform the same exercise as with schools' hispanic population, but with female students.

females = combined.copy()
females = females.sort_values('female_per', ascending = False)
females = females[(females['sat_score'] > 1700) & (females['female_per'] > 60)]
females = females[['SCHOOL NAME', 'female_per', 'sat_score']]
females

Researching these schools, we note that Fiorello LaGuardia, for example, specializes in the visual and performing arts. The majority of students in this school (93%) feel safe inside the premises and only a small amount (15%) consider that bullying takes place, as per Inside Schools (please review Sources). Townsend Harris High School focuses on the humanities, and a large percentage of teachers (84%) indicate they would recommend this school to families (Inside Schools).

Correlation to AP testing¶

We will now examine the relationship between the percentage of students in each school who take AP exams and the school's performance on the SATs. We expect this relationship to be positive, as high school students take these Advanced Placement exams to earn college credit. We note that our dataset only includes the absolute number of students who toke these tests, so we wil have to create a new variable to reflect percentages.

combined['ap_per'] = combined['AP Test Takers '] / combined['total_enrollment']

combined.plot.scatter('ap_per', 'sat_score')
plt.title('Relationship between percentage of AP test takers and SAT scores')
plt.show()

It is interesting to see that the positive relationship one would expect is not very clear. For a group of schools we do se higher percentage of AP students and higher SAT scores, but for a second cluster of data points we do not see much influence of this variable on the performance in the SATs. This could perhaps be explained by the poor quality of AP classes in some schools; in this case one would not expect a significant positive externality on students' performance in the SATs from having to take those AP courses.

Conclusions¶

In this project we have gathered data from significant sources to look into the variables that explain high school students' performance on the SAT scores. For this purpose, we have cleaned and combined several datasets and performed exploratory analysis of our data primarily through plots. We found interesting relationships at the level of different factors: gender, race, AP test takers, total enrollment and perceived safety. While some of the relationships we found were expected, such as that between perceived safety of schools and student performance, other correlations were surprising. This was the case of the percentage of female students and total enrollment.

To further this research, one could look at a more granular level into geographic data to identify the neighborhoods that have the best schools. It could also be important to explore the relationship between SAT scores and class size.

Sources¶

Google. Brooklyn Safety Map. Google Maps: Link. Accessed October 31, 2019.

InsideSchools. Fiorello H. LaGuardia High School of Music & Art and Performing Arts. Link. Accessed: October 31, 2019.

InsideSchools. Townsend Harris High School. Link. Accessed: October 31, 2019.

New York City Department of Education. Map of Instructional Divisions. Link. Accessed: October 31, 2019

New York City Crime Map. Link. Accessed: October 31, 2019.

NYC Open Data. 2012 SAT Results. Link. Accessed: October 31, 2019.

NYC Open Data. 2010-2011 School Attendance and Enrollment Statistics by District. Link. Accessed: October 31, 2019.

NYC Open Data. 2010 AP (College Board) School Level Results. Link. Accessed: October 31, 2019.

NYC Open Data. 2005-2010 Graduation Outcomes - School Level. Link. Accessed: October 31, 2019.

NYC Open Data. 2006-2012 School Demographics and Accountability Snapshot. Link. Accessed: October 31, 2019.

NYC Open Data. 2011 NYC School Survey. Link. Accessed: October 31, 2019.

NYC Open Data. 2010-2011 Class Size - School-level detail. Link. Accessed: October 31, 2019.

US News. Manhattan Bridges High School. Link. Accessed: October 31, 2019.

US News. Multicultural High School. Link. Accessed: October 31, 2019.

US News. Pan American International High School. Link. Accessed: October 31, 2019.

US News. Stuyvestant High School. Link. Accessed: October 31, 2019.

US News. Bronx High School of Science. Link. Accessed: October 31, 2019.

	saf_s_11	lon	lat
school_dist
01	6.768611	-73.982377	40.719022
02	6.910660	-73.991386	40.739699
03	6.716667	-73.977370	40.781574
04	6.885714	-73.943215	40.793449
05	6.314286	-73.949251	40.817077

	DBN	SCHOOL NAME	Num of SAT Test Takers	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	SchoolName	AP Test Takers	Total Exams Taken	...	priority05	Location 1	lat	lon	school_dist
44	02M542	MANHATTAN BRIDGES HIGH SCHOOL	66	336.0	378.0	344.0	1058.0	Manhattan Bridges High School	67.000000	102.000000	...	0	525 West 50Th Street\nNew York, NY 10019\n(40....	40.765027	-73.992517	02
82	06M348	WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL	70	380.0	395.0	399.0	1174.0	0	129.028846	197.038462	...	Then to New York City residents	511 West 182Nd Street\nNew York, NY 10033\n(40...	40.848879	-73.930807	06
89	06M552	GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...	56	339.0	349.0	326.0	1014.0	GREGORIO LUPERON HS SCI & MATH	88.000000	138.000000	...	0	501 West 165Th\nNew York, NY 10032\n(40.838032...	40.838032	-73.938371	06
125	09X365	ACADEMY FOR LANGUAGE AND TECHNOLOGY	54	315.0	339.0	297.0	951.0	Academy for Language and Technology	20.000000	20.000000	...	0	1700 Macombs Road\nBronx, NY 10453\n(40.849102...	40.849102	-73.916088	09
141	10X342	INTERNATIONAL SCHOOL FOR LIBERAL ARTS	49	300.0	333.0	301.0	934.0	International School for Liberal Arts	55.000000	73.000000	...	0	2780 Reservoir Avenue\nBronx, NY 10468\n(40.87...	40.870377	-73.898163	10
176	12X388	PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE	30	321.0	351.0	298.0	970.0	0	129.028846	197.038462	...	0	1300 Boynton Avenue\nBronx, NY 10472\n(40.8313...	40.831366	-73.878823	12
253	19K583	MULTICULTURAL HIGH SCHOOL	29	279.0	322.0	286.0	887.0	Multicultural High School	44.000000	44.000000	...	0	999 Jamaica Avenue\nBrooklyn, NY 11208\n(40.69...	40.691144	-73.868426	19
286	24Q296	PAN AMERICAN INTERNATIONAL HIGH SCHOOL	55	317.0	323.0	311.0	951.0	0	129.028846	197.038462	...	0	45-10 94Th Street\nElmhurst, NY 11373\n(40.743...	40.743303	-73.870575	24

	DBN	SCHOOL NAME	Num of SAT Test Takers	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score	sat_score	SchoolName	AP Test Takers	Total Exams Taken	...	Location 1	lat	lon	school_dist
37	02M475	STUYVESANT HIGH SCHOOL	832	679.0	735.0	682.0	2096.0	STUYVESANT HS	1510.0	2819.0	...	345 Chambers Street\nNew York, NY 10282\n(40.7...	40.717746	-74.014049	02
151	10X445	BRONX HIGH SCHOOL OF SCIENCE	731	632.0	688.0	649.0	1969.0	BRONX HS OF SCIENCE	1190.0	2435.0	...	75 West 205 Street\nBronx, NY 10468\n(40.87705...	40.877056	-73.889780	10
187	13K430	BROOKLYN TECHNICAL HIGH SCHOOL	1277	587.0	659.0	587.0	1833.0	BROOKLYN TECHNICAL HS	2117.0	3692.0	...	29 Ft Greene Place\nBrooklyn, NY 11217\n(40.68...	40.688107	-73.976745	13
327	28Q687	QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...	121	612.0	660.0	596.0	1868.0	Queens HS for Science York Colllege	215.0	338.0	...	94-50 159 Street\nJamaica, NY 11433\n(40.70099...	40.700999	-73.798154	28
356	31R605	STATEN ISLAND TECHNICAL HIGH SCHOOL	227	635.0	682.0	636.0	1953.0	STATEN ISLAND TECHNICAL HS	528.0	905.0	...	485 Clawson Street\nStaten Island, NY 10306\n(...	40.567913	-74.115362	31

	SCHOOL NAME	female_per	sat_score
61	FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A...	73.6	1707.0
302	TOWNSEND HARRIS HIGH SCHOOL	71.1	1910.0
5	BARD HIGH SCHOOL EARLY COLLEGE	68.7	1856.0
26	ELEANOR ROOSEVELT HIGH SCHOOL	67.5	1758.0
60	BEACON HIGH SCHOOL	61.0	1744.0

	DBN	SCHOOL NAME	Num of SAT Test Takers	SAT Critical Reading Avg. Score	SAT Math Avg. Score	SAT Writing Avg. Score
0	01M292	HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES	29	355	404	363
1	01M448	UNIVERSITY NEIGHBORHOOD HIGH SCHOOL	91	383	423	366
2	01M450	EAST SIDE COMMUNITY SCHOOL	70	377	402	370
3	01M458	FORSYTH SATELLITE ACADEMY	7	414	401	359
4	01M509	MARTA VALLE HIGH SCHOOL	44	390	433	384

	N_p	N_s	N_t	aca_p_11	aca_s_11	aca_t_11	aca_tot_11	bn	com_p_11	com_s_11	...	t_q8c_1	t_q8c_2	t_q8c_3	t_q8c_4	t_q9	t_q9_1	t_q9_2	t_q9_3	t_q9_4	t_q9_5
0	90.0	NaN	22.0	7.8	NaN	7.9	7.9	M015	7.6	NaN	...	29.0	67.0	5.0	0.0	NaN	5.0	14.0	52.0	24.0	5.0
1	161.0	NaN	34.0	7.8	NaN	9.1	8.4	M019	7.6	NaN	...	74.0	21.0	6.0	0.0	NaN	3.0	6.0	3.0	78.0	9.0
2	367.0	NaN	42.0	8.6	NaN	7.5	8.0	M020	8.3	NaN	...	33.0	35.0	20.0	13.0	NaN	3.0	5.0	16.0	70.0	5.0
3	151.0	145.0	29.0	8.5	7.4	7.8	7.9	M034	8.2	5.9	...	21.0	45.0	28.0	7.0	NaN	0.0	18.0	32.0	39.0	11.0
4	90.0	NaN	23.0	7.9	NaN	8.1	8.0	M063	7.9	NaN	...	59.0	36.0	5.0	0.0	NaN	10.0	5.0	10.0	60.0	15.0

	DBN	rr_s	rr_t	rr_p	N_s	N_t	N_p	saf_p_11	com_p_11	eng_p_11	...	eng_t_11	aca_t_11	saf_s_11	com_s_11	eng_s_11	aca_s_11	saf_tot_11	com_tot_11	eng_tot_11	aca_tot_11
0	01M015	NaN	88	60	NaN	22.0	90.0	8.5	7.6	7.5	...	7.6	7.9	NaN	NaN	NaN	NaN	8.0	7.7	7.5	7.9
1	01M019	NaN	100	60	NaN	34.0	161.0	8.4	7.6	7.6	...	8.9	9.1	NaN	NaN	NaN	NaN	8.5	8.1	8.2	8.4
2	01M020	NaN	88	73	NaN	42.0	367.0	8.9	8.3	8.3	...	6.8	7.5	NaN	NaN	NaN	NaN	8.2	7.3	7.5	8.0
3	01M034	89.0	73	50	145.0	29.0	151.0	8.8	8.2	8.0	...	6.8	7.8	6.2	5.9	6.5	7.4	7.3	6.7	7.1	7.9
4	01M063	NaN	100	60	NaN	23.0	90.0	8.7	7.9	8.1	...	7.8	8.1	NaN	NaN	NaN	NaN	8.5	7.6	7.9	8.0

	CSD	BOROUGH	SCHOOL CODE	SCHOOL NAME	GRADE	PROGRAM TYPE	CORE SUBJECT (MS CORE and 9-12 ONLY)	CORE COURSE (MS CORE and 9-12 ONLY)	SERVICE CATEGORY(K-9* ONLY)	NUMBER OF STUDENTS / SEATS FILLED	NUMBER OF SECTIONS	AVERAGE CLASS SIZE	SIZE OF SMALLEST CLASS	SIZE OF LARGEST CLASS	DATA SOURCE	SCHOOLWIDE PUPIL-TEACHER RATIO	padded_csd	DBN
0	1	M	M015	P.S. 015 Roberto Clemente	0K	GEN ED	-	-	-	19.0	1.0	19.0	19.0	19.0	ATS	NaN	01	01M015
1	1	M	M015	P.S. 015 Roberto Clemente	0K	CTT	-	-	-	21.0	1.0	21.0	21.0	21.0	ATS	NaN	01	01M015
2	1	M	M015	P.S. 015 Roberto Clemente	01	GEN ED	-	-	-	17.0	1.0	17.0	17.0	17.0	ATS	NaN	01	01M015
3	1	M	M015	P.S. 015 Roberto Clemente	01	CTT	-	-	-	17.0	1.0	17.0	17.0	17.0	ATS	NaN	01	01M015
4	1	M	M015	P.S. 015 Roberto Clemente	02	GEN ED	-	-	-	15.0	1.0	15.0	15.0	15.0	ATS	NaN	01	01M015