Introduction

Data cleaning is crucial to any data analysis task. After all, any analytic model is as good as the data from which it is built. The purpose of data cleaning is to prepare data for analysis and this involves a wide range of tasks including removing duplicate data, reshaping data, handling missing values, etc. Data cleaning is said to consume most of the time of a data analysis work.

In this project, I clean messy, real-world datasets which contain information about the demographic of New York City(NYC) schools, test scores, graduation and dropout rates, and school locations.Some analyses of the cleaned data are performed to understand the impact of NYC high schools’ demographics (race, sex, income, etc.) on students’ academic performance.

The datasets used in this project come from the New York City Department of Education. They are stored in the following six files:

sat_results.csv: Data on Scholastic Aptitude Test (SAT) scores. The SAT is a standardized test taken by most high school students that is used to assess preparation for college. The dataset contains 2012 SAT results.
ap_2010.csv - Data on Advanced Placement (AP) test results. AP tests are taken by advanced high school students, often at schools that can afford to provide the specialized classes needed to prepare for them.AP results for 2010 are currently available.
class_size.csv: Data on average school class size by grade and program type.The dataset includes 2010-2011 class size data.
demographics.csv: Data on demographics of NYC school students from 2006 - 2012, including race and sex.
graduation.csv: Data on graduation outcomes for cohorts of 2001 through 2006, including percentages of students who graduated or dropped out of school.
hs_directory.csv: A directory of high schools that contains location information for each school.

Cleaning the Datasets

Usually, the first task of working with any new data is to familiarize yourself with the data by reading the descriptions of the datasets (i.e., the metadata) and if necessary, doing additional research to ensure you understand what each data represents. After gaining familiarity with the data, you can then focus on the data cleaning operations to perform to prepare the data for analysis. The data cleaning tasks can be framed as questions including:

What data do we need based on the purpose of analysis for which the data is to be used?
Are the data of the correct type?
Are there any data duplicates?
Are there missing values?
Do we need to create new variables, combine different datasets, reshape the data, etc?

I walk through the process of cleaning each data guided by these questions as appropriate. Let’s begin by importing all R modules relevant for this project and also reading in the above described datasets.

Importind Modules and Data

library(readr)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(ggplot2)

sat_results <- read_csv("sat_results.csv") 
ap_2010 <- read_csv("ap_2010.csv") 
class_size <- read_csv("class_size.csv")
demographics <- read_csv("demographics.csv") 
graduation <- read_csv("graduation.csv") 
hs_directory <- read_csv("hs_directory.csv")

Create a function get_df_dims() which displays the dimensions (i.e., number of rows and columns) of the datasets imported as dataframes using the read_csv() function.

list_of_df <- list(sat_results, ap_2010, class_size, demographics, graduation, hs_directory)
names(list_of_df) <- c("sat_results", "ap_2010", "class_size", "demographics", "graduation", "hs_directory")

get_df_dims <- function(df_list) {
 
  dim_mat <- matrix(ncol = 2)
  
  for (df in df_list) {
    if (all(is.na(dim_mat))) {
      dim_mat <- rbind(c(dim(df)[1], dim(df)[2]))
    }
    else 
      dim_mat <- rbind(dim_mat, c(dim(df)[1], dim(df)[2]))
    
  }
  
  colnames(dim_mat) <- c("Num_of_rows", "Num_of_cols")
  rownames(dim_mat) <- names(df_list)
  dim_mat
}

get_df_dims(list_of_df)

##              Num_of_rows Num_of_cols
## sat_results          478           6
## ap_2010              258           5
## class_size         27611          16
## demographics       10075          38
## graduation         25096          23
## hs_directory         435          58

class_size includes data about different classes and grades and demographics and graduation have data over several years. It is thus not surprising these three are the largest datasets.sat_results includes only 2012 data but we can see that it has more observations than hs_directory. This either suggests duplicate data in the SAT dataset or that there are some schools in the SAT data not listed in the directory data. We will check for duplicates after we have done some initial cleaning works on each individual dataset.

Cleaning SAT Data

As noted above, the SAT dataset contains information about the most recent (i.e., 2012) school level SAT results.As only high school students take the SAT, we can expect that the schools listed in the dataset are all high schools.

#explore few rows of the SAT data using the `head()` function.
head(sat_results, 10)

We can see SAT data has six variables which together identify each school and provide information about the average scores students at each high school earned on the three SAT sections: Critical Reading, Math and Writing. All the variables are relevant for our purpose of understanding the effect of demographics on test performance and should be retained.It seems the DBN variable uniquely identifies each school’s data; it can be used as a key to combine the SAT dataframe with other dataframes.

Data cleaning tasks to be performed:

It may be useful to have a variable containing average total SAT scores which is the sum of the three SAT section scores. We will call this variable avg_sat_score.
The data for the SAT section scores are character data. To calculate the sum of the SAT section scores, we need to change the data to numeric.

Let’s implement these changes and overwrite the original SAT data.

sat_results <- sat_results %>%
    mutate(`Num of SAT Test Takers` = as.numeric(`Num of SAT Test Takers`),
           `SAT Writing Avg. Score` = as.numeric(`SAT Writing Avg. Score`),
           `SAT Critical Reading Avg. Score` = as.numeric(`SAT Critical Reading Avg. Score`),
           `SAT Math Avg. Score` = as.numeric(`SAT Math Avg. Score`)) %>% 
    mutate(avg_sat_score = `SAT Writing Avg. Score` + `SAT Critical Reading Avg. Score` + `SAT Math Avg. Score`)

head(sat_results)

Cleaning AP Exam Data

Only high school students take AP exams and thus, we can expect that the data in ap_2010 are for high schools.

head(ap_2010, 10)

Similar to the SAT data, all the five variables in the AP data are relevant for our analysis as they identify schools, number of students who took AP exam per school, number of exams taken (students can take exams in multiple subjects) and number of students who earned high scores on the exams, which are scored on a scale from one (lowest) to five (highest).

The metadata of the AP dataset indicates that test records of schools where 5 or fewer students took the test are suppressed in the dataset. This may explain the missing data in the dataset represented as s. We will address the issue of missing data later on. For now, here are the cleaning opperations to perform:

Since students can take multiple AP exams, it may be interesting to calculate the average number of exams taken per student by dividing Total Exams Taken by AP Test Takers. We could also calculate the percentage of high-scoring AP exams by dividing Number of Exams with scores 3 4 or 5 by Total Exams Taken and multiplying by 100.
The AP data are all formatted as character data. We’ll need to change the last three variables to numeric.

ap_2010 <- ap_2010 %>% 
    mutate_at(vars(`AP Test Takers`, `Total Exams Taken`, 
                    `Number of Exams with scores 3 4 or 5`), as.numeric) %>%
    mutate(exams_per_student = `Total Exams Taken`/`AP Test Takers`) %>%
    mutate(high_score_percent = (`Number of Exams with scores 3 4 or 5`/`Total Exams Taken`)*100)

head(ap_2010)

Cleaning Class Size Data

This dataset contains information about class sizes associated with different grades and classes of NYC schools.

head(class_size, 10)

We can see that there are multiple rows for each school in the class size dataset. This is due to the different types of grades and programs recorded for each school. Since we are interested high schools’ data, we will keep only data where the GRADE variable has the value of “09-12” — grades which are considered part of high schools in the U.S. Similarly, for consistency among datasets, we will keep rows for which PROGRAM TYPE has the value of “GEN ED”(i.e., “general education”).

Let’s filter the class size dataset to retain relevant rows based on the above conditions.

class_size <- class_size %>% filter (GRADE == "09-12" & `PROGRAM TYPE` == "GEN ED")
head(class_size)

The filtered class size dataframe still contains multiple rows for some high schools. These rows contain information about class sizes of various courses.To calculate single values for class size variables for each high school, we can calculate an average across courses. Particular variables of interest include AVERAGE CLASS SIZE, SIZE OF SMALLEST CLASS, and SIZE OF LARGEST CLASS which may be useful in understanding the impacts of demographic factors on student academic performance. For example, as wealtheir schools can afford more teachers, they may be able to offer smaller classes.

To calculate averages for the three variables, we need to group the class sise data by a variable that uniquely identifies each high school. We saw earlier that DBN is a key in the AP and SAT datasets. In the class size data, DBN can be created from the CSD and SCHOOL CODE variables. Thus, we can group the class size data by CSD, SCHOOL CODE and SCHOOL NAME to calculate averages for smallest class size, largest class size, and average class size.

class_size <- class_size %>% 
    group_by(CSD, `SCHOOL CODE`, `SCHOOL NAME`) %>%
    summarize(avg_class_size = mean(`AVERAGE CLASS SIZE`),
              avg_largest_class = mean(`SIZE OF LARGEST CLASS`),
              avg_smallest_class = mean(`SIZE OF SMALLEST CLASS`))

Let’s add a DBN variable to the class size dataframe which we can use as a key to join other dataframes.

class_size <- class_size %>%
  mutate(DBN = str_c(CSD, `SCHOOL CODE`, sep = "")) %>%
  mutate(DBN = str_pad(DBN, width = 6, side = 'left', pad = "0"))

#reorder columns and drop CSD and `SCHOOL CODE` variables
class_size <- class_size[c(7,3,4,5,6)]

head(class_size)

Cleaning Graduation Data

The graduation data contains graduation details of schools including percentages of students who graduated or dropped out of school.

head(graduation, 10)

Graduation data for cohorts 2001 through 2006 are contained in the graduation dataset. Since none of the other datasets contains school data between 2001 and 2005, we will filter the graduation dataset to retain data for 2006. To understand students’ academic success, the most relevant variables are probably the ones that have to do with whether students successfully graduate or if they drop out: Total Grads - % of cohort and Dropped Out - % of cohort. These variables are thus retained.

graduation <- graduation %>% 
    filter (Demographic == "Total Cohort" & Cohort == "2006") %>%
    select (DBN, `School Name`, `Total Grads - % of cohort`, `Dropped Out - % of cohort`) 

head(graduation)

The percentage values in the Total Grads - % of cohort and Dropped Out - % of cohort are formatted as character strings. Let’s extract the numeric portion using the parse_number() function.

graduation <- graduation %>%
    mutate_at(vars(`Total Grads - % of cohort` : `Dropped Out - % of cohort`), parse_number)

head(graduation)

Cleaning Demographics Data

The demographics dataset includes demographics of students in NYC.

head(demographics, 10)

Data on students’ demographics from 2006 through 2012 are contained in the dataset. For now, we will retain the data for all years since the other datasets together also have data between 2006 and 2012.We will select demographics data for a particular year when combining the demographics datat with other datasets for analysis later on.

We can see that demographics for students other than high school students are included the data. We need to filter the data to select rows pertaining to high schools. Since grades 9 through 12 represent high schools, selecting rows where values of grade9 are not missing will fetch high schools’ data.

For our analysis, we will retain variables that provide information about percentages of students of different races and sexes as well as variables that provide information about the percentage of students in programs that can indicate socioeconomic conditions of students.

Let’s implement these changes.

demographics <- demographics %>% 
    filter (grade9 != "NA") %>%
    select (DBN, schoolyear, frl_percent, total_enrollment, ell_percent, sped_percent, asian_per, black_per, hispanic_per, white_per, male_per)

head(demographics)

Cleaning High School Directory Data

The directory data contains location information including GPS coordinates which can be useful when we want to look at patterns based on school locations. These coordinates are buried within character strings contained in the Location 1 variable.

head(hs_directory)

Let’s simplify the directory data to retain the boro and Location 1 columns in addition to columns identifying each school.

hs_directory <- hs_directory %>%
    rename (DBN = dbn, Borough=boro) %>%
    select (DBN, school_name, Borough, `Location 1`)

head(hs_directory)

Let’s isolate the latitude and longitude data, of each school contained in the Location 1 variable. We can use this information in visualizing spatial trends in the future. The latitude values are those that begin with "40" and the longitude values are those that begin with "-73".

#Create a new hs_directory dataframe adding a new variable, lat_long, which contains the #coordinates extracted from `Location 1` column.

lat_long <- str_split(hs_directory$`Location 1`, "\n", simplify=TRUE)

hs_directory <- hs_directory %>% mutate (lat_long = lat_long[,3])

#Create lat and long variables from lat_long and add them to the hs_directory dataframe.
split_location <- str_split(hs_directory$lat_long, ",", simplify=TRUE)

hs_directory <- hs_directory %>%
    mutate(lat = split_location[,1], long = split_location[,2])

#Remove the opening and closing parentheses from lat and long cols. 
lat_clean <- str_sub(hs_directory$lat, 2, -1)
long_clean <- str_sub(hs_directory$long, 1, -2)

#Convert the data type of the lat and long variables from character to numeric.
hs_directory <- hs_directory %>%
    mutate(lat = lat_clean, long = long_clean) %>%
    mutate_at(vars(lat : long), as.numeric) %>%
    select(-`Location 1`)

head(hs_directory, 10)

Exploratory Data Analysis (EDA)

So far we have done initial cleaning of each dataset separately. It is time to combine the datasets and perform some exploratory analyses to understand demographic disparities in students’ academic performance. We will continue data cleaning as part of the analysis, this time focusing on reshaping data, checking for duplicates and addressing missing values as appropriate.

Grades and grade point average(GPA) are the most commonly used measure of academic performance/success in educational research. Thus, variables in our datasets such as SAT scores and graduation rate can be used as a measure of academic success. We will begin our analysis by exploring the relationship between demographic factors and SAT scores.

Relationship Between Demographics and SAT Scores

Selecting Relevant Data

We will use the SAT data which has 2012 SAT scores.
Since we have 2012 SAT scores, we need demographics data for 2012.
It may be important to also understand how academic performance and demographics variables vary geographically in NYC. Thus, we also need the high school directory data.
The other datasets (graduation, class size and AP_2010) also have important variables but since they do not contain data for the year 2012, we will use them in separate analyses.

The cleaned SAT and directory data can be used as is. However, we need to select 2012 demographics data from the demographics dataset.

demographics2012 <- demographics %>% 
    filter (schoolyear == "20112012") %>%
    select (-schoolyear)

head(demographics2012)

Checking for Duplicates

Before we combine the cleaned datasets for analysis, we will check for duplicates.As noted previously, data for each school in the datasets is uniquely identified by the DBN variable. We will check for duplicate values of DBN using the duplicated() function.

#create a list of the six dataframes named `ny_schools`
ny_schools <- list(sat_results, demographics2012, hs_directory, graduation, class_size, ap_2010)

#assign the dataframes in the list proper name attributes
names(ny_schools) <- c("sat_results", "demographics2012", "hs_directory", "graduation", "class_size", "ap_2010")

#return a list of rows from each dataframe that contain duplicate values of DBN
duplicate_DBN <- ny_schools %>% 
    map(mutate, is_dup = duplicated(DBN)) %>%
    map(filter, is_dup == "TRUE")

duplicate_DBN

## $sat_results
## # A tibble: 0 x 8
## # ... with 8 variables: DBN <chr>, `SCHOOL NAME` <chr>, `Num of SAT Test
## #   Takers` <dbl>, `SAT Critical Reading Avg. Score` <dbl>, `SAT Math Avg.
## #   Score` <dbl>, `SAT Writing Avg. Score` <dbl>, avg_sat_score <dbl>,
## #   is_dup <lgl>
## 
## $demographics2012
## # A tibble: 0 x 11
## # ... with 11 variables: DBN <chr>, frl_percent <dbl>, total_enrollment <dbl>,
## #   ell_percent <dbl>, sped_percent <dbl>, asian_per <dbl>, black_per <dbl>,
## #   hispanic_per <dbl>, white_per <dbl>, male_per <dbl>, is_dup <lgl>
## 
## $hs_directory
## # A tibble: 0 x 7
## # ... with 7 variables: DBN <chr>, school_name <chr>, Borough <chr>,
## #   lat_long <chr>, lat <dbl>, long <dbl>, is_dup <lgl>
## 
## $graduation
## # A tibble: 0 x 5
## # ... with 5 variables: DBN <chr>, `School Name` <chr>, `Total Grads - % of
## #   cohort` <dbl>, `Dropped Out - % of cohort` <dbl>, is_dup <lgl>
## 
## $class_size
## # A tibble: 0 x 6
## # ... with 6 variables: DBN <chr>, `SCHOOL NAME` <chr>, avg_class_size <dbl>,
## #   avg_largest_class <dbl>, avg_smallest_class <dbl>, is_dup <lgl>
## 
## $ap_2010
## # A tibble: 1 x 8
##   DBN   SchoolName `AP Test Takers` `Total Exams Ta~ `Number of Exam~
##   <chr> <chr>                 <dbl>            <dbl>            <dbl>
## 1 04M6~ YOUNG WOM~               NA               NA               NA
## # ... with 3 more variables: exams_per_student <dbl>, high_score_percent <dbl>,
## #   is_dup <lgl>

The check for duplicates showed that only the ap_2010 dataframe contains 1 duplicate value: DBN = "04M610". Let’s filter the data to identify the rows containing the duplicated DBN value.

check_dup <- ap_2010 %>%
            filter(DBN == "04M610")

check_dup

We found 2 rows corresponding to the DBN value 04M610. We can remove the row with the school name “YOUNG WOMEN’S LEADERSHIP SCH” which does not contain data.

ap_2010 <- ap_2010 %>%
            filter (SchoolName != "YOUNG WOMEN'S LEADERSHIP SCH")

#check if only one "04M610" DBN value remains
check_dup <- ap_2010 %>%
            filter(DBN == "04M610")

check_dup

Combining Datasets

We are ready to combine the following dataframes for our first set of analyses: sat_results, demographics2012 and hs_directory. Since SAT scores represent our main variable of interest, we will keep all observations in the SAT data and matching data in the other datasets. Therefore, We will use the left_join() function to combine the dataframes on the key variable DBN.

sat_demog_dir <- sat_results %>%
  left_join(demographics2012, by = "DBN") %>%
  left_join(hs_directory, by = "DBN") %>%
  rename(School_Name = `SCHOOL NAME`) %>%
  select(-school_name, -lat_long, -lat, -long)

head(sat_demog_dir)

Handling Missing Values

Earlier in this project, we saw that there are more observations in the SAT data(478) than in the directory data(435). We also found no duplicates in either dataset indicating that some schools in the SAT dataset are not listed in the directory data. Let’s review all missing values in our combined dataset and track the schools without location information.

#find the number of missing values for each variable in the combined dataframe: sat_demog_dir

na_count <- colSums(is.na(sat_demog_dir)) %>%
            as_tibble(rownames = "Variable")

na_count

There are 57 schools without SAT data and 41 without student demographic details. As noted earlier, the location (boroughs) of quite a number of schools are missing. We need both SAT records and demographic data for our analysis so we will drop schools with either of these information missing.

sat_demog_dir <- sat_demog_dir %>%
  filter(!is.na(avg_sat_score)) %>%
  filter(!is.na(frl_percent))

sat_demog_dir

There were 478 observations in the combined dataset, sat_demog_dir. A total of 401 observations remain after filtering out missing SAT or demographic observations. Thus, there were 77 schools with no SAT scores or demographic data or both.

Let’s identify schools without location data (boroughs).

na_boro <- sat_demog_dir %>%
  filter(is.na(Borough)) %>%
  select(DBN, Borough, School_Name)

 
na_boro

Geographically, NYC is divided into five administrative districts called boroughs.The metadata of some of the datasets shows that letter in the DBN variable represents a borough. Below are the letters and their corresponding boroughs:

K : Brooklyn
M : Manhattan
R : Staten Island
Q : Queens
X : Bronx

Let’s use this information to replace the missing boroughs.

sat_demog_dir <- sat_demog_dir %>%
  mutate (
    Borough = case_when(
        str_extract(DBN, "[A-Z]") == "M" ~ replace_na(Borough, "Manhattan"),
        str_extract(DBN, "[A-Z]") == "X" ~ replace_na(Borough, "Bronx"),
        str_extract(DBN, "[A-Z]") == "K" ~ replace_na(Borough, "Brooklyn"),
        str_extract(DBN, "[A-Z]") == "Q" ~ replace_na(Borough, "Queens"),
        str_extract(DBN, "[A-Z]") == "R" ~ replace_na(Borough, "Staten Island")
    )
  )

#check a few of the missing boroughs to see if inputation was successful
na_boro_repl <- sat_demog_dir %>%
  filter(DBN %in% c("01M458", "07X321", "13K336", "25Q792", "31R470")) %>%
  select(DBN, Borough, `School_Name`)
  

na_boro_repl

EDA 1: Race vs SAT Scores

We are ready to analyse the data. Let’s begin by using a scatter plot to examine the relationship between race(Asian, Black, Hispanic, White) and SAT scores. The race variables are stored as seperate columns in the dataset sat_demog_dir. To create a scatter plot, we need to reshape the dataset to store the race variables in one column and their percentage values in another column. Below, We use the pivot_longer() function to reshape the data and then create the scatter plot.

sat_race <- sat_demog_dir %>%
    pivot_longer(cols = c(asian_per, black_per, hispanic_per, white_per),
                 names_to = "Race",
                 values_to = "Percent")

ggplot(data = sat_race,
        aes(x = Percent, y = avg_sat_score, color = Race)) +
        geom_point() +
        labs(x="Percent", y="Avg. SAT Score") +
        theme (
          #strip.background = element_blank(),
          strip.text.x = element_blank()) +
        facet_wrap(~Race)

We can see from the results that while the percentages of White and Asian students display positive relationships with SAT score, the percentages of Black and Hispanic students display a negative relationship with SAT score. This finding may be explained by the reported pervasive inequalities regarding resource allocation to schools with a high number of Black and Hispanic students. Such schools may not have experienced teachers, advanced courses, and high-quality instructional materials that help prepare students for the SAT.

Next, we will quantify the strength of the relationship between SAT score and race as well as other demographic factors using correlation analysis.

Below, we create a correlation matrix to investigate relationships among all pairs of numeric variables in the combined dataframe. After creating the correlation matrix, we will select the columns that contain relationships between all variables paired with average SAT score.

#create a correlation matrix of all pairs of numeric variables
cor_mat <- sat_demog_dir %>% 
    select_if(is.numeric) %>%
    cor(use = "pairwise.complete.obs")

#convert a correation matrix to a tibble
cor_tib <- cor_mat %>%
  as_tibble(rownames = "variable")

#Now that we have a dataframe (tibble) version of the correlation matrix, we can:
#Select the columns that contain relationships between all variables paired with avg_sat_score and the names of the variables.
#Filter the selected columns to retain only rows with correlation coefficient values greater than 0.25 or less than -0.25.

sat_cors <- cor_tib %>%
  select(variable, avg_sat_score) %>%
  filter(avg_sat_score > 0.25 | avg_sat_score < -0.25) %>%
  arrange(avg_sat_score)

sat_cors

Confirming earlier results, the correlation analysis shows racial inequality of SAT scores; schools with higher percentages of Asian and White students tend to have higher average SAT scores, while schools with higher percentages of Hispanic and Black students tend to have lower SAT scores.

Not surprisingly, all the socioeconomic variables (frl_percent, ell_percent, sped_percent) display negative correlations with SAT scores. Meanings of the variables are provided below:

frl_percent: percentage of students eligible for free or reduced lunch based on household income.
ell_percent: percentage of students who are learning to speak English.
sped_percent: percentage of students who receive specialized instruction to accommodate special needs such as learning or physical disabilities.

Overall, the findings suggest socioeconomic and demographic disparities in students’ performance on standardized tests.

EDA 2: Geographic Distribution of Demographic factors

Before we explore geographic distribution of high school students’ demographics, let’s look at the income distribution in NYC. As previously noted, NYC is divided into boroughs. From the below income distribution of these boroughs between 2009 and 2012, Staten Island was the wealthiest borough whereas the poorest boroughs included Bronx and Brooklyn.

Income Distribution by Borough

Eearlier, we saw socioeconomic factors (frl_percent, ell_percent, sped_percent) were negatively related to SAT scores. Let’s investigate whether these factors vary geographically.

sat_socioecon_fac <- sat_demog_dir %>%
    group_by(Borough) %>%
    summarize(frl_percent = mean(frl_percent),
              sped_percent = mean(sped_percent),
              ell_percent = mean(ell_percent))

sat_socioecon_fac <- sat_socioecon_fac %>%
    pivot_longer(cols = c(frl_percent, sped_percent, ell_percent),
                 names_to = "socioecon_factors",
                 values_to = "Percent")


ggplot(data = sat_socioecon_fac, 
         aes(x= Borough, y=Percent, fill=reorder(socioecon_factors, -Percent))) +
         geom_bar(stat = "identity", position = "dodge") +
         labs(
              title = "Geographic Distribution of Socioeconomic Factors", 
              x="", 
              y="Avg. Percent") +
         theme(
              axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
              axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
              panel.background = element_rect(fill = "white"),
              legend.title = element_blank()) +
        coord_cartesian(ylim = c(0, 80)) +
        geom_text(
              aes(label = paste(round(Percent), "%", sep="")),
              position = position_dodge(0.9),
              vjust = -0.3, size = 3.5
        )

On average, the percentage of students who were English language learners(ell_percent) or had special learning needs (sped_percent) was far lower than the percentange of students who qualified for free or reduced lunch (frl_percent) in each borough. The socio-economic factors, frl_percent and ell_percent, may be especially associated with less wealthy areas and poorly funded schools. It is thus not surprising that the percentage values for these factors were higher in the poorest boroughs (Bronx and Brooklyn) than in the wealthiest borough, Staten Island.

Next, let’s check racial breakdown of students by borough.

socio_fc <- sat_demog_dir %>%
                      group_by(Borough) %>%
                      summarise(mean(ell_percent), mean(frl_percent), mean(sped_percent))

socio_fc

race_fc <- sat_demog_dir %>%
                      group_by(Borough) %>%
                      summarise(mean(black_per), mean(hispanic_per), mean(asian_per), mean(white_per))

race_fc

sat_race <- sat_demog_dir %>%
    group_by(Borough) %>%
    summarize(black_per = mean(black_per),
              hispanic_per = mean(hispanic_per),
              asian_per = mean(asian_per),
              white_per = mean(white_per))

sat_race <- sat_race %>%
    pivot_longer(cols = c(black_per, hispanic_per, asian_per, white_per),
                 names_to = "Race",
                 values_to = "Percent")

ggplot(data = sat_race,
        aes(x = Borough, y = Percent, fill=Race)) +
        geom_bar(stat = "identity") +
        labs(x="", y="Avg. Percent") +
        theme(
          axis.text.x = element_text(angle = 90, vjust = 0.2, hjust = 0.95),
          axis.line = element_line(color = "black", size = 0.7, linetype = "solid"),
          panel.background = element_rect(fill = "white"),
          strip.text.x = element_blank()) +
        coord_cartesian(ylim = c(0, 65)) +
        facet_wrap(~Race) +
        geom_text(
              aes(label = paste(round(Percent), "%", sep="")),
              position = position_dodge(0.9),
              vjust = -0.3, size = 3.5
        )

Interestingly, all boroughs had higher percentages of Black and Hispanic students than Asian and White students. The only exception was Staten Island where on average 46% of students were White. However, the largest representations of Black and Hispanic students were especially in the poorest boroughs. In contrast, there were fewer percentages of Asian and White students in poor boroughs and higher percentages of these students in wealthy areas.

EDA 3: Geographic Distribution of SAT Scores

The preceding analyses have shown differences in SAT scores based on socioeconomic and demographic factors which in turn vary geographically. Thus, we would also expect SAT scores to vary based on the geographic locations of schools. Let’s investigate this.

sat_boro <- sat_demog_dir %>%
  group_by(Borough) %>%
  summarise(
            SAT_Avg = mean(avg_sat_score),
            SAT_Median = median(avg_sat_score))

sat_boro

As expected, SAT scores varied geographically (in 2012); the wealthier the borough the higher the SAT scores and vice-versa. Let’s explore further the distribution of SAT scores for each borough using box plots.

ggplot(data = sat_demog_dir,
       aes(x=Borough, y=avg_sat_score)) +
       geom_boxplot()

Here are some observations from the box plots:

Consistent with the average values observed earlier, SAT scores were in general higher for schools in wealthy neighborhoods than in poor neighborhoods.
Interestingly, the lowest SAT score in Staten Island, the wealthiest borough, was higher than most SAT scores in the poorest boroughs: more than 75% of scores in Bronx and about 73% of scores in Brooklyn.
There was more variation in SAT scores of schools in Queens and Manhattan than in the other three boroughs.

Conclusion : Demographics vs SAT Scores

In this project, I have cleaned several messy datasets containing information (from 2006 to 2012) about the demographic of New York City(NYC) schools, test scores, graduation and dropout rates, and school locations. I analysed the 2012 data to understand the impact of demographics on students’ academic performance.

Drawing on the existing research, I used SAT scores as a measure of academic performance. Overall, the analysis revealed socioeoconomic and demogrpahic disparities in SAT scores. Higher percentages of Asian and White students are associated with higher SAT scores. In contrast, higher percentages of Black and Hispanic students are associated with lower SAT scores. Socioeconomic factors such as percentages of students qualifying for free/reduced lunch, learning English, etc. are also negatively correlated with SAT scores.

The analysis also found that demographic variables and SAT scores vary geographically. SAT scores were found to be higher in wealthy boroughs than in poor boroughs. Not surprisingly, the percentage of students qualifying for free or reduced lunch was higher in poor boroughs than in rich boroughs. Higher percentages of Asian and White students were in wealthy boroughs, whereas the largest representations of Black and Hispanic students were in poor boroughs.

I hope in the future to analyse recent data (when publicly available) to understand whether the pattern of results obtained from the 2012 data still hold.

Cleaning and Preparing Data for Analysis

E Adu