NYC SAT Performance Analysis.

Author

Melvin

Introduction

This project analyzes the 2012 SAT Results dataset, which provides average scores for NYC public high schools.

Variables Defined:

-DBN (District, Borough, Number):A unique six-character identifier for every NYC school. -School Name: The official name of the educational institution. -Average Scores (Reading, Math, Writing): The mean score achieved by students at that school in each respective category. -Number of Test Takers: The total count of students who sat for the exam at that location.

##Research Goal: I plan to explore whether a school’s performance in Critical Reading is a reliable predictor of its performance in Writing and Math, and how these trends vary by Borough.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)

# Load the data 
sat_data <- read_csv("2012_SAT_Results_20260324.csv")

Rows: 478 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): DBN, SCHOOL NAME, Num of SAT Test Takers, SAT Critical Reading Avg....

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Clean the data 
# This removes 's' values and makes scores numeric
sat_clean <- sat_data %>%
  rename(
    reading = `SAT Critical Reading Avg. Score`,
    math = `SAT Math Avg. Score`,
    writing = `SAT Writing Avg. Score`,
    takers = `Num of SAT Test Takers`
  ) %>%
  mutate(across(c(reading, math, writing, takers), ~as.numeric(na_if(., "s")))) %>%
  drop_na() %>%
  mutate(Borough = case_when(
    str_detect(DBN, "M") ~ "Manhattan",
    str_detect(DBN, "X") ~ "Bronx",
    str_detect(DBN, "K") ~ "Brooklyn",
    str_detect(DBN, "Q") ~ "Queens",
    str_detect(DBN, "R") ~ "Staten Island"
  ))

# We are predicting Writing scores based on Reading scores
sat_model <- lm(writing ~ reading, data = sat_clean)
summary(sat_model)


Call:
lm(formula = writing ~ reading, data = sat_clean)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.292  -7.206   0.851   8.866  44.999 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.52321    4.93528  -1.524    0.128    
reading      1.00164    0.01219  82.166   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.19 on 419 degrees of freedom
Multiple R-squared:  0.9416,    Adjusted R-squared:  0.9414 
F-statistic:  6751 on 1 and 419 DF,  p-value: < 2.2e-16

# Visualization
ggplot(sat_clean, aes(x = reading, y = math, color = Borough)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", color = "black") +
  scale_color_brewer(palette = "Set1") + #a trick to get 5 colors automaticlly
  theme_minimal() +
  labs(
    title = "NYC High School SAT Scores (2012)",
    x = "Average Reading Score",
    y = "Average Math Score",
    caption = "Source: NYC Open Data"
  )

`geom_smooth()` using formula = 'y ~ x'

Conclusion and Reflection

Data Cleaning and Methodology

I cleaned the 2012 NYC SAT dataset using tidyverse to handle privacy-masked values. I used rename() for simpler variable names and na_if() to convert the character “s” into NA. I then transformed the scores into quantitative types using as.numeric() and removed incomplete rows with drop_na(). Finally, I created a categorical Borough variable by using str_detect() on the DBN codes within a case_when() statement to allow for group-based visualization (I am proud of that last one).

Visualization and Key Findings

The visualization shows a strong, positive linear correlation between Reading and Math scores across all boroughs. A key observation is that schools excelling in one subject almost always excel in the other.

Limitations and Future Work

I originally intended to include a geographic map to see if school location correlates with performance. However, due to time constraints, I focused on the linear regression and static plotting. In the future, I would like to join this with demographic data to explore the socioeconomic factors behind these score disparities.