DS Labs

Author

Charanpreet Singh

The NYC Regents Scores Dataset

The dataset used in this project contains student-level results from the New York City Regents exams across several subjects. Each row represents a student, and the columns include scores in Integrated Algebra, Global History, Living Environment, English, & US History. The scores are numeric and reflect performance levels on standardized high school exit exams administered across New York State. Some entries contain missing values or unusually high scores, suggesting either incomplete records or outliers that may need further investigation. This dataset offers a valuable opportunity to explore cross subject academic patterns & identify correlations among different areas of students achievements.

Downloaded & Ran packages required

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dslabs)

Understand the dataset & what I have to work with

head(nyc_regents_scores)
  score integrated_algebra global_history living_environment english us_history
1     0                 56             55                 66     165         65
2     1                 NA              8                  3      69          4
3     2                  1              9                  2     237         16
4     3                 NA              3                  1     190         10
5     4                  3             15                  1     109          6
6     5                  2             11                 10     122          8

One thing that I was quite confused by for a bit was why is this dataset so small & if I was doing something wrong because I am used to way more larger datasets!

Filter & clean the dataset

# Filter out missing or extreme values
filtered_scores <- nyc_regents_scores |>
  filter(!is.na(living_environment), !is.na(english), !is.na(us_history))

I filtered all the NAs out from the different categories that I was going to utilize on my ggplot using !is.na

Created a scatterplot between Living Enviornment vs English Scores

# colored by US History scores
ggplot(filtered_scores, aes(x = living_environment, y = english, color = us_history)) +
  geom_point(size = 4, alpha = 0.8) +
  labs(
    title = "Living Environment vs English Regents Scores",
    subtitle = "Colored by US History Performance",
    x = "Living Environment Score",
    y = "English Score",
    color = "US History Score"
  ) +
  theme_light() +  # Change ggplot theme
  scale_color_viridis_c(option = "C")  # Color scale for US History

#Data Analysis

The scatterplot in this project examines the relationship between students’ scores on the Living Environment and English Regents exams, with U.S. History scores represented through color. The plot shows a clear positive correlation between Living Environment and English performance of students who do well in one subject tend to do well in the other. This pattern suggests a shared skill set across reading intensive and analytical subjects. The visualization also reveals a few extreme outliers, with some students scoring in the thousands far beyond the cluster of typical scores in a sense “Over achievers”. In conclusion, this visualization reveals valuable trends in Regents exam performance across multiple subjects. It illustrates a strong link between English and science proficiency, highlights the independence of history scores, and uncovers potential anomalies in the data. These findings not only deepen our understanding of academic performance but also open the door to future analyses that might explore additional factors such as school district, demographics, or instructional methods.

Sources

https://ggplot2.tidyverse.org/reference/scale_viridis.html

^ Scale Plot Viridis for color