dslabsgraph

Author

Emma Poch

# Loading all necessary libraries
setwd("C:/Users/emmap/Downloads/DATA110")
library(dslabs)

Warning: package 'dslabs' was built under R version 4.3.3

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(viridis)

Warning: package 'viridis' was built under R version 4.3.3

Loading required package: viridisLite

library(plotly)

Warning: package 'plotly' was built under R version 4.3.3


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

# Loading in data demonstrating the NYC Regents exam scores from 2010
data("nyc_regents_scores")

# Removing NA values from all columns
scores2 <- nyc_regents_scores |>
  filter(!is.na(score) & !is.na(integrated_algebra) & !is.na(global_history) & !is.na(living_environment) & !is.na(english) & !is.na(us_history))

# Comparing the total amount of scores for each test
sum(scores2$integrated_algebra) - sum(scores2$english)

[1] 28243

# Creating an interactive plot to determine the frequency distribution of math and english scores 
plot1 <- scores2 |>
  ggplot(aes(x = score, y = integrated_algebra, col = english))+
  geom_jitter()+
  labs(x = "Test Score", y = "Frequency on Integrated Algebra Exam", col = "Frequency on \n English Exam", title = "Frequency of NYC Math and English Score Distributions 2010")+
  theme_bw()+
  scale_color_viridis(option = "A")
plot2 <- ggplotly(plot1)
plot2

I used the nyc_regents_scores dataset, which included the frequencies of each score (on a scale from 0 to 100) for 5 different subjects of the NYC regents exams. The regents exams are standardized tests administered statewide in New York to determine student proficiency at a variety of subjects. I was interested in observing the distribution of these scores, particularly for math and English (given that these two are arguably the most core subjects out of all being considered). The graph follows a fairly normal distribution; there’s a moderate left skew, suggesting that students are inclined toward slightly higher math scores, but it’s not pronounced enough to meaningfully affect the distribution. Although it’s somewhat more difficult to interpret from the color scale, the English scores also seem to follow a fairly normal distribution. It’s worth noting that the amount of students who took the math test far exceeds the amount of students who took the English test (by 28,243), adding a degree of difficulty to the comparison. However, the overall patterns present within the data are similar enough that I feel the current comparison is justified.