Week8_DS Labs Datasets

Loading necessary libraries

library(dslabs)
library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.3
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

For this assignment, I am using one of the datasets in DS Labs called the nyc_regents_scores dataset. It includes distribution of scores for New York City Regents Algebra, Global History, Living Environment, English, and U.S. History exams. I am interested in creating a visualization and plotting score frequency for some of these subjects such as, Algebra, Global History, English and US History.

data(nyc_regents_scores)
head(nyc_regents_scores)

##   score integrated_algebra global_history living_environment english us_history
## 1     0                 56             55                 66     165         65
## 2     1                 NA              8                  3      69          4
## 3     2                  1              9                  2     237         16
## 4     3                 NA              3                  1     190         10
## 5     4                  3             15                  1     109          6
## 6     5                  2             11                 10     122          8

For my visualization, I decided to create a line graph with scores on the x-axis and frequency of scores for each subject on the y-axis. I start with removing all NA values from “score”, “integrated_algebra”, “global_history”, “english”, and “us_history” using !is.na. Next, I select the subjects I am interested in and create a new column called Frequency. After providing labels for the plot, I use scale_color_manual and provide a list of colors to specify my own set of mappings for the color aesthetic. In the next steps, I split my line plot into a matrix of panels using facet_grid, make changes to the generic ggplot theme and get rid of default labels in my panel plots.

nyc_regents_scores %>%
    filter(!is.na(score) &
             !is.na(integrated_algebra) & 
             !is.na(global_history) & 
             !is.na(english) & 
             !is.na(us_history)) %>%
    select(Score = score, 
           Algebra = integrated_algebra, 
           History = global_history, 
           English = english, 
           US = us_history) %>%
    gather(Subject, Frequency, Algebra, History, English, US) %>%
    ggplot(aes(Score, Frequency, col = Subject)) +
    geom_line(size = 0.5) +
    ylab("Score Frequency") +
    xlab("Scores from 0 to 100") +
    ggtitle("NYC Regents Exam Frequency Plot for Different Subjects") +
    xlim(c(0, 100)) +
    scale_color_manual(values = c("#335c67", "#fff3b0", "#e09f3e", "#9e2a2b")) +
    facet_grid(. ~ Subject)+
    theme_dark()+
    theme(strip.background = element_blank(),
        strip.text.x = element_blank(),
        strip.text.y = element_blank())

According to the New York City Regents Examinations guidelines, students must achieve a score of 65 or higher to pass. We observe that the distribution of scores is roughly the same for subjects used in this visualization and the greatest spike in frequency is generally observed right around the passing threshold. In other words, many more students receive Regents exam scores just around the passing threshold than significantly above or below it.

Source: https://www.regents.nysed.gov/common/regents/files/718PathwaystoGraduation.pdf

Week8_DS Labs Datasets

Amnah Mahmood