Assignment7

Author

Mia Ramirez

Assignment 7

Starting by installing the necessary packages, including the data science labs package “dslabs”.

install.packages("tidyverse")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

install.packages("dslabs")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

install.packages("ggplot")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

Warning: package 'ggplot' is not available for this version of R

A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

library(tidyr)
library(ggplot2)
library(dslabs)
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

I’m interested in the dataset with NYC Regents scores! The Regents are standardized that almost all New York State public school students need to take in order to qualify for their high school diplomas (some alternative schools exempt their students from these tests though and the students qualify in other ways).

Background information on the dataset was found here: https://cran.r-project.org/web/packages/dslabs/refman/dslabs.html#nyc_regents_scores

data("nyc_regents_scores")

#rename dataset to make it easier to work with
regents <- as_tibble(nyc_regents_scores)
head(regents)

# A tibble: 6 × 6
  score integrated_algebra global_history living_environment english us_history
  <dbl>              <dbl>          <dbl>              <dbl>   <dbl>      <dbl>
1     0                 56             55                 66     165         65
2     1                 NA              8                  3      69          4
3     2                  1              9                  2     237         16
4     3                 NA              3                  1     190         10
5     4                  3             15                  1     109          6
6     5                  2             11                 10     122          8

I see that the frequencies of the test scores vary a lot, so I’m interested to see if the total amount of tests taken for each exam are equal. I’m going to use sum to find the total number of tests taken for each exam.

totalalgebra <- sum(regents$integrated_algebra,na.rm= TRUE)
totalglobal <- sum(regents$global_history, na.rm = TRUE)
totalenvironment <- sum(regents$living_environment, na.rm = TRUE)
totalenglish <- sum(regents$english, na.rm = TRUE)
totalhistory <- sum(regents$us_history, na.rm = TRUE)

sums <- c(totalalgebra,totalglobal,totalenvironment,totalenglish,totalhistory)

head(sums)

[1] 131172 113869 104296 103972  91839

I see that they are not in fact equal! So I’m curious–for each of the regents exams, what scores were the MOST frequent? What percentage of the total number of tests taken scored, for example, a 42 on the english exam?

I create a code to help me sort and clean this data. I create a new column for category since I’m going try to tie each score to it’s relevant exam. I use the rename feature to label the frequency. I use mutate to add a column where I take the score frequency and divide it by the total number of exams taken for that category of exam. Then I create a new dataframe “regents[exam]” so I have all of the information in one place.

regentsalgebra <- regents
regentsalgebra$category <- "Integrated Algebra"
regentsalgebra <- rename(regentsalgebra, score_frequency = integrated_algebra)
regentsalgebra <- regentsalgebra %>% mutate(percentoftest = (score_frequency / totalalgebra))
regentsalgebra <- subset(regentsalgebra, select = c(score,score_frequency, category, percentoftest))

regentsglobal <- regents
regentsglobal$category <- "Global History"
regentsglobal <- rename(regentsglobal, score_frequency = global_history)
regentsglobal <- regentsglobal %>% mutate(percentoftest = (score_frequency / totalglobal))
regentsglobal <- subset(regentsglobal, select = c(score,score_frequency, category, percentoftest))

regentsenvironment <- regents
regentsenvironment$category <- "Living Environment"
regentsenvironment <- rename(regentsenvironment, score_frequency = living_environment)
regentsenvironment <- regentsenvironment %>% mutate(percentoftest = (score_frequency / totalenvironment))
regentsenvironment <- subset(regentsenvironment, select = c(score,score_frequency, category, percentoftest))

regentsenglish <- regents
regentsenglish$category <- "English"
regentsenglish <- rename(regentsenglish, score_frequency = english)
regentsenglish <- regentsenglish %>% mutate(percentoftest = (score_frequency / totalenglish))
regentsenglish <- subset(regentsenglish, select = c(score,score_frequency, category, percentoftest))

regentshistory <- regents
regentshistory$category <- "US History"
regentshistory <- rename(regentshistory, score_frequency = us_history)
regentshistory <- regentshistory %>% mutate(percentoftest = (score_frequency / totalhistory))
regentshistory <- subset(regentshistory, select = c(score,score_frequency, category, percentoftest))

Next, I’m going to combine all the information into one data frame. My goal is to have an object where the possible scores, (0-100), the exam categories (5 regents exams), the respective score frequency for that exam are recorded (example, for a score of 65, the living environment exam had a total of 7,978 tests with that score, which was 7.65% of the total tests taken for that exam).

cleanregents <- bind_rows(regentsalgebra,regentsenglish,regentsenvironment,regentsglobal,regentshistory)

cleanregents <- cleanregents %>% group_by(score)

Now I’m going to make a graph!

install.packages("plotly")

Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

library(scales)

I want to make an area graph so we can see all 5 exams together, and see a noticeable change along the test scores and where there are some “spikes” in the percentage of tests with that score. I use the scales library and scale_y_continuous to make the y axis present in percentage. I add the relevant labels.

regentsarea <- ggplot(cleanregents, 
                      aes(x=score,
                          y=percentoftest,
                          fill = category))+
            geom_area()+
  scale_y_continuous(labels = percent)+
  scale_fill_brewer(palette = "Accent")+
  labs(title="Percentage of Scores on NYC Regents Exams",
subtitle = "An exam score of 65 is the most frequent result for test-takers.",
caption = "Source: Data Science Labs (dslabs)",
x = "Percentage of Tests Taken", 
y = "Regents Exam Score",
fill = "Regents Exam")+
  theme(legend.position="bottom")+
  theme_dark()

regentsarea

Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_align()`).

My biggest insight is that of all the tests taken, across each of the 5 exam types, we see a major spike at the “65” test score, which is the score needed to pass. Compared to any other test score, 65 clearly makes up the largest percentage of all scores, and this is noticeable compared to 64 (which is just barely failing) and even slightly higher scores of 66, 67, 68. This is interesting, and it makes me wonder if there are many students who study and practice just enough to make sure they don’t fail and then don’t care about doing any better or the test, or if there is a bias on the side of the grading system.