Starting by installing the necessary packages, including the data science labs package “dslabs”.
install.packages("tidyverse")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("dslabs")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
install.packages("ggplot")
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
Warning: package 'ggplot' is not available for this version of R
A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
I’m interested in the dataset with NYC Regents scores! The Regents are standardized that almost all New York State public school students need to take in order to qualify for their high school diplomas (some alternative schools exempt their students from these tests though and the students qualify in other ways).
I see that the frequencies of the test scores vary a lot, so I’m interested to see if the total amount of tests taken for each exam are equal. I’m going to use sum to find the total number of tests taken for each exam.
I see that they are not in fact equal! So I’m curious–for each of the regents exams, what scores were the MOST frequent? What percentage of the total number of tests taken scored, for example, a 42 on the english exam?
I create a code to help me sort and clean this data. I create a new column for category since I’m going try to tie each score to it’s relevant exam. I use the rename feature to label the frequency. I use mutate to add a column where I take the score frequency and divide it by the total number of exams taken for that category of exam. Then I create a new dataframe “regents[exam]” so I have all of the information in one place.
Next, I’m going to combine all the information into one data frame. My goal is to have an object where the possible scores, (0-100), the exam categories (5 regents exams), the respective score frequency for that exam are recorded (example, for a score of 65, the living environment exam had a total of 7,978 tests with that score, which was 7.65% of the total tests taken for that exam).
Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.6'
(as 'lib' is unspecified)
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(scales)
I want to make an area graph so we can see all 5 exams together, and see a noticeable change along the test scores and where there are some “spikes” in the percentage of tests with that score. I use the scales library and scale_y_continuous to make the y axis present in percentage. I add the relevant labels.
regentsarea <-ggplot(cleanregents, aes(x=score,y=percentoftest,fill = category))+geom_area()+scale_y_continuous(labels = percent)+scale_fill_brewer(palette ="Accent")+labs(title="Percentage of Scores on NYC Regents Exams",subtitle ="An exam score of 65 is the most frequent result for test-takers.",caption ="Source: Data Science Labs (dslabs)",x ="Percentage of Tests Taken", y ="Regents Exam Score",fill ="Regents Exam")+theme(legend.position="bottom")+theme_dark()regentsarea
Warning: Removed 14 rows containing non-finite outside the scale range
(`stat_align()`).
My biggest insight is that of all the tests taken, across each of the 5 exam types, we see a major spike at the “65” test score, which is the score needed to pass. Compared to any other test score, 65 clearly makes up the largest percentage of all scores, and this is noticeable compared to 64 (which is just barely failing) and even slightly higher scores of 66, 67, 68. This is interesting, and it makes me wonder if there are many students who study and practice just enough to make sure they don’t fail and then don’t care about doing any better or the test, or if there is a bias on the side of the grading system.