NYC-scores

Author

Davi Krause

New York city regents scores for 2010

This Project wants to see the the correlation between scores fot NYC regents. The data is from DS Labs dataset.

Opening data set and libraries

To begin any project we must state our data and the tools we willl use to dissect it into somthing meaningfull.

# install.packages("dslabs")  # these are data science labs
library("dslabs")
Warning: package 'dslabs' was built under R version 4.3.3
library(reshape2)
Warning: package 'reshape2' was built under R version 4.3.3
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(ggrepel)
Warning: package 'ggrepel' was built under R version 4.3.3
Carregando pacotes exigidos: ggplot2
Warning: package 'ggplot2' was built under R version 4.3.3
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)
data("nyc_regents_scores")
head(nyc_regents_scores)
  score integrated_algebra global_history living_environment english us_history
1     0                 56             55                 66     165         65
2     1                 NA              8                  3      69          4
3     2                  1              9                  2     237         16
4     3                 NA              3                  1     190         10
5     4                  3             15                  1     109          6
6     5                  2             11                 10     122          8
tail(nyc_regents_scores)
    score integrated_algebra global_history living_environment english
97     96                125            547                403     729
98     97                110           1229                446    1071
99     98                 55            764                 87     171
100    99                 19            499                 NA     638
101   100                 NA             NA                 NA      NA
102    NA                148             65                 95      86
    us_history
97         972
98        3039
99        2074
100       1710
101         NA
102         83

The data set is structured as 6 variables, the first is the score in each test from 0 to 100, the other 5 are the quantity of applicants that had that score in the test, for algebra, history, environment, english and history. They give the freequency that score was acchieved in the test. The goal now own is to compare where these applicants score get the A, B, C, D and F scores.

Data manipulation

Another thing we can see, is that there are many NAs in it, that is beacouse no one had that score in the test. The only NA in the scores represent the candidates that did not take the test, witch I will put into the 0 score, becouse that is what happends when someone does not take a test.

datanona <- replace(nyc_regents_scores,is.na(nyc_regents_scores),0)
datanona <- datanona %>% group_by(score) %>% summarise_each(funs(sum))  # group the columns you want to "leave alone"
Warning: `summarise_each()` was deprecated in dplyr 0.7.0.
ℹ Please use `across()` instead.
Warning: `funs()` was deprecated in dplyr 0.8.0.
ℹ Please use a list of either functions or lambdas:

# Simple named list: list(mean = mean, median = median)

# Auto named with `tibble::lst()`: tibble::lst(mean, median)

# Using lambdas list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
head(datanona)
# A tibble: 6 × 6
  score integrated_algebra global_history living_environment english us_history
  <dbl>              <dbl>          <dbl>              <dbl>   <dbl>      <dbl>
1     0                204            120                161     251        148
2     1                  0              8                  3      69          4
3     2                  1              9                  2     237         16
4     3                  0              3                  1     190         10
5     4                  3             15                  1     109          6
6     5                  2             11                 10     122          8
tail(datanona)
# A tibble: 6 × 6
  score integrated_algebra global_history living_environment english us_history
  <dbl>              <dbl>          <dbl>              <dbl>   <dbl>      <dbl>
1    95                242           1388                457     521       1083
2    96                125            547                403     729        972
3    97                110           1229                446    1071       3039
4    98                 55            764                 87     171       2074
5    99                 19            499                  0     638       1710
6   100                  0              0                  0       0          0

Now we must merge the data for gglopt to understand it all as measuring the same thing, making it easier to make a beautifull plot and to make the variables names more understandable to those who are not familiar with the data.

datanona <- setNames(datanona, c("Scores","Algebra","GlobalHistory", "Science","English","USHistory")) 
  


#newdata <- melt(datanona, id = "score", measure = c("global_history", "us_history", "english", "integrated_algebra", "living_environment"))
newdata <- melt(datanona, id = "Scores", measure = c("Algebra", "GlobalHistory", "USHistory", "Science", "English"))

PLOT

Now it is time to plot it all and see how it all works

plot <- ggplot(newdata, aes(Scores, value, colour = variable)) +
  geom_rect(aes(xmin = 0, xmax = 60, ymin = 0, ymax = 8500), alpha = 0.3, fill = "gray", color =NA) +
  geom_rect(aes(xmin = 60, xmax = 70, ymin = 0, ymax = 8500), color =NA, fill = "white", alpha = 0.3) +
  geom_rect(aes(xmin = 70, xmax = 80, ymin = 0, ymax = 8500), color =NA, fill = "slategray1", alpha = 0.3) +
  geom_rect(aes(xmin = 80, xmax = 90, ymin = 0, ymax = 8500), color =NA, fill = "slateblue1", alpha = 0.3) +
  geom_rect(aes(xmin = 90, xmax = 100, ymin = 0, ymax = 8500), color =NA, fill = "blue", alpha = 0.1) +
  geom_line() +
  geom_point() +
  ggtitle("NYC-Regents Scores") +
  ylab("Score") +
  ylab("Frequency") +
  scale_color_brewer(palette = "Set1") +
  theme_minimal() +
  theme(legend.position = "top") +
      scale_x_continuous(limits=c(0,100)) +
      scale_y_continuous(limits=c(0,8500)) 
plot

A color change:

plot <- ggplot(newdata, aes(Scores, value, colour = variable)) +
  geom_rect(aes(xmin = 0, xmax = 60, ymin = 0, ymax = 8500), alpha = 0.3, fill = "red3", color =NA) +
  geom_rect(aes(xmin = 60, xmax = 70, ymin = 0, ymax = 8500), color =NA, fill = "white", alpha = 0.3) +
  geom_rect(aes(xmin = 70, xmax = 80, ymin = 0, ymax = 8500), color =NA, fill = "slategray1", alpha = 0.3) +
  geom_rect(aes(xmin = 80, xmax = 90, ymin = 0, ymax = 8500), color =NA, fill = "slateblue1", alpha = 0.3) +
  geom_rect(aes(xmin = 90, xmax = 100, ymin = 0, ymax = 8500), color =NA, fill = "blue", alpha = 0.1) +
  geom_line() +
  geom_point() +
  ggtitle("NYC-Regents Scores") +
  ylab("Score") +
  ylab("Frequency") +
  scale_color_brewer(palette = "Dark2") +
  theme_minimal() +
  theme(legend.position = "top") +
      scale_x_continuous(limits=c(0,100)) +
      scale_y_continuous(limits=c(0,8500)) 
plot

We can see that many would not have passed, and would actually fail. ANd the great majority of Regents would be a little mediocres

future impruvements

The plots Are nice but unfortunally there is room for impruvement, the alpha was not working and there are many dots. A percentage change could be nice. And putting the Labbles for every Score “A, B, C, D, F” on the graph.