Note: The original file is 57MB so I did minor manipulation using excel to create a smaller file. I copy and pasted all columns for the statewide data and each NYC county to a new csv file. This is the only data manipulation done outside of R and this subset of data will be used in this project.
#install.packages("kableExtra") #install this package if it is not already installed
library(knitr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(RCurl)
library(kableExtra)
library(tidyverse)
df <- read.table(text = getURL("https://raw.githubusercontent.com/ltcancel/Project2/master/3-8%20ELA%20and%20Math%20Scores%20by%20County.csv"), header = TRUE, sep = ",")
head(df)
## X.U.FEFF.SY_END_DATE NRC_CODE NRC_DESC COUNTY_CODE COUNTY_DESC BEDSCODE
## 1 6/30/2019 NA NA NA 0
## 2 6/30/2019 NA NA NA 0
## 3 6/30/2019 NA NA NA 0
## 4 6/30/2019 NA NA NA 0
## 5 6/30/2019 NA NA NA 0
## 6 6/30/2019 NA NA NA 0
## NAME ITEM_SUBJECT_AREA ITEM_DESC
## 1 STATEWIDE - ALL DISTRICTS AND CHARTERS ELA Grade 3 ELA
## 2 STATEWIDE - ALL DISTRICTS AND CHARTERS ELA Grade 3 ELA
## 3 STATEWIDE - ALL DISTRICTS AND CHARTERS ELA Grade 3 ELA
## 4 STATEWIDE - ALL DISTRICTS AND CHARTERS ELA Grade 3 ELA
## 5 STATEWIDE - ALL DISTRICTS AND CHARTERS ELA Grade 3 ELA
## 6 STATEWIDE - ALL DISTRICTS AND CHARTERS ELA Grade 3 ELA
## SUBGROUP_CODE SUBGROUP_NAME TOTAL_TESTED L1_COUNT
## 1 1 All Students 172149 26358
## 2 2 Female 84744 10645
## 3 3 Male 87405 15713
## 4 4 American Indian or Alaska Native 1216 187
## 5 5 Black or African American 29834 6181
## 6 6 Hispanic or Latino 50297 10216
## L1_PCT L2_COUNT L2_PCT L3_COUNT L3_PCT L4_COUNT L4_PCT L2.L4_PCT
## 1 15% 55837 32% 77771 45% 12183 7% 85%
## 2 13% 26224 31% 40751 48% 7124 8% 87%
## 3 18% 29613 34% 37020 42% 5059 6% 82%
## 4 15% 450 37% 499 41% 80 7% 85%
## 5 21% 10264 34% 11721 39% 1668 6% 79%
## 6 20% 18514 37% 19334 38% 2233 4% 80%
## L3.L4_PCT MEAN_SCALE_SCORE
## 1 52% 600
## 2 56% 602
## 3 48% 598
## 4 48% 599
## 5 45% 596
## 6 43% 595
Gender subgroup with only the columns that are most important for this analysis
##all genders by boro excluding statewide data
All.gender.df <- df %>% select(COUNTY_DESC, NAME, SUBGROUP_CODE, SUBGROUP_NAME, TOTAL_TESTED, MEAN_SCALE_SCORE) %>% filter(SUBGROUP_CODE %in% c(1:3), COUNTY_DESC !="")
head(All.gender.df)
## COUNTY_DESC NAME SUBGROUP_CODE SUBGROUP_NAME TOTAL_TESTED
## 1 QUEENS QUEENS COUNTY 1 All Students 20691
## 2 QUEENS QUEENS COUNTY 2 Female 10185
## 3 QUEENS QUEENS COUNTY 3 Male 10506
## 4 QUEENS QUEENS COUNTY 1 All Students 21104
## 5 QUEENS QUEENS COUNTY 2 Female 10346
## 6 QUEENS QUEENS COUNTY 3 Male 10758
## MEAN_SCALE_SCORE
## 1 603
## 2 605
## 3 601
## 4 603
## 5 603
## 6 603
Now that a filter dataframe has been created, let us compare the average test scores by gender for each county. If we only look at each group, ignoring the colors within each subgroup for a moment, we can see that the Bronx population of 3-8 graders had the lowest average score while the New York county had the highest. Each of the five counties has a similar distribution of males and female scores where females (green) tend to score higher than males (blue). The Bronx and Richmond county had the greatest spread of scores while Queens is a little more condensed, showing us that most of the students had similar test scores.
##all gender plot by boro
ggplot(All.gender.df, aes(x=COUNTY_DESC, y=MEAN_SCALE_SCORE, color = SUBGROUP_NAME)) + geom_dotplot(binaxis = "y",stackdir = "center", binwidth = 1, fill = "white", stroke = 2) + labs(title = "Average Scores By Gender", x = "County", y = "Average Score")
Subgroup based on immigration status and living status (homeless status and foster care status)
All.status.df <- df %>% select(COUNTY_DESC, NAME, SUBGROUP_CODE, SUBGROUP_NAME, TOTAL_TESTED, MEAN_SCALE_SCORE) %>% filter(SUBGROUP_CODE %in% c(17,18,20,21,22,23))
head(All.status.df)
## COUNTY_DESC NAME SUBGROUP_CODE
## 1 STATEWIDE - ALL DISTRICTS AND CHARTERS 17
## 2 STATEWIDE - ALL DISTRICTS AND CHARTERS 18
## 3 STATEWIDE - ALL DISTRICTS AND CHARTERS 20
## 4 STATEWIDE - ALL DISTRICTS AND CHARTERS 21
## 5 STATEWIDE - ALL DISTRICTS AND CHARTERS 22
## 6 STATEWIDE - ALL DISTRICTS AND CHARTERS 23
## SUBGROUP_NAME TOTAL_TESTED MEAN_SCALE_SCORE
## 1 Migrant 125 583
## 2 Not Migrant 172024 600
## 3 Homeless 8675 591
## 4 Not Homeless 163474 600
## 5 In Foster Care 209 588
## 6 Not in Foster Care 171940 600
This boxplot shows us a greater disparity based on home satus, greater than the gender breakdown. Statewide data is included in this dataset but appears with no label under county. The population of immigrant students and students in foster care were only included in the statewide data and were not broken down by county. However we can see that this population of students scored the lowest when compared to all other subgroups. Homeless students are right in the middle across all counties. All other students who are not homeless, not in foster care, and not immigrants, all had very similar scores.
ggplot(All.status.df, aes(x=COUNTY_DESC, y=MEAN_SCALE_SCORE, color = SUBGROUP_NAME)) + geom_boxplot() + labs(title = "Average Scores by Home Status", x = "County", y = "Average Score")
Subgroup based on race/ethnicity.
race.df <- df %>% select(COUNTY_DESC, NAME, SUBGROUP_CODE, SUBGROUP_NAME, TOTAL_TESTED, MEAN_SCALE_SCORE) %>% filter(SUBGROUP_CODE == "1" | SUBGROUP_CODE %in% c(4:9))
##rename columns in new filtered dataframe
colnames(race.df) <- c("COUNTY", "NAME", "SUB_CODE", "RACE_ETHNICITY", "TOTAL_TESTED", "MEAN_SCORE")
head(race.df)
## COUNTY NAME SUB_CODE
## 1 STATEWIDE - ALL DISTRICTS AND CHARTERS 1
## 2 STATEWIDE - ALL DISTRICTS AND CHARTERS 4
## 3 STATEWIDE - ALL DISTRICTS AND CHARTERS 5
## 4 STATEWIDE - ALL DISTRICTS AND CHARTERS 6
## 5 STATEWIDE - ALL DISTRICTS AND CHARTERS 7
## 6 STATEWIDE - ALL DISTRICTS AND CHARTERS 8
## RACE_ETHNICITY TOTAL_TESTED MEAN_SCORE
## 1 All Students 172149 600
## 2 American Indian or Alaska Native 1216 599
## 3 Black or African American 29834 596
## 4 Hispanic or Latino 50297 595
## 5 Asian or Pacific Islander 17623 609
## 6 White 67699 602
This chart color codes all students based on their race/ethnicity so we can compare the average score between each group. We can see that towards that top of each county, green seems to take up most of the space and this is the Asian or Pacific Islander group. They are acutally hold the highest score across counties. New York has a larger Multiracial population and this group scored higher when compared to the other counties. New York also has an outlier group of American Indian or Alaskan Native that is their lowest scoring group.
##Plot average scores for each county based on race/ethnicity
ggplot(race.df, aes(x=COUNTY, y=MEAN_SCORE)) + geom_raster(aes(fill = RACE_ETHNICITY), hjust = 0.2, vjust = 0.2, interpolate = FALSE ) + labs(title = "Average Score by Race/Ethnicity per County", x = "County", y = "Mean Score")