Dataset #1

Data Description: ELA and Math test scores for grades 3-8 by state and county

Note: The original file is 57MB so I did minor manipulation using excel to create a smaller file. I copy and pasted all columns for the statewide data and each NYC county to a new csv file. This is the only data manipulation done outside of R and this subset of data will be used in this project.


Load all libraries

#install.packages("kableExtra") #install this package if it is not already installed
library(knitr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(RCurl)
library(kableExtra)
library(tidyverse)

Load csv file from Github

df <- read.table(text = getURL("https://raw.githubusercontent.com/ltcancel/Project2/master/3-8%20ELA%20and%20Math%20Scores%20by%20County.csv"), header = TRUE, sep = ",")
head(df)
##   X.U.FEFF.SY_END_DATE NRC_CODE NRC_DESC COUNTY_CODE COUNTY_DESC BEDSCODE
## 1            6/30/2019       NA       NA          NA                    0
## 2            6/30/2019       NA       NA          NA                    0
## 3            6/30/2019       NA       NA          NA                    0
## 4            6/30/2019       NA       NA          NA                    0
## 5            6/30/2019       NA       NA          NA                    0
## 6            6/30/2019       NA       NA          NA                    0
##                                     NAME ITEM_SUBJECT_AREA   ITEM_DESC
## 1 STATEWIDE - ALL DISTRICTS AND CHARTERS               ELA Grade 3 ELA
## 2 STATEWIDE - ALL DISTRICTS AND CHARTERS               ELA Grade 3 ELA
## 3 STATEWIDE - ALL DISTRICTS AND CHARTERS               ELA Grade 3 ELA
## 4 STATEWIDE - ALL DISTRICTS AND CHARTERS               ELA Grade 3 ELA
## 5 STATEWIDE - ALL DISTRICTS AND CHARTERS               ELA Grade 3 ELA
## 6 STATEWIDE - ALL DISTRICTS AND CHARTERS               ELA Grade 3 ELA
##   SUBGROUP_CODE                    SUBGROUP_NAME TOTAL_TESTED L1_COUNT
## 1             1                     All Students       172149    26358
## 2             2                           Female        84744    10645
## 3             3                             Male        87405    15713
## 4             4 American Indian or Alaska Native         1216      187
## 5             5        Black or African American        29834     6181
## 6             6               Hispanic or Latino        50297    10216
##   L1_PCT L2_COUNT L2_PCT L3_COUNT L3_PCT L4_COUNT L4_PCT L2.L4_PCT
## 1    15%    55837    32%    77771    45%    12183     7%       85%
## 2    13%    26224    31%    40751    48%     7124     8%       87%
## 3    18%    29613    34%    37020    42%     5059     6%       82%
## 4    15%      450    37%      499    41%       80     7%       85%
## 5    21%    10264    34%    11721    39%     1668     6%       79%
## 6    20%    18514    37%    19334    38%     2233     4%       80%
##   L3.L4_PCT MEAN_SCALE_SCORE
## 1       52%              600
## 2       56%              602
## 3       48%              598
## 4       48%              599
## 5       45%              596
## 6       43%              595

Create dataframes based on subgroups.

Gender subgroup with only the columns that are most important for this analysis

##all genders by boro excluding statewide data
All.gender.df <- df %>% select(COUNTY_DESC, NAME, SUBGROUP_CODE, SUBGROUP_NAME, TOTAL_TESTED, MEAN_SCALE_SCORE) %>% filter(SUBGROUP_CODE %in% c(1:3), COUNTY_DESC !="") 
head(All.gender.df)
##   COUNTY_DESC          NAME SUBGROUP_CODE SUBGROUP_NAME TOTAL_TESTED
## 1      QUEENS QUEENS COUNTY             1  All Students        20691
## 2      QUEENS QUEENS COUNTY             2        Female        10185
## 3      QUEENS QUEENS COUNTY             3          Male        10506
## 4      QUEENS QUEENS COUNTY             1  All Students        21104
## 5      QUEENS QUEENS COUNTY             2        Female        10346
## 6      QUEENS QUEENS COUNTY             3          Male        10758
##   MEAN_SCALE_SCORE
## 1              603
## 2              605
## 3              601
## 4              603
## 5              603
## 6              603

Now that a filter dataframe has been created, let us compare the average test scores by gender for each county. If we only look at each group, ignoring the colors within each subgroup for a moment, we can see that the Bronx population of 3-8 graders had the lowest average score while the New York county had the highest. Each of the five counties has a similar distribution of males and female scores where females (green) tend to score higher than males (blue). The Bronx and Richmond county had the greatest spread of scores while Queens is a little more condensed, showing us that most of the students had similar test scores.

##all gender plot by boro
ggplot(All.gender.df, aes(x=COUNTY_DESC, y=MEAN_SCALE_SCORE, color = SUBGROUP_NAME)) + geom_dotplot(binaxis = "y",stackdir = "center", binwidth = 1, fill = "white", stroke = 2) + labs(title = "Average Scores By Gender", x = "County", y = "Average Score")

Subgroup based on immigration status and living status (homeless status and foster care status)

All.status.df <- df %>% select(COUNTY_DESC, NAME, SUBGROUP_CODE, SUBGROUP_NAME, TOTAL_TESTED, MEAN_SCALE_SCORE) %>% filter(SUBGROUP_CODE %in% c(17,18,20,21,22,23)) 
head(All.status.df)
##   COUNTY_DESC                                   NAME SUBGROUP_CODE
## 1             STATEWIDE - ALL DISTRICTS AND CHARTERS            17
## 2             STATEWIDE - ALL DISTRICTS AND CHARTERS            18
## 3             STATEWIDE - ALL DISTRICTS AND CHARTERS            20
## 4             STATEWIDE - ALL DISTRICTS AND CHARTERS            21
## 5             STATEWIDE - ALL DISTRICTS AND CHARTERS            22
## 6             STATEWIDE - ALL DISTRICTS AND CHARTERS            23
##        SUBGROUP_NAME TOTAL_TESTED MEAN_SCALE_SCORE
## 1            Migrant          125              583
## 2        Not Migrant       172024              600
## 3           Homeless         8675              591
## 4       Not Homeless       163474              600
## 5     In Foster Care          209              588
## 6 Not in Foster Care       171940              600

This boxplot shows us a greater disparity based on home satus, greater than the gender breakdown. Statewide data is included in this dataset but appears with no label under county. The population of immigrant students and students in foster care were only included in the statewide data and were not broken down by county. However we can see that this population of students scored the lowest when compared to all other subgroups. Homeless students are right in the middle across all counties. All other students who are not homeless, not in foster care, and not immigrants, all had very similar scores.

ggplot(All.status.df, aes(x=COUNTY_DESC, y=MEAN_SCALE_SCORE, color = SUBGROUP_NAME)) + geom_boxplot() + labs(title = "Average Scores by Home Status", x = "County", y = "Average Score")

Subgroup based on race/ethnicity.

race.df <- df %>% select(COUNTY_DESC, NAME, SUBGROUP_CODE, SUBGROUP_NAME, TOTAL_TESTED, MEAN_SCALE_SCORE) %>% filter(SUBGROUP_CODE == "1" | SUBGROUP_CODE %in% c(4:9))
##rename columns in new filtered dataframe
colnames(race.df) <- c("COUNTY", "NAME", "SUB_CODE", "RACE_ETHNICITY", "TOTAL_TESTED", "MEAN_SCORE")
head(race.df)
##   COUNTY                                   NAME SUB_CODE
## 1        STATEWIDE - ALL DISTRICTS AND CHARTERS        1
## 2        STATEWIDE - ALL DISTRICTS AND CHARTERS        4
## 3        STATEWIDE - ALL DISTRICTS AND CHARTERS        5
## 4        STATEWIDE - ALL DISTRICTS AND CHARTERS        6
## 5        STATEWIDE - ALL DISTRICTS AND CHARTERS        7
## 6        STATEWIDE - ALL DISTRICTS AND CHARTERS        8
##                     RACE_ETHNICITY TOTAL_TESTED MEAN_SCORE
## 1                     All Students       172149        600
## 2 American Indian or Alaska Native         1216        599
## 3        Black or African American        29834        596
## 4               Hispanic or Latino        50297        595
## 5        Asian or Pacific Islander        17623        609
## 6                            White        67699        602

This chart color codes all students based on their race/ethnicity so we can compare the average score between each group. We can see that towards that top of each county, green seems to take up most of the space and this is the Asian or Pacific Islander group. They are acutally hold the highest score across counties. New York has a larger Multiracial population and this group scored higher when compared to the other counties. New York also has an outlier group of American Indian or Alaskan Native that is their lowest scoring group.

##Plot average scores for each county based on race/ethnicity
ggplot(race.df, aes(x=COUNTY, y=MEAN_SCORE)) + geom_raster(aes(fill = RACE_ETHNICITY), hjust = 0.2, vjust = 0.2, interpolate = FALSE ) + labs(title = "Average Score by Race/Ethnicity per County", x = "County", y = "Mean Score")