Data Set suggestion came from Sung Lee. He suggested analysis is to compare the schools to see which one had the highest performing ratio of exams with a 3, 4, or 5 and cross-reference with the location of the school.
Raw file is obtained from http://opendata.cityofnewyork.us/
alt text
if (!require('tidyverse')) install.packages('tidyverse')
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
if (!require('scales')) install.packages('scales')
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
#library(tidyverse)
#library(scales)
#install.packages("readxl")
#library(readxl)
# Reading csv file
school <- read.csv(file='https://data.cityofnewyork.us/resource/itfs-ms3e.csv')
# Table visualization
head(school)
## dbn schoolname ap_test_takers_
## 1 01M448 UNIVERSITY NEIGHBORHOOD H.S. 39
## 2 01M450 EAST SIDE COMMUNITY HS 19
## 3 01M515 LOWER EASTSIDE PREP 24
## 4 01M539 NEW EXPLORATIONS SCI,TECH,MATH 255
## 5 02M296 High School of Hospitality Management NA
## 6 02M298 Pace High School 21
## total_exams_taken number_of_exams_with_scores_3_4_or_5
## 1 49 10
## 2 21 NA
## 3 26 24
## 4 377 191
## 5 NA NA
## 6 21 NA
From the data dictionary, field definitions are: DBN: District, Borough and school number
School Name: Name of school
Num of AP Test Takers: Number of students who took AP test
Num of AP Total Exams Taken: Number of AP exams taken by students
Num of AP Exams Passed: Number of AP exams passed by students
Some data exploration and tidying
Using summary() we get an idea of number of NAs, for example.
# Data summary
summary(school)
## dbn schoolname
## 04M610 : 2 A PHILIP RANDOLPH CAMPUS : 1
## 01M448 : 1 ABRAHAM LINCOLN HS : 1
## 01M450 : 1 Academy for College Preparation and Career Explora: 1
## 01M515 : 1 Academy for Language and Technology : 1
## 01M539 : 1 Academy for Scholarship and Entrepreneurship: A Co: 1
## 02M296 : 1 Academy for Social Action: A College Board School : 1
## (Other):251 (Other) :252
## ap_test_takers_ total_exams_taken number_of_exams_with_scores_3_4_or_5
## Min. : 6.0 Min. : 6.0 Min. : 6.0
## 1st Qu.: 24.0 1st Qu.: 30.0 1st Qu.: 14.0
## Median : 46.0 Median : 65.0 Median : 37.0
## Mean : 118.9 Mean : 180.5 Mean : 143.6
## 3rd Qu.: 114.0 3rd Qu.: 149.0 3rd Qu.: 106.0
## Max. :2117.0 Max. :3692.0 Max. :2687.0
## NA's :25 NA's :25 NA's :107
There are 25 NAs in third column, 25 in the fourth column and 107 in the last column.
These need to be ignored, so a transformed data set is generated - school1.
# Data summary
school1<-na.omit(school)
summary(school1)
## dbn schoolname
## 01M448 : 1 A PHILIP RANDOLPH CAMPUS : 1
## 01M515 : 1 ABRAHAM LINCOLN HS : 1
## 01M539 : 1 Academy for College Preparation and Career Explora: 1
## 02M300 : 1 Academy for Language and Technology : 1
## 02M305 : 1 ACADEMY OF AMER. STUDIES HS : 1
## 02M400 : 1 Academy of Finance and Enterprise : 1
## (Other):145 (Other) :145
## ap_test_takers_ total_exams_taken number_of_exams_with_scores_3_4_or_5
## Min. : 12.0 Min. : 12.0 Min. : 6.0
## 1st Qu.: 43.5 1st Qu.: 57.0 1st Qu.: 14.0
## Median : 84.0 Median : 117.0 Median : 37.0
## Mean : 169.6 Mean : 261.5 Mean : 143.6
## 3rd Qu.: 179.5 3rd Qu.: 246.0 3rd Qu.: 106.0
## Max. :2117.0 Max. :3692.0 Max. :2687.0
##
I will also split column dbn into district, borough and school id, as line out in the data dictionary. I will also add a column containing the borough name.
# District
school1$district<-str_extract(school1$dbn, "^.{2}")
# borough code
school1$borough<-substr(school1$dbn, 3, 3)
school1$borough_desc<-ifelse(school1$borough=="M",'Manhattan',ifelse(school1$borough=="Q",'Queens',ifelse(school1$borough=="K",'Brooklyn',ifelse(school1$borough=="X",'Bronx',ifelse(school1$borough=="R",'Staten Island',NA)))))
# School id
school1$id<-substr(school1$dbn, 4, 6)
head(school1)
## dbn schoolname ap_test_takers_
## 1 01M448 UNIVERSITY NEIGHBORHOOD H.S. 39
## 3 01M515 LOWER EASTSIDE PREP 24
## 4 01M539 NEW EXPLORATIONS SCI,TECH,MATH 255
## 7 02M300 Urban Assembly School of Design and Construction, 99
## 9 02M305 Urban Assembly Academy of Government and Law, The 25
## 11 02M400 HS FOR ENVIRONMENTAL STUDIES 213
## total_exams_taken number_of_exams_with_scores_3_4_or_5 district borough
## 1 49 10 01 M
## 3 26 24 01 M
## 4 377 191 01 M
## 7 117 10 02 M
## 9 37 15 02 M
## 11 298 152 02 M
## borough_desc id
## 1 Manhattan 448
## 3 Manhattan 515
## 4 Manhattan 539
## 7 Manhattan 300
## 9 Manhattan 305
## 11 Manhattan 400
As suggested by my colleague Sung Lee, I will compare schools to see which one had the highest performing ratio of exams with a 3, 4, or 5 and cross-reference with the location of the school.
Performing ratio has to be normalized by the total of exams taken. A box plot is showed for comparison.
# Normalizing performance ratio by total exams taken
#(school1<-mutate(school1, perf_ratio = paste(round(number_of_exams_with_scores_3_4_or_5 / total_exams_taken*100,1), "%", sep="") ) )
school1<-mutate(school1, perf_ratio = number_of_exams_with_scores_3_4_or_5 / total_exams_taken )
# Boxpplot
ggplot(school1, mapping=aes(borough_desc,perf_ratio)) + geom_boxplot() + scale_x_discrete(name ="Borough") + scale_y_continuous(name = "Performance Ratio")
Some Highlights: Manhattan has the highest median while Bronx has the lowest. Highest individual performance found in Bronx. Highest dispersion of performance is found in Bronx. Lowest dispersion is in Staten Island.
Summarizing by the mean and median of performance ratio and grouped by borough
school2<-school1 %>%
group_by(borough_desc) %>%
summarize(Mean_Perf=mean(perf_ratio), Median_Perf=median(perf_ratio)) %>%
arrange(desc(Mean_Perf),Median_Perf)
# ggplot(school2) + geom_col(mapping=aes(x=reorder(borough_desc,-Mean_Perf),Mean_Perf)) + scale_x_discrete(name ="Borough")+ scale_y_continuous(name ="Mean Performance")
Summarizing by the mean and median of performance ratio and grouped by district and borough
# Summarizing by the mean of performance ratio and grouped by district
(school3<-school1 %>%
group_by(district, borough_desc) %>%
summarize(Mean=percent(mean(perf_ratio)), Median=percent(median(perf_ratio))) %>%
arrange(desc(Mean), Median))
## # A tibble: 27 x 4
## # Groups: district [27]
## district borough_desc Mean Median
## <chr> <chr> <chr> <chr>
## 1 05 Manhattan 64% 64%
## 2 04 Manhattan 57% 57%
## 3 01 Manhattan 54% 51%
## 4 22 Brooklyn 54% 54%
## 5 26 Queens 50% 58%
## 6 19 Brooklyn 49% 48%
## 7 02 Manhattan 47% 43%
## 8 13 Brooklyn 46% 42%
## 9 24 Queens 45% 42%
## 10 07 Bronx 44% 36%
## # … with 17 more rows
Top 3 districts with highest means are all in Manhattan.
As suggested by my colleage, I’ll display the top schools.
school4<-school1 %>%
select(dbn,schoolname,borough_desc,perf_ratio) %>%
mutate(rank=min_rank(perf_ratio)) %>%
arrange(desc(rank))
head(school4)
## dbn schoolname borough_desc
## 1 09X365 Academy for Language and Technology Bronx
## 2 02M475 STUYVESANT HS Manhattan
## 3 02M545 High School for Dual Language and Asian Studies Manhattan
## 4 01M515 LOWER EASTSIDE PREP Manhattan
## 5 07X520 FOREIGN LANG ACAD/GLOBAL STUD Bronx
## 6 10X445 BRONX HS OF SCIENCE Bronx
## perf_ratio rank
## 1 1.0000000 151
## 2 0.9393402 150
## 3 0.9270833 149
## 4 0.9230769 147
## 5 0.9230769 147
## 6 0.8989733 146
Top school is Academy for Language and Technology, located in Bronx with a performance ratio of 100%.
I will also explore demography data related to each school using another dataset provide by NYC Open Data.
demog<-read.csv(file='https://data.cityofnewyork.us/resource/ihfw-zy9j.csv')
head(demog)
## dbn name schoolyear fl_percent frl_percent
## 1 01M015 P.S. 015 ROBERTO CLEMENTE 20052006 89.4 NA
## 2 01M015 P.S. 015 ROBERTO CLEMENTE 20062007 89.4 NA
## 3 01M015 P.S. 015 ROBERTO CLEMENTE 20072008 89.4 NA
## 4 01M015 P.S. 015 ROBERTO CLEMENTE 20082009 89.4 NA
## 5 01M015 P.S. 015 ROBERTO CLEMENTE 20092010 NA 96.5
## 6 01M015 P.S. 015 ROBERTO CLEMENTE 20102011 NA 96.5
## total_enrollment prek k grade1 grade2 grade3 grade4 grade5 grade6 grade7
## 1 281 15 36 40 33 38 52 29 38 NA
## 2 243 15 29 39 38 34 42 46 NA NA
## 3 261 18 43 39 36 38 47 40 NA NA
## 4 252 17 37 44 32 34 39 49 NA NA
## 5 208 16 40 28 32 30 24 38 NA NA
## 6 203 13 37 35 33 30 30 25 NA NA
## grade8 grade9 grade10 grade11 grade12 ell_num ell_percent sped_num
## 1 NA NA NA NA NA 36 12.8 57
## 2 NA NA NA NA NA 38 15.6 55
## 3 NA NA NA NA NA 52 19.9 60
## 4 NA NA NA NA NA 48 19.0 62
## 5 NA NA NA NA NA 40 19.2 46
## 6 NA NA NA NA NA 30 14.8 46
## sped_percent ctt_num selfcontained_num asian_num asian_per black_num
## 1 20.3 25 9 10 3.6 74
## 2 22.6 19 15 18 7.4 68
## 3 23.0 20 14 16 6.1 77
## 4 24.6 21 17 16 6.3 75
## 5 22.1 14 14 16 7.7 67
## 6 22.7 21 9 13 6.4 75
## black_per hispanic_num hispanic_per white_num white_per male_num male_per
## 1 26.3 189 67.3 5 1.8 158 56.2
## 2 28.0 153 63.0 4 1.6 140 57.6
## 3 29.5 157 60.2 7 2.7 143 54.8
## 4 29.8 149 59.1 7 2.8 149 59.1
## 5 32.2 118 56.7 6 2.9 124 59.6
## 6 36.9 110 54.2 4 2.0 113 55.7
## female_num female_per
## 1 123 43.8
## 2 103 42.4
## 3 118 45.2
## 4 103 40.9
## 5 84 40.4
## 6 90 44.3
Both data sets are linked by the DBN field, which will be a way to join them.
There are just too many columns to display, exactly 38 columns. Data dictionary location is here
Main columns I will be using are: DBN Year The school year ‘#Female The number of females %Female The percentage of females’#Male The number of males %Male The percentage of males ‘#Asian The number of asians %Asian The percentage of asians’#Black The number of blacks %Black The percentage of blacks ‘#Hispanic The number of hispanics %Hispanic The percentage of hispanics’#Other The number of others %Other The percentage of others ’#White The number of whites %White The percentage of whites
#Filtering by year of first data set
demog1<-demog %>%
filter(schoolyear=="20102011")
#left join by unique key
school5<-school4 %>%
left_join(demog1, by="dbn")
## Warning: Column `dbn` joining factors with different levels, coercing to
## character vector
head(school5)
## dbn schoolname borough_desc
## 1 09X365 Academy for Language and Technology Bronx
## 2 02M475 STUYVESANT HS Manhattan
## 3 02M545 High School for Dual Language and Asian Studies Manhattan
## 4 01M515 LOWER EASTSIDE PREP Manhattan
## 5 07X520 FOREIGN LANG ACAD/GLOBAL STUD Bronx
## 6 10X445 BRONX HS OF SCIENCE Bronx
## perf_ratio rank name schoolyear
## 1 1.0000000 151 <NA> NA
## 2 0.9393402 150 STUYVESANT HIGH SCHOOL 20102011
## 3 0.9270833 149 HIGH SCHOOL FOR DUAL LANGUAGE & ASIAN STUDIES 20102011
## 4 0.9230769 147 LOWER EAST SIDE PREPARATORY HIGH SCHOOL 20102011
## 5 0.9230769 147 <NA> NA
## 6 0.8989733 146 <NA> NA
## fl_percent frl_percent total_enrollment prek k grade1 grade2 grade3 grade4
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA 41.3 3287 NA NA NA NA NA NA
## 3 NA 91.3 339 NA NA NA NA NA NA
## 4 NA 95.5 573 NA NA NA NA NA NA
## 5 NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA
## grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12 ell_num
## 1 NA NA NA NA NA NA NA NA NA
## 2 NA NA NA NA 842 855 811 779 NA
## 3 NA NA NA NA 86 105 72 76 150
## 4 NA NA NA NA NA 289 204 80 486
## 5 NA NA NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA NA NA
## ell_percent sped_num sped_percent ctt_num selfcontained_num asian_num
## 1 NA NA NA NA NA NA
## 2 0.0 10 0.3 NA NA 2353
## 3 44.2 4 1.2 NA NA 293
## 4 84.8 2 0.3 NA NA 537
## 5 NA NA NA NA NA NA
## 6 NA NA NA NA NA NA
## asian_per black_num black_per hispanic_num hispanic_per white_num white_per
## 1 NA NA NA NA NA NA NA
## 2 71.6 40 1.2 94 2.9 791 24.1
## 3 86.4 22 6.5 10 2.9 10 2.9
## 4 93.7 12 2.1 20 3.5 3 0.5
## 5 NA NA NA NA NA NA NA
## 6 NA NA NA NA NA NA NA
## male_num male_per female_num female_per
## 1 NA NA NA NA
## 2 1934 58.8 1353 41.2
## 3 177 52.2 162 47.8
## 4 322 56.2 251 43.8
## 5 NA NA NA NA
## 6 NA NA NA NA
Top School does not have any demography data, but second best ranked has 58.8% males, and located in 71.6% Asians, for example.