Project 2 - Data Set 1

Data Set suggestion came from Sung Lee. He suggested analysis is to compare the schools to see which one had the highest performing ratio of exams with a 3, 4, or 5 and cross-reference with the location of the school.

Raw file is obtained from http://opendata.cityofnewyork.us/

alt text

alt text

if (!require('tidyverse')) install.packages('tidyverse')
## Loading required package: tidyverse
## ── Attaching packages ───────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
if (!require('scales')) install.packages('scales')
## Loading required package: scales
## 
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
## 
##     discard
## The following object is masked from 'package:readr':
## 
##     col_factor
#library(tidyverse)
#library(scales)
#install.packages("readxl")
#library(readxl)

# Reading csv file
school <- read.csv(file='https://data.cityofnewyork.us/resource/itfs-ms3e.csv')

# Table visualization
head(school)
##      dbn                            schoolname ap_test_takers_
## 1 01M448          UNIVERSITY NEIGHBORHOOD H.S.              39
## 2 01M450                EAST SIDE COMMUNITY HS              19
## 3 01M515                   LOWER EASTSIDE PREP              24
## 4 01M539        NEW EXPLORATIONS SCI,TECH,MATH             255
## 5 02M296 High School of Hospitality Management              NA
## 6 02M298                      Pace High School              21
##   total_exams_taken number_of_exams_with_scores_3_4_or_5
## 1                49                                   10
## 2                21                                   NA
## 3                26                                   24
## 4               377                                  191
## 5                NA                                   NA
## 6                21                                   NA

Data Dictionary

From the data dictionary, field definitions are: DBN: District, Borough and school number
School Name: Name of school
Num of AP Test Takers: Number of students who took AP test
Num of AP Total Exams Taken: Number of AP exams taken by students
Num of AP Exams Passed: Number of AP exams passed by students

Some data exploration and tidying

Using summary() we get an idea of number of NAs, for example.

# Data summary

summary(school)
##       dbn                                                   schoolname 
##  04M610 :  2   A PHILIP RANDOLPH CAMPUS                          :  1  
##  01M448 :  1   ABRAHAM LINCOLN HS                                :  1  
##  01M450 :  1   Academy for College Preparation and Career Explora:  1  
##  01M515 :  1   Academy for Language and Technology               :  1  
##  01M539 :  1   Academy for Scholarship and Entrepreneurship: A Co:  1  
##  02M296 :  1   Academy for Social Action: A College Board School :  1  
##  (Other):251   (Other)                                           :252  
##  ap_test_takers_  total_exams_taken number_of_exams_with_scores_3_4_or_5
##  Min.   :   6.0   Min.   :   6.0    Min.   :   6.0                      
##  1st Qu.:  24.0   1st Qu.:  30.0    1st Qu.:  14.0                      
##  Median :  46.0   Median :  65.0    Median :  37.0                      
##  Mean   : 118.9   Mean   : 180.5    Mean   : 143.6                      
##  3rd Qu.: 114.0   3rd Qu.: 149.0    3rd Qu.: 106.0                      
##  Max.   :2117.0   Max.   :3692.0    Max.   :2687.0                      
##  NA's   :25       NA's   :25        NA's   :107

There are 25 NAs in third column, 25 in the fourth column and 107 in the last column.

These need to be ignored, so a transformed data set is generated - school1.

# Data summary
school1<-na.omit(school)
summary(school1)
##       dbn                                                   schoolname 
##  01M448 :  1   A PHILIP RANDOLPH CAMPUS                          :  1  
##  01M515 :  1   ABRAHAM LINCOLN HS                                :  1  
##  01M539 :  1   Academy for College Preparation and Career Explora:  1  
##  02M300 :  1   Academy for Language and Technology               :  1  
##  02M305 :  1   ACADEMY OF AMER. STUDIES HS                       :  1  
##  02M400 :  1   Academy of Finance and  Enterprise                :  1  
##  (Other):145   (Other)                                           :145  
##  ap_test_takers_  total_exams_taken number_of_exams_with_scores_3_4_or_5
##  Min.   :  12.0   Min.   :  12.0    Min.   :   6.0                      
##  1st Qu.:  43.5   1st Qu.:  57.0    1st Qu.:  14.0                      
##  Median :  84.0   Median : 117.0    Median :  37.0                      
##  Mean   : 169.6   Mean   : 261.5    Mean   : 143.6                      
##  3rd Qu.: 179.5   3rd Qu.: 246.0    3rd Qu.: 106.0                      
##  Max.   :2117.0   Max.   :3692.0    Max.   :2687.0                      
## 

I will also split column dbn into district, borough and school id, as line out in the data dictionary. I will also add a column containing the borough name.

# District 
school1$district<-str_extract(school1$dbn, "^.{2}")

# borough code
school1$borough<-substr(school1$dbn, 3, 3)
school1$borough_desc<-ifelse(school1$borough=="M",'Manhattan',ifelse(school1$borough=="Q",'Queens',ifelse(school1$borough=="K",'Brooklyn',ifelse(school1$borough=="X",'Bronx',ifelse(school1$borough=="R",'Staten Island',NA))))) 

# School id
school1$id<-substr(school1$dbn, 4, 6)


head(school1)
##       dbn                                        schoolname ap_test_takers_
## 1  01M448                      UNIVERSITY NEIGHBORHOOD H.S.              39
## 3  01M515                               LOWER EASTSIDE PREP              24
## 4  01M539                    NEW EXPLORATIONS SCI,TECH,MATH             255
## 7  02M300 Urban Assembly School of Design and Construction,              99
## 9  02M305 Urban Assembly Academy of Government and Law, The              25
## 11 02M400                      HS FOR ENVIRONMENTAL STUDIES             213
##    total_exams_taken number_of_exams_with_scores_3_4_or_5 district borough
## 1                 49                                   10       01       M
## 3                 26                                   24       01       M
## 4                377                                  191       01       M
## 7                117                                   10       02       M
## 9                 37                                   15       02       M
## 11               298                                  152       02       M
##    borough_desc  id
## 1     Manhattan 448
## 3     Manhattan 515
## 4     Manhattan 539
## 7     Manhattan 300
## 9     Manhattan 305
## 11    Manhattan 400

Part 1 - Analysis of AP test data set

As suggested by my colleague Sung Lee, I will compare schools to see which one had the highest performing ratio of exams with a 3, 4, or 5 and cross-reference with the location of the school.

Performing ratio has to be normalized by the total of exams taken. A box plot is showed for comparison.

# Normalizing performance ratio by total exams taken
#(school1<-mutate(school1, perf_ratio = paste(round(number_of_exams_with_scores_3_4_or_5 / total_exams_taken*100,1), "%", sep="") ) )

school1<-mutate(school1, perf_ratio = number_of_exams_with_scores_3_4_or_5 / total_exams_taken )

# Boxpplot
ggplot(school1, mapping=aes(borough_desc,perf_ratio)) + geom_boxplot() + scale_x_discrete(name ="Borough") + scale_y_continuous(name = "Performance Ratio")

Some Highlights: Manhattan has the highest median while Bronx has the lowest. Highest individual performance found in Bronx. Highest dispersion of performance is found in Bronx. Lowest dispersion is in Staten Island.

Summarizing by the mean and median of performance ratio and grouped by borough

school2<-school1 %>%
  group_by(borough_desc) %>%
  summarize(Mean_Perf=mean(perf_ratio), Median_Perf=median(perf_ratio)) %>%
  arrange(desc(Mean_Perf),Median_Perf)

# ggplot(school2) + geom_col(mapping=aes(x=reorder(borough_desc,-Mean_Perf),Mean_Perf)) + scale_x_discrete(name ="Borough")+ scale_y_continuous(name ="Mean Performance")

Summarizing by the mean and median of performance ratio and grouped by district and borough

# Summarizing by the mean of performance ratio and grouped by district
(school3<-school1 %>%
  group_by(district, borough_desc) %>%
  summarize(Mean=percent(mean(perf_ratio)), Median=percent(median(perf_ratio))) %>%
  arrange(desc(Mean), Median))
## # A tibble: 27 x 4
## # Groups:   district [27]
##    district borough_desc Mean  Median
##    <chr>    <chr>        <chr> <chr> 
##  1 05       Manhattan    64%   64%   
##  2 04       Manhattan    57%   57%   
##  3 01       Manhattan    54%   51%   
##  4 22       Brooklyn     54%   54%   
##  5 26       Queens       50%   58%   
##  6 19       Brooklyn     49%   48%   
##  7 02       Manhattan    47%   43%   
##  8 13       Brooklyn     46%   42%   
##  9 24       Queens       45%   42%   
## 10 07       Bronx        44%   36%   
## # … with 17 more rows

Top 3 districts with highest means are all in Manhattan.

As suggested by my colleage, I’ll display the top schools.

school4<-school1 %>%
  select(dbn,schoolname,borough_desc,perf_ratio) %>%
  mutate(rank=min_rank(perf_ratio)) %>%
  arrange(desc(rank))

head(school4)
##      dbn                                      schoolname borough_desc
## 1 09X365             Academy for Language and Technology        Bronx
## 2 02M475                                   STUYVESANT HS    Manhattan
## 3 02M545 High School for Dual Language and Asian Studies    Manhattan
## 4 01M515                             LOWER EASTSIDE PREP    Manhattan
## 5 07X520                   FOREIGN LANG ACAD/GLOBAL STUD        Bronx
## 6 10X445                             BRONX HS OF SCIENCE        Bronx
##   perf_ratio rank
## 1  1.0000000  151
## 2  0.9393402  150
## 3  0.9270833  149
## 4  0.9230769  147
## 5  0.9230769  147
## 6  0.8989733  146

Top school is Academy for Language and Technology, located in Bronx with a performance ratio of 100%.

Additional data set

I will also explore demography data related to each school using another dataset provide by NYC Open Data.

demog<-read.csv(file='https://data.cityofnewyork.us/resource/ihfw-zy9j.csv')
head(demog)
##      dbn                      name schoolyear fl_percent frl_percent
## 1 01M015 P.S. 015 ROBERTO CLEMENTE   20052006       89.4          NA
## 2 01M015 P.S. 015 ROBERTO CLEMENTE   20062007       89.4          NA
## 3 01M015 P.S. 015 ROBERTO CLEMENTE   20072008       89.4          NA
## 4 01M015 P.S. 015 ROBERTO CLEMENTE   20082009       89.4          NA
## 5 01M015 P.S. 015 ROBERTO CLEMENTE   20092010         NA        96.5
## 6 01M015 P.S. 015 ROBERTO CLEMENTE   20102011         NA        96.5
##   total_enrollment prek  k grade1 grade2 grade3 grade4 grade5 grade6 grade7
## 1              281   15 36     40     33     38     52     29     38     NA
## 2              243   15 29     39     38     34     42     46     NA     NA
## 3              261   18 43     39     36     38     47     40     NA     NA
## 4              252   17 37     44     32     34     39     49     NA     NA
## 5              208   16 40     28     32     30     24     38     NA     NA
## 6              203   13 37     35     33     30     30     25     NA     NA
##   grade8 grade9 grade10 grade11 grade12 ell_num ell_percent sped_num
## 1     NA     NA      NA      NA      NA      36        12.8       57
## 2     NA     NA      NA      NA      NA      38        15.6       55
## 3     NA     NA      NA      NA      NA      52        19.9       60
## 4     NA     NA      NA      NA      NA      48        19.0       62
## 5     NA     NA      NA      NA      NA      40        19.2       46
## 6     NA     NA      NA      NA      NA      30        14.8       46
##   sped_percent ctt_num selfcontained_num asian_num asian_per black_num
## 1         20.3      25                 9        10       3.6        74
## 2         22.6      19                15        18       7.4        68
## 3         23.0      20                14        16       6.1        77
## 4         24.6      21                17        16       6.3        75
## 5         22.1      14                14        16       7.7        67
## 6         22.7      21                 9        13       6.4        75
##   black_per hispanic_num hispanic_per white_num white_per male_num male_per
## 1      26.3          189         67.3         5       1.8      158     56.2
## 2      28.0          153         63.0         4       1.6      140     57.6
## 3      29.5          157         60.2         7       2.7      143     54.8
## 4      29.8          149         59.1         7       2.8      149     59.1
## 5      32.2          118         56.7         6       2.9      124     59.6
## 6      36.9          110         54.2         4       2.0      113     55.7
##   female_num female_per
## 1        123       43.8
## 2        103       42.4
## 3        118       45.2
## 4        103       40.9
## 5         84       40.4
## 6         90       44.3

Both data sets are linked by the DBN field, which will be a way to join them.

Data Dictionary

There are just too many columns to display, exactly 38 columns. Data dictionary location is here

https://data.cityofnewyork.us/api/views/ihfw-zy9j/files/9e1aa617-1249-43ea-b7eb-2240bbed5b41?download=true&filename=Demographic%20Data%20DD.xlsx

Main columns I will be using are: DBN Year The school year ‘#Female The number of females %Female The percentage of females’#Male The number of males %Male The percentage of males ‘#Asian The number of asians %Asian The percentage of asians’#Black The number of blacks %Black The percentage of blacks ‘#Hispanic The number of hispanics %Hispanic The percentage of hispanics’#Other The number of others %Other The percentage of others ’#White The number of whites %White The percentage of whites

#Filtering by year of first data set

demog1<-demog %>%
  filter(schoolyear=="20102011")

#left join by unique key 
school5<-school4 %>%  
  left_join(demog1, by="dbn")
## Warning: Column `dbn` joining factors with different levels, coercing to
## character vector
head(school5)
##      dbn                                      schoolname borough_desc
## 1 09X365             Academy for Language and Technology        Bronx
## 2 02M475                                   STUYVESANT HS    Manhattan
## 3 02M545 High School for Dual Language and Asian Studies    Manhattan
## 4 01M515                             LOWER EASTSIDE PREP    Manhattan
## 5 07X520                   FOREIGN LANG ACAD/GLOBAL STUD        Bronx
## 6 10X445                             BRONX HS OF SCIENCE        Bronx
##   perf_ratio rank                                          name schoolyear
## 1  1.0000000  151                                          <NA>         NA
## 2  0.9393402  150                        STUYVESANT HIGH SCHOOL   20102011
## 3  0.9270833  149 HIGH SCHOOL FOR DUAL LANGUAGE & ASIAN STUDIES   20102011
## 4  0.9230769  147       LOWER EAST SIDE PREPARATORY HIGH SCHOOL   20102011
## 5  0.9230769  147                                          <NA>         NA
## 6  0.8989733  146                                          <NA>         NA
##   fl_percent frl_percent total_enrollment prek  k grade1 grade2 grade3 grade4
## 1         NA          NA               NA   NA NA     NA     NA     NA     NA
## 2         NA        41.3             3287   NA NA     NA     NA     NA     NA
## 3         NA        91.3              339   NA NA     NA     NA     NA     NA
## 4         NA        95.5              573   NA NA     NA     NA     NA     NA
## 5         NA          NA               NA   NA NA     NA     NA     NA     NA
## 6         NA          NA               NA   NA NA     NA     NA     NA     NA
##   grade5 grade6 grade7 grade8 grade9 grade10 grade11 grade12 ell_num
## 1     NA     NA     NA     NA     NA      NA      NA      NA      NA
## 2     NA     NA     NA     NA    842     855     811     779      NA
## 3     NA     NA     NA     NA     86     105      72      76     150
## 4     NA     NA     NA     NA     NA     289     204      80     486
## 5     NA     NA     NA     NA     NA      NA      NA      NA      NA
## 6     NA     NA     NA     NA     NA      NA      NA      NA      NA
##   ell_percent sped_num sped_percent ctt_num selfcontained_num asian_num
## 1          NA       NA           NA      NA                NA        NA
## 2         0.0       10          0.3      NA                NA      2353
## 3        44.2        4          1.2      NA                NA       293
## 4        84.8        2          0.3      NA                NA       537
## 5          NA       NA           NA      NA                NA        NA
## 6          NA       NA           NA      NA                NA        NA
##   asian_per black_num black_per hispanic_num hispanic_per white_num white_per
## 1        NA        NA        NA           NA           NA        NA        NA
## 2      71.6        40       1.2           94          2.9       791      24.1
## 3      86.4        22       6.5           10          2.9        10       2.9
## 4      93.7        12       2.1           20          3.5         3       0.5
## 5        NA        NA        NA           NA           NA        NA        NA
## 6        NA        NA        NA           NA           NA        NA        NA
##   male_num male_per female_num female_per
## 1       NA       NA         NA         NA
## 2     1934     58.8       1353       41.2
## 3      177     52.2        162       47.8
## 4      322     56.2        251       43.8
## 5       NA       NA         NA         NA
## 6       NA       NA         NA         NA

Top School does not have any demography data, but second best ranked has 58.8% males, and located in 71.6% Asians, for example.