Explore the 2021 Department of Education Middle School Directory

The purpose is to explore the 2021 DOE Middle School Directory data from the New York City Open Data Portal. This activity is an introduction to exploratory data analysis and visualizations using R and RStudio by CUNY CITE Project in 2022.

Code

Loading the Data from NYC Open Data

Before starting, we will load the packages that we need for the exploration.

library(tidyverse)
library(scales)
library(sf)
library(RColorBrewer)
library(RSocrata)
library(ggcharts)
library(ggblanket)
library(knitr)
library(DT)
options(scipen = 999)

With the RSocrata package, we can load the CSV data directly from the website with API.

# Import the data directly into RStudio using url path
data <- read.socrata("https://data.cityofnewyork.us/resource/f6s7-vytj.csv")

Selecting a Subset of the Data

Instead of working with the full dataset, which has 327 variables, it is more convenient to select only the variables that we want and work with. The dplyr package from the tidyverse collection allows us to do that using the select() verb as follows:

# Selected variables (columns)
columns <-c("district","name","borough","latitude","longitude",
            "coursepassrate","elaprof","mathprof","surveysafety",
            "totalstudents","gradespan","tophs1","tophs2","tophs3",
            "acceleratedclasses","electiveclasses","languageclasses",
            "diversityinadmissions")

mydata <- data |> 
  select(all_of(columns))

In line 2 in the code chunk above, we create the variable columns as a character vector, and we use the symbol <- to assign to columns the vector with the names of the variables that we are interested in. The function c() is used to combine the names into a character vector. The function c() can also be used to combine numbers into numerical vectors.

Lines 7 and 8 are the key lines of code, where we create the dataframe mydata, which is a subset of the original dataframe data, obtained by selecting all of the columns listed in the vector columns, implemented by the call select(all_of(columns)).

Note

|> means “Create the variable mydata by starting with the dataframe data and then (|>) selecting all of the columns of data specified by the names in the variable columns.”

Exploratory Data Analysis with Tidyverse

We can embed in our report a part of the data by slicing the dataframe to display only specific rows and columns that we may be interested in

mydata |>
  select(1:3) |>
  slice(1:7) |> kable()

district	name	borough
1	P.S. 034 Franklin D. Roosevelt	MANHATTAN
1	P.S. 140 Nathan Straus	MANHATTAN
1	P.S. 184m Shuang Wen	MANHATTAN
1	P.S. 188 The Island School	MANHATTAN
1	University Neighborhood Middle School	MANHATTAN
1	School for Global Leaders	MANHATTAN
1	East Side Community School	MANHATTAN

Filtering the Data

We can use the dplyr package from the tidyverse collection of packages for data filtering and pretty much any data analysis that we may want to perform.

For example, we can select from the data only certain variables and then filter the data using one or more of these variables. In the code below, we select the variables name, coursepassrate, mathprof, and elaprof, and we filter the variable borough to extract only those observations in the dataframe that correspond to the borough being “MANHATTAN”. That way we can print a table with the filtered data.

mydata |> 
  filter(borough == "MANHATTAN") |> 
  select(c(name,coursepassrate,mathprof,elaprof)) |> 
  arrange(desc(mathprof)) |> 
  slice(1:10) |> 
  kable()

name	coursepassrate	mathprof	elaprof
Special Music School	100	98	93
East Side Middle School	100	97	94
The Anderson School	100	97	97
Tag Young Scholars	99	96	92
New Explorations into Science, Technology & Math	99	95	94
New York City Lab Middle School for Collaborative Studies	100	94	91
The Clinton School	100	93	92
M.S. 255 Salk School of Science	100	92	94
M.S. 243 Center School	99	90	94
Columbia Secondary School	95	88	87

District Statistics

mydata |> 
  filter(district == 13 | district == 2) |> 
  select(c(district,name,mathprof)) |> 
  arrange(desc(mathprof)) |> 
  datatable() # An interactive table

district_stats<-mydata |> 
  group_by(district) |> 
  summarize(med_mathprof = median(mathprof, na.rm=TRUE),
            avg_mathprof = mean(mathprof, na.rm=TRUE))

district_stats |> 
  arrange(desc(med_mathprof)) |> 
  #kable()
  datatable()

bar_chart(district_stats, x=district, y=med_mathprof)

Borough Statistics

The school P.S. 046 Arthur Tappan in district 5 is missing a borough value, but district 5 is in Manhattan, so we can add the missing value.

School M.S. 935 in district 19 is also missing a borough value. Find which borough district 19 is in and add its value for this school.

One can use the package EditData to edit a dataframe interactively.

mydata <- mydata |> 
  mutate(borough = replace(borough, 
                           name == "P.S. 046 Arthur Tappan", "MANHATTAN")) |> 
  mutate(borough = replace(borough, 
                           name == "M.S. 935", "BROOKLYN"))

borough_stats<-mydata |> 
  group_by(borough) |> 
  summarize(med_mathprof = median(mathprof, na.rm=TRUE),
            avg_mathprof = mean(mathprof, na.rm=TRUE))

borough_stats |> 
  arrange(desc(med_mathprof)) |> 
  #kable()
  datatable()

bar_chart(borough_stats, x=borough, y=med_mathprof)

A more complex example is to summarize and visualize the total number of schools for each school district. We can do this by first grouping the data based on districts, and then compute the sizes of these groups, which is the same as the number of schools within each district.

mydata |> 
  group_by(district) |> 
  summarize(nschools = n()) |> 
  mutate(district=reorder(district,nschools)) |> 
  ggplot(aes(x=district,y=nschools,fill=district)) +
  geom_col(show.legend = FALSE) +
  coord_flip()

mathprof_zscore <- mydata |>
  select(district,borough,name,mathprof) |> 
  mutate(zscore = scale(mathprof)) |> 
  filter(!is.na(zscore))

mathprof_zscore |> 
  filter(borough %in% c("QUEENS","MANHATTAN","BROOKLYN","BRONX")) |>
  gg_histogram(x=zscore,facet=borough, bins=12)

## ℹ For further ggblanket information, see https://davidhodge931.github.io/ggblanket/
## This message is displayed once every 8 hours.