The purpose is to explore the 2021 DOE Middle School Directory data from the New York City Open Data Portal. This activity is an introduction to exploratory data analysis and visualizations using R and RStudio by CUNY CITE Project in 2022.

Code


Loading the Data from NYC Open Data

Before starting, we will load the packages that we need for the exploration.

library(tidyverse)
library(scales)
library(sf)
library(RColorBrewer)
library(RSocrata)
library(ggcharts)
library(ggblanket)
library(knitr)
library(DT)
options(scipen = 999)

With the RSocrata package, we can load the CSV data directly from the website with API.

# Import the data directly into RStudio using url path
data <- read.socrata("https://data.cityofnewyork.us/resource/f6s7-vytj.csv")

Selecting a Subset of the Data

Instead of working with the full dataset, which has 327 variables, it is more convenient to select only the variables that we want and work with. The dplyr package from the tidyverse collection allows us to do that using the select() verb as follows:

# Selected variables (columns)
columns <-c("district","name","borough","latitude","longitude",
            "coursepassrate","elaprof","mathprof","surveysafety",
            "totalstudents","gradespan","tophs1","tophs2","tophs3",
            "acceleratedclasses","electiveclasses","languageclasses",
            "diversityinadmissions")

mydata <- data |> 
  select(all_of(columns))

In line 2 in the code chunk above, we create the variable columns as a character vector, and we use the symbol <- to assign to columns the vector with the names of the variables that we are interested in. The function c() is used to combine the names into a character vector. The function c() can also be used to combine numbers into numerical vectors.

Lines 7 and 8 are the key lines of code, where we create the dataframe mydata, which is a subset of the original dataframe data, obtained by selecting all of the columns listed in the vector columns, implemented by the call select(all_of(columns)).

Note

|> means “Create the variable mydata by starting with the dataframe data and then (|>) selecting all of the columns of data specified by the names in the variable columns.”


Exploratory Data Analysis with Tidyverse

We can embed in our report a part of the data by slicing the dataframe to display only specific rows and columns that we may be interested in

mydata |>
  select(1:3) |>
  slice(1:7) |> kable()
district name borough
1 P.S. 034 Franklin D. Roosevelt MANHATTAN
1 P.S. 140 Nathan Straus MANHATTAN
1 P.S. 184m Shuang Wen MANHATTAN
1 P.S. 188 The Island School MANHATTAN
1 University Neighborhood Middle School MANHATTAN
1 School for Global Leaders MANHATTAN
1 East Side Community School MANHATTAN

Filtering the Data

We can use the dplyr package from the tidyverse collection of packages for data filtering and pretty much any data analysis that we may want to perform.

For example, we can select from the data only certain variables and then filter the data using one or more of these variables. In the code below, we select the variables name, coursepassrate, mathprof, and elaprof, and we filter the variable borough to extract only those observations in the dataframe that correspond to the borough being “MANHATTAN”. That way we can print a table with the filtered data.

mydata |> 
  filter(borough == "MANHATTAN") |> 
  select(c(name,coursepassrate,mathprof,elaprof)) |> 
  arrange(desc(mathprof)) |> 
  slice(1:10) |> 
  kable()
name coursepassrate mathprof elaprof
Special Music School 100 98 93
East Side Middle School 100 97 94
The Anderson School 100 97 97
Tag Young Scholars 99 96 92
New Explorations into Science, Technology & Math 99 95 94
New York City Lab Middle School for Collaborative Studies 100 94 91
The Clinton School 100 93 92
M.S. 255 Salk School of Science 100 92 94
M.S. 243 Center School 99 90 94
Columbia Secondary School 95 88 87

District Statistics

mydata |> 
  filter(district == 13 | district == 2) |> 
  select(c(district,name,mathprof)) |> 
  arrange(desc(mathprof)) |> 
  datatable() # An interactive table
district_stats<-mydata |> 
  group_by(district) |> 
  summarize(med_mathprof = median(mathprof, na.rm=TRUE),
            avg_mathprof = mean(mathprof, na.rm=TRUE)) 
district_stats |> 
  arrange(desc(med_mathprof)) |> 
  #kable()
  datatable()
bar_chart(district_stats, x=district, y=med_mathprof)


Borough Statistics

The school P.S. 046 Arthur Tappan in district 5 is missing a borough value, but district 5 is in Manhattan, so we can add the missing value.

School M.S. 935 in district 19 is also missing a borough value. Find which borough district 19 is in and add its value for this school.

One can use the package EditData to edit a dataframe interactively.

mydata <- mydata |> 
  mutate(borough = replace(borough, 
                           name == "P.S. 046 Arthur Tappan", "MANHATTAN")) |> 
  mutate(borough = replace(borough, 
                           name == "M.S. 935", "BROOKLYN"))
borough_stats<-mydata |> 
  group_by(borough) |> 
  summarize(med_mathprof = median(mathprof, na.rm=TRUE),
            avg_mathprof = mean(mathprof, na.rm=TRUE)) 
borough_stats |> 
  arrange(desc(med_mathprof)) |> 
  #kable()
  datatable()
bar_chart(borough_stats, x=borough, y=med_mathprof)

A more complex example is to summarize and visualize the total number of schools for each school district. We can do this by first grouping the data based on districts, and then compute the sizes of these groups, which is the same as the number of schools within each district.

mydata |> 
  group_by(district) |> 
  summarize(nschools = n()) |> 
  mutate(district=reorder(district,nschools)) |> 
  ggplot(aes(x=district,y=nschools,fill=district)) +
  geom_col(show.legend = FALSE) +
  coord_flip()

mathprof_zscore <- mydata |>
  select(district,borough,name,mathprof) |> 
  mutate(zscore = scale(mathprof)) |> 
  filter(!is.na(zscore))
mathprof_zscore |> 
  filter(borough %in% c("QUEENS","MANHATTAN","BROOKLYN","BRONX")) |>
  gg_histogram(x=zscore,facet=borough, bins=12)
## ℹ For further ggblanket information, see https://davidhodge931.github.io/ggblanket/
## This message is displayed once every 8 hours.