The purpose is to explore the 2021 DOE Middle School Directory data from the New York City Open Data Portal. This activity is an introduction to exploratory data analysis and visualizations using R and RStudio by CUNY CITE Project in 2022.
Before starting, we will load the packages that we need for the exploration.
library(tidyverse)
library(scales)
library(sf)
library(RColorBrewer)
library(RSocrata)
library(ggcharts)
library(ggblanket)
library(knitr)
library(DT)
options(scipen = 999)
With the RSocrata package, we can load the CSV data directly from the website with API.
# Import the data directly into RStudio using url path
data <- read.socrata("https://data.cityofnewyork.us/resource/f6s7-vytj.csv")
Instead of working with the full dataset, which has 327 variables, it is more convenient to select only the variables that we want and work with. The dplyr package from the tidyverse collection allows us to do that using the select() verb as follows:
# Selected variables (columns)
columns <-c("district","name","borough","latitude","longitude",
"coursepassrate","elaprof","mathprof","surveysafety",
"totalstudents","gradespan","tophs1","tophs2","tophs3",
"acceleratedclasses","electiveclasses","languageclasses",
"diversityinadmissions")
mydata <- data |>
select(all_of(columns))
In line 2 in the code chunk above, we create the variable columns as a character vector, and we use the symbol <- to assign to columns the vector with the names of the variables that we are interested in. The function c() is used to combine the names into a character vector. The function c() can also be used to combine numbers into numerical vectors.
Lines 7 and 8 are the key lines of code, where we create the dataframe mydata, which is a subset of the original dataframe data, obtained by selecting all of the columns listed in the vector columns, implemented by the call select(all_of(columns)).
|> means “Create the variable mydata by starting with the dataframe data and then (|>) selecting all of the columns of data specified by the names in the variable columns.”
We can embed in our report a part of the data by slicing the dataframe to display only specific rows and columns that we may be interested in
mydata |>
select(1:3) |>
slice(1:7) |> kable()
| district | name | borough |
|---|---|---|
| 1 | P.S. 034 Franklin D. Roosevelt | MANHATTAN |
| 1 | P.S. 140 Nathan Straus | MANHATTAN |
| 1 | P.S. 184m Shuang Wen | MANHATTAN |
| 1 | P.S. 188 The Island School | MANHATTAN |
| 1 | University Neighborhood Middle School | MANHATTAN |
| 1 | School for Global Leaders | MANHATTAN |
| 1 | East Side Community School | MANHATTAN |
We can use the dplyr package from the tidyverse collection of packages for data filtering and pretty much any data analysis that we may want to perform.
For example, we can select from the data only certain variables and then filter the data using one or more of these variables. In the code below, we select the variables name, coursepassrate, mathprof, and elaprof, and we filter the variable borough to extract only those observations in the dataframe that correspond to the borough being “MANHATTAN”. That way we can print a table with the filtered data.
mydata |>
filter(borough == "MANHATTAN") |>
select(c(name,coursepassrate,mathprof,elaprof)) |>
arrange(desc(mathprof)) |>
slice(1:10) |>
kable()
| name | coursepassrate | mathprof | elaprof |
|---|---|---|---|
| Special Music School | 100 | 98 | 93 |
| East Side Middle School | 100 | 97 | 94 |
| The Anderson School | 100 | 97 | 97 |
| Tag Young Scholars | 99 | 96 | 92 |
| New Explorations into Science, Technology & Math | 99 | 95 | 94 |
| New York City Lab Middle School for Collaborative Studies | 100 | 94 | 91 |
| The Clinton School | 100 | 93 | 92 |
| M.S. 255 Salk School of Science | 100 | 92 | 94 |
| M.S. 243 Center School | 99 | 90 | 94 |
| Columbia Secondary School | 95 | 88 | 87 |
mydata |>
filter(district == 13 | district == 2) |>
select(c(district,name,mathprof)) |>
arrange(desc(mathprof)) |>
datatable() # An interactive table
district_stats<-mydata |>
group_by(district) |>
summarize(med_mathprof = median(mathprof, na.rm=TRUE),
avg_mathprof = mean(mathprof, na.rm=TRUE))
district_stats |>
arrange(desc(med_mathprof)) |>
#kable()
datatable()
bar_chart(district_stats, x=district, y=med_mathprof)
The school P.S. 046 Arthur Tappan in district 5 is missing a borough value, but district 5 is in Manhattan, so we can add the missing value.
School M.S. 935 in district 19 is also missing a borough value. Find which borough district 19 is in and add its value for this school.
One can use the package EditData to edit a dataframe interactively.
mydata <- mydata |>
mutate(borough = replace(borough,
name == "P.S. 046 Arthur Tappan", "MANHATTAN")) |>
mutate(borough = replace(borough,
name == "M.S. 935", "BROOKLYN"))
borough_stats<-mydata |>
group_by(borough) |>
summarize(med_mathprof = median(mathprof, na.rm=TRUE),
avg_mathprof = mean(mathprof, na.rm=TRUE))
borough_stats |>
arrange(desc(med_mathprof)) |>
#kable()
datatable()
bar_chart(borough_stats, x=borough, y=med_mathprof)
A more complex example is to summarize and visualize the total number of
schools for each school district. We can do this by first grouping the
data based on districts, and then compute the sizes of these groups,
which is the same as the number of schools within each district.
mydata |>
group_by(district) |>
summarize(nschools = n()) |>
mutate(district=reorder(district,nschools)) |>
ggplot(aes(x=district,y=nschools,fill=district)) +
geom_col(show.legend = FALSE) +
coord_flip()
mathprof_zscore <- mydata |>
select(district,borough,name,mathprof) |>
mutate(zscore = scale(mathprof)) |>
filter(!is.na(zscore))
mathprof_zscore |>
filter(borough %in% c("QUEENS","MANHATTAN","BROOKLYN","BRONX")) |>
gg_histogram(x=zscore,facet=borough, bins=12)
## ℹ For further ggblanket information, see https://davidhodge931.github.io/ggblanket/
## This message is displayed once every 8 hours.