library(RSocrata) # for loading the data from NYC Open Data
library(tidyverse) # for data analysis and visualizations
library(ggcharts) # for easy visualizations based on ggplot2
library(ggblanket) # for easy visualizations based on ggplot2
library(knitr) # for printing tables
library(DT) # for interactive tables in html formatData Analysis Using R
CUNY CITE Project, 2022
Introduction
The goal of this activity is to explore the 2021 DOE Middle School Directory data from the New York City Open Data Portal. This activity is an introduction to exploratory data analysis and visualizations using R and RStudio.
Loading the data from NYC Open Data
We use the R packages below, which must be installed first, and then loaded with library().
We can use the RSocrata package to load the data in the CSV format, directly from the API tab in the NYC Open Data Portal, using the unique identifier f6s7-vytj for the data.
API stands for Application Programming Interface. From a data analytics perspective, think about it as a way to access and load data directly into another application, such as the RStudio Integrated Development Environment (IDE).
We can load the data directly into RStudio using the function read.socrata(path) from the RSocrata package, where path is a string with the full url path for the data copied from the API tab in the NYC Open Data Portal, where we have chosen the CSV format instead of the default JSON format, shown in Figure 1.
The code that loads the data directly into RStudio using an url path, and creates a dataframe called data is shown in the code chunk below.
A dataframe is a table or a two-dimensional array structure in which each column represents a variable and each row represents one particular observation from this variable. The variables can be of different type, numerical, character, logical etc.
Alternatively, the data can be downloaded first, by clicking the Export tab and selecting the CSV format, as shown in Figure 2. Once the data file 2021_DOE_Middle_School_Directory.csv has been downloaded and placed in the same directory where our Quarto notebook is located, we can load the data in RStudio using the read_csv() function from the readr package, a part of the tidyverse collection of packages.
Keep in mind that if the dataset is being updated on a regular basis, and we want to recompute our data analysis every time the data changes then it becomes impractical to keep downloading the updated dataset, and it is better to import it directly.
Selecting a Subset of the Data
We can run dim(data) to find the dimension of the data, i.e. the number of rows, being the number of observations, and the number of columns, being the number of variables. The dataframe has 474 rows (observations) and 327 columns (variables). However, we want to use for our analysis only selected variables, whose names and descriptions are given below.
Selected variables:
district: district in which school is locatedname: school’s nameborough: NYC borough in which school is locatedlatitude: geographic latitude of the schoollongitude: geographic longitude of the schooldiversityinadmissions: admissions criteria for diversitycoursepassrate: Percent of students who passed core courseselaprof: Percent of students proficient in ELAmathprof: Percent of students proficient in mathsurveysafety: Percent of students in the school year before last that felt safe in the hallways, bathrooms, locker rooms, and cafeteriatotalstudents: Total number of students enrolled in the schoolgradespan: The grades that will be served by the schoolacceleratedclasses: Accelerated classes offered by the schoolelectiveclasses: Elective classes offered by the schoollanguageclasses: language classes offered by the schooltophs1: 1st most frequently attended high schooltophs2: 2nd most frequently attended high schooltophs3: 3rd most frequently attended high school
Instead of working with the full dataset, which has 327 variables, it is more convenient to select only the variables that we want and work with. The dplyr package from the tidyverse collection allows us to do that using the select() verb as follows:
# selected variables (columns)
columns <-c("district","name","borough","latitude","longitude",
"coursepassrate","elaprof","mathprof","surveysafety",
"totalstudents","gradespan","tophs1","tophs2","tophs3",
"acceleratedclasses","electiveclasses","languageclasses",
"diversityinadmissions")
mydata <- data |>
select(all_of(columns))In line 2 in the code chunk above, we create the variable columns as a character vector, and we use the symbol <- to assign to columns the vector with the names of the variables that we are interested in. The function c() is used to combine the names into a character vector. The function c() can also be used to combine numbers into numerical vectors.
Lines 7 and 8 are the key lines of code, where we create the dataframe mydata, which is a subset of the original dataframe data, obtained by selecting all of the columns listed in the vector columns, implemented by the call select(all_of(columns)).
The symbol |> in line 7 represents composition of functions, so that lines 7 and 8 can be read as follows: “Create the variable mydata by starting with the dataframe data and then (|>) selecting all of the columns of data specified by the names in the variable columns.”
The mydata dataframe is now a subset of the original data with only 18 variables (columns).
Exploratory Data Analysis with Tidyverse
The Tidyverse is a collection of R packages, developed by RStudio, for doing modern data analysis and visualizations. We will use the Tidyverse for all our data analysis and visualizations needs.
The functions read.socrata() and read_csv() both create a dataframe in the RStudio environment. The dataframe can be viewed inside RStudio by simply clicking on the white play button inside the blue circle in the Environment pane in RStudio. This opens a new tab in the RStudio editor with the data displayed in a spreadsheet style that can be browsed across rows and columns.
Additionally, we can embed in our report a part of the data by slicing the dataframe to display only specific rows and columns that we may be interested in.
| district | name | borough |
|---|---|---|
| 1 | P.S. 034 Franklin D. Roosevelt | MANHATTAN |
| 1 | P.S. 140 Nathan Straus | MANHATTAN |
| 1 | P.S. 184m Shuang Wen | MANHATTAN |
| 1 | P.S. 188 The Island School | MANHATTAN |
| 1 | University Neighborhood Middle School | MANHATTAN |
| 1 | School for Global Leaders | MANHATTAN |
| 1 | East Side Community School | MANHATTAN |
Filtering the data
We can use the dplyr package from the tidyverse collection of packages for data filtering and pretty much any data analysis that we may want to perform.
For example, we can select from the data only certain variables and then filter the data using one or more of these variables. In the code below, we select the variables name, coursepassrate, mathprof, and elaprof, and we filter the variable borough to extract only those observations in the dataframe that correspond to the borough being "MANHATTAN". That way we can print a table with the filtered data.
The coding exercises below will be properly scaffolded and detailed descriptions will be added to guide the students through the data analysis.
mydata |>
filter(borough == "MANHATTAN") |>
select(c(name,coursepassrate,mathprof,elaprof)) |>
arrange(desc(mathprof)) |>
slice(1:10) |>
kable()| name | coursepassrate | mathprof | elaprof |
|---|---|---|---|
| Special Music School | 100 | 98 | 93 |
| East Side Middle School | 100 | 97 | 94 |
| The Anderson School | 100 | 97 | 97 |
| Tag Young Scholars | 99 | 96 | 92 |
| New Explorations into Science, Technology & Math | 99 | 95 | 94 |
| New York City Lab Middle School for Collaborative Studies | 100 | 94 | 91 |
| The Clinton School | 100 | 93 | 92 |
| M.S. 255 Salk School of Science | 100 | 92 | 94 |
| M.S. 243 Center School | 99 | 90 | 94 |
| Columbia Secondary School | 95 | 88 | 87 |
District Statistics
mydata |>
filter(district == 13 | district == 2) |>
select(c(district,name,mathprof)) |>
arrange(desc(mathprof)) |>
datatable() # for html onlyBorough Statistics
The school P.S. 046 Arthur Tappan in district 5 is missing a borough value, but district 5 is in Manhattan, so we can add the missing value.
School M.S. 935 in district 19 is also missing a borough value. Find which borough district 19 is in and add its value for this school.
One can use the package EditData to edit a dataframe interactively.
A more complex example is to summarize and visualize the total number of schools for each school district. We can do this by first grouping the data based on districts, and then compute the sizes of these groups, which is the same as the number of schools within each district.
- The code below is to be explained in great detail, step by step.
mydata |>
group_by(district) |>
summarize(nschools = n()) |>
mutate(district=reorder(district,nschools)) |>
ggplot(aes(x=district,y=nschools,fill=district)) +
geom_col(show.legend = FALSE) +
coord_flip()