Data Analysis Using R

CUNY CITE Project, 2022

Authors

Nadia Kennedy, City Tech

Boyan Kostadinov City Tech

Ariane Masuda, City Tech

Published

August 10, 2022

Introduction

The goal of this activity is to explore the 2021 DOE Middle School Directory data from the New York City Open Data Portal. This activity is an introduction to exploratory data analysis and visualizations using R and RStudio.

Loading the data from NYC Open Data

We use the R packages below, which must be installed first, and then loaded with library().

library(RSocrata) # for loading the data from NYC Open Data
library(tidyverse) # for data analysis and visualizations
library(ggcharts) # for easy visualizations based on ggplot2
library(ggblanket) # for easy visualizations based on ggplot2
library(knitr) # for printing tables
library(DT) # for interactive tables in html format

We can use the RSocrata package to load the data in the CSV format, directly from the API tab in the NYC Open Data Portal, using the unique identifier f6s7-vytj for the data.

Note

API stands for Application Programming Interface. From a data analytics perspective, think about it as a way to access and load data directly into another application, such as the RStudio Integrated Development Environment (IDE).

Figure 1: The API tab in NYC Open Data, with the url link to the data.

We can load the data directly into RStudio using the function read.socrata(path) from the RSocrata package, where path is a string with the full url path for the data copied from the API tab in the NYC Open Data Portal, where we have chosen the CSV format instead of the default JSON format, shown in Figure 1.

Figure 2: The Export tab in NYC Open Data, with the url link to download the data.

The code that loads the data directly into RStudio using an url path, and creates a dataframe called data is shown in the code chunk below.

# import the data directly into RStudio using url path
data <- read.socrata("https://data.cityofnewyork.us/resource/f6s7-vytj.csv")

Important

A dataframe is a table or a two-dimensional array structure in which each column represents a variable and each row represents one particular observation from this variable. The variables can be of different type, numerical, character, logical etc.

Alternatively, the data can be downloaded first, by clicking the Export tab and selecting the CSV format, as shown in Figure 2. Once the data file 2021_DOE_Middle_School_Directory.csv has been downloaded and placed in the same directory where our Quarto notebook is located, we can load the data in RStudio using the read_csv() function from the readr package, a part of the tidyverse collection of packages.

# load the data by downloading it first
data <- read_csv("2021_DOE_Middle_School_Directory.csv")

Tip

Keep in mind that if the dataset is being updated on a regular basis, and we want to recompute our data analysis every time the data changes then it becomes impractical to keep downloading the updated dataset, and it is better to import it directly.

Selecting a Subset of the Data

We can run dim(data) to find the dimension of the data, i.e. the number of rows, being the number of observations, and the number of columns, being the number of variables. The dataframe has 474 rows (observations) and 327 columns (variables). However, we want to use for our analysis only selected variables, whose names and descriptions are given below.

Selected variables:

district: district in which school is located
name: school’s name
borough: NYC borough in which school is located
latitude: geographic latitude of the school
longitude: geographic longitude of the school
diversityinadmissions: admissions criteria for diversity
coursepassrate: Percent of students who passed core courses
elaprof: Percent of students proficient in ELA
mathprof: Percent of students proficient in math
surveysafety: Percent of students in the school year before last that felt safe in the hallways, bathrooms, locker rooms, and cafeteria
totalstudents: Total number of students enrolled in the school
gradespan: The grades that will be served by the school
acceleratedclasses: Accelerated classes offered by the school
electiveclasses: Elective classes offered by the school
languageclasses: language classes offered by the school
tophs1: 1st most frequently attended high school
tophs2: 2nd most frequently attended high school
tophs3: 3rd most frequently attended high school

Instead of working with the full dataset, which has 327 variables, it is more convenient to select only the variables that we want and work with. The dplyr package from the tidyverse collection allows us to do that using the select() verb as follows:

# selected variables (columns)
columns <-c("district","name","borough","latitude","longitude",
            "coursepassrate","elaprof","mathprof","surveysafety",
            "totalstudents","gradespan","tophs1","tophs2","tophs3",
            "acceleratedclasses","electiveclasses","languageclasses",
            "diversityinadmissions")
mydata <- data |> 
  select(all_of(columns))

In line 2 in the code chunk above, we create the variable columns as a character vector, and we use the symbol <- to assign to columns the vector with the names of the variables that we are interested in. The function c() is used to combine the names into a character vector. The function c() can also be used to combine numbers into numerical vectors.

Lines 7 and 8 are the key lines of code, where we create the dataframe mydata, which is a subset of the original dataframe data, obtained by selecting all of the columns listed in the vector columns, implemented by the call select(all_of(columns)).

Important

The symbol |> in line 7 represents composition of functions, so that lines 7 and 8 can be read as follows: “Create the variable mydata by starting with the dataframe data and then (|>) selecting all of the columns of data specified by the names in the variable columns.”

Practice in RStudio

Create the numerical vector (3,6,9,12) in the Console.
Copy and run in a Quarto notebook all code chunks up to this point.
Run the command view(mydata) in the Console to view the data.

The mydata dataframe is now a subset of the original data with only 18 variables (columns).

Exploratory Data Analysis with Tidyverse

The Tidyverse is a collection of R packages, developed by RStudio, for doing modern data analysis and visualizations. We will use the Tidyverse for all our data analysis and visualizations needs.

The functions read.socrata() and read_csv() both create a dataframe in the RStudio environment. The dataframe can be viewed inside RStudio by simply clicking on the white play button inside the blue circle in the Environment pane in RStudio. This opens a new tab in the RStudio editor with the data displayed in a spreadsheet style that can be browsed across rows and columns.

Additionally, we can embed in our report a part of the data by slicing the dataframe to display only specific rows and columns that we may be interested in.

mydata |>
  select(1:3) |>
  slice(1:7) |> kable()

district	name	borough
1	P.S. 034 Franklin D. Roosevelt	MANHATTAN
1	P.S. 140 Nathan Straus	MANHATTAN
1	P.S. 184m Shuang Wen	MANHATTAN
1	P.S. 188 The Island School	MANHATTAN
1	University Neighborhood Middle School	MANHATTAN
1	School for Global Leaders	MANHATTAN
1	East Side Community School	MANHATTAN

Filtering the data

We can use the dplyr package from the tidyverse collection of packages for data filtering and pretty much any data analysis that we may want to perform.

For example, we can select from the data only certain variables and then filter the data using one or more of these variables. In the code below, we select the variables name, coursepassrate, mathprof, and elaprof, and we filter the variable borough to extract only those observations in the dataframe that correspond to the borough being "MANHATTAN". That way we can print a table with the filtered data.

Note

The coding exercises below will be properly scaffolded and detailed descriptions will be added to guide the students through the data analysis.

mydata |> 
  filter(borough == "MANHATTAN") |> 
  select(c(name,coursepassrate,mathprof,elaprof)) |> 
  arrange(desc(mathprof)) |> 
  slice(1:10) |> 
  kable()

name	coursepassrate	mathprof	elaprof
Special Music School	100	98	93
East Side Middle School	100	97	94
The Anderson School	100	97	97
Tag Young Scholars	99	96	92
New Explorations into Science, Technology & Math	99	95	94
New York City Lab Middle School for Collaborative Studies	100	94	91
The Clinton School	100	93	92
M.S. 255 Salk School of Science	100	92	94
M.S. 243 Center School	99	90	94
Columbia Secondary School	95	88	87

District Statistics

mydata |> 
  filter(district == 13 | district == 2) |> 
  select(c(district,name,mathprof)) |> 
  arrange(desc(mathprof)) |> 
  datatable() # for html only

  #kable() # for pdf and html

district_stats<-mydata |> 
  group_by(district) |> 
  summarize(med_mathprof = median(mathprof, na.rm=TRUE),
            avg_mathprof = mean(mathprof, na.rm=TRUE))

district_stats |> 
  arrange(desc(med_mathprof)) |> 
  #kable()
  datatable()

bar_chart(district_stats, x=district, y=med_mathprof)

Borough Statistics

The school P.S. 046 Arthur Tappan in district 5 is missing a borough value, but district 5 is in Manhattan, so we can add the missing value.

School M.S. 935 in district 19 is also missing a borough value. Find which borough district 19 is in and add its value for this school.

One can use the package EditData to edit a dataframe interactively.

mydata <- mydata |> 
  mutate(borough = replace(borough, 
                           name == "P.S. 046 Arthur Tappan", "MANHATTAN")) |> 
  mutate(borough = replace(borough, 
                           name == "M.S. 935", "BROOKLYN"))

borough_stats<-mydata |> 
  group_by(borough) |> 
  summarize(med_mathprof = median(mathprof, na.rm=TRUE),
            avg_mathprof = mean(mathprof, na.rm=TRUE))

borough_stats |> 
  arrange(desc(med_mathprof)) |> 
  #kable()
  datatable()

bar_chart(borough_stats, x=borough, y=med_mathprof)

A more complex example is to summarize and visualize the total number of schools for each school district. We can do this by first grouping the data based on districts, and then compute the sizes of these groups, which is the same as the number of schools within each district.

The code below is to be explained in great detail, step by step.

mydata |> 
  group_by(district) |> 
  summarize(nschools = n()) |> 
  mutate(district=reorder(district,nschools)) |> 
  ggplot(aes(x=district,y=nschools,fill=district)) +
  geom_col(show.legend = FALSE) +
  coord_flip()

mathprof_zscore <- mydata |>
  select(district,borough,name,mathprof) |> 
  mutate(zscore = scale(mathprof)) |> 
  filter(!is.na(zscore))

mathprof_zscore |> 
  filter(borough %in% c("QUEENS","MANHATTAN","BROOKLYN","BRONX")) |>
  gg_histogram(x=zscore,facet=borough, bins=12)