Data Analysis Using R

CUNY CITE Project, 2022

AUTHORS PUBLISHED
Nadia Kennedy, City Tech August 10, 2022
Boyan Kostadinov City Tech
Ariane Masuda, City Tech

Introduction

The goal of this activity is to explore the 2021 DOE Middle School Directory data from the New York City Open Data Portal. This activity is an introduction to exploratory data analysis and visualizations using R and RStudio.

Loading the data from NYC Open Data

We use the R packages below, which must be installed first, and then loaded with library().

Import the data

directly into RStudio using url path

OpenData <- read.socrata("https://data.cityofnewyork.us/resource/f6s7-vytj.csv")

Selecting a Subset of the Data selected variables (columns)

columns <-c("district","name","borough","latitude","longitude",
            "coursepassrate","elaprof","mathprof","surveysafety",
            "totalstudents","gradespan","tophs1","tophs2","tophs3",
            "acceleratedclasses","electiveclasses","languageclasses",
            "diversityinadmissions")
mydata <- OpenData |> 
  select(all_of(columns))

Exploratory Data Analysis with Tidyverse

mydata |>
  select(1:3) |>
  slice(1:10) |> 
  kable()
district name borough
1 P.S. 034 Franklin D. Roosevelt MANHATTAN
1 P.S. 140 Nathan Straus MANHATTAN
1 P.S. 184m Shuang Wen MANHATTAN
1 P.S. 188 The Island School MANHATTAN
1 University Neighborhood Middle School MANHATTAN
1 School for Global Leaders MANHATTAN
1 East Side Community School MANHATTAN
1 New Explorations into Science, Technology & Math MANHATTAN
1 Tompkins Square Middle School MANHATTAN
2 J.H.S. 104 Simon Baruch MANHATTAN

Filtering the data

mydata |> 
  filter(borough == "MANHATTAN") |> 
  select(c(name,coursepassrate,mathprof,elaprof)) |> 
  arrange(desc(mathprof)) |> 
  slice(1:10) |> 
  kable()
name coursepassrate mathprof elaprof
Special Music School 100 98 93
East Side Middle School 100 97 94
The Anderson School 100 97 97
Tag Young Scholars 99 96 92
New Explorations into Science, Technology & Math 99 95 94
New York City Lab Middle School for Collaborative Studies 100 94 91
The Clinton School 100 93 92
M.S. 255 Salk School of Science 100 92 94
M.S. 243 Center School 99 90 94
Columbia Secondary School 95 88 87

District Statistics

mydata |> 
  filter(district == 13 | district == 2) |> 
  select(c(district,name,mathprof)) |> 
  arrange(desc(mathprof)) |> 
  slice(1:10) |>
  kable()
district name mathprof
2 East Side Middle School 97
2 New York City Lab Middle School for Collaborative Studies 94
2 The Clinton School 93
2 M.S. 255 Salk School of Science 92
2 Manhattan Academy of Technology/Jacob Riis 83
2 Battery Park City School 82
2 Ballet Tech, NYC Public School for Dance 81
2 Yorkville East Middle School 80
2 J.H.S. 167 Robert F. Wagner 79
2 J.H.S. 104 Simon Baruch 77
district_stats<-mydata |> 
  group_by(district) |> 
  summarize(med_mathprof = median(mathprof, na.rm=TRUE),
            avg_mathprof = mean(mathprof, na.rm=TRUE))
district_stats |> 
  arrange(desc(med_mathprof)) |> 
  slice(1:10) |>
  kable()
district med_mathprof avg_mathprof
26 75.0 72.00000
2 71.5 66.72727
25 64.0 60.33333
20 59.0 62.53333
30 58.5 54.25000
24 56.5 51.92857
31 53.0 47.06667
15 49.0 52.90909
28 48.0 47.46154
21 46.0 47.75000

Plot chart

bar_chart(district_stats, x=district, y=med_mathprof)

Borough Statistics

mydata <- mydata |> 
  mutate(borough = replace(borough, 
                           name == "P.S. 046 Arthur Tappan", "MANHATTAN")) |> 
  mutate(borough = replace(borough, 
                           name == "M.S. 935", "BROOKLYN"))
borough_stats<-mydata |> 
  group_by(borough) |> 
  summarize(med_mathprof = median(mathprof, na.rm=TRUE),
            avg_mathprof = mean(mathprof, na.rm=TRUE))
borough_stats |> 
  arrange(desc(med_mathprof)) |> 
  slice(1:10) |>
  kable()
borough med_mathprof avg_mathprof
STATEN IS 53 47.06667
QUEENS 49 47.10476
MANHATTAN 37 44.38710
BROOKLYN 29 35.70714
BRONX 25 27.06957

Plot chart

bar_chart(borough_stats, x=borough, y=med_mathprof)

Summarize and visualize the total number of schools for each school district

mydata |> 
  group_by(district) |> 
  summarize(nschools = n()) |> 
  mutate(district=reorder(district,nschools)) |> 
  ggplot(aes(x=district,y=nschools,fill=district)) +
  geom_col(show.legend = FALSE) +
  coord_flip()

mathprof_zscore <- mydata |>
  select(district,borough,name,mathprof) |> 
  mutate(zscore = scale(mathprof)) |> 
  filter(!is.na(zscore))
mathprof_zscore |> 
  filter(borough %in% c("QUEENS","MANHATTAN","BROOKLYN","BRONX")) |>
  gg_histogram(x=zscore,facet=borough, bins=12)