CUNY CITE Project, 2022
| AUTHORS | PUBLISHED |
|---|---|
| Nadia Kennedy, City Tech | August 10, 2022 |
| Boyan Kostadinov City Tech | |
| Ariane Masuda, City Tech |
The goal of this activity is to explore the 2021 DOE Middle School Directory data from the New York City Open Data Portal. This activity is an introduction to exploratory data analysis and visualizations using R and RStudio.
We use the R packages below, which must be installed first, and then loaded with library().
directly into RStudio using url path
OpenData <- read.socrata("https://data.cityofnewyork.us/resource/f6s7-vytj.csv")
Selecting a Subset of the Data selected variables (columns)
columns <-c("district","name","borough","latitude","longitude",
"coursepassrate","elaprof","mathprof","surveysafety",
"totalstudents","gradespan","tophs1","tophs2","tophs3",
"acceleratedclasses","electiveclasses","languageclasses",
"diversityinadmissions")
mydata <- OpenData |>
select(all_of(columns))
mydata |>
select(1:3) |>
slice(1:10) |>
kable()
| district | name | borough |
|---|---|---|
| 1 | P.S. 034 Franklin D. Roosevelt | MANHATTAN |
| 1 | P.S. 140 Nathan Straus | MANHATTAN |
| 1 | P.S. 184m Shuang Wen | MANHATTAN |
| 1 | P.S. 188 The Island School | MANHATTAN |
| 1 | University Neighborhood Middle School | MANHATTAN |
| 1 | School for Global Leaders | MANHATTAN |
| 1 | East Side Community School | MANHATTAN |
| 1 | New Explorations into Science, Technology & Math | MANHATTAN |
| 1 | Tompkins Square Middle School | MANHATTAN |
| 2 | J.H.S. 104 Simon Baruch | MANHATTAN |
mydata |>
filter(borough == "MANHATTAN") |>
select(c(name,coursepassrate,mathprof,elaprof)) |>
arrange(desc(mathprof)) |>
slice(1:10) |>
kable()
| name | coursepassrate | mathprof | elaprof |
|---|---|---|---|
| Special Music School | 100 | 98 | 93 |
| East Side Middle School | 100 | 97 | 94 |
| The Anderson School | 100 | 97 | 97 |
| Tag Young Scholars | 99 | 96 | 92 |
| New Explorations into Science, Technology & Math | 99 | 95 | 94 |
| New York City Lab Middle School for Collaborative Studies | 100 | 94 | 91 |
| The Clinton School | 100 | 93 | 92 |
| M.S. 255 Salk School of Science | 100 | 92 | 94 |
| M.S. 243 Center School | 99 | 90 | 94 |
| Columbia Secondary School | 95 | 88 | 87 |
mydata |>
filter(district == 13 | district == 2) |>
select(c(district,name,mathprof)) |>
arrange(desc(mathprof)) |>
slice(1:10) |>
kable()
| district | name | mathprof |
|---|---|---|
| 2 | East Side Middle School | 97 |
| 2 | New York City Lab Middle School for Collaborative Studies | 94 |
| 2 | The Clinton School | 93 |
| 2 | M.S. 255 Salk School of Science | 92 |
| 2 | Manhattan Academy of Technology/Jacob Riis | 83 |
| 2 | Battery Park City School | 82 |
| 2 | Ballet Tech, NYC Public School for Dance | 81 |
| 2 | Yorkville East Middle School | 80 |
| 2 | J.H.S. 167 Robert F. Wagner | 79 |
| 2 | J.H.S. 104 Simon Baruch | 77 |
district_stats<-mydata |>
group_by(district) |>
summarize(med_mathprof = median(mathprof, na.rm=TRUE),
avg_mathprof = mean(mathprof, na.rm=TRUE))
district_stats |>
arrange(desc(med_mathprof)) |>
slice(1:10) |>
kable()
| district | med_mathprof | avg_mathprof |
|---|---|---|
| 26 | 75.0 | 72.00000 |
| 2 | 71.5 | 66.72727 |
| 25 | 64.0 | 60.33333 |
| 20 | 59.0 | 62.53333 |
| 30 | 58.5 | 54.25000 |
| 24 | 56.5 | 51.92857 |
| 31 | 53.0 | 47.06667 |
| 15 | 49.0 | 52.90909 |
| 28 | 48.0 | 47.46154 |
| 21 | 46.0 | 47.75000 |
bar_chart(district_stats, x=district, y=med_mathprof)
mydata <- mydata |>
mutate(borough = replace(borough,
name == "P.S. 046 Arthur Tappan", "MANHATTAN")) |>
mutate(borough = replace(borough,
name == "M.S. 935", "BROOKLYN"))
borough_stats<-mydata |>
group_by(borough) |>
summarize(med_mathprof = median(mathprof, na.rm=TRUE),
avg_mathprof = mean(mathprof, na.rm=TRUE))
borough_stats |>
arrange(desc(med_mathprof)) |>
slice(1:10) |>
kable()
| borough | med_mathprof | avg_mathprof |
|---|---|---|
| STATEN IS | 53 | 47.06667 |
| QUEENS | 49 | 47.10476 |
| MANHATTAN | 37 | 44.38710 |
| BROOKLYN | 29 | 35.70714 |
| BRONX | 25 | 27.06957 |
Plot chart
bar_chart(borough_stats, x=borough, y=med_mathprof)
Summarize and visualize the total number of schools for each school district
mydata |>
group_by(district) |>
summarize(nschools = n()) |>
mutate(district=reorder(district,nschools)) |>
ggplot(aes(x=district,y=nschools,fill=district)) +
geom_col(show.legend = FALSE) +
coord_flip()
mathprof_zscore <- mydata |>
select(district,borough,name,mathprof) |>
mutate(zscore = scale(mathprof)) |>
filter(!is.na(zscore))
mathprof_zscore |>
filter(borough %in% c("QUEENS","MANHATTAN","BROOKLYN","BRONX")) |>
gg_histogram(x=zscore,facet=borough, bins=12)