Basic Data Extraction : NYC Parks





Overview

This project will explore the NYC park data from the “NYC Open Data” project and the question we will answer is what neighborhood has the most or best park lands. We will look at it by acreage vs. population.




Lets begin.




Imports. Constants.

library(readxl)
library(tidyverse)         # ggplot2, dplyr, tidyr, readr, tibble, sringr and more
library(knitr)

CURR_PATH<-str_trim(getwd())





Download the Data

Download the NYC Park data into a local csv file and bring it into a data frame.

download_parks<-"https://data.cityofnewyork.us/api/views/enfh-gkve/rows.csv?accessType=DOWNLOAD"

destfile<-paste0(CURR_PATH,"/parks.csv")


download.file(download_parks, destfile)


cfile<-read.csv(destfile)


Download an extract of neighborhoods and zip codes so we can group each park into neighborhoods.

download_neighborhoods<- "https://data.beta.nyc/dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/7caac650-d082-4aea-9f9b-3681d568e8a5/download/nyc_zip_borough_neighborhoods_pop.csv"

destfile<-paste0(CURR_PATH,"/neighborhoods.csv")

download.file(download_neighborhoods, destfile)

zfile<-read.csv(destfile)





A look at the Data


Just because its in the database doesnt mean it adds to quality of life. Some parks are beautiful…

Gantry Park




Others not so beautiful.
Crotona Parkway Malls




We will make some effort to restrict the data set to just the kinds of park land that add to quality of life.





Filter data


Remove some duff data and remove the ones that really arent “quality of life” parks.

# get rid of some of the the duff data, we dont need the records that are not of class PARK

cfile<-subset(cfile, CLASS=="PARK")


cfile<-subset(cfile, TYPECATEGORY!="Cemetery" & TYPECATEGORY!="Mall" & TYPECATEGORY!="Parkway"  )

cfile<-subset(cfile, SUBCATEGORY!="Lot" & SUBCATEGORY!="Parking Lot" & SUBCATEGORY!="PKWY"  & SUBCATEGORY!="EXWY" & SUBCATEGORY!="Building" & SUBCATEGORY!="Facility" )


# i noticed if the sign just says "Park" its usually a very tiny piece of land
cfile<-subset(cfile, SIGNNAME != "Park")





Tidy data


Now we need to join our Neighbood data to the Park data, joining by zip code.


The problem is many parks span multiple zipcodes and appear as 1.110311e+34 or 1.000110e+09.


So convert to string remove the dot, substring out the first zip code, and mark it as a multiple zip code park.


Note I created a zipcode table using the unique identifier “GISOBJID”.

multiple_zips <- cfile %>% filter(ZIPCODE > 99999) %>% select("GISOBJID","ZIPCODE")

multiple_zips$ZIPCODE<-as.character(multiple_zips$ZIPCODE)

multiple_zips$ZIPCODE<-str_replace_all(multiple_zips$ZIPCODE, "\\.", "")


multiple_zips$ZIPCODE<-str_sub(multiple_zips$ZIPCODE, 1, 5) 

multiple_zips<- multiple_zips %>% mutate(multi_zip = 'Y'   )

# this approach was my plan B, create a table with all zip codes
single_zips <- cfile %>% filter(ZIPCODE < 99999) %>% select("GISOBJID","ZIPCODE")
single_zips<- single_zips %>% mutate(multi_zip = 'N'   )
all_zips<-rbind(multiple_zips, single_zips)


Now that we created a zipcode table. Join the 3 tables into one.

#  now merge the 2 tables
#   1) update the original zipcode with the new zipcode
#   2) add the multi_zip  field





# reduce the data set to just the fields we care about, note im not including zipcode or borough
# note im not bringinging in EAPPLY or NAME311, as SIGNNAME seems to be the most reliable of the 3
df_temp<-cfile[c("SIGNNAME","ACRES", "DEPARTMENT", "URL",  "GISOBJID",
            "JURISDICTION", "TYPECATEGORY", "SUBCATEGORY", "WATERFRONT")]



# all.X=True means we will include all records from df_temp regardless
df_temp<-merge(df_temp, all_zips, by.x ="GISOBJID", by.y = "GISOBJID", all.x = TRUE)   


# commit point

# now merge again with neighborhood file
df_final<-merge(df_temp, zfile, by.x ="ZIPCODE", by.y = "zip", all.x = TRUE)  


Great. Cleanup the columns a bit.

# clean up the columns
names(df_final)<-tolower(names(df_final))



# rename some of the columns to help clarify things

df_final<-df_final %>% 
  rename(
    size_acres=acres,              
    pd_type=typecategory,          # Planning and Development Division type
    om_type=subcategory           # Operations & Management Division type
)


df_final <- df_final %>% drop_na(borough)   # a few bad zip codes


Convert the acres to square miles.

# an acre is 0.0015625 square miles

df_final<-df_final %>% dplyr::mutate(size_sm = size_acres*0.0015625)





Take a peek with a couple of queries


The 5 biggest parks.

# display the biggest waterfront parks 
df<-subset(df_final, waterfront=="true")

df<-df %>% select("signname", "borough", "neighborhood", "zipcode", "size_acres", "om_type", "pd_type", "waterfront")

df<-df[order(-df$size_acres),]

kable(head(df), caption="Biggest Parks",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))
Biggest Parks
signname borough neighborhood zipcode size_acres om_type pd_type waterfront
Pelham Bay Park Bronx Southeast Bronx 10461 2,772 Flagship Park Flagship Park true
Freshkills Park Staten Island South Shore 10312 920 Flagship Park Undeveloped true
Flushing Meadows Corona Park Queens North Queens 11354 898 Flagship Park Flagship Park true
Marine Park Brooklyn Southern Brooklyn 11229 798 Large Park Community Park true
Bronx Park Bronx Bronx Park and Fordham 10458 718 Large Park Flagship Park true
Franklin D. Roosevelt Boardwalk and Beach Staten Island Stapleton and St. George 10305 644 Large Park Waterfront Facility true


The 5 biggest waterfront parks.

# display the biggest waterfront parks 
df<-subset(df_final, waterfront=="true")

df<-df %>% select("signname", "borough", "neighborhood", "zipcode", "size_acres", "om_type", "pd_type", "waterfront")

df<-df[order(-df$size_acres),]

kable(head(df), caption="Big Waterfront Parks",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))
Big Waterfront Parks
signname borough neighborhood zipcode size_acres om_type pd_type waterfront
Pelham Bay Park Bronx Southeast Bronx 10461 2,772 Flagship Park Flagship Park true
Freshkills Park Staten Island South Shore 10312 920 Flagship Park Undeveloped true
Flushing Meadows Corona Park Queens North Queens 11354 898 Flagship Park Flagship Park true
Marine Park Brooklyn Southern Brooklyn 11229 798 Large Park Community Park true
Bronx Park Bronx Bronx Park and Fordham 10458 718 Large Park Flagship Park true
Franklin D. Roosevelt Boardwalk and Beach Staten Island Stapleton and St. George 10305 644 Large Park Waterfront Facility true


The 5 biggest parks in Borough Park, Brooklyn.(not very big)

df<-subset(df_final, neighborhood=="Borough Park")

df<-df %>% select("signname", "borough", "neighborhood", "size_acres", "zipcode", "population")

df<-df[order(-df$size_acres),]


kable(head(df), caption=" ",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))
signname borough neighborhood size_acres zipcode population
Leif Ericson Park Brooklyn Borough Park 16.80 11219 92,221
Friends Field Brooklyn Borough Park 6.70 11230 86,408
Gravesend Park Brooklyn Borough Park 6.38 11204 78,134
Seth Low Playground/ Bealin Square Brooklyn Borough Park 4.95 11204 78,134
Greenwood Playground Brooklyn Borough Park 3.39 11218 75,220
Colonel David Marcus Playground Brooklyn Borough Park 1.97 11230 86,408


The 5 biggest parks in the South Shore of Staten Island (fairly big).

df<-subset(df_final, neighborhood=="South Shore")

df<-df %>% select("signname", "borough", "neighborhood", "size_acres", "zipcode", "population")

df<-df[order(-df$size_acres),]


kable(head(df), caption=" ",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))
signname borough neighborhood size_acres zipcode population
Freshkills Park Staten Island South Shore 920 10312 59,304
LaTourette Park & Golf Course Staten Island South Shore 761 10306 55,909
Great Kills Park Staten Island South Shore 315 10306 55,909
Wolfe’s Pond Park Staten Island South Shore 303 10309 32,519
Conference House Park Staten Island South Shore 286 10307 14,096
Brookfield Park Staten Island South Shore 259 10308 27,357





Aggregate By Neighborhood






Create summary by zip code.


Recall that our population is by zip code. The acres is by park.

df_sum_zip <-df_final %>%
  group_by(zipcode) %>%
  summarise(borough = min(borough), neighborhood=min(neighborhood), population=min(population), acres = sum(size_acres), parks=n()  )


Create summary by neighborhood.

df_sum_neighborhood <- df_sum_zip %>%
  group_by(neighborhood) %>%
  summarise(borough = min(borough), population=sum(population), parks = sum(parks), acres = sum(acres))

df_sum_neighborhood <- df_sum_neighborhood %>% mutate(parkland_per_person = acres/(population/100000))


A few records from the neiborhood summary.

kable(head(df_sum_neighborhood), caption="By Neighborhood",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))
By Neighborhood
neighborhood borough population parks acres parkland_per_person
Borough Park Brooklyn 331,983 20 47.9 14.4
Bronx Park and Fordham Bronx 252,655 37 1,954.1 773.4
Bushwick and Williamsburg Brooklyn 210,468 59 51.6 24.5
Canarsie and Flatlands Brooklyn 195,027 32 507.8 260.4
Central Bronx Bronx 206,116 68 105.8 51.3
Central Brooklyn Brooklyn 318,898 111 189.4 59.4





The Summary in 3 Bar Charts






Population of each neighborhood.

options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = population, main=" ")) +
  geom_bar(stat = "identity") +
  coord_flip() + scale_y_continuous(name="Population") +
  scale_x_discrete(name="Neighborhood") +
  theme(axis.text.x = element_text(face="bold", color="#008000",
                                   size=8, angle=0),
        axis.text.y = element_text(face="bold", color="#008000",
                                   size=8, angle=0))


Parkland of each neighborhood.

options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = acres, main=" ")) +
  geom_bar(stat = "identity") +
  coord_flip() + scale_y_continuous(name="Parkland in Acres") +
  scale_x_discrete(name="Neighborhood") +
  theme(axis.text.x = element_text(face="bold", color="#008000",
                                   size=8, angle=0),
        axis.text.y = element_text(face="bold", color="#008000",
                                   size=8, angle=0))


Parkland per Person.

options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = parkland_per_person, main=" ")) +
  geom_bar(stat = "identity") +
  coord_flip() + scale_y_continuous(name="Parkland Acreage per Population") +
  scale_x_discrete(name="Neighborhood") +
  theme(axis.text.x = element_text(face="bold", color="#008000",
                                   size=8, angle=0),
        axis.text.y = element_text(face="bold", color="#008000",
                                   size=8, angle=0))


There are some massive parklands that skew the results.
But this study does illuminate some neighborhoods (like Borough Park) that could stand to have more space for parks.


Theres a lot of context and caveats so this is a study that definitely could be expanded.