Basic Data Extraction : NYC Parks

Overview

This project will explore the NYC park data from the “NYC Open Data” project and the question we will answer is what neighborhood has the most or best park lands. We will look at it by acreage vs. population.

Lets begin.

Imports. Constants.

library(readxl)
library(tidyverse)         # ggplot2, dplyr, tidyr, readr, tibble, sringr and more
library(knitr)

CURR_PATH<-str_trim(getwd())

Download the Data

Download the NYC Park data into a local csv file and bring it into a data frame.

download_parks<-"https://data.cityofnewyork.us/api/views/enfh-gkve/rows.csv?accessType=DOWNLOAD"

destfile<-paste0(CURR_PATH,"/parks.csv")


download.file(download_parks, destfile)


cfile<-read.csv(destfile)

Download an extract of neighborhoods and zip codes so we can group each park into neighborhoods.

download_neighborhoods<- "https://data.beta.nyc/dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/7caac650-d082-4aea-9f9b-3681d568e8a5/download/nyc_zip_borough_neighborhoods_pop.csv"

destfile<-paste0(CURR_PATH,"/neighborhoods.csv")

download.file(download_neighborhoods, destfile)

zfile<-read.csv(destfile)

A look at the Data

Just because its in the database doesnt mean it adds to quality of life. Some parks are beautiful…

Gantry Park

Others not so beautiful.
Crotona Parkway Malls

We will make some effort to restrict the data set to just the kinds of park land that add to quality of life.

Filter data

Remove some duff data and remove the ones that really arent “quality of life” parks.

# get rid of some of the the duff data, we dont need the records that are not of class PARK

cfile<-subset(cfile, CLASS=="PARK")


cfile<-subset(cfile, TYPECATEGORY!="Cemetery" & TYPECATEGORY!="Mall" & TYPECATEGORY!="Parkway"  )

cfile<-subset(cfile, SUBCATEGORY!="Lot" & SUBCATEGORY!="Parking Lot" & SUBCATEGORY!="PKWY"  & SUBCATEGORY!="EXWY" & SUBCATEGORY!="Building" & SUBCATEGORY!="Facility" )


# i noticed if the sign just says "Park" its usually a very tiny piece of land
cfile<-subset(cfile, SIGNNAME != "Park")

Tidy data

Now we need to join our Neighbood data to the Park data, joining by zip code.

The problem is many parks span multiple zipcodes and appear as 1.110311e+34 or 1.000110e+09.

So convert to string remove the dot, substring out the first zip code, and mark it as a multiple zip code park.

Note I created a zipcode table using the unique identifier “GISOBJID”.

multiple_zips <- cfile %>% filter(ZIPCODE > 99999) %>% select("GISOBJID","ZIPCODE")

multiple_zips$ZIPCODE<-as.character(multiple_zips$ZIPCODE)

multiple_zips$ZIPCODE<-str_replace_all(multiple_zips$ZIPCODE, "\\.", "")


multiple_zips$ZIPCODE<-str_sub(multiple_zips$ZIPCODE, 1, 5) 

multiple_zips<- multiple_zips %>% mutate(multi_zip = 'Y'   )

# this approach was my plan B, create a table with all zip codes
single_zips <- cfile %>% filter(ZIPCODE < 99999) %>% select("GISOBJID","ZIPCODE")
single_zips<- single_zips %>% mutate(multi_zip = 'N'   )
all_zips<-rbind(multiple_zips, single_zips)

Now that we created a zipcode table. Join the 3 tables into one.

#  now merge the 2 tables
#   1) update the original zipcode with the new zipcode
#   2) add the multi_zip  field





# reduce the data set to just the fields we care about, note im not including zipcode or borough
# note im not bringinging in EAPPLY or NAME311, as SIGNNAME seems to be the most reliable of the 3
df_temp<-cfile[c("SIGNNAME","ACRES", "DEPARTMENT", "URL",  "GISOBJID",
            "JURISDICTION", "TYPECATEGORY", "SUBCATEGORY", "WATERFRONT")]



# all.X=True means we will include all records from df_temp regardless
df_temp<-merge(df_temp, all_zips, by.x ="GISOBJID", by.y = "GISOBJID", all.x = TRUE)   


# commit point

# now merge again with neighborhood file
df_final<-merge(df_temp, zfile, by.x ="ZIPCODE", by.y = "zip", all.x = TRUE)

Great. Cleanup the columns a bit.

# clean up the columns
names(df_final)<-tolower(names(df_final))



# rename some of the columns to help clarify things

df_final<-df_final %>% 
  rename(
    size_acres=acres,              
    pd_type=typecategory,          # Planning and Development Division type
    om_type=subcategory           # Operations & Management Division type
)


df_final <- df_final %>% drop_na(borough)   # a few bad zip codes

Convert the acres to square miles.

# an acre is 0.0015625 square miles

df_final<-df_final %>% dplyr::mutate(size_sm = size_acres*0.0015625)

Take a peek with a couple of queries

The 5 biggest parks.

# display the biggest waterfront parks 
df<-subset(df_final, waterfront=="true")

df<-df %>% select("signname", "borough", "neighborhood", "zipcode", "size_acres", "om_type", "pd_type", "waterfront")

df<-df[order(-df$size_acres),]

kable(head(df), caption="Biggest Parks",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))

Biggest Parks
signname	borough	neighborhood	zipcode	size_acres	om_type	pd_type	waterfront
Pelham Bay Park	Bronx	Southeast Bronx	10461	2,772	Flagship Park	Flagship Park	true
Freshkills Park	Staten Island	South Shore	10312	920	Flagship Park	Undeveloped	true
Flushing Meadows Corona Park	Queens	North Queens	11354	898	Flagship Park	Flagship Park	true
Marine Park	Brooklyn	Southern Brooklyn	11229	798	Large Park	Community Park	true
Bronx Park	Bronx	Bronx Park and Fordham	10458	718	Large Park	Flagship Park	true
Franklin D. Roosevelt Boardwalk and Beach	Staten Island	Stapleton and St. George	10305	644	Large Park	Waterfront Facility	true

The 5 biggest waterfront parks.

# display the biggest waterfront parks 
df<-subset(df_final, waterfront=="true")

df<-df %>% select("signname", "borough", "neighborhood", "zipcode", "size_acres", "om_type", "pd_type", "waterfront")

df<-df[order(-df$size_acres),]

kable(head(df), caption="Big Waterfront Parks",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))

Big Waterfront Parks
signname	borough	neighborhood	zipcode	size_acres	om_type	pd_type	waterfront
Pelham Bay Park	Bronx	Southeast Bronx	10461	2,772	Flagship Park	Flagship Park	true
Freshkills Park	Staten Island	South Shore	10312	920	Flagship Park	Undeveloped	true
Flushing Meadows Corona Park	Queens	North Queens	11354	898	Flagship Park	Flagship Park	true
Marine Park	Brooklyn	Southern Brooklyn	11229	798	Large Park	Community Park	true
Bronx Park	Bronx	Bronx Park and Fordham	10458	718	Large Park	Flagship Park	true
Franklin D. Roosevelt Boardwalk and Beach	Staten Island	Stapleton and St. George	10305	644	Large Park	Waterfront Facility	true

The 5 biggest parks in Borough Park, Brooklyn.(not very big)

df<-subset(df_final, neighborhood=="Borough Park")

df<-df %>% select("signname", "borough", "neighborhood", "size_acres", "zipcode", "population")

df<-df[order(-df$size_acres),]


kable(head(df), caption=" ",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))

signname	borough	neighborhood	size_acres	zipcode	population
Leif Ericson Park	Brooklyn	Borough Park	16.80	11219	92,221
Friends Field	Brooklyn	Borough Park	6.70	11230	86,408
Gravesend Park	Brooklyn	Borough Park	6.38	11204	78,134
Seth Low Playground/ Bealin Square	Brooklyn	Borough Park	4.95	11204	78,134
Greenwood Playground	Brooklyn	Borough Park	3.39	11218	75,220
Colonel David Marcus Playground	Brooklyn	Borough Park	1.97	11230	86,408

The 5 biggest parks in the South Shore of Staten Island (fairly big).

df<-subset(df_final, neighborhood=="South Shore")

df<-df %>% select("signname", "borough", "neighborhood", "size_acres", "zipcode", "population")

df<-df[order(-df$size_acres),]


kable(head(df), caption=" ",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))

signname	borough	neighborhood	size_acres	zipcode	population
Freshkills Park	Staten Island	South Shore	920	10312	59,304
LaTourette Park & Golf Course	Staten Island	South Shore	761	10306	55,909
Great Kills Park	Staten Island	South Shore	315	10306	55,909
Wolfe’s Pond Park	Staten Island	South Shore	303	10309	32,519
Conference House Park	Staten Island	South Shore	286	10307	14,096
Brookfield Park	Staten Island	South Shore	259	10308	27,357

Aggregate By Neighborhood

Create summary by zip code.

Recall that our population is by zip code. The acres is by park.

df_sum_zip <-df_final %>%
  group_by(zipcode) %>%
  summarise(borough = min(borough), neighborhood=min(neighborhood), population=min(population), acres = sum(size_acres), parks=n()  )

Create summary by neighborhood.

df_sum_neighborhood <- df_sum_zip %>%
  group_by(neighborhood) %>%
  summarise(borough = min(borough), population=sum(population), parks = sum(parks), acres = sum(acres))

df_sum_neighborhood <- df_sum_neighborhood %>% mutate(parkland_per_person = acres/(population/100000))

A few records from the neiborhood summary.

kable(head(df_sum_neighborhood), caption="By Neighborhood",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))

By Neighborhood
neighborhood	borough	population	parks	acres	parkland_per_person
Borough Park	Brooklyn	331,983	20	47.9	14.4
Bronx Park and Fordham	Bronx	252,655	37	1,954.1	773.4
Bushwick and Williamsburg	Brooklyn	210,468	59	51.6	24.5
Canarsie and Flatlands	Brooklyn	195,027	32	507.8	260.4
Central Bronx	Bronx	206,116	68	105.8	51.3
Central Brooklyn	Brooklyn	318,898	111	189.4	59.4

The Summary in 3 Bar Charts

Population of each neighborhood.

options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = population, main=" ")) +
  geom_bar(stat = "identity") +
  coord_flip() + scale_y_continuous(name="Population") +
  scale_x_discrete(name="Neighborhood") +
  theme(axis.text.x = element_text(face="bold", color="#008000",
                                   size=8, angle=0),
        axis.text.y = element_text(face="bold", color="#008000",
                                   size=8, angle=0))

Parkland of each neighborhood.

options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = acres, main=" ")) +
  geom_bar(stat = "identity") +
  coord_flip() + scale_y_continuous(name="Parkland in Acres") +
  scale_x_discrete(name="Neighborhood") +
  theme(axis.text.x = element_text(face="bold", color="#008000",
                                   size=8, angle=0),
        axis.text.y = element_text(face="bold", color="#008000",
                                   size=8, angle=0))

Parkland per Person.

options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = parkland_per_person, main=" ")) +
  geom_bar(stat = "identity") +
  coord_flip() + scale_y_continuous(name="Parkland Acreage per Population") +
  scale_x_discrete(name="Neighborhood") +
  theme(axis.text.x = element_text(face="bold", color="#008000",
                                   size=8, angle=0),
        axis.text.y = element_text(face="bold", color="#008000",
                                   size=8, angle=0))

There are some massive parklands that skew the results.
But this study does illuminate some neighborhoods (like Borough Park) that could stand to have more space for parks.

Theres a lot of context and caveats so this is a study that definitely could be expanded.

Data607_Project2_Week6

Tom Buonora

September 26, 2021