Basic Data Extraction : NYC Parks
This project will explore the NYC park data from the “NYC Open Data” project and the question we will answer is what neighborhood has the most or best park lands. We will look at it by acreage vs. population.
Lets begin.
Imports. Constants.
library(readxl)
library(tidyverse) # ggplot2, dplyr, tidyr, readr, tibble, sringr and more
library(knitr)
CURR_PATH<-str_trim(getwd())Download the NYC Park data into a local csv file and bring it into a data frame.
download_parks<-"https://data.cityofnewyork.us/api/views/enfh-gkve/rows.csv?accessType=DOWNLOAD"
destfile<-paste0(CURR_PATH,"/parks.csv")
download.file(download_parks, destfile)
cfile<-read.csv(destfile)
Download an extract of neighborhoods and zip codes so we can group each park into neighborhoods.
download_neighborhoods<- "https://data.beta.nyc/dataset/0ff93d2d-90ba-457c-9f7e-39e47bf2ac5f/resource/7caac650-d082-4aea-9f9b-3681d568e8a5/download/nyc_zip_borough_neighborhoods_pop.csv"
destfile<-paste0(CURR_PATH,"/neighborhoods.csv")
download.file(download_neighborhoods, destfile)
zfile<-read.csv(destfile)
Just because its in the database doesnt mean it adds to quality of life. Some parks are beautiful…
We will make some effort to restrict the data set to just the kinds of park land that add to quality of life.
Remove some duff data and remove the ones that really arent “quality of life” parks.
# get rid of some of the the duff data, we dont need the records that are not of class PARK
cfile<-subset(cfile, CLASS=="PARK")
cfile<-subset(cfile, TYPECATEGORY!="Cemetery" & TYPECATEGORY!="Mall" & TYPECATEGORY!="Parkway" )
cfile<-subset(cfile, SUBCATEGORY!="Lot" & SUBCATEGORY!="Parking Lot" & SUBCATEGORY!="PKWY" & SUBCATEGORY!="EXWY" & SUBCATEGORY!="Building" & SUBCATEGORY!="Facility" )
# i noticed if the sign just says "Park" its usually a very tiny piece of land
cfile<-subset(cfile, SIGNNAME != "Park")
Now we need to join our Neighbood data to the Park data, joining by zip code.
The problem is many parks span multiple zipcodes and appear as 1.110311e+34 or 1.000110e+09.
So convert to string remove the dot, substring out the first zip code, and mark it as a multiple zip code park.
Note I created a zipcode table using the unique identifier “GISOBJID”.
multiple_zips <- cfile %>% filter(ZIPCODE > 99999) %>% select("GISOBJID","ZIPCODE")
multiple_zips$ZIPCODE<-as.character(multiple_zips$ZIPCODE)
multiple_zips$ZIPCODE<-str_replace_all(multiple_zips$ZIPCODE, "\\.", "")
multiple_zips$ZIPCODE<-str_sub(multiple_zips$ZIPCODE, 1, 5)
multiple_zips<- multiple_zips %>% mutate(multi_zip = 'Y' )
# this approach was my plan B, create a table with all zip codes
single_zips <- cfile %>% filter(ZIPCODE < 99999) %>% select("GISOBJID","ZIPCODE")
single_zips<- single_zips %>% mutate(multi_zip = 'N' )
all_zips<-rbind(multiple_zips, single_zips)
Now that we created a zipcode table. Join the 3 tables into one.
# now merge the 2 tables
# 1) update the original zipcode with the new zipcode
# 2) add the multi_zip field
# reduce the data set to just the fields we care about, note im not including zipcode or borough
# note im not bringinging in EAPPLY or NAME311, as SIGNNAME seems to be the most reliable of the 3
df_temp<-cfile[c("SIGNNAME","ACRES", "DEPARTMENT", "URL", "GISOBJID",
"JURISDICTION", "TYPECATEGORY", "SUBCATEGORY", "WATERFRONT")]
# all.X=True means we will include all records from df_temp regardless
df_temp<-merge(df_temp, all_zips, by.x ="GISOBJID", by.y = "GISOBJID", all.x = TRUE)
# commit point
# now merge again with neighborhood file
df_final<-merge(df_temp, zfile, by.x ="ZIPCODE", by.y = "zip", all.x = TRUE)
Great. Cleanup the columns a bit.
# clean up the columns
names(df_final)<-tolower(names(df_final))
# rename some of the columns to help clarify things
df_final<-df_final %>%
rename(
size_acres=acres,
pd_type=typecategory, # Planning and Development Division type
om_type=subcategory # Operations & Management Division type
)
df_final <- df_final %>% drop_na(borough) # a few bad zip codes
Convert the acres to square miles.
# an acre is 0.0015625 square miles
df_final<-df_final %>% dplyr::mutate(size_sm = size_acres*0.0015625)
The 5 biggest parks.
# display the biggest waterfront parks
df<-subset(df_final, waterfront=="true")
df<-df %>% select("signname", "borough", "neighborhood", "zipcode", "size_acres", "om_type", "pd_type", "waterfront")
df<-df[order(-df$size_acres),]
kable(head(df), caption="Biggest Parks",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))| signname | borough | neighborhood | zipcode | size_acres | om_type | pd_type | waterfront |
|---|---|---|---|---|---|---|---|
| Pelham Bay Park | Bronx | Southeast Bronx | 10461 | 2,772 | Flagship Park | Flagship Park | true |
| Freshkills Park | Staten Island | South Shore | 10312 | 920 | Flagship Park | Undeveloped | true |
| Flushing Meadows Corona Park | Queens | North Queens | 11354 | 898 | Flagship Park | Flagship Park | true |
| Marine Park | Brooklyn | Southern Brooklyn | 11229 | 798 | Large Park | Community Park | true |
| Bronx Park | Bronx | Bronx Park and Fordham | 10458 | 718 | Large Park | Flagship Park | true |
| Franklin D. Roosevelt Boardwalk and Beach | Staten Island | Stapleton and St. George | 10305 | 644 | Large Park | Waterfront Facility | true |
The 5 biggest waterfront parks.
# display the biggest waterfront parks
df<-subset(df_final, waterfront=="true")
df<-df %>% select("signname", "borough", "neighborhood", "zipcode", "size_acres", "om_type", "pd_type", "waterfront")
df<-df[order(-df$size_acres),]
kable(head(df), caption="Big Waterfront Parks",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))| signname | borough | neighborhood | zipcode | size_acres | om_type | pd_type | waterfront |
|---|---|---|---|---|---|---|---|
| Pelham Bay Park | Bronx | Southeast Bronx | 10461 | 2,772 | Flagship Park | Flagship Park | true |
| Freshkills Park | Staten Island | South Shore | 10312 | 920 | Flagship Park | Undeveloped | true |
| Flushing Meadows Corona Park | Queens | North Queens | 11354 | 898 | Flagship Park | Flagship Park | true |
| Marine Park | Brooklyn | Southern Brooklyn | 11229 | 798 | Large Park | Community Park | true |
| Bronx Park | Bronx | Bronx Park and Fordham | 10458 | 718 | Large Park | Flagship Park | true |
| Franklin D. Roosevelt Boardwalk and Beach | Staten Island | Stapleton and St. George | 10305 | 644 | Large Park | Waterfront Facility | true |
The 5 biggest parks in Borough Park, Brooklyn.(not very big)
df<-subset(df_final, neighborhood=="Borough Park")
df<-df %>% select("signname", "borough", "neighborhood", "size_acres", "zipcode", "population")
df<-df[order(-df$size_acres),]
kable(head(df), caption=" ",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))| signname | borough | neighborhood | size_acres | zipcode | population |
|---|---|---|---|---|---|
| Leif Ericson Park | Brooklyn | Borough Park | 16.80 | 11219 | 92,221 |
| Friends Field | Brooklyn | Borough Park | 6.70 | 11230 | 86,408 |
| Gravesend Park | Brooklyn | Borough Park | 6.38 | 11204 | 78,134 |
| Seth Low Playground/ Bealin Square | Brooklyn | Borough Park | 4.95 | 11204 | 78,134 |
| Greenwood Playground | Brooklyn | Borough Park | 3.39 | 11218 | 75,220 |
| Colonel David Marcus Playground | Brooklyn | Borough Park | 1.97 | 11230 | 86,408 |
The 5 biggest parks in the South Shore of Staten Island (fairly big).
df<-subset(df_final, neighborhood=="South Shore")
df<-df %>% select("signname", "borough", "neighborhood", "size_acres", "zipcode", "population")
df<-df[order(-df$size_acres),]
kable(head(df), caption=" ",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))| signname | borough | neighborhood | size_acres | zipcode | population |
|---|---|---|---|---|---|
| Freshkills Park | Staten Island | South Shore | 920 | 10312 | 59,304 |
| LaTourette Park & Golf Course | Staten Island | South Shore | 761 | 10306 | 55,909 |
| Great Kills Park | Staten Island | South Shore | 315 | 10306 | 55,909 |
| Wolfe’s Pond Park | Staten Island | South Shore | 303 | 10309 | 32,519 |
| Conference House Park | Staten Island | South Shore | 286 | 10307 | 14,096 |
| Brookfield Park | Staten Island | South Shore | 259 | 10308 | 27,357 |
Create summary by zip code.
Recall that our population is by zip code. The acres is by park.
df_sum_zip <-df_final %>%
group_by(zipcode) %>%
summarise(borough = min(borough), neighborhood=min(neighborhood), population=min(population), acres = sum(size_acres), parks=n() )
Create summary by neighborhood.
df_sum_neighborhood <- df_sum_zip %>%
group_by(neighborhood) %>%
summarise(borough = min(borough), population=sum(population), parks = sum(parks), acres = sum(acres))
df_sum_neighborhood <- df_sum_neighborhood %>% mutate(parkland_per_person = acres/(population/100000))
A few records from the neiborhood summary.
kable(head(df_sum_neighborhood), caption="By Neighborhood",row.names = FALSE, booktabs=TRUE,format.args = list(decimal.mark = '.', big.mark = ",", digits=3))| neighborhood | borough | population | parks | acres | parkland_per_person |
|---|---|---|---|---|---|
| Borough Park | Brooklyn | 331,983 | 20 | 47.9 | 14.4 |
| Bronx Park and Fordham | Bronx | 252,655 | 37 | 1,954.1 | 773.4 |
| Bushwick and Williamsburg | Brooklyn | 210,468 | 59 | 51.6 | 24.5 |
| Canarsie and Flatlands | Brooklyn | 195,027 | 32 | 507.8 | 260.4 |
| Central Bronx | Bronx | 206,116 | 68 | 105.8 | 51.3 |
| Central Brooklyn | Brooklyn | 318,898 | 111 | 189.4 | 59.4 |
Population of each neighborhood.
options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = population, main=" ")) +
geom_bar(stat = "identity") +
coord_flip() + scale_y_continuous(name="Population") +
scale_x_discrete(name="Neighborhood") +
theme(axis.text.x = element_text(face="bold", color="#008000",
size=8, angle=0),
axis.text.y = element_text(face="bold", color="#008000",
size=8, angle=0))
Parkland of each neighborhood.
options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = acres, main=" ")) +
geom_bar(stat = "identity") +
coord_flip() + scale_y_continuous(name="Parkland in Acres") +
scale_x_discrete(name="Neighborhood") +
theme(axis.text.x = element_text(face="bold", color="#008000",
size=8, angle=0),
axis.text.y = element_text(face="bold", color="#008000",
size=8, angle=0))
Parkland per Person.
options(repr.plot.width=8, repr.plot.height=3)
ggplot(df_sum_neighborhood, aes(x = neighborhood, y = parkland_per_person, main=" ")) +
geom_bar(stat = "identity") +
coord_flip() + scale_y_continuous(name="Parkland Acreage per Population") +
scale_x_discrete(name="Neighborhood") +
theme(axis.text.x = element_text(face="bold", color="#008000",
size=8, angle=0),
axis.text.y = element_text(face="bold", color="#008000",
size=8, angle=0))
There are some massive parklands that skew the results.
But this study does illuminate some neighborhoods (like Borough Park) that could stand to have more space for parks.
Theres a lot of context and caveats so this is a study that definitely could be expanded.