Data Science Assignment

Description of problem

The data (sourced from NYC OpenData: https://opendata.cityofnewyork.us/) is a dataset which contains information about the status of building permits being issued. It contains data about the locations about the buildings, the type of buildings etc.

This exercise will involve the exploration of this data. The main questions answered will be what areas are residential buildings being built in? what areas are non residential buildings being built in? are there any specific things being kept in mind when deciding the location of these buildings? There could be other interesting characteristics that might come up.

Data

The dataset has 3508249 rows and 60 columns. The data is quite unclean with NA values as well as blanks in most columns. As the most important columns for this exploration are latitude, longitude, borough, building type, construction type and status. For this exercise, only in process applications will be considered as the location decisions of buildings that will be considered in the future are being explored.

setwd("C:/Users/Eashani/Desktop/topos")
permit_data <- read.csv("permit_data.csv")
subway <- read.csv("subway1.csv")
library(leaflet)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(sp)
library(maptools)

## Checking rgeos availability: FALSE
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()

library(httr)
library(rgdal)

## rgdal: version: 1.4-3, (SVN revision 828)
##  Geospatial Data Abstraction Library extensions to R successfully loaded
##  Loaded GDAL runtime: GDAL 2.2.3, released 2017/11/20
##  Path to GDAL shared files: C:/Users/Eashani/Documents/R/win-library/3.5/rgdal/gdal
##  GDAL binary built with GEOS: TRUE 
##  Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
##  Path to PROJ.4 shared files: C:/Users/Eashani/Documents/R/win-library/3.5/rgdal/proj
##  Linking to sp version: 1.3-1

library(tigris)

## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.

## 
## Attaching package: 'tigris'

## The following object is masked from 'package:graphics':
## 
##     plot

library(leaflet.extras)
library(geosphere)

Data cleaning

The data is checked for NA values

na_values <- length(unique(unlist(lapply(permit_data, function(permit_data) which(is.na(permit_data))))))

na_values

## [1] 162319

percent_of_na <- na_values/nrow(permit_data)

percent_of_na

## [1] 0.04626781

as the rows with missing values form a very small section of the data, and the data contains non numeric columns i.e. imputation through mean or mode will be incorrect, it makes more sense to get rid of the rows

Residential Buildings

The buildings that are being constructed for residential purposes are considered and the data is subset according to that.

The locations of the buildings are plotted on the map for NYC and are colored according to borough

permit2 <- na.omit(permit_data)
permit2 <- permit2[permit2$Permit.Status=="IN PROCESS",]
permit2 <- permit2[permit2$Residential=="YES",]
permit2 <- permit2[permit2$Job.Type =="NB",]
r <- GET('http://data.beta.nyc//dataset/68c0332f-c3bb-4a78-a0c1-32af515892d6/resource/7c164faa-4458-4ff2-9ef0-09db00b509ef/download/42c737fd496f4d6683bba25fb0e86e1dnycboroughboundaries.geojson')
nyc_neighborhoods <- readOGR(content(r,'text'), 'OGRGeoJSON', verbose = F)

## No encoding supplied: defaulting to UTF-8.

points <- data.frame(lat=permit2$LATITUDE, lng=permit2$LONGITUDE)
points_spdf <- points
coordinates(points_spdf) <- ~lng + lat
proj4string(points_spdf) <- proj4string(nyc_neighborhoods)
matches <- over(points_spdf, nyc_neighborhoods)
points <- cbind(points, matches)
points_by_borough <- points %>%
  group_by(borough) %>%
  summarize(num_points=n())




map_data <- geo_join(nyc_neighborhoods, points_by_borough, "borough", "borough")

points_bronx1 = permit2[permit2$ï..BOROUGH=="BRONX",]
points_brooklyn1 = permit2[permit2$ï..BOROUGH=="BROOKLYN",]
points_manhattan1 = permit2[permit2$ï..BOROUGH=="MANHATTAN",]
points_queens1 = permit2[permit2$ï..BOROUGH=="QUEENS",]
points_staten1 = permit2[permit2$ï..BOROUGH=="STATEN ISLAND",]

pal <- colorNumeric(palette = "Blues",
                    domain = range(map_data@data$num_points, na.rm=T))

n <- leaflet() %>%
  addTiles() %>% 
  addPolygons(fillColor = ~pal(num_points), popup = ~borough,data=map_data) %>% 
  addCircleMarkers(data=points_bronx1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="RED",  fillColor="red", stroke = TRUE, fillOpacity = 0.8, group="Red")%>%
  addCircleMarkers(data=points_brooklyn1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="blue",  fillColor="blue", stroke = TRUE, fillOpacity = 0.8, group="blue")%>%
  addCircleMarkers(data=points_manhattan1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="green",  fillColor="green", stroke = TRUE, fillOpacity = 0.8, group="green")%>%
  addCircleMarkers(data=points_queens1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="yellow",  fillColor="yellow", stroke = TRUE, fillOpacity = 0.8, group="yellow")%>%
  addCircleMarkers(data=points_staten1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="orange",  fillColor="orange", stroke = TRUE, fillOpacity = 0.8, group="orange")%>%
addProviderTiles("CartoDB.Positron")  %>% setView(-74.00, 40.71, zoom = 10.5)
n

It is seen that the residential buildings being constructed in Manhattan are mostly random in placement but the constructions in other boroughs seem to be clustered in the areas that neighbour Manhattan. This makes sense as Manhattan has a lot of offices and people would like to live in areas that are close to their workplace. An interesting observation with regards to Staten Island is that most constructions are either centered around the north side. This part is closest to Manhattan which makes it accessible by the staten island ferry However a part of the constructions seem to be coming up only on one side of the island. This is an interesting observation.

Let’s see how the position of these buildings compares to the locations of the NYC subway stations.

m <- leaflet() %>%
  addTiles() %>% 
  addPolygons(fillColor = ~pal(num_points), popup = ~borough,data=map_data) %>% 
  addCircleMarkers(data=points_bronx1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="RED",  fillColor="red", stroke = TRUE, fillOpacity = 0.8, group="Red")%>%
  addCircleMarkers(data=points_brooklyn1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="blue",  fillColor="blue", stroke = TRUE, fillOpacity = 0.8, group="blue")%>%
  addCircleMarkers(data=points_manhattan1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="green",  fillColor="green", stroke = TRUE, fillOpacity = 0.8, group="green")%>%
  addCircleMarkers(data=points_queens1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="yellow",  fillColor="yellow", stroke = TRUE, fillOpacity = 0.8, group="yellow")%>%
  addCircleMarkers(data=points_staten1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="orange",  fillColor="orange", stroke = TRUE, fillOpacity = 0.8, group="orange")%>%
     addCircleMarkers(data=subway, lng=~longitude, lat=~latitude , color="pink",radius=0.1 ,   stroke = TRUE, fillOpacity = 0.8)%>%
  addProviderTiles("CartoDB.Positron")  %>% setView(-74.00, 40.71, zoom = 10.5)

m

It is seen that quite a lot of the constructions are in the vicinity of subway stations

Non-residential

permit2 <- na.omit(permit_data)
permit2 <- permit2[permit2$Permit.Status=="IN PROCESS",]
permit2 <- permit2[!permit2$Residential=="YES",]
permit2 <- permit2[permit2$Job.Type =="NB",]
r <- GET('http://data.beta.nyc//dataset/68c0332f-c3bb-4a78-a0c1-32af515892d6/resource/7c164faa-4458-4ff2-9ef0-09db00b509ef/download/42c737fd496f4d6683bba25fb0e86e1dnycboroughboundaries.geojson')
nyc_neighborhoods <- readOGR(content(r,'text'), 'OGRGeoJSON', verbose = F)

## No encoding supplied: defaulting to UTF-8.

points <- data.frame(lat=permit2$LATITUDE, lng=permit2$LONGITUDE)
points_spdf <- points
coordinates(points_spdf) <- ~lng + lat
proj4string(points_spdf) <- proj4string(nyc_neighborhoods)
matches <- over(points_spdf, nyc_neighborhoods)
points <- cbind(points, matches)
points_by_borough <- points %>%
  group_by(borough) %>%
  summarize(num_points=n())

map_data <- geo_join(nyc_neighborhoods, points_by_borough, "borough", "borough")

points_bronx1 = permit2[permit2$ï..BOROUGH=="BRONX",]
points_brooklyn1 = permit2[permit2$ï..BOROUGH=="BROOKLYN",]
points_manhattan1 = permit2[permit2$ï..BOROUGH=="MANHATTAN",]
points_queens1 = permit2[permit2$ï..BOROUGH=="QUEENS",]
points_staten1 = permit2[permit2$ï..BOROUGH=="STATEN ISLAND",]

pal <- colorNumeric(palette = "Blues",
                    domain = range(map_data@data$num_points, na.rm=T))

n <- leaflet() %>%
  addTiles() %>% 
  addPolygons(fillColor = ~pal(num_points), popup = ~borough,data=map_data) %>% 
  addCircleMarkers(data=points_bronx1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="RED",  fillColor="red", stroke = TRUE, fillOpacity = 0.8, group="Red")%>%
  addCircleMarkers(data=points_brooklyn1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="blue",  fillColor="blue", stroke = TRUE, fillOpacity = 0.8, group="blue")%>%
  addCircleMarkers(data=points_manhattan1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="green",  fillColor="green", stroke = TRUE, fillOpacity = 0.8, group="green")%>%
  addCircleMarkers(data=points_queens1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="yellow",  fillColor="yellow", stroke = TRUE, fillOpacity = 0.8, group="yellow")%>%
  addCircleMarkers(data=points_staten1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="orange",  fillColor="orange", stroke = TRUE, fillOpacity = 0.8, group="orange")%>%
  addProviderTiles("CartoDB.Positron")  %>% setView(-74.00, 40.71, zoom = 10.5)
n

The distribution of non-residential buildings seems to be mostly random in placement with no specific pattern. Let us compare it with the subway map.

m <- leaflet() %>%
  addTiles() %>% 
  addPolygons(fillColor = ~pal(num_points), popup = ~borough,data=map_data) %>% 
  addCircleMarkers(data=points_bronx1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="RED",  fillColor="red", stroke = TRUE, fillOpacity = 0.8, group="Red")%>%
  addCircleMarkers(data=points_brooklyn1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="blue",  fillColor="blue", stroke = TRUE, fillOpacity = 0.8, group="blue")%>%
  addCircleMarkers(data=points_manhattan1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="green",  fillColor="green", stroke = TRUE, fillOpacity = 0.8, group="green")%>%
  addCircleMarkers(data=points_queens1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="yellow",  fillColor="yellow", stroke = TRUE, fillOpacity = 0.8, group="yellow")%>%
  addCircleMarkers(data=points_staten1, lng=~LONGITUDE, lat=~LATITUDE, radius=0.3 , color="orange",  fillColor="orange", stroke = TRUE, fillOpacity = 0.8, group="orange")%>%
 addCircleMarkers(data=subway, lng=~longitude, lat=~latitude , color="pink",radius=0.1 ,  stroke = TRUE, fillOpacity = 0.8)%>%
  addProviderTiles("CartoDB.Positron")  %>% setView(-74.00, 40.71, zoom = 10.5)

m

The location of the non-residential buildings, although without a specific pattern, does coincide with the positions of the subway stations.

Conclusion:

Upcoming residental constructions in NYC are mostly centered around the areas surrounding manhattan with a few variations in this pattern especially around Staten Island. The pattern of construction can be further explained by using the locations of subway stations. The constructions that are further away from Manhattan are usually close to subway stations or close to the ferry terminal(Staten Island).

Non residential constructions have no apparent pattern on simple observation, they are located all over the boroughs. However, when they are compared with subway station locations, most constructions are around these stations even if away from Manhattan.

The only unexplained construction is in Queens in areas which are away from Manhattan as well as away from subway stations. It is possible that there is another factor that could be responsible like school locations etc. This can be verified with more data.

The validity of these claims can be tested using hypothesis testing. It was not