Introduction

This is the final project for IS607 Data Acquisition and Management class in the M.S. of Data Analytics program at The City University of New York.

This is an observational study.

Motivation

The motivation of this project is to understand the relationship between crime and property values in New York City. The majority of the project team are New York City residents and this is of interest.

The results of this study could be useful to those interested in purchasing real estate in New York City.

The Data

The data were obtained from two sources. Crime data were collected from New York City Police Department complaint data. The crime dataset includes all valid felony, misdemeanor, and violation crimes reported to the New York City Police Department (NYPD) during 2015. These data represent criminal offenses according to New York State Penal Law definitions, not FBI Uniform Crime Report definitions, and are therefore not comparable to UCR reported crime.

Each row represents a complaint reported to NYPD. Only valid complaints are included in this release. Complaints deemed unfounded due to reporter error or misinformation are excluded from the data set, as they are not reflected in official figures nor are they considered to have actually occurred in a criminal context. Similarly, complaints that were voided due to internal error are also excluded from the data set.

Property sales data from August to November 2016 were scraped from the real estate listing site Trulia.

Finally, a dissaggregate analysis was also conducted using data from Zillow. Data sources and some summary observations are in the disaggregate analysis section.

Obtain the data

#library(plyr)
library(dplyr)
library(corrplot)
library(ggmap)
library(RMySQL)
library(RCurl)
library(XML)
library(tidyr)
library(stringr)
library(ggplot2)
library(geosphere)
temp <- tempfile()
download.file("https://raw.githubusercontent.com/cunyauthor/FinalProject/master/NYPD_Complaint_Data_Historic.zip",temp)
NYPD <- read.csv(unz(temp, "NYPD_Complaint_Data_Historic.csv"), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)
unlink(temp)

str(NYPD)
head(NYPD)

Scrub and explore the data

A subset of the crime data was created with only variables required for the analysis. All missing data were removed from the dataset, the resulting dataset contained 4500,000+ cases.

A random sample was taken to create a sample of 11,700 cases. The sample was broken into 5 smaller samples in order to add Google API to match longitude and latitude in the data to street addresses. Google API allowed 2,500 free downloads per 24 hour period.

We had a challenge here as there is no single standard defining of violent crime, there are two standards, NCVS Bureau of Justice Statistics National Crime Victimization Survey and the UCR Federal Bureau of Investigation’s Uniform Crime Report and they were not fully compatible with the description of the NYC PD.

To solve this we create our own definition of violent crime, we filter from the main the follow crime categories: DANGEROUS WEAPONS,FELONY ASSAULT,KIDNAPPING & RELATED OFFENSES,MURDER & NON-NEGL. MANSLAUGHTER,ROBBERY and SEX CRIMES.

class(NYPD) #view class
dim(NYPD)   #view dimensions
names(NYPD) #explore the column names
unstand the structure
str(NYPD)  

new <- select(NYPD, RPT_DT, OFNS_DESC, OFNS_DESC, LAW_CAT_CD, BORO_NM, ADDR_PCT_CD,
              Latitude, Longitude, Lat_Lon) # data with required variables

sapply(new, function(x) sum(is.na(x))) # review NAs in the data
noNas <- na.omit(new) # remove NAs from data
sapply(noNas, function(x) sum(is.na(x)))
glimpse(noNas) #look at the data

summary(noNas) #summarize the data
Create datasets with Google API addresses, crime index
#set.seed(123) # set the seed to standardize
#nyc<-noNas[sample(nrow(noNas),458557),] # random the sample to standardize the sequence
#nyc$id<-1:458557 # create a id to check

# Create datasets with Google API
#nyc_1<-nyc[1:2400,]# 
#res <- mapply(FUN = function(lon, lat) {
        #revgeocode(c(lon, lat), output = "more")
#},
#nyc_1$Longitude, nyc_1$Latitude
#)
#res1<-rbind_all(lapply(res, as.data.frame))
#View(res1)
#nyc_1<-cbind(nyc_1,res1) # add full address to data base
#write.csv(nyc_51 file = 'nyc_1.csv', row.names = FALSE) #write to csv
#head(nyc_1)

#nyc_2<-nyc[2401:4800,]  
#res <- mapply(FUN = function(lon, lat) {
        #revgeocode(c(lon, lat), output = "more")
#},
#nyc_2$Longitude, nyc_2$Latitude
#)
#res1<-rbind_all(lapply(res, as.data.frame))
#View(res1)
#nyc_2<-cbind(nyc_2,res1) # add full address to data base
#write.csv(nyc_2, file = 'nyc_2.csv', row.names = FALSE) #write to csv
#head(nyc_2)

#nyc_3<-nyc[4801:6900,]# 2nd Dec Sharon sample # (for today)
#res <- mapply(FUN = function(lon, lat) {
        #revgeocode(c(lon, lat), output = "more")
#},
#nyc_3$Longitude, nyc_3$Latitude
#)
#res1<-rbind_all(lapply(res, as.data.frame))
#View(res1)
#nyc_3<-cbind(nyc_3,res1) # add full address to data base
#write.csv(nyc_3, file = 'nyc_3.csv', row.names = FALSE) #write to csv
#head(nyc_3)

#nyc_4<-nyc[7201:9600,]# 
#res <- mapply(FUN = function(lon, lat) {
        #revgeocode(c(lon, lat), output = "more")
#},
#nyc_4$Longitude, nyc_4$Latitude
#)
#res1<-rbind_all(lapply(res, as.data.frame))
#View(res1)
#nyc_4<-cbind(nyc_4,res1) # add full address to data base
#write.csv(nyc_4, file = 'nyc_4.csv', row.names = FALSE) #write to csv
#head(nyc_4)

#nyc_5<-nyc[9600:12000,]# 
#res <- mapply(FUN = function(lon, lat) {
        #revgeocode(c(lon, lat), output = "more")
#},
#nyc_5$Longitude, nyc_5$Latitude
#)
#res1<-rbind_all(lapply(res, as.data.frame))
#View(res1)
#nyc_5<-cbind(nyc_5,res1) # add full address to data base
#write.csv(nyc_5, file = 'nyc_5.csv', row.names = FALSE) #write to csv
#head(nyc_5)
# read the nyc data
urlfile<-"https://raw.githubusercontent.com/cunyauthor/FinalProject/master/nyc_15.csv"
nyc_1<-read.csv(url(urlfile), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)

nyc_1$postal_code<-as.character(nyc_1$postal_code)

# create new column for violent crime 
nyc_1<-nyc_1[-10639,]
index<-c("DANGEROUS WEAPONS","FELONY ASSAULT","KIDNAPPING & RELATED OFFENSES","MURDER & NON-NEGL. MANSLAUGHTER","ROBBERY","SEX CRIMES")
nyc_1$crime_v<-index[(match(nyc_1$OFNS_DESC,index))]
nyc_1$crime_v<-as.factor(nyc_1$crime_v)

# agrregate the crime data by zip code
nyc_s1<-aggregate(OFNS_DESC ~ postal_code, nyc_1, length)
nyc_s2<-aggregate(crime_v ~ postal_code, nyc_1, length)

colnames(nyc_s1) <- c("zip","freq_crime")
colnames(nyc_s2) <- c("zip","freq_violent_crime")

# joint two tables of crime data
nyc_s12<-left_join(nyc_s1, nyc_s2, "zip")
# change NA for zero 
nyc_s12[is.na(nyc_s12)]<-0
# Crime table
head(nyc_s12)
##     zip freq_crime freq_violent_crime
## 1 10001        161                  8
## 2 10002        141                 10
## 3 10003        140                  4
## 4 10004         18                  0
## 5 10005          7                  1
## 6 10006          6                  0
#crime summary
summary(nyc_s12$freq_crime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   27.00   51.00   63.21   93.00  222.00
sd(nyc_s12$freq_crime)
## [1] 48.22886
summary(nyc_s12$freq_violent_crime)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    1.00    5.00    6.73   10.00   39.00
sd(nyc_s12$freq_violent_crime)
## [1] 6.985422
# Read data from real state 
urlfile<-"https://raw.githubusercontent.com/cunyauthor/FinalProject/master/trulia_sqft.csv"
trulia<-read.csv(url(urlfile), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)

# Read the table
urlfile<-"https://raw.githubusercontent.com/cunyauthor/FinalProject/master/zip_neig.csv"
zip_neig<-read.csv(url(urlfile), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)
 
#joint crime data with borough and neighborhood
 zip_neig$zip<-as.character(zip_neig$zip)
nyc_s12<-left_join(nyc_s12,zip_neig, "zip")
head(nyc_s12)
##     zip freq_crime freq_violent_crime   Borough        Neighborhood
## 1 10001        161                  8 Manhattan Chelsea and Clinton
## 2 10002        141                 10 Manhattan     Lower East Side
## 3 10003        140                  4 Manhattan     Lower East Side
## 4 10004         18                  0 Manhattan     Lower Manhattan
## 5 10005          7                  1 Manhattan     Lower Manhattan
## 6 10006          6                  0 Manhattan     Lower Manhattan
# Statistics summary of NYC prices
summary(trulia$Median_sales_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  400000  879000 1040000 1271000 1656000 3200000
summary(trulia$Price_sqft)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     515    1100    1374    1348    1684    2074
# Trulia table price by zip
head(trulia)
##   X   zip Median_sales_price Price_sqft
## 1 1 10001            1687500       2074
## 2 2 10002             875000       1216
## 3 3 10003            2395000       1837
## 4 4 10004            1260000       1450
## 5 5 10005            1513779       1404
## 6 6 10006             749000       1276
# Change zip to as.character
trulia$zip<-as.character(trulia$zip)
# Join the two tables
nyc_full<-inner_join(trulia,nyc_s12, "zip")
head(nyc_full)
##   X   zip Median_sales_price Price_sqft freq_crime freq_violent_crime
## 1 1 10001            1687500       2074        161                  8
## 2 2 10002             875000       1216        141                 10
## 3 3 10003            2395000       1837        140                  4
## 4 4 10004            1260000       1450         18                  0
## 5 5 10005            1513779       1404          7                  1
## 6 6 10006             749000       1276          6                  0
##     Borough        Neighborhood
## 1 Manhattan Chelsea and Clinton
## 2 Manhattan     Lower East Side
## 3 Manhattan     Lower East Side
## 4 Manhattan     Lower Manhattan
## 5 Manhattan     Lower Manhattan
## 6 Manhattan     Lower Manhattan
# read population by ZIP frm 2010 US Census
urlfile<-"https://raw.githubusercontent.com/cunyauthor/FinalProject/master/pop_zip_census.csv"
zip_pop<-read.csv(url(urlfile), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)

colnames(zip_pop) <- c("zip","pop")
zip_pop$zip<-as.character(zip_pop$zip)

# Join the two tables, add population data
nyc_full<-inner_join(nyc_full,zip_pop, "zip")

# Include crime rate data per 1000.000 habit per year by zip
nyc_full$crime_rate<-(nyc_full$freq_crime/nyc_full$pop)*100000 * 458557/nrow(nyc_1)
nyc_full$v_crime_rate<-(nyc_full$freq_violent_crime/nyc_full$pop)*100000 * 458557/nrow(nyc_1)
#Database schema
#source("logincredentiasl.R")
#rmysql.settingsfile<-"C:/Program Files/MySQL/MySQL Server 5.0/my.ini"
#connection <- dbConnect(MySQL(),username="root", password="password")

#dbSendQuery(connection, 'CREATE SCHEMA IF NOT EXISTS SYS;')
#dbSendQuery(connection, 'USE SYS;')
#dbSendQuery(connection, 'DROP TABLE IF EXISTS nyc_1;')
#dbSendQuery(connection, 'Drop Table If Exists nyc_full;')
#dbSendQuery(connection, 'Drop Table If Exists nyc_s1;')
#dbSendQuery(connection, 'Drop Table If Exists nyc_s12;')
#dbSendQuery(connection, 'Drop Table If Exists nyc_s2;')
#dbSendQuery(connection, 'Drop Table If Exists trulia;')
#dbSendQuery(connection, 'Drop Table If Exists zip_pop;')

#urlfile<-"https://raw.githubusercontent.com/cunyauthor/FinalProject/master/nyc_15.csv"
#nyc_1<-read.csv(url(urlfile), encoding="latin1", na.strings=c("","NA"), stringsAsFactors = F)

#dbWriteTable(connection, "tbl_nyc1", nyc_1, append = TRUE, row.names = FALSE)
#dbSendQuery(connection, "ALTER TABLE tbl_nyc1
            #MODIFY COLUMN id varchar(100) NOT NULL,            
            #MODIFY COLUMN RPT_DT varchar(100) NOT NULL,
            #MODIFY COLUMN OFNS_DESC varchar(100) NOT NULL,
            #MODIFY COLUMN BORO_NM varchar(100) NOT NULL,
            #MODIFY COLUMN ADDR_PCT_CD varchar(100) NULL,
            #MODIFY COLUMN Latitude varchar(100) NULL,
            #MODIFY COLUMN Longitude varchar(100) NULL,
            #MODIFY COLUMN Lat_Lon varchar(100) NULL,
            #MODIFY COLUMN address varchar(200) NULL,
            #MODIFY COLUMN street_number varchar(100) NULL,
            #MODIFY COLUMN route varchar(100) NULL,
            #MODIFY COLUMN neighborhood varchar(100) NULL,
            #MODIFY COLUMN political varchar(100) NULL,
            #MODIFY COLUMN administrative_area_level_2 varchar(100) NULL,
            #MODIFY COLUMN administrative_area_level_1 varchar(100) NULL,
            #MODIFY COLUMN country varchar(100) NULL,
            #MODIFY COLUMN postal_code varchar(100) NULL,
            #MODIFY COLUMN postal_code_suffix varchar(100) NULL,
            #MODIFY COLUMN locality varchar(100) NULL,
            #MODIFY COLUMN premise varchar(100) NULL,
            #MODIFY COLUMN crime_v  varchar(150) NULL
            #;")


#urlfile<-"https://raw.githubusercontent.com/cunyauthor/FinalProject/master/pop_zip_census.csv"
#zip_pop<-read.csv(url(urlfile), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)
#colnames(zip_pop) <- c("zip","pop")# rename variables
#dbWriteTable(connection, "tbl_zip_pop", zip_pop, append = TRUE, row.names = FALSE)
#dbSendQuery(connection, "ALTER TABLE tbl_zip_pop
            #MODIFY COLUMN zip varchar(10) NOT NULL,
            #MODIFY COLUMN pop varchar(50) NOT NULL
            #;")

#urlfile<-"https://raw.githubusercontent.com/cunyauthor/FinalProject/master/zip_neig.csv"
#zip_neig<-read.csv(url(urlfile), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)

#dbWriteTable(connection, "tbl_zip_neig", zip_neig, append = TRUE, row.names = FALSE)
#dbSendQuery(connection, "ALTER TABLE tbl_zip_neig
            #MODIFY COLUMN Borough varchar(100) NOT NULL,
            #MODIFY COLUMN Neighborhood varchar(100) NOT NULL,
            #MODIFY COLUMN zip varchar(10) NOT NULL
            #;")

#urlfile<-"https://raw.githubusercontent.com/cunyauthor/FinalProject/master/trulia_sqft.csv"
#trulia<-read.csv(url(urlfile), encoding="UTF-8", na.strings=c("","NA"), stringsAsFactors = F)

#dbWriteTable(connection, "tbl_trulia", trulia, append = TRUE, row.names = FALSE)
#dbSendQuery(connection, "ALTER TABLE tbl_trulia
            #MODIFY COLUMN x varchar(10) NOT NULL,
            #MODIFY COLUMN zip varchar(10) NOT NULL,
            #MODIFY COLUMN Median_sales_price varchar(50) NOT NULL,
            #MODIFY COLUMN Price_sqft varchar(50) NOT NULL
            #;")

#dbSendQuery(connection, 'DROP TABLE IF EXISTS nyc_1;')
#dbSendQuery(connection, 'Drop Table If Exists nyc_full;')
#dbSendQuery(connection, 'Drop Table If Exists nyc_s1;')
#dbSendQuery(connection, 'Drop Table If Exists nyc_s12;')
#dbSendQuery(connection, 'Drop Table If Exists nyc_s2;')
#dbSendQuery(connection, 'Drop Table If Exists trulia;')
#dbSendQuery(connection, 'Drop Table If Exists zip_pop;')
#dbDisconnect(connection)

Model the Data

#Correlation matrix
nyc_full$X <- NULL
m1<- subset(nyc_full, select=-c(zip,Borough,Neighborhood))
colnames(m1)<-c("median crime", "price sqr/ft", "count crime", "count violent crime",
                 "population", "crime rate", "violent crime rate") 
m<-cor(m1)
corrplot(m, type="upper", order="hclust", t1.col="black",t1.srt=45)

# Scatterplot with regression line
ggplot(nyc_full,aes(y = Price_sqft, x = freq_violent_crime)) +
    geom_point() + ggtitle("Sales Price per Square Foot vs Violent Crime by ZIP") + 
    xlab("Violent Crime") + ylab("Price per Square Foot") + 
    geom_smooth(method="lm", fill=NA) +
    theme(plot.title = element_text(color="blue", size=14, face="bold"),
          axis.title.x = element_text(color="black", size=10, face="bold"),
          axis.title.y = element_text(color="black", size=10, face="bold"))

# Regression analysis
lmt<-lm(Price_sqft ~ freq_violent_crime, nyc_full)
summary(lmt)
## 
## Call:
## lm(formula = Price_sqft ~ freq_violent_crime, data = nyc_full)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -675.57 -267.43    6.02  244.03  817.33 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1519.88      82.50   18.42  < 2e-16 ***
## freq_violent_crime   -32.90      11.23   -2.93  0.00552 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 379.8 on 41 degrees of freedom
## Multiple R-squared:  0.1731, Adjusted R-squared:  0.153 
## F-statistic: 8.585 on 1 and 41 DF,  p-value: 0.005518

Dissaggregate Analysis

Goals and Aims

In addition to the aggregate analysis (focusing on median prices at the zip code level), we wanted to understand the impact of crime and other attibutes on property values(number of bedrooms, bathrooms, distance from the Central Business District and “neigbhorhood premium”) at a disaggregate level. Due to Zillow API call limits and time the analysis focused on Northwest and Central Brooklyn, in New York City - however the approach is very general and can easily be extended to other areas of Brooklyn and even New York City.

Phase I: Getting the data

The data is collected from different sources.
1. Planning data from Pluto for Addresses (input to Zillow) 2. Real Estate data from Zillow (using Zillow’s deepsearch API)
3. Crime data from NYPD

i) Planning database

Pluto data from NYC planning department is used to provide addresses to get the latest data from Zillow (source:http://www1.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page). Addresses obtained are then used to make call from the Zillow API.
The code below identifies how these addresses were selected - primarily using building codes. Since the zillow API places a constraint of 1000 calls per day and due to limited time, the analysis

#source pluto database from NYC
plutodata<-read.csv(text=getURL("https://raw.githubusercontent.com/cunyauthor/FinalProject/master/BK.csv"),header=TRUE,stringsAsFactors=FALSE) 

#focus analysis on certain zip codes
#central brooklyn, northwest brooklyn and greenpoint
#filter by zip code
nw<-c("11201", "11205", "11215", "11217", "11231")
gp<-c("11211", "11222")
cb<-c("11212", "11213", "11216", "11233", "11238")

#select residential buildings 
#filter data by building types and codes
#two family
fmly<-c("B1","B2","B3")
#walkup Apartments
walkup<-c("C0","C1","C2","C3","C4","C5","C6","C7")
# Elevator Apartments
elev<-c("D0","D1","D3","D4","D5","D6","D7","D8")
#Condominiums
condo<-c("R1","R2","R3","R4","R6","RD","RM","RR")
#combined buildings
# limiting type of buildings due to constraints imposed by zillow of 1000 API calls per day
buildings<-c(elev,condo)

#select neighborhood areas and building types
selectdata<-plutodata %>% filter(ZipCode==nw[1] | ZipCode==nw[2] | ZipCode==nw[3] | ZipCode==nw[4]| ZipCode==nw[5] | ZipCode==cb[1] | ZipCode==cb[2]| ZipCode==cb[3]| ZipCode==cb[4] | ZipCode==cb[5])

#select buidings
finaldata<-selectdata[(selectdata$BldgClass %in% buildings),]
#select columns
finaldata<-select(finaldata,one_of(c("CD","CT2010","CB2010","SchoolDist","ZipCode","Address","BldgClass","UnitsRes","YearBuilt","NumFloors")))

# dimensions of data
dim(finaldata)
## [1] 1992   10
# head of the data
head(finaldata)
##     CD CT2010 CB2010 SchoolDist ZipCode              Address BldgClass
## 3  302     21   2000         13   11201        1 JOHN STREET        RM
## 22 302     21   2002         13   11201  135 PLYMOUTH STREET        D5
## 29 302     21   3008         13   11201  183 PLYMOUTH STREET        RM
## 45 302     21   1010         13   11201 26 WASHINGTON STREET        D7
## 46 302     21   1010         13   11201        1 MAIN STREET        RM
## 47 302     21   2009         13   11201 25 WASHINGTON STREET        D7
##    UnitsRes YearBuilt NumFloors
## 3        42         0        12
## 22      100      1900         7
## 29       10      1920         6
## 45       94      1930         7
## 46      124      1913        15
## 47      106      1902         8
#convert factors to character types
finaldata$Address<-as.character(finaldata$Address)
finaldata$BldgClass<-as.character(finaldata$BldgClass)
#structure of dataset
str(finaldata)
## 'data.frame':    1992 obs. of  10 variables:
##  $ CD        : int  302 302 302 302 302 302 302 302 302 302 ...
##  $ CT2010    : num  21 21 21 21 21 21 21 21 21 21 ...
##  $ CB2010    : int  2000 2002 3008 1010 1010 2009 2009 2009 3009 3009 ...
##  $ SchoolDist: int  13 13 13 13 13 13 13 13 13 13 ...
##  $ ZipCode   : int  11201 11201 11201 11201 11201 11201 11201 11201 11201 11201 ...
##  $ Address   : chr  "1 JOHN STREET" "135 PLYMOUTH STREET" "183 PLYMOUTH STREET" "26 WASHINGTON STREET" ...
##  $ BldgClass : chr  "RM" "D5" "RM" "D7" ...
##  $ UnitsRes  : int  42 100 10 94 124 106 13 52 74 6 ...
##  $ YearBuilt : int  0 1900 1920 1930 1913 1902 1930 2006 1905 1900 ...
##  $ NumFloors : num  12 7 6 7 15 8 6 12 7 5 ...
# Count of Zipcodes covered
finaldata %>% count(ZipCode)
## # A tibble: 10 × 2
##    ZipCode     n
##      <int> <int>
## 1    11201   321
## 2    11205   275
## 3    11212    65
## 4    11213   116
## 5    11215   338
## 6    11216   128
## 7    11217   198
## 8    11231   155
## 9    11233    78
## 10   11238   318
# Count of Building Class
finaldata %>% count(BldgClass)
## # A tibble: 15 × 2
##    BldgClass     n
##        <chr> <int>
## 1         D0     3
## 2         D1   318
## 3         D3    78
## 4         D4   173
## 5         D5    26
## 6         D6    44
## 7         D7   110
## 8         D8     8
## 9         R1   504
## 10        R2   112
## 11        R3    66
## 12        R4   147
## 13        R6   150
## 14        RM   251
## 15        RR     2

ii) Zillow Data

Addresses from the New York City Pluto database are used to make API calls to Zillow. Addresses are a required input to the Zillow API. As part of the analysis we define three functions to collect data from Zillow.The code was modified and adapted (signficantly) from the following stackoverflow source (http://stackoverflow.com/questions/17349630/r-dataframe-from-xml-when-values-are-multiple-or-missing).
The zillowcaller function takes in as inputs addresses in zillow api call form, zip codes and the number of calls to make and the starting number of the call. The zillowfunction function takes as input the parsed XML from the API and then selects various attributes from the XML nodes. Finally the zillowaggregator combined all the attributes into a single dataframe.
We use the GetDeepSearch API from zillow for this analysis.

#select all the addresses
address<-finaldata$Address
#make it to zillow form
address<-str_replace_all(address," ","+")
#city state to zillow form
citystate<-str_c("Brooklyn%2C+NY+",finaldata$ZipCode)

#URL for the deep search API at Zillow
zillow_url<-"http://www.zillow.com/webservice/GetDeepSearchResults.htm?"

#define a function to make api calls
zillowcaller<-function(add,zip,nos,start,zillowadd){
for (i in 1:nos) {
  feed<-getURL(paste0(zillowadd,"zws-id=",getOption("zillowid"),"&address=",add[i+start],"&citystatezip=",zip[i+start]))
  Sys.sleep(1)
  if(i==1){
    last<-xmlParse(feed)
  }
  else {
    temp<-xmlParse(feed)
    last<-c(last,temp)
  }
}
  return(last)
}


# define zillow function to parse and extract relevant attributes from zillow api calls
zillowfunction<-function(xmldata){
do.call(rbind, xpathApply(xmldata, "//result", function(node) {

   zpid<-xmlValue(node[["zpid"]])
   bathrooms <- xmlValue(node[["bathrooms"]])
   bedrooms <- xmlValue(node[["bedrooms"]])
   sqft<-xmlValue(node[["finishedSqFt"]])
   st <- xpathSApply(node,"./address/street", xmlValue)
   zip <- xpathSApply(node,"./address/zipcode", xmlValue)
   city<-xpathSApply(node,"./address/city", xmlValue)
   state<-xpathSApply(node,"./address/state", xmlValue)
   lat<-xpathSApply(node,"./address/latitude", xmlValue)
   long<-xpathSApply(node,"./address/longitude", xmlValue)
   valuation<-xpathSApply(node,"./zestimate/amount", xmlValue)
   lowval<-xpathSApply(node,"./zestimate/valuationRange/low", xmlValue)
   highval<-xpathSApply(node,"./zestimate/valuationRange/high", xmlValue)
   nb<-xpathSApply(node,"//result/localRealEstate/region",xmlAttrs)[[1]]
   nbid<-xpathSApply(node,"//result/localRealEstate/region",xmlAttrs)[[2]]
   return(data.frame(zpid,bathrooms, bedrooms,sqft,st,city,state,lat,long,valuation,lowval,highval,nb,nbid, stringsAsFactors = FALSE))
}))
}

#zillow aggregator function
zillowaggregator <- function(call){
for (i in 2:length(call)) {
  if(i==2) {
    output<-rbind(zillowfunction(call[[1]]),zillowfunction(call[[2]]))
  }
else {
  output<-rbind(output,zillowfunction(call[[i]]))
     }
}
  return(output)
}

We make the zillow api calls below to collect the data. However, due to call limits we have already collected the data and stored on github for this analysis.

#return a list of calls to api
#daily limit of 1000 calls  

#calls were made over 2 days to meet the limits

#firstcall<-zillowcaller(address,citystate,40,0,zillow_url)
#zillowdata<-zillowaggregator(firstcall)
#secondcall<-zillowcaller(address,citystate,980,40,zillow_url)
#two<-zillowaggregator(secondcall)
#thirdcall<-zillowcaller(address,citystate,972,1020,zillow_url)
#three<-zillowaggregator(thirdcall)
#zillowdata<-rbind(zillowdata,three,two)

#read in the precollected data from github
zillowdata<-read.csv(text=getURL("https://raw.githubusercontent.com/cunyauthor/FinalProject/master/zillowdata.csv"),stringsAsFactors = FALSE)
zillowdata<-rename(zillowdata,number=X)

Phase II: Cleaning the data

zillowdata$zpid<-as.numeric(zillowdata$zpid)
zillowdata$nbid<-as.numeric(zillowdata$nbid)
zillowdata$valuation<-as.numeric(zillowdata$valuation)
zillowdata$lowval<-as.numeric(zillowdata$lowval)
zillowdata$highval<-as.numeric(zillowdata$highval)
zillowdata$bedrooms<-as.integer(zillowdata$bedrooms)
zillowdata$bathrooms<-as.numeric(zillowdata$bathrooms)
zillowdata$sqft<-as.numeric(zillowdata$sqft)
str(zillowdata)
## 'data.frame':    4183 obs. of  15 variables:
##  $ number   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ zpid     : num  2.10e+09 1.22e+08 1.22e+08 1.22e+08 6.26e+07 ...
##  $ bathrooms: num  4 NA NA NA 1 1 1 2 1 1 ...
##  $ bedrooms : int  4 NA NA NA 1 1 1 2 1 1 ...
##  $ sqft     : num  3655 2156 3056 NA 2200 ...
##  $ st       : chr  "1 John St" "183 Plymouth St" "183 Plymouth St" "183 Plymouth St" ...
##  $ city     : chr  "Brooklyn" "Brooklyn" "Brooklyn" "Brooklyn" ...
##  $ state    : chr  "NY" "NY" "NY" "NY" ...
##  $ lat      : num  40.7 40.7 40.7 40.7 40.7 ...
##  $ long     : num  -74 -74 -74 -74 -74 ...
##  $ valuation: num  8073177 3875849 4865945 4521751 NA ...
##  $ lowval   : num  7588786 3682057 4622648 4295663 NA ...
##  $ highval  : num  8557568 4069641 5109242 4747839 NA ...
##  $ nb       : chr  "DUMBO" "DUMBO" "DUMBO" "DUMBO" ...
##  $ nbid     : num  270841 270841 270841 270841 270841 ...
head(zillowdata)
##   number       zpid bathrooms bedrooms sqft              st     city state
## 1      1 2096383529         4        4 3655       1 John St Brooklyn    NY
## 2      2  121711481        NA       NA 2156 183 Plymouth St Brooklyn    NY
## 3      3  121711954        NA       NA 3056 183 Plymouth St Brooklyn    NY
## 4      4  122026423        NA       NA   NA 183 Plymouth St Brooklyn    NY
## 5      5   62646114         1        1 2200       1 Main St Brooklyn    NY
## 6      6 2099573457         1        1   NA       1 Main St Brooklyn    NY
##        lat      long valuation  lowval highval    nb   nbid
## 1 40.70513 -73.98749   8073177 7588786 8557568 DUMBO 270841
## 2 40.70378 -73.98589   3875849 3682057 4069641 DUMBO 270841
## 3 40.70378 -73.98589   4865945 4622648 5109242 DUMBO 270841
## 4 40.70378 -73.98589   4521751 4295663 4747839 DUMBO 270841
## 5 40.70356 -73.99021        NA      NA      NA DUMBO 270841
## 6 40.70340 -73.99062   1411114 1298225 1524003 DUMBO 270841
#select data to ensure unique observations
zillowdata<-zillowdata[(zillowdata$zpid %in% unique(zillowdata$zpid)),]
##select data where valuation and bedrooms are non-missing
zillow_clean<-zillowdata[!is.na(zillowdata$bedrooms)&!is.na(zillowdata$valuation),]
#remove homes greater than 5 bedrooms
zillow_clean<-filter(zillow_clean,bedrooms<=5)
# remove homes greater $10 million
zillow_clean<-filter(zillow_clean,valuation<=10000000)
# remove neighborhoods with less than 10 observations
neighborhoods<-c("Canarsie","Williamsburg","Ocean Hill","Brownsville","Chelsea","Vinegar Hill","Garment District","Greenwood","Red Hook","Wingate")

zillow_clean<-zillow_clean[!(zillow_clean$nb %in% neighborhoods),]

Calculating Distance

Property prices have a distance and accessibility premium. Residential properties

# Identify a point in Manhattan
#location refers to mid-town Manhattan
ref_lat=40.75047
ref_long=-73.98961
ref_loc<-c(ref_long,ref_lat)
zillow_clean$ref_long<-ref_long
zillow_clean$ref_lat<-ref_lat

# calculate "crow-fly" distance between mid-town and property location 
zillow_clean$distance <- distGeo(zillow_clean[,c('long','lat')], zillow_clean[,c('ref_long','ref_lat')],a=6378137,f=1/298.257223563
)
#convert distance to miles from meters
zillow_clean$distance<-zillow_clean$distance*0.000621371

Summary statistics

Summary statistics from the data.

zillow_clean %>% count(nb) %>% arrange(desc(n))
## # A tibble: 16 × 2
##                                     nb     n
##                                  <chr> <int>
## 1                     Brooklyn Heights   282
## 2                           Park Slope   271
## 3                         Clinton Hill   196
## 4                          Fort Greene   191
## 5                        Crown Heights   172
## 6                   Bedford Stuyvesant   142
## 7                     Prospect Heights    78
## 8                                DUMBO    66
## 9                              Gowanus    50
## 10                         Boerum Hill    49
## 11                     Carroll Gardens    38
## 12 Columbia Street Waterfront District    18
## 13                         Cobble Hill    15
## 14                       East Flatbush    13
## 15                            Downtown    12
## 16                     Windsor Terrace    12
zillow_clean %>% count(bedrooms)
## # A tibble: 6 × 2
##   bedrooms     n
##      <int> <int>
## 1        0   198
## 2        1   631
## 3        2   544
## 4        3   187
## 5        4    38
## 6        5     7
zillow_clean %>% count(bathrooms)%>% arrange(desc(n))
## # A tibble: 10 × 2
##    bathrooms     n
##        <dbl> <int>
## 1        1.0  1093
## 2        2.0   341
## 3        1.5    53
## 4        3.0    41
## 5         NA    30
## 6        2.5    26
## 7        4.0    10
## 8        3.5     8
## 9        0.5     2
## 10      15.0     1
summarize(group_by(zillow_clean,nb),med=median(valuation),avg=mean(valuation))%>% arrange(desc(med))
## # A tibble: 16 × 3
##                                     nb       med       avg
##                                  <chr>     <dbl>     <dbl>
## 1  Columbia Street Waterfront District 1720725.5 1782031.9
## 2                              Gowanus 1526602.5 1431016.1
## 3                     Brooklyn Heights 1455288.5 2106239.9
## 4                             Downtown 1355506.5 1478971.3
## 5                      Carroll Gardens 1349613.0 1626519.6
## 6                           Park Slope 1294308.0 1565576.4
## 7                          Boerum Hill 1284026.0 1547857.5
## 8                                DUMBO 1257340.5 1487110.2
## 9                     Prospect Heights 1094135.0 1301814.3
## 10                  Bedford Stuyvesant 1014597.0 1156665.2
## 11                         Fort Greene 1011282.0 1193400.2
## 12                        Clinton Hill  961819.5 1123980.0
## 13                         Cobble Hill  938279.0 1575225.0
## 14                     Windsor Terrace  921110.0  956955.2
## 15                       Crown Heights  845173.0 1015211.5
## 16                       East Flatbush  701231.0  707515.9
summarize(group_by(zillow_clean,bedrooms),med=median(valuation),avg=mean(valuation))%>% arrange(desc(med))
## # A tibble: 6 × 3
##   bedrooms     med     avg
##      <int>   <dbl>   <dbl>
## 1        4 1957341 2891292
## 2        3 1804316 2212880
## 3        5 1646615 3826170
## 4        2 1253066 1425014
## 5        0  938395 1176441
## 6        1  925172 1188206

Conducting some plots and visualizations

The visualizations below show the valuation of the properties. Valuations show a very long tail and is similar to a log normal distribution

#plot of the valuation data
plot<-ggplot(data=zillow_clean,aes(x=valuation))
plot+geom_histogram(fill="blue",binwidth = 100000)+ggtitle("Zillow House Prices")

plot_lo<-ggplot(data=zillow_clean,aes(x=lowval))
plot_lo+geom_histogram(fill="orange",binwidth = 100000)+ggtitle("Zillow House Prices- Low Estimate")+xlab("Low Valuation")

plot_hi<-ggplot(data=zillow_clean,aes(x=highval))
plot_hi+geom_histogram(fill="darkgreen",binwidth = 100000)+ggtitle("Zillow House Prices- High Estimate")+xlab("High Valuation")

#convert to factor
zillow_clean$bedrooms<-as.factor(zillow_clean$bedrooms)
zillow_clean$bathrooms<-as.factor(zillow_clean$bathrooms)
#plot of square feet and valuation
plot<-ggplot(data=zillow_clean[zillow_clean$sqft<10000&!is.na(zillow_clean$sqft),],aes(x=sqft, y=valuation))
plot+geom_point(aes(color=bedrooms))+ggtitle("Valuations vs Sq.Ft")+ylab("Property Value in $")+scale_y_continuous(breaks=seq(0,10000000,2000000))

#plot of distance and valuation
plot<-ggplot(data=zillow_clean,aes(x=distance, y=valuation))
plot+geom_point(aes(color=bedrooms))+ggtitle("Valuations vs Distance")+ylab("Property Value in $")+xlab("Distance in miles")+scale_y_continuous(breaks=seq(0,10000000,2000000))

#Box plot of property prices 
ggplot(zillow_clean, aes(y=valuation, x=bedrooms)) + geom_boxplot(aes(color=bedrooms))+ylab("Property Value in $")+xlab("Bedrooms")+ggtitle("Box Plots of Property Value")

#Box plot of property sizes 
ggplot(zillow_clean[zillow_clean$sqft<10000&!is.na(zillow_clean$sqft),], aes(y=sqft, x=bedrooms)) + geom_boxplot(aes(color=bedrooms))+ylab("Size in Square Feet")+xlab("Bedrooms")+ggtitle("Box Plots of Property Size")

Findings and Conclusion

Summary Analysis

All data was joined by ZIP code, ZIP was the better standard to connect all data base, we try by neighborhood but each source had its own definition, so we abandoned. We aggregate all data by ZIP code.

There are differences between the full sales price among zip codes. The median sales price is different from mean, in some zipcode there is big change in the mean. The best central tendency here is median, which better reflect the reality.

The price per square/foot is more stable with lower difference between mean and median, here mean can be adopted.

In our analysis we focus in the price per square/foot this better to do comparison, remove the size, in the value

In the crime data by zip there are large variability, for both crimes. The variability illustrates some areas are more violent than others.

The times series plot by crime (See charts Tableau Public.) The mosaic plot shows, petit larceny is the most frequent, followed by harassment during the summer months so do overall crime rates.

Correlation Matrix Plot

The price square/foot has negative relationship with violent crime. The common crime doesn’t have relationship with price square/foot or median sales price. The price square/foot doesn’t have relationship with crime rate (crime/population).

Regression Analysis

There is negative relationship between violent crime and price square/foot, for each occurrence of violent crime the price drops US$ 32.90, at 0.05 significant level. There is a ($500) mean drop in price between areas with the most or least violent crimes. The crime frequency explain 17% of property value.

There is no relationship with price and crime rate or violent crime rate, this is an interesting finding, for people interested in how the crime impacts price. Is not taken into account in the number of people (population). This can lead to an incorrect crime perception.