1 Background

This tutorial page describes the final project we conducted for Professor Geoffrey Parker’s Spring 2020 Data Analytics course (ENGM 182) at Dartmouth College’s Thayer School of Engineering. The goal of the project was to create a compelling and approachable a visualization of historical NYC real estate sales and crime data, and an accompanying model for predicting the price of a hypothetical real estate sale.

We chose to use the R programming language to complete this task. The secions below describe the code and methodologies for acquiring, cleaning, and processing the data, as well as the code that drives the Shiny app visualization.

The project’s source code is available on Github, and the Shiny app is hosted for viewing on shinyapps.io. Our data sources can be downloaded at the links below:

NYC Crimes 2006-2017: https://data.cityofnewyork.us/Public-Safety/NYPD-Complaint-Data-Historic/qgea-i56i
NYC Real Estate Data 2003-2015: https://data.cityofnewyork.us/Housing-Development/NYC-Calendar-Sales-Archive-/uzf5-f8n2
NYC Real Estate Data 2016-2017: https://www.kaggle.com/new-york-city/nyc-property-sales
NYC Real Estate Past Year: https://www1.nyc.gov/site/finance/taxes/property-rolling-sales-data.page
NYC Census and Economic Data 2015: https://www.kaggle.com/muonneutrino/new-york-city-census-data

2 Importing Data

This section describes how we imported the data from each of our data sources into an R environment, where it can be then be easily cleaned and processed. Expand the tab below to see the directory structure we used for our downloaded data files.

Data directory structure

Data
├── Census_2015
│   ├── census_block_loc.csv
│   └── nyc_census_tracts.csv
├── Crimes_2006-2017
│   └── NYPD_Complaint_Data_Historic.csv
├── Crimes_2014-2015
│   ├── Crime_Column_Description.csv
│   ├── NYPD_Complaint_Data_Historic.csv
│   └── Population_by_Borough_NYC.csv
├── RE_2003-2015
│   ├── 2009_bronx.xls
│   ├── 2009_brooklyn.xls
│   ├── 2009_manhattan.xls
│   ├── 2009_queens.xls
│   ├── 2009_statenisland.xls
│   ├── 2010_bronx.xls
│   ├── 2010_brooklyn.xls
│   ├── 2010_manhattan.xls
│   ├── 2010_queens.xls
│   ├── 2010_statenisland.xls
│   ├── 2011_bronx.xls
│   ├── 2011_brooklyn.xls
│   ├── 2011_manhattan.xls
│   ├── 2011_queens.xls
│   ├── 2011_statenisland.xls
│   ├── 2012_bronx.xls
│   ├── 2012_brooklyn.xls
│   ├── 2012_manhattan.xls
│   ├── 2012_queens.xls
│   ├── 2012_statenisland.xls
│   ├── 2013_bronx.xls
│   ├── 2013_brooklyn.xls
│   ├── 2013_manhattan.xls
│   ├── 2013_queens.xls
│   ├── 2013_statenisland.xls
│   ├── 2014_bronx.xls
│   ├── 2014_brooklyn.xls
│   ├── 2014_manhattan.xls
│   ├── 2014_queens.xls
│   ├── 2014_statenisland.xls
│   ├── 2015_bronx.xls
│   ├── 2015_brooklyn.xls
│   ├── 2015_manhattan.xls
│   ├── 2015_queens.xls
│   ├── 2015_statenisland.xls
│   ├── glossary_rsf071607.pdf
│   ├── sales_2007_bronx.xls
│   ├── sales_2007_brooklyn.xls
│   ├── sales_2007_manhattan.xls
│   ├── sales_2007_queens.xls
│   ├── sales_2007_statenisland.xls
│   ├── sales_2008_bronx.xls
│   ├── sales_2008_brooklyn.xls
│   ├── sales_2008_manhattan.xls
│   ├── sales_2008_queens.xls
│   ├── sales_2008_statenisland.xls
│   ├── sales_bronx_03.xls
│   ├── sales_bronx_04.xls
│   ├── sales_bronx_05.xls
│   ├── sales_bronx_06.xls
│   ├── sales_brooklyn_03.xls
│   ├── sales_brooklyn_04.xls
│   ├── sales_brooklyn_05.xls
│   ├── sales_brooklyn_06.xls
│   ├── sales_manhattan_03.xls
│   ├── sales_manhattan_04.xls
│   ├── sales_manhattan_05.xls
│   ├── sales_manhattan_06.xls
│   ├── sales_queens_03.xls
│   ├── sales_queens_04.xls
│   ├── sales_queens_05.xls
│   ├── sales_queens_06.xls
│   ├── sales_si_03.xls
│   ├── sales_si_04.xls
│   ├── sales_si_05.xls
│   └── sales_si_06.xls
├── RE_2016-2017
│   └── nyc-rolling-sales.csv
└── RE_Apr2019-Mar2020
    ├── rollingsales_bronx.xls
    ├── rollingsales_brooklyn.xls
    ├── rollingsales_manhattan.xls
    ├── rollingsales_queens.xls
    └── rollingsales_statenisland.xls

2.1 Importing the Real Estate Data

The yearly real estate data we downloaded from the city of New York’s OpenData portal were formatted in Excel files (with the exception of that from 2016-2017, which were contained in a single CSV file, and 2018, which were missing entirely). Each Excel file contained all the real estate sales in a particular borough during a particular calendar year from 2003 to 2020.

To import the Excel file data into R, we start by creating a list of the all the filenames.

# Get list of all excel file names for 2003-2015 and 2019-2020 data
all_files = c(
    list.files(path="../Data/RE_2003-2015", pattern="*.xls", full.names=TRUE, recursive=FALSE),
    list.files(path="../Data/RE_Apr2019-Mar2020", pattern="*.xls", full.names=TRUE, recursive=FALSE)
)

Next, we create a custom function to pass to lapply() that will import the data from each file into a dataframe, and store all those dataframes into a corresponding list called dfs_list. Although all the data across every file were consistently formatted with the same column headers, frustratingly, they did not always start on the same row (but always started on one of the first 10 rows). Because of this, we had to do two reads of each file—an initial read of the first 10 rows and columns just to find the index of the first column header (titled “BOROUGH” in our case), and a second read to actually import the data starting from that index. This stack overflow post was very helpful in setting up this ‘double-read’ functionality. We also use clean_names() from the janitor package to make the column names easier to work with later on.

library(tidyverse)
library(readxl)
library(data.table)
library(janitor)

# Import data from all files into a list of datarames
dfs_list = lapply(all_files, function(filename) {
  
    # Setting up the first read
    temp_read = suppressMessages(read_excel(filename))
    desired_sheet = 1
    skip_rows = NULL
    col_skip = 0
    search_string = "BOROUGH"
    max_cols_to_search = 10
    max_rows_to_search = 10
    
    # First read to get the cell index of the first column header
    while (length(skip_rows) == 0) {
        col_skip = col_skip + 1
        if (col_skip == max_cols_to_search) break
        skip_rows = which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0
    }
  
    # Second read to actually import the data
    df = suppressMessages(read_excel(
        filename,
        sheet = desired_sheet,
        skip = skip_rows,
        .name_repair="universal"
    ) ) %>% clean_names()
  
    return(df)
})

The only data we still need to import are the 2016-2017 data from the CSV file. We append it to our list of dataframes before combining that list into a single dataframe with rbindlist() and saving it as in an RDA file for easy access later on.

# Append 2016-2017 data from csv file to list of data frames
dfs_list = append(dfs_list, list(read_csv("../Data/RE_2016-2017/nyc-rolling-sales.csv")[,-(1)] %>% clean_names()))

# Combine list of data frames into a single data frame and save as RDA file
df_housing = rbindlist(dfs_list)
save(df_housing, file="Data_Housing.rda")

Now that we have our housing data, we can move on to importing the crime data!

**df_housing**
borough	neighborhood	building_class_category	tax_class_at_present	block	lot	ease_ment	building_class_at_present	address	apartment_number	zip_code	residential_units	commercial_units	total_units	land_square_feet	gross_square_feet	year_built	tax_class_at_time_of_sale	building_class_at_time_of_sale	sale_price	sale_date
2	BATHGATE	01 ONE FAMILY HOMES	1	3035	5	NA	A9	451 EAST 178 STREET	NA	10457	1	0	1	1287	1992	1899	1	A9	0	2009-12-07
2	BATHGATE	01 ONE FAMILY HOMES	1	3046	42	NA	A1	2069 BATHGATE AVENUE	NA	10457	1	0	1	1964	1424	1899	1	A1	100	2009-04-28
2	BATHGATE	01 ONE FAMILY HOMES	1	3046	42	NA	A1	2069 BATHGATE AVE	NA	10457	1	0	1	1964	1424	1899	1	A1	100	2009-04-13
2	BATHGATE	01 ONE FAMILY HOMES	1	3048	19	NA	A5	2186 BATHGATE AVENUE	NA	10457	1	0	1	1768	1188	1901	1	A5	345752	2009-03-16
2	BATHGATE	01 ONE FAMILY HOMES	1	3048	28	NA	A1	540 EAST 182ND STREET	NA	10457	1	0	1	1209	1048	1901	1	A1	0	2009-05-14
2	BATHGATE	01 ONE FAMILY HOMES	1	3053	105	NA	A1	2323 BASSFORD AVENUE	NA	10458	1	0	1	922	1248	1901	1	A1	0	2009-01-22
2	BATHGATE	02 TWO FAMILY HOMES	1	3030	71	NA	B2	4443 PARK AVENUE	NA	10457	2	0	2	2352	1632	1899	1	B2	135000	2009-05-08
2	BATHGATE	02 TWO FAMILY HOMES	1	3030	76	NA	B2	4435 PARK AVENUE	NA	10457	2	0	2	2444	2125	1899	1	B2	325000	2009-03-30
2	BATHGATE	02 TWO FAMILY HOMES	1	3035	131	NA	B1	464 EAST 179TH	NA	10457	2	0	2	1790	2340	1998	1	B1	0	2009-10-07
2	BATHGATE	02 TWO FAMILY HOMES	1	3036	33	NA	B3	2091 WASHINGTON AVE	NA	10457	2	0	2	2022	3854	1899	1	B3	0	2009-05-08
2	BATHGATE	02 TWO FAMILY HOMES	1	3036	62	NA	B1	459 EAST 179 STREET	NA	10457	2	0	2	2000	2340	1998	1	B1	344500	2009-11-20
2	BATHGATE	02 TWO FAMILY HOMES	1	3036	62	NA	B1	459 EAST 179TH STREET	NA	10457	2	0	2	2000	2340	1998	1	B1	316474	2009-05-04
2	BATHGATE	02 TWO FAMILY HOMES	1	3037	6	NA	B2	4392 PARK AVENUE	NA	10457	2	0	2	2275	3240	1899	1	B2	359038	2009-01-19
2	BATHGATE	02 TWO FAMILY HOMES	1	3039	12	NA	B2	454 EAST 185TH STREET	NA	10458	2	0	2	1667	1944	1910	1	B2	183028	2009-11-02
2	BATHGATE	02 TWO FAMILY HOMES	1	3039	45	NA	B1	446 EAST 186 STREET	NA	10458	2	0	2	2000	2060	1901	1	B1	286000	2009-10-09
2	BATHGATE	02 TWO FAMILY HOMES	1	3047	16	NA	B2	2125 BATHGATE AVENUE	NA	10457	2	0	2	3362	1744	1901	1	B2	0	2009-04-03
2	BATHGATE	02 TWO FAMILY HOMES	1	3050	94	NA	B2	2240 BASSFORD AVENUE	NA	10457	2	0	2	1300	2300	1901	1	B2	389000	2009-07-28
2	BATHGATE	02 TWO FAMILY HOMES	1	3053	22	NA	S2	511 EAST 183 STREET	NA	10458	2	1	3	2011	4280	1901	1	S2	1	2009-11-18
2	BATHGATE	02 TWO FAMILY HOMES	1	3053	22	NA	S2	511 EAST 183RD STREET	NA	10458	2	1	3	2011	4280	1901	1	S2	415440	2009-08-24
2	BATHGATE	03 THREE FAMILY HOMES	1	2912	116	NA	C0	1536 WASHINGTON AVENUE	NA	10457	3	0	3	2000	2400	1993	1	C0	319500	2009-10-07
2	BATHGATE	03 THREE FAMILY HOMES	1	2912	116	NA	C0	1536 WASHINGTON AVENUE	NA	10457	3	0	3	2000	2400	1993	1	C0	100	2009-01-12
2	BATHGATE	03 THREE FAMILY HOMES	1	3029	28	NA	C0	2092 WEBSTER AVENUE	NA	10457	3	0	3	2122	3510	1899	1	C0	0	2009-11-18
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	67	NA	C0	460 EAST 182 STREET	NA	10457	3	0	3	2000	3744	1899	1	C0	162240	2009-11-12
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	103	NA	C0	2135 WASHINGTON AVENUE	NA	10457	3	0	3	1701	4600	2006	1	C0	630000	2009-07-17
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	104	NA	C0	465 EAST 181 STREET	NA	10457	3	0	3	1095	3706	2006	1	C0	602500	2009-09-02
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	105	NA	C0	463 EAST 181 STREET	NA	10457	3	0	3	1095	3706	2006	1	C0	602500	2009-04-24
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	106	NA	C0	461 EAST 181 STREET	NA	10457	3	0	3	1095	3706	2006	1	C0	602500	2009-04-27
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	107	NA	C0	459 EAST 181 STREET	NA	10457	3	0	3	1095	3706	2006	1	C0	640000	2009-05-20
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	108	NA	C0	457 EAST 181 STREET	NA	10457	3	0	3	1419	3706	2006	1	C0	602500	2009-07-23
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	109	NA	C0	455 EAST 181 STREET	NA	10457	3	0	3	1687	4050	2006	1	C0	602500	2009-07-15
2	BATHGATE	03 THREE FAMILY HOMES	1	3037	110	NA	C0	453 EAST 181 STREET	NA	10457	3	0	3	1687	4050	2006	1	C0	602500	2009-04-29
2	BATHGATE	03 THREE FAMILY HOMES	1	3044	51	NA	C0	1966 BATHGATE AVENUE	NA	10457	3	0	3	2880	3132	1910	1	C0	101400	2009-12-16
2	BATHGATE	03 THREE FAMILY HOMES	1	3044	51	NA	C0	1964 BATHGATE AVENUE	NA	10457	3	0	3	2880	3132	1910	1	C0	0	2009-02-23
2	BATHGATE	03 THREE FAMILY HOMES	1	3050	3	NA	C0	2190 WASHINGTON AVENUE	NA	10457	3	0	3	1331	2244	1901	1	C0	178500	2009-07-07
2	BATHGATE	05 TAX CLASS 1 VACANT LAND	1B	3050	12	NA	V0	2246 WASHINGTON AVENUE	NA	10457	0	0	0	2901	0	0	1	V0	0	2009-09-30
2	BATHGATE	05 TAX CLASS 1 VACANT LAND	1B	3050	13	NA	V0	2250 WASHINGTON AVENUE	NA	10457	0	0	0	2668	0	0	1	V0	0	2009-09-30
2	BATHGATE	05 TAX CLASS 1 VACANT LAND	1B	3050	15	NA	V0	2254 WASHINGTON AVENUE	NA	10457	0	0	0	2673	0	0	1	V0	0	2009-09-30
2	BATHGATE	05 TAX CLASS 1 VACANT LAND	2	3050	41	NA	D1	491 FLETCHER PLACE	NA	10457	0	0	0	1637	0	0	1	V0	0	2009-09-30
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2	3027	29	NA	C1	422 EAST 178TH STREET	NA	10457	16	0	16	3344	11918	1909	2	C1	1315000	2009-08-26
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2	3028	27	NA	C7	4305 PARK AVENUE	NA	10457	24	1	25	5568	16680	1909	2	C7	0	2009-08-26
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2	3028	29	NA	C1	4301 PARK AVENUE	NA	10457	24	0	24	5189	15300	1909	2	C1	0	2009-08-26
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2	3028	34	NA	C1	4293 PARK AVENUE	NA	10457	20	0	20	6045	15512	1909	2	C1	0	2009-08-26
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2	3028	44	NA	C1	4289 PARK AVENUE	NA	10457	22	0	22	6098	16728	1908	2	C1	6885000	2009-08-26
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2A	3037	99	NA	C3	2137 WASHINGTON AVENUE	NA	10457	4	0	4	2719	3348	1901	2	C3	10	2009-04-17
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2B	3041	2	NA	C1	4678 PARK AVENUE	NA	10458	8	0	8	3417	9360	1910	2	C1	0	2009-11-04
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2B	3041	3	NA	C1	4682 PARK AVENUE	NA	10458	8	0	8	3425	9360	1910	2	C1	0	2009-11-04
2	BATHGATE	07 RENTALS - WALKUP APARTMENTS	2B	3048	20	NA	C1	2190 BATHGATE AVE	NA	10457	7	0	7	2417	5400	1931	2	C1	1e+05	2009-02-19
2	BATHGATE	08 RENTALS - ELEVATOR APARTMENTS	2	3050	10	NA	D1	2240 WASHINGTON AVENUE	NA	10457	68	1	69	13424	54365	2010	2	D1	0	2009-09-30
2	BATHGATE	10 COOPS - ELEVATOR APARTMENTS	2	3035	48	NA	D4	463 EAST 178TH STREET, 1E	NA	10457	0	0	0	0	0	1935	2	D4	24000	2009-01-21
2	BATHGATE	10 COOPS - ELEVATOR APARTMENTS	2	3035	48	NA	D4	463 EAST 178TH STREET, 6E	NA	10457	0	0	0	0	0	1935	2	D4	20000	2009-12-09

2.2 Importing crime data

2.2.1 Loading tidyverse package

library(tidyverse)

2.2.2 Reading crime data

We use read.csv instead of read_csv to get a data.frame instead of tibble, which is more useful when doing analysis in other programs

df_crime <- read_csv("../Data/Crimes_2006-2017/NYPD_Complaint_Data_Historic.csv")

2.3 Importing the Census Data

2.3.1 Downloading the census block data

Download the United States census bureau data estimates for 2015 for all 2,167 individual NYC census tracts. This data is in two parts. Import the first part which includes the block code/census tract, its latitude and longitude as well as the name of the county and state through the following code. After getting them in a dataframe (block_loc), remove the rows with ‘NA’. A basic ‘unique value check’ shows that there are 38,936 locations in 20,406 blocks in 15 counties. These longitude and latitude locations would help us later when we classify the demographic data by zip code.

#Code for census data
block_loc <- read.csv("../Data/Census_2015/census_block_loc.csv")

#This data frame has 5 columns (Latitude, longitude and block code for a block as well as its county and state) and 38,396 rows


#Checking if there are any missing or infinite 
missing_val = any(is.na(block_loc))
#There are no missing values in the dataframe

#Checking for unique values 
unique_state <- unique(block_loc$State)
unique_county <- unique(block_loc$County)
unique_block <- unique(block_loc$BlockCode)
#There are 38,936 locations in 20,406 blocks which are in 15 counties in NJ and NY in the data

The dataframe looks like this:

Latitude	Longitude	BlockCode	County	State
40.48	-74.28000	3.402301e+14	Middlesex	NJ
40.48	-74.27683	3.402301e+14	Middlesex	NJ
40.48	-74.27367	3.402301e+14	Middlesex	NJ
40.48	-74.27050	3.402301e+14	Middlesex	NJ
40.48	-74.26734	3.402301e+14	Middlesex	NJ
40.48	-74.26417	3.402301e+14	Middlesex	NJ

We divide this dataframe into different counties in NY and NJ to get a better sense about the data. This gives us information about the number of locations in each of the counties in our data.

#Splitting dataframes into different states and then different counties
library(dplyr)
block_loc_NY <- filter(block_loc, State == "NY")
#This has 24,240 locations
block_loc_NJ <- filter(block_loc, State == "NJ")
#This has 14,156 locations
unique(block_loc_NY$County)
block_loc_Richmond <- filter(block_loc, County == "Richmond") #3933 locations
block_loc_Queens <- filter(block_loc, County == "Queens") #6872 locations
block_loc_Nassau <- filter(block_loc, County == "Nassau") #4906 locations
block_loc_Kings <- filter(block_loc, County == "Kings") #3729 locations
block_loc_NewYork <- filter(block_loc, County == "New York") #1295 locations
block_loc_Bronx <- filter(block_loc, County == "Bronx") #2224 locations
block_loc_Westchester <- filter(block_loc, County == "Westchester") #1281 locations

unique(block_loc_NJ$County)
block_loc_Middlesex <- filter(block_loc, County == "Middlesex") #760 locations
block_loc_Monmouth <- filter(block_loc, County == "Monmouth") #1117 locations
block_loc_Union<- filter(block_loc, County == "Union") #1553 locations
block_loc_Hudson <- filter(block_loc, County == "Hudson") #2400 locations
block_loc_Essex <- filter(block_loc, County == "Essex") #3302 locations
block_loc_Bergen <- filter(block_loc, County == "Bergen") #3456 locations
block_loc_Passaic <- filter(block_loc, County == "Passaic") #1539 locations
block_loc_Morris <- filter(block_loc, County == "Morris") #29 locations

2.3.2 Downloading the 2nd part NYC census tracts data

The second part of the census data is the NYC census tracts data which is imported below. This is also split into different dataframes which contained data about sex, race, income, jobs, commute and employment, albeit these dataframes will not be used separately for analysis in the future.

#importing nyc census tracts file
nyc_tracts <- read.csv("../Data/Census_2015/nyc_census_tracts.csv")

#Splitting this into different dataframes
nyc_tracts_sex <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Men, Women) #Actual number of men and women in population
nyc_tracts_race <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Hispanic, White, Black, Native, Asian) #Percentage of different race in population
nyc_tracts_income <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Citizen, Professional, Service, Office, Construction, Production) #Percentage of different jobs
nyc_tracts_jobs <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Citizen, Income, IncomeErr, IncomePerCap, IncomePerCapErr, Poverty, ChildPoverty) #Income and Poverty data
nyc_tracts_commute <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Citizen, Drive, Carpool, Transit, Walk, OtherTransp, WorkAtHome, MeanCommute) #Percentage of different ways that people use for commuting. Don't know MeanCommute, maybe it is the percentage of people commuting every day
nyc_tracts_employment <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Citizen, Employed, PrivateWork, PublicWork, SelfEmployed, FamilyWork, Unemployment)#Percentage of different kinds of employment

The second dataframe looks like this:

CensusTract	County	Borough	TotalPop	Men	Women	Hispanic	White	Black	Native	Asian	Citizen	Income	IncomeErr	IncomePerCap	IncomePerCapErr	Poverty	ChildPoverty	Professional	Service	Office	Construction	Production	Drive	Carpool	Transit	Walk	OtherTransp	WorkAtHome	MeanCommute	Employed	PrivateWork	PublicWork	SelfEmployed	FamilyWork	Unemployment
3.6005e+10	Bronx	Bronx	7703	7133	570	29.9	6.1	60.9	0.2	1.6	6476	NA	NA	2440	373	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	0	NA	NA	NA	NA	NA
3.6005e+10	Bronx	Bronx	5403	2659	2744	75.8	2.3	16.0	0.0	4.2	3639	72034	13991	22180	2206	20.0	20.7	28.7	17.1	23.9	8.0	22.3	44.8	13.7	38.6	2.9	0.0	0.0	43.0	2308	80.8	16.2	2.9	0.0	7.7
3.6005e+10	Bronx	Bronx	5915	2896	3019	62.7	3.6	30.7	0.0	0.3	4100	74836	8407	27700	2449	13.2	23.6	32.2	23.4	24.9	9.0	10.5	41.3	10.0	44.6	1.4	0.5	2.1	45.0	2675	71.7	25.3	2.5	0.6	9.5
3.6005e+10	Bronx	Bronx	5879	2558	3321	65.1	1.6	32.4	0.0	0.0	3536	32312	6859	17526	2945	26.3	35.9	19.1	36.1	26.2	4.9	13.8	37.2	5.3	45.5	8.6	1.6	1.7	38.8	2120	75.0	21.3	3.8	0.0	8.7
3.6005e+10	Bronx	Bronx	2591	1206	1385	55.4	9.0	29.0	0.0	2.1	1557	37936	3771	17986	2692	37.1	31.5	35.4	20.9	26.2	6.6	11.0	19.2	5.3	63.9	3.0	2.4	6.2	45.4	1083	76.8	15.5	7.7	0.0	19.2
3.6005e+10	Bronx	Bronx	8516	3301	5215	61.1	1.6	31.1	0.3	3.3	5436	18086	3694	12023	2346	53.2	67.7	14.5	41.1	16.7	7.1	20.6	19.6	7.0	68.2	4.3	1.0	0.0	46.0	2508	71.0	21.3	7.7	0.0	17.2

3 Data Cleaning and Processing

After importing the data, we moved on to cleaning and processing. Cleaning the data involved removing the entries that were either irrelevant or incomplete for the purposes of our project. Processing involved geocoding the data, grouping it spatiotemporally, and then computing new metrics to help describe each group.

Grouping the data would allow us to more easily draw observe trends and draw meaningful conclusions, and we decided to group the real estate and crime data entries by zip code and month. The ~250 zip code areas in NYC would give us enough granularity to see the how the myriad of different regions in New York City differ from one another over time, but also enough courseness to guarantee acceptable response times of our visualization app, and ensure that there were usually multiple entries in each group to average across.

3.1 Crime Data

3.1.1 Cleaning the Crime Data

Generally, the crime data is cleaned as part of the other sections. When the crime scores are taken to classify the crimes and these scores are summed across zipcodes, crimes that could not be classified into defined categories are excluded. While excluding NAs could introduce bias, it is believed that the effect would be consistent across time and population for the analysis. However, it did appear that there was a difference in the reporting for crime scores after 2006, which was accounted for when representing the data visually.

3.1.2 Assigning Crime Rating

3.1.2.1 Adding zipcode to each criminal activity

The df_crime dataframe that contains data about all the crimes committed is first of all purged of all NA rows in the latitude and longitude columns. Then by reverse geocoding, the latitude and longitude values are mapped to a zip code in the following way.

# These should be taken care of with the higher level Import_All.R:
# source("../Engm182_Project/Import_Scripts/Import_Crime_Data.R")
# head(df_crime)
library(tidyr)
library(dplyr)
#Dropping the rows whcih have blank in latitude or longitude column
df_crime2 <- drop_na(df_crime, Latitude, Longitude)

#Reverse geocoding to find the zip code for each Lonlat
library("ggmap")
dset_crime2 <- as.data.frame(df_crime2[,29:28]) 
loc <- dset_crime2
res <- lapply(seq(nrow(location)), function(i){ 
    revgeocode(location[i,], 
               output = c("all"), 
               messaging = FALSE, 
               sensor = FALSE, 
               override_limit = FALSE)
})

3.1.2.2 Assigning weights to each crime

Assign a weight to every crime based on the degree of severity. A violation which includes minor offenses like speeding ticket, public intoxication etc. is assigned a weight of 1. A misdemeanour which can include a simple assault or driving under influence is assigned a wight of 2. These were easy to identify as the LAW_CAT_CD column in df_crime gives the category of the crime (Violation, Misdemeanour or Felony).

#Assigning weights to each crime
#Violations are the most minor of offenses. They have been assigned a weight of 1
#source: https://patrickparrottalaw.com/differences-between-a-violation-a-misdemeanor-and-a-felony/
df_crime2$weight <- 0
df_crime2$weight[which(df_crime2$LAW_CAT_CD=="VIOLATION")]=1

#Misdemeanours are more serious than violations but less severe than felonies, they can carry up to a year in jail and have score of 2
#source: https://patrickparrottalaw.com/differences-between-a-violation-a-misdemeanor-and-a-felony/
df_crime2$weight[which(df_crime2$LAW_CAT_CD=="MISDEMEANOR")]=2

Assigning weights to felonies is a little more tricky. Since, felonies include a vast variety of criminal activity, we decided to first divide them into 4 categories, 1st to 4th degree. First-degree felonyincludes murder, rape, kidnapping, arson and fraud. Second-degree felony includes aggravated assault, felony assault, arson, manslaughter, possession of a controlled substance and child molestation. Third-degree felony includes assault and battery, elder abuse, transmission of pornography, driving under the influence, fraud and arson. Finally, fourth-degree felony includes involuntary manslaughter, burglary, larceny and resisting arrest. (These lists are just examples and not exhaustive and are based on a source: https://legaldictionary.net/felony/) From the OFNS_DESC column generate a list of unique keywords and group them into 4 lists pertaining to the type of feloniy. Then, the crime is assigned a weight of 6, 5, 4 or 3 for felonies of 1st degree, 2nd degree, 3rd degree and 4th degree respectively depending on which category the OFNS_DESC column of that crime suggests.

#Felonies are the most serious of offenses and will be sub-classified
#Common law and statutes in most states divide felonies into first through fourth degree felonies, each carrying decreasing penalties
#source: https://legaldictionary.net/felony/

df_crime2_felony <- subset(df_crime2, df_crime2$LAW_CAT_CD=="FELONY")
felonytypes <- unique(df_crime2_felony$OFNS_DESC)
felony_1stdegree <- c("RAPE", "THEFT-FRAUD", "ARSON", "KIDNAPPING & RELATED OFFENSES", "MURDER & NON-NEGL. MANSLAUGHTER", "KIDNAPPING", "KIDNAPPING AND RELATED OFFENSES", "NYS LAWS-UNCLASSIFIED FELONY" )
felony_2nddegree <- c("FELONY ASSAULT", "FORGERY", "SEX CRIMES", "DANGEROUS DRUGS", "DANGEROUS WEAPONS")
felony_3rddegree <- c("INTOXICATED/IMPAIRED DRIVING","ROBBERY", "CRIMINAL MISCHIEF & RELATED OF", "PROSTITUTION & RELATED OFFENSES")
felony_4thdegree <- c("GAMBLING","", "BURGLARY","GRAND LARCENY", "GRAND LARCENY OF MOTOR VEHICLE", "MISCELLANEOUS PENAL LAW", "POSSESSION OF STOLEN PROPERTY", "CHILD ABANDONMENT/NON SUPPORT", "HOMICIDE-NEGLIGENT-VEHICLE", "HOMICIDE-NEGLIGENT,UNCLASSIFIE", "ENDAN WELFARE INCOMP")

df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_1stdegree == TRUE)]=6
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_2nddegree == TRUE)]=5
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_3rddegree == TRUE)]=4
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_4thdegree == TRUE)]=3

df_crime <- df_crime2
save(df_crime, file='Data_Crime.rda')

#Extract month and year for each crime
df_crime3 <- subset(df_crime2, select = c(CMPLNT_FR_DT, OFNS_DESC, LAW_CAT_CD, Longitude, Latitude, weight))
head(df_crime3)

3.1.3 Group crime information by month, year, and zipcode

First, the relevant datafiles are imported into the environment and the packages are loaded:

library(lubridate)
library(dplyr)
load(file = "Data_Crime_w_Zipcodes.rda")
load(file = "Data_Zip_Population.rda")

For analysis, it is helpful to have the events grouped by month, year, and zipcode. Then, a field like the crime score can be aggregated and normalized for this criterion. In this case, the crime data is formatted in the desired way with the following command:

score_by_time_and_rating <- df_crime %>% group_by(month=floor_date(CMPLNT_FR_DT, "month"), zip_code)  %>% summarize(summary_variable=sum(weight))

Source: https://stackoverflow.com/questions/33221425/how-do-i-group-my-date-variable-into-month-year-in-r

3.1.4 Group other fields by time and zipcode

In addition, commands are run to normalize the crime score by the total population and only take data greater than the year 2000. There are extraneous entries in the data for crimes hundreds of years ago and 2000 is a good baseline. It also appears that the reporting method for these crimes changes in 2006, as evidenced by a dramatic change in crime scores at this time.

# Summing the scores
names(score_by_time_and_rating)[names(score_by_time_and_rating) == 'summary_variable'] <- 'sum_weight'

# Joining the zipcodes
score_by_time_and_rating <- left_join(score_by_time_and_rating, zip_population, by="zip_code")

# Dividing sum by the total population (at 2015 census levels)
score_by_time_and_rating$weight <- score_by_time_and_rating$sum_weight / score_by_time_and_rating$TotalPop

# Dipslaying the head
head(score_by_time_and_rating)

save(score_by_time_and_rating, file = "Data_Score_by_Time_and_Rating.rda")

#Group by year and rating (This is for the regression model for house sales)
score_by_zip_and_year <- df_crime %>% group_by(year=floor_date(CMPLNT_FR_DT, "year"), zip_code)  %>% summarize(summary_variable=sum(weight))

#Summing the scores
names(score_by_zip_and_year)[names(score_by_zip_and_year) == 'summary_variable'] <- 'sum_weight'

#Extracting the year
score_by_zip_and_year$year <- year(score_by_zip_and_year$year)
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(year,zip_code,sum_weight))

# Joining the zipcodes with populations
score_by_zip_and_year <- left_join(score_by_zip_and_year, zip_population, by="zip_code")
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(year,zip_code,sum_weight, TotalPop))

# Dividing sum by the total population (at 2015 census levels)
score_by_zip_and_year$weight <- score_by_zip_and_year$sum_weight / score_by_zip_and_year$TotalPop

#Taking data only for years 2001 and above 
score_by_zip_and_year <- subset(score_by_zip_and_year, year > 2000)
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(zip_code, year, sum_weight, weight, TotalPop))

save(score_by_zip_and_year, file = "Data_Score_by_year_and_zipcode.rda")

3.1.5 Converting coordinates to zipcodes

There was difficulty with initially processing the location data into a format for analysis. For the crime and census data:

Crime data - complaints are recorded with only a latitude and longitude
Census data - the census data uses census tracts for identifying locations

For the regressions and displaying the data visually, the data needs to be grouped by similar location. In this case, zipcodes were chosen to be the level to group by. By classifying the data entries in a zipcode, the effect of the zipcode can be studied.

At first, an API was used to convert the coordinates from the crime data to zip codes. The revgeocode function from the ggmap package was used and applied to each crime complaint. However, it was taking too long to make the requests to retrieve each of the zipcodes. Although this route was not pursued, it may be possible to perform the API requests in parallel and improve performance.

Instead, it was chosen to perform the reverse geocoding locally and avoid having to make requests to an API service for each zip code. Further, since the zip codes are limited to only New York, it was possible to only download the location data for this area. In this section, the same method will be applied to extract the zipcodes.

3.1.6 Sources for latitude and longitude to zipcode code and data

The source for the code to retrieve the zipcodes from latitude and longitude coordinates locally is: https://stackoverflow.com/questions/46267287/reverse-geocoding-speed

The source for the geospatial polygon data to process the coordinates is: https://jsspina.carto.com/tables/nyc_zip_code_tabulation_areas_polygons/public/map

3.1.7 Processing the crime data

First, the packages are loaded:

library(sf)
library(purrr)
library(lubridate)
library(dplyr)

The crime data is read in with the following command:

sf <- sf::st_read("nyc_zip_code_tabulation_areas_polygons.geojson")

Then the desired latitude and longitude fields are taken. the complaint number is also taken so that the crime-zipcode data can be joined again later. The NA values are omitted.

CMPLNT_NUM <- c(df_crime$CMPLNT_NUM)
latitude <- c(df_crime$Latitude)
longitude <- c(df_crime$Longitude)

x = data.frame(CMPLNT_NUM, longitude, latitude)
x <- na.omit(x)

sf_x <- sf::st_as_sf(x, coords = c("longitude", "latitude"))

As stated in the sourced guide above for reverse geocoding this data, “To perform spatial operations, the coordinate reference system needs to match between the two geometries.”

st_crs(sf_x) <- st_crs(sf)

The coordinates are processed with the st_within function and put into an output using sapply:

res <- st_within(sf_x, sf)

out <- sapply(res, function(x) as.character(sf$postalcode[x]))

The output for each zipcode was a list of zipcodes even though there was just one. There were multiple outputs in the case that one was trying to link overlapping geometries with the data. The zipcodes are converted to the proper format:

out2 <- map(out, 1) # Take first element from each list
out3 <- lapply(out, function(x) ifelse(length(x) == 1, x[1], NA)) # Handle NAs
out4 <- vapply(out3, paste, collapse = ", ", character(1L)) # Flattten listst; source: https://stackoverflow.com/questions/24829027/unimplemented-type-list-when-trying-to-write-table

sf_x$zip_code <- out4

Finally, the zipcodes are linked back with the crime data and saved:

# Putting it back with the sf_data
df_crime <- left_join(df_crime, sf_x, by = "CMPLNT_NUM")

# Format column as dates
df_crime$CMPLNT_FR_DT <- as.Date(df_crime$CMPLNT_FR_DT, format = "%m/%d/%Y")

# This column NEEDS to be removed because it causes errors
df_crime$geometry<-NULL

# Saving the data with thet important information
save(df_crime, file='Data_Crime_w_Zipcodes.rda')

3.2 Census Data

3.2.1 Cleaning the census data

After importing the census data in section “02C: Data_Import_Census”, no more data cleaning is required as the data was organized into a usable format there itself and the NA values were removed.

3.2.2 Processing the census data

First, the packages are loaded:

library(readxl)
library(dplyr)
library(plyr)

The process for finding the zipcodes that corresponded to the census data is different. The census tract information is more granular than at the zipcode level. The mapping between tracts and zipcodes is imported and joined to the census data, so that each element has a corresponding zipcode. The Excel file can be downloaded from the source at end of this section.

tract_zips <- my_data <- read_excel("TRACT_ZIP_032020.xlsx")

names(tract_zips)[names(tract_zips) == 'ZIP'] <- 'zip_code'
names(tract_zips)[names(tract_zips) == 'TRACT'] <- 'CensusTract'
tract_zips$CensusTract <- as.numeric(tract_zips$CensusTract)

nyc_tracts <- left_join(nyc_tracts, tract_zips, by="CensusTract")

# Save updated nyc_tracs
save(nyc_tracts, file='Data_NYC_Tracts.rda')

#Adding all the parameters required for regression in zip_population file
zip_population <- subset(nyc_tracts, select = c(CensusTract,zip_code,TotalPop, Men, Women, Hispanic, White, Black, Native, Asian, IncomePerCap, Unemployment))

The most important element was the corresponding populations for each zipcode. The population data for each demographic was summed across each zipcode. The demographic information was initially stored as percentages, so it was necessary to multiply those percentages by the total population for each census tract.

#Converting the demographic and unemployment percentages to numbers and income per capita to income
zip_population$Hispanic.number <- zip_population$Hispanic*zip_population$TotalPop/100
zip_population$White.number <- zip_population$White*zip_population$TotalPop/100
zip_population$Black.number <- zip_population$Black*zip_population$TotalPop/100
zip_population$Native.number <- zip_population$Native*zip_population$TotalPop/100
zip_population$Asian.number <- zip_population$Asian*zip_population$TotalPop/100
zip_population$IncomeTot <- zip_population$IncomePerCap*zip_population$TotalPop
#Percentage of people in the age group of 18-65 in NY is 65% source: https://www.census.gov/quickfacts/newyorkcitynewyork
zip_population$Unemployment.number <- zip_population$Unemployment*0.65*zip_population$TotalPop/100

#Aggregating the various census tract data into 1 zip code
zip_population <- subset(zip_population, select = c(zip_code,TotalPop, Men, Women, Hispanic.number, White.number, Black.number, Native.number, Asian.number, IncomeTot, Unemployment.number))
zip_population <- aggregate(.~zip_code, data = zip_population, FUN = sum)

#Convert back the demographics and unemployment to % and IncomeTot to per capita income
zip_population$Hispanic.number <- zip_population$Hispanic.number*100/zip_population$TotalPop
zip_population$White.number <- zip_population$White.number*100/zip_population$TotalPop
zip_population$Black.number <- zip_population$Black.number*100/zip_population$TotalPop
zip_population$Native.number <- zip_population$Native.number*100/zip_population$TotalPop
zip_population$Asian.number <- zip_population$Asian.number*100/zip_population$TotalPop
zip_population$IncomeTot <- zip_population$IncomeTot/zip_population$TotalPop
zip_population$Unemployment.number <- zip_population$Unemployment.number*100/(0.65*zip_population$TotalPop)

Finally, the columns are renamed as needed. The data for zipcode to population demographic was saved for analysis later:

#Renaming the column headers
colnames(zip_population) <- c("zip_code", "TotalPop", "Men", "Women", "Hispanic", "White", "Black", "Native", "Asian", "PerCapitaIncome", "Unemployed")

save(zip_population, file = "Data_Zip_Population.rda")

3.2.3 Source for census tract to zipccode mapping data

The following site includes the most up to date census tract to zipcode information: https://www.huduser.gov/portal/datasets/usps_crosswalk.html

3.3 Real Estate Data

3.3.1 Geocoding the Real Estate Data

In order to group the real estate data by zip code, we first need to attach a zip code value to each entry in our data (a process known as “geocoding”, or “reverse-geocoding” to be more precise). To do this, we use the Geoclient API available on the city’s Developer Portal. Before pulling from the API, we do some quick cleaning by converting the borough, sale_price, and gross_square_feet columns to the numeric data type (to make sure they’re not incorrectly stored as character or factor types). Then we filter out some of the data with bad sale_price and gross_square_feet values, as this makes for fewer API queries and we know we won’t want use those entries in our analysis anyway. The last step before pulling from the API is creating the bbl column, which is a 10 digit combination of the borough, block, and lot columns. We will pass each entry’s bbl number to the API, which will then return the zip code (among other data) for that specific plot of land.

library(tidyverse)

load("Data_Housing.rda")

# Convert chr variables to numerics
df_housing$borough = as.numeric(df_housing$borough)
df_housing$sale_price = as.numeric(df_housing$sale_price)
df_housing$gross_square_feet = as.numeric(df_housing$gross_square_feet)

# Get only data with valid sale prices and square footages
# Build borough-block-lot (BBL) ID for reverse geocoding
df_housing = df_housing %>%
  filter(sale_price > 0) %>%
  filter(gross_square_feet > 1) %>%
  mutate(bbl = borough * 1e9 + block * 1e4 + lot)

A nifty CRAN package appropriately titled geoclient allows us to pull data from the Geoclient API in just a single line of code. However, because the API is public and free, the rate of requests from a single account is limited and it took ~7 hours to query all 1.4 mllion properties. After the queries are complete, we bind the returned API data to the existing df_housing data frame in a new data frame called geocoded_df_housing.

library(geoclient)

# Query the API
bbl_df = geo_bbl(df_housing$bbl, id = APPLICATION_ID, key = API_KEY)

# Join the existing data with the API data
geocoded_df_housing = bind_cols(df_housing, bbl_df)

3.3.2 Cleaning the Real Estate Data

In the next step, we modify data from a CSV file with all NYC building class codes to include a column labeling the building category of each class code (e.g., Residential, Commercial, Industrial, etc.).

building_class_key = read_csv("BUILDING_CLASS.csv") %>%
  group_by(BUILDING_CODE_ID) %>%
  summarize(category = str_to_title(TYPE[1])) %>%
  rename(class_code = BUILDING_CODE_ID) %>%
  mutate(category = case_when(
    substring(class_code, 1,1) %in% c("A","B","C","D","R") ~ "Residential",
    substring(class_code, 1,1) %in% c("S") ~ "Mixed",
    substring(class_code, 1,1) %in% c("H","K","L","O") ~ "Commercial",
    substring(class_code, 1,1) %in% c("E","F","G","T","U") ~ "Industrial",
    substring(class_code, 1,1) %in% c("V") ~ "Vacant",
    substring(class_code, 1,1) %in% c("I","J","M","N","P","Q","W") ~ "Civic",
    substring(class_code, 1,1) %in% c("Y") ~ "Government",
    substring(class_code, 1,1) %in% c("Z") ~ "Other"
  ))

**building_class_key**
class_code	category
A0	Residential
A1	Residential
A2	Residential
A3	Residential
A4	Residential
A5	Residential
A6	Residential
A7	Residential
A8	Residential
A9	Residential
B1	Residential
B2	Residential
B3	Residential
B9	Residential
C0	Residential
C1	Residential
C2	Residential
C3	Residential
C4	Residential
C5	Residential
C6	Residential
C7	Residential
C8	Residential
C9	Residential
CM	Residential
D0	Residential
D1	Residential
D2	Residential
D3	Residential
D4	Residential
D5	Residential
D6	Residential
D7	Residential
D8	Residential
D9	Residential
E1	Industrial
E2	Industrial
E3	Industrial
E4	Industrial
E7	Industrial
E9	Industrial
F1	Industrial
F2	Industrial
F4	Industrial
F5	Industrial
F8	Industrial
F9	Industrial
G0	Industrial
G1	Industrial
G2	Industrial
G3	Industrial
G4	Industrial
G5	Industrial
G6	Industrial
G7	Industrial
G8	Industrial
G9	Industrial
GU	Industrial
GW	Industrial
H1	Commercial
H2	Commercial
H3	Commercial
H4	Commercial
H5	Commercial
H6	Commercial
H7	Commercial
H8	Commercial
H9	Commercial
HB	Commercial
HH	Commercial
HR	Commercial
HS	Commercial
I1	Civic
I2	Civic
I3	Civic
I4	Civic
I5	Civic
I6	Civic
I7	Civic
I9	Civic
J1	Civic
J2	Civic
J3	Civic
J4	Civic
J5	Civic
J6	Civic
J7	Civic
J8	Civic
J9	Civic
K1	Commercial
K2	Commercial
K3	Commercial
K4	Commercial
K5	Commercial
K6	Commercial
K7	Commercial
K8	Commercial
K9	Commercial
L1	Commercial
L2	Commercial
L3	Commercial
L8	Commercial
L9	Commercial
M1	Civic
M2	Civic
M3	Civic
M4	Civic
M9	Civic
N1	Civic
N2	Civic
N3	Civic
N4	Civic
N9	Civic
O1	Commercial
O2	Commercial
O3	Commercial
O4	Commercial
O5	Commercial
O6	Commercial
O7	Commercial
O8	Commercial
O9	Commercial
P1	Civic
P2	Civic
P3	Civic
P4	Civic
P5	Civic
P6	Civic
P7	Civic
P8	Civic
P9	Civic
Q1	Civic
Q2	Civic
Q3	Civic
Q4	Civic
Q5	Civic
Q6	Civic
Q7	Civic
Q8	Civic
Q9	Civic
R0	Residential
R1	Residential
R2	Residential
R3	Residential
R4	Residential
R5	Residential
R6	Residential
R7	Residential
R8	Residential
R9	Residential
RA	Residential
RB	Residential
RG	Residential
RH	Residential
RK	Residential
RP	Residential
RR	Residential
RS	Residential
RT	Residential
RW	Residential
S0	Mixed
S1	Mixed
S2	Mixed
S3	Mixed
S4	Mixed
S5	Mixed
S9	Mixed
T1	Industrial
T2	Industrial
T9	Industrial
U0	Industrial
U1	Industrial
U2	Industrial
U3	Industrial
U4	Industrial
U5	Industrial
U6	Industrial
U7	Industrial
U8	Industrial
U9	Industrial
V0	Vacant
V1	Vacant
V2	Vacant
V3	Vacant
V4	Vacant
V5	Vacant
V6	Vacant
V7	Vacant
V8	Vacant
V9	Vacant
W1	Civic
W2	Civic
W3	Civic
W4	Civic
W5	Civic
W6	Civic
W7	Civic
W8	Civic
W9	Civic
Y1	Government
Y2	Government
Y3	Government
Y4	Government
Y5	Government
Y6	Government
Y7	Government
Y8	Government
Y9	Government
Z0	Other
Z1	Other
Z2	Other
Z3	Other
Z4	Other
Z5	Other
Z7	Other
Z8	Other
Z9	Other

Now we select the columns from geocoded_df_housing that we want to work with moving forward, filter out missing values and high/low-priced outliers, and use the building class codes to assign a category to each building sale. This is all stored and saved new data frame called cleaned_df_housing, which we will use for grouping in the next sub-section.

cleaned_df_housing = geocoded_df_housing %>%
  select(
    building_class_at_present,
    zip_code,
    gross_square_feet,
    sale_price,
    sale_date,
  ) %>%
  rename(lat = latitudeInternalLabel) %>%
  rename(lng = longitudeInternalLabel) %>%
  filter(!(is.na(lat) | is.na(lng))) %>%
  filter(!(is.na(zip_code) | zip_code == 0)) %>%
  filter(!(is.na(sale_price))) %>%
  filter(!(is.na(gross_square_feet) | gross_square_feet <= 0)) %>%
  mutate(price_per_sqft = sale_price/gross_square_feet) %>%
  filter(price_per_sqft <= 2500) %>%
  filter(price_per_sqft > 20) %>%
  mutate(zip_code = sapply(zip_code, as.character)) %>%
  left_join(building_class_key,  by = c("building_class_at_present" = "class_code"))

save(cleaned_df_housing, file="cleaned_housing.rda")

**cleaned_df_housing**
building_class_at_present	zip_code	gross_square_feet	sale_price	sale_date	price_per_sqft	category
A5	10457	1188	345752	2009-03-16	291.03704	Residential
B2	10457	1632	135000	2009-05-08	82.72059	Residential
B2	10457	2125	325000	2009-03-30	152.94118	Residential
B1	10457	2340	344500	2009-11-20	147.22222	Residential
B1	10457	2340	316474	2009-05-04	135.24530	Residential
B2	10457	3240	359038	2009-01-19	110.81420	Residential
B2	10458	1944	183028	2009-11-02	94.15021	Residential
B1	10458	2060	286000	2009-10-09	138.83495	Residential
B2	10457	2300	389000	2009-07-28	169.13043	Residential
S2	10458	4280	415440	2009-08-24	97.06542	Mixed
C0	10457	2400	319500	2009-10-07	133.12500	Residential
C0	10457	3744	162240	2009-11-12	43.33333	Residential
C0	10457	4600	630000	2009-07-17	136.95652	Residential
C0	10457	3706	602500	2009-09-02	162.57420	Residential
C0	10457	3706	602500	2009-04-24	162.57420	Residential
C0	10457	3706	602500	2009-04-27	162.57420	Residential
C0	10457	3706	640000	2009-05-20	172.69293	Residential
C0	10457	3706	602500	2009-07-23	162.57420	Residential
C0	10457	4050	602500	2009-07-15	148.76543	Residential
C0	10457	4050	602500	2009-04-29	148.76543	Residential
C0	10457	3132	101400	2009-12-16	32.37548	Residential
C0	10457	2244	178500	2009-07-07	79.54545	Residential
C1	10457	11918	1315000	2009-08-26	110.33730	Residential
C1	10457	16728	6885000	2009-08-26	411.58537	Residential
K1	10457	1775	336000	2009-07-23	189.29577	Commercial
S4	10457	3233	185000	2009-07-07	57.22239	Mixed
M1	10458	9000	860000	2009-07-15	95.55556	Civic
A5	10469	1512	370000	2009-05-04	244.70899	Residential
A5	10469	1701	258570	2009-06-04	152.01058	Residential
S1	10469	1889	325000	2009-09-04	172.04870	Mixed
A1	10469	2232	349990	2009-10-09	156.80556	Residential
A5	10469	1681	260000	2009-09-17	154.66984	Residential
A5	10469	1920	312000	2009-06-12	162.50000	Residential
A5	10469	1920	315000	2009-08-05	164.06250	Residential
A5	10469	1920	308510	2009-10-28	160.68229	Residential
A5	10469	1742	330000	2009-03-30	189.43743	Residential
A5	10469	1440	310000	2009-04-17	215.27778	Residential
A5	10469	1778	380000	2009-03-25	213.72328	Residential
A5	10469	1280	300000	2009-10-05	234.37500	Residential
A5	10469	1320	364000	2009-01-12	275.75758	Residential
A5	10469	1890	423470	2009-06-18	224.05820	Residential
A2	10469	1896	395000	2009-04-13	208.33333	Residential
A5	10469	1534	250000	2009-06-08	162.97262	Residential
A2	10469	779	125000	2009-08-24	160.46213	Residential
A2	10469	396	113645	2009-07-27	286.98232	Residential
A1	10469	1288	400000	2009-06-25	310.55901	Residential
A2	10469	1140	135000	2009-09-30	118.42105	Residential
A2	10469	2234	320000	2009-07-23	143.24082	Residential
A2	10469	1000	325943	2009-09-03	325.94300	Residential
B3	10469	4019	165000	2009-11-09	41.05499	Residential

3.3.3 Grouping the Real Estate Data

Now that we have the real estate geocoded and cleaned, we can group it by aggregating all sales that took place in the same zip code and during the same month. We will create 3 new variables to describe each group: average price per square foot (avg_price_per_sqft), total number of sales (num_sales), and total proceeds from all sales (total_proceeds).

load("cleaned_housing.rda")

grouped_housing_all = cleaned_df_housing %>%
    mutate(month_char = format(as.Date(sale_date), "%Y-%m")) %>%
    group_by(zip_code, month_char) %>%
    summarize(
        category = "All",
        avg_price_per_sqft = mean(price_per_sqft),
        num_sales = n(),
        total_proceeds = sum(sale_price)
    )

Next, we peform the same grouping function again, but with building category as an extra grouping variable. This will allow us to filter the displayed data by building category in the UI for our visualization. We combine the groups of individual categories with the groups of all categories to create a single data frame. Then we filter high-sales outliers and save this data frame for visualization in our Shiny app.

grouped_housing = cleaned_df_housing %>%
    mutate(month_char = format(as.Date(sale_date), "%Y-%m")) %>%
    group_by(zip_code, month_char, category) %>%
    summarize(
        avg_price_per_sqft = mean(price_per_sqft),
        num_sales = n(),
        total_proceeds = sum(sale_price)
    ) %>%
    bind_rows(grouped_housing_all) %>%
    filter(total_proceeds < 200000000) %>%
    data.frame()

save(grouped_housing, file="grouped_housing.rda")

**grouped_housing**
zip_code	month_char	category	avg_price_per_sqft	num_sales	total_proceeds
10001	2003-01	Commercial	277.77778	1	2200000
10001	2003-01	Mixed	418.26420	1	2400000
10001	2003-01	Residential	201.08860	1	665000
10001	2003-02	Commercial	391.52249	7	65050000
10001	2003-03	Commercial	267.30755	1	10900000
10001	2003-04	Commercial	541.50438	2	4670000
10001	2003-04	Residential	295.63932	1	2000000
10001	2003-06	Commercial	258.90032	2	4780000
10001	2003-07	Commercial	298.33823	1	8025000
10001	2003-09	Commercial	506.40168	1	2650000
10001	2003-09	Mixed	377.17284	1	1150000
10001	2003-10	Commercial	148.92465	1	11550000
10001	2003-11	Commercial	108.53280	2	4263310
10001	2003-12	Commercial	77.25308	2	5234000
10001	2004-01	Commercial	84.60919	1	210000
10001	2004-01	Industrial	179.62312	1	38319000
10001	2004-02	Commercial	456.69442	3	17785074
10001	2004-02	Mixed	314.91003	1	1225000
10001	2004-03	Commercial	275.29213	2	3550000
10001	2004-04	Commercial	603.31827	10	58435000
10001	2004-04	Residential	143.24058	2	19900000
10001	2004-06	Commercial	132.41494	5	96970000
10001	2004-07	Commercial	177.95878	3	18380000
10001	2004-08	Civic	822.78481	1	11375000
10001	2004-08	Residential	468.70275	1	1550000
10001	2004-09	Commercial	154.36808	3	14033333
10001	2004-10	Commercial	200.30904	2	8350000
10001	2004-10	Residential	616.69386	2	4625000
10001	2005-01	Commercial	217.81244	4	12475000
10001	2005-01	Mixed	437.01799	1	1700000
10001	2005-01	Residential	321.61235	1	1875000
10001	2005-02	Commercial	133.75306	2	7700000
10001	2005-02	Residential	483.55932	2	9933590
10001	2005-03	Commercial	100.31802	2	16800000
10001	2005-03	Residential	134.52915	1	3000000
10001	2005-04	Commercial	285.02850	1	3800000
10001	2005-05	Civic	85.59236	1	940660
10001	2005-05	Commercial	139.86014	1	1000000
10001	2005-06	Commercial	667.11141	1	4000000
10001	2005-07	Commercial	602.09691	2	28500000
10001	2005-07	Residential	419.50293	2	8350000
10001	2005-09	Commercial	280.31495	4	120277465
10001	2005-10	Commercial	218.40519	1	31750000
10001	2005-11	Commercial	226.83706	1	35500000
10001	2005-12	Commercial	35.30153	1	3605875
10001	2005-12	Mixed	332.08247	2	3818622
10001	2006-01	Commercial	175.38672	2	32525000
10001	2006-01	Residential	132.40123	1	12500000
10001	2006-02	Commercial	638.95109	3	32350000
10001	2006-02	Residential	329.28586	1	17600000

3.4 Creating a single dataframe for modeling

3.4.1 Section of housing sales data

First, load the 2 relevant files. Data_Housing.rda (df_housing) contains all the housing sales data. Data_Zip_Population.rda (zip_population) contains the census data for each zip code.

load(file="Data_Housing.rda")
load(file="Data_Zip_Population.rda")

Next, remove rows which are either blank or have NA values in the sale_price or land_square_feet column of df_housing.

# Only sales prices > 0 for housing data

# Only take > 0; source: https://stackoverflow.com/questions/33535006/remove-rows-if-it-is-negative-number
df <- df_housing[df_housing$land_square_feet > 0, ]
df <- df[df$sale_price > 0, ]

# Get rid of NA values; source: https://stackoverflow.com/questions/48658832/how-to-remove-row-if-it-has-a-na-value-in-one-certain-column/48659101
df <- df[!is.na(df$sale_price), ]
df <- df[!is.na(df$land_square_feet), ]

Create a new dataframe, df_sale, which containse the relevant sales price prediction input variables from the df_housing dataframe. Add 2 more column, sale_year and sale_month, which contain the year and month of the sale. These values are extracted from the sale_date column by using the functions year() and month() respectively.
Add a new column, crime_score_year. If the sale was completed in the months of January, February or March, add the year just before the sale_year in the crime_score_year column. If the sale was completed in the rest of the months, crime_score_year = sale_year. This is done to map the relevant year’s crime score for each sale’s analysis. For e.g. if a sale happens in January of 2006, then it doesn’t make sense to add the crime score for 2006. Crimes happening in 2005 would have affected that sale’s price more than 2006.

library(lubridate)

#Identify the input variables for sales price prediction
#Zipcode (will be used to join census and crime data), land_square_feet, gross_square_feet, year built, tax_class_at_time_of_sale, building_class_at_time_of_sale, sale_date (will be used to link crime rating for that year)
#Subset new dataframe with relevant variables
df_sale <- subset(df, select = c(zip_code,land_square_feet, gross_square_feet, year_built, tax_class_at_time_of_sale, building_class_at_time_of_sale, sale_date, sale_price))
df_sale$sale_year <- year(df_sale$sale_date) 
df_sale$sale_month <- month(df_sale$sale_date)
#If sale is in months 1, 2, 3 then crime score will be taken from previous year else crime score of that year
df_sale$crime_score_year <- ifelse(df_sale$sale_month == 1 | df_sale$sale_month == 2 | df_sale$sale_month == 3, df_sale$sale_year-1, df_sale$sale_year)

3.4.2 Adding the section about census data

Now, we will add the census data to the housing sales data. First find the zipcodes which are present in both df_sale and zip_population dataframes. Then add the df_sale and zip_population dataframes by these unique zipcodes through the following code. This merged dataframe is saved as df_sale_census.

#Adding the census data by linking through zipcode
#Taking a subset of df_sale for only the zip codes for which we have census data
zip_codes_inhousing <- unique(df_sale$zip_code)
zip_codes_incensus <- unique(zip_population$zip_code)
zip_codes_inhousing_incensus <- intersect(zip_codes_inhousing, zip_codes_incensus)

#Merging the sales and census data into one dataframe
df_sale$zip_code <- as.character(df_sale$zip_code)
df_sale_census <- merge(df_sale[df_sale$zip_code %in% zip_codes_inhousing_incensus, ], zip_population[zip_population$zip_code %in% zip_codes_inhousing_incensus, ], by = "zip_code")

3.4.3 Adding the crime score

Load the file Data_Score_by_year_and_zipcode.rda (score_by_zip_and_year) that was created in section “04B: Data_Processing_Month_Year”. Find the zipcodes that are present in both df_sale_census and score_by_zip_and_year and store them in zip_codes_inhousing_incensus_incrime. Merge the df_sale_census and score_by_zip_year dataframes by zipcodes and store them in df_sale_census_crime. Remove all rows from this final dataframe which have a zero value and save this dataframe as Data_sale_census_crime.rda for future analysis.

load(file = "Data_Score_by_year_and_zipcode.rda")
#Adding the crime score data for each year
zip_codes_inhousing_incensus_incrime <- intersect(unique(df_sale_census$zip_code), unique(score_by_zip_and_year$zip_code))
colnames(score_by_zip_and_year) <- c("zip_code", "crime_score_year", "sum_weight", "weight", "TotalPop")
df_sale_census_crime <- merge(df_sale_census[df_sale_census$zip_code %in% zip_codes_inhousing_incensus_incrime, ], score_by_zip_and_year[score_by_zip_and_year$zip_code %in% zip_codes_inhousing_incensus_incrime, ], by = c("zip_code", "crime_score_year"))
df_sale_census_crime <- na.omit(df_sale_census_crime)
#any(is.na(df_sale_census_crime))

#Remove values of 0
#source: https://stackoverflow.com/questions/9977686/how-to-remove-rows-with-any-zero-value
zero_rows = apply(df_sale_census_crime, 1, function(row) all(row != 0))
df_sale_census_crime <- df_sale_census_crime[zero_rows, ]

save(df_sale_census_crime, file = "Data_sale_census_crime.rda")

**Final Dataframe**
zip_code	crime_score_year	land_square_feet	gross_square_feet	year_built	tax_class_at_time_of_sale	building_class_at_time_of_sale	sale_date	sale_price	sale_year	sale_month	TotalPop.x	Men	Women	Hispanic	White	Black	Native	Asian	PerCapitaIncome	Unemployed	sum_weight	weight	TotalPop.y
10001	2002	4937	54000	1912	4	L1	2003-02-12	5e+06	2003	2	127587350	61690470	65896880	3.824723	74.48506	2.424281	1.450839	15.4563	114905.6	4.76503	19	1e-07	127587350
10001	2002	4937	54000	1912	4	L1	2003-02-12	13500000	2003	2	127587350	61690470	65896880	3.824723	74.48506	2.424281	1.450839	15.4563	114905.6	4.76503	19	1e-07	127587350
10001	2002	1480	7060	1920	4	L9	2003-02-07	2150000	2003	2	127587350	61690470	65896880	3.824723	74.48506	2.424281	1.450839	15.4563	114905.6	4.76503	19	1e-07	127587350
10001	2002	2469	4494	1901	2	S4	2003-01-03	24375	2003	1	127587350	61690470	65896880	3.824723	74.48506	2.424281	1.450839	15.4563	114905.6	4.76503	19	1e-07	127587350
10001	2002	2345	7035	1920	4	K9	2003-01-03	46687	2003	1	127587350	61690470	65896880	3.824723	74.48506	2.424281	1.450839	15.4563	114905.6	4.76503	19	1e-07	127587350
10001	2002	1975	5738	1910	2	S5	2003-01-17	2400000	2003	1	127587350	61690470	65896880	3.824723	74.48506	2.424281	1.450839	15.4563	114905.6	4.76503	19	1e-07	127587350

4 Regression Modeling

The regression model helps to establish a baseline. Unfortunately, when working with a large set of data, there are issues faced when running regressions:

Large dataset - it takes a long time to run the regression
P-values - they go to zero quickly (they all became significant)
Mean-squared error (MSE) - the mean square errors of the predicted values are high, despite the amount of data

Each of the issues can be respectively addressed:

A function called fastLm removes overhead from the regression
The data can be partitioned, and the accuracy prediction can be analyzed (MSE)
Machine learning models can be used to capture data granularity (see machine learning code section)

4.1 Code to setup and run regressions on combined crime data

The steps for setting up the regression, running the model, and making predictions are in the following sections.

4.1.1 Load packages and import data

First, the packages are loaded:

library(dplyr)
library(RcppArmadillo)
library(lubridate)

The desired data is imported into R for the regression.

#Regress sale price with sale year, land area, gross area, tax class, building class, year built
load(file = "Data_sale_census_crime.rda")

4.1.2 Variable selection

The desired independent and dependent (sale_price) variables are selected with the code below and certain parameters are set to limit the amount of data.

# Make sure sale prices are numeric
df_sale_census_crime$sale_price <- as.numeric(df_sale_census_crime$sale_price)

# Remove large house sale prices; source: https://stackoverflow.com/questions/25764810/delete-rows-based-on-range-of-values-in-column
df_sale_census_crime <- df_sale_census_crime[with(df_sale_census_crime, sale_price <= 1000000), ]

#Regression 1
x <- df_sale_census_crime %>% select("land_square_feet", "PerCapitaIncome", "Unemployed", "TotalPop.x", "Men", "Women")

y <- df_sale_census_crime %>% select("sale_price")

4.1.3 Variable coercion

The variables are coerced into the proper types for the regression (e.g. factor variable).

x$land_square_feet <- as.numeric(x[[1]])
x$PerCapitaIncome <- as.numeric(x[[2]])
x$Unemployed <- as.numeric(x[[3]])
x$TotalPop.x <- as.numeric(x[[4]])
x$Men <- as.numeric(x[[5]])
x$Women <- as.numeric(x[[6]])

# Setting up factor variable
df_sale_census_crime$sale_year <- as.character(df_sale_census_crime$sale_year)
df_sale_census_crime$year_built <- as.character(df_sale_census_crime$year_built)

building_class <- model.matrix( ~ building_class_at_time_of_sale - 1, data=df_sale_census_crime )
sale_year <- model.matrix( ~ sale_year - 1, data=df_sale_census_crime )
year_built <- model.matrix( ~ year_built - 1, data=df_sale_census_crime )

# Binding factor variables
x<-cbind(x, building_class)
x<-cbind(x, sale_year)
x<-cbind(x, year_built)

# Add intercept column and renaming
x<-cbind(x, 1)
colnames(x)[dim(x)[2]] <- "Intercept" # Source: https://www.dummies.com/programming/r/how-to-name-matrix-rows-and-columns-in-r/

# Setup the y output variable
y <- as.numeric(y[[1]])

# Display the classes
sapply(x, class)
sapply(y, class)

4.1.4 Partition the data

The data is partitioned into training and test sets for running regressions and determining the accuracy of a model on the test set respectively. For this initial regression analysis, a subset of 80% of the data is used for training and the rest is used for testing the accuracy of the predictions.

# ----- Split data into training and validation sets ----- 
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- sort(sample(nrow(x), nrow(x)*.8))
# select 20% of the data for validation
x_test <- x[-validation_index,]
y_test <- y[-validation_index]
# use the remaining 80% of data to training and testing the models
x_train <- x[validation_index,]
y_train <- y[validation_index]

4.1.5 Run the regression using fastLm

The regression is run using fastLm to expedite the process and remove the overhead from the normal regression functions.

# Run regression
fit1 <- fastLm(x_train, y_train)

4.1.6 Summarize results

The results are summarized for analysis.

summary(fit1)
#R-squared value is 25.63%

# Perform predictions on test set
reg_pred1 <- predict(fit1, as.matrix(x_test))

# See the comparison
results <- data.frame(cbind(y_test, reg_pred1))
results$diff <- results$reg_pred1 - results$y_test
results$diff_squared <- '^' (results$diff, 2)
# RMS = 
rms = sqrt(mean(results$diff_squared))
rms

# RMS Error = 
rms_error = rms/mean(results$y_test)
rms_error

4.2 Results from the regression models

The results that were found when the regression was run for this project are summarized:

Measure	Value
Adjusted R-squared value ($R^2$)	27.26%
Root Mean Squared (RMS)	198,239.1
Root Mean Squared Error (RMSE)	43.59%

The more detailed output from the regression model is summarized below. While it may be possible to improve the regression results by running further regressions, the machine learning models explored next will likely help to capture the granularity in the data.

Regression output

> # Run regression
> fit1 <- fastLm(x_train, y_train)
> summary(fit1)

Call:
fastLm.default(X = x_train, y = y_train)

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-951240.0  -99101.0    2810.9  110820.0 1088100.0 

                                    Estimate      StdErr     t.value   p.value    
land_square_feet                  2.8982e-02  7.3239e-03  3.9572e+00 7.583e-05 ***
PerCapitaIncome                   1.5932e+00  5.5788e-02  2.8558e+01 < 2.2e-16 ***
Unemployed                       -1.2420e+04  1.6778e+02 -7.4027e+01 < 2.2e-16 ***
TotalPop.x                       -6.1569e+09  2.5369e-01 -2.4269e+10 < 2.2e-16 ***
Men                               6.1569e+09  2.4254e-01  2.5385e+10 < 2.2e-16 ***
Women                             6.1569e+09  2.9053e-01  2.1192e+10 < 2.2e-16 ***
building_class_at_time_of_saleA0  1.3952e+05  3.7534e+03  3.7171e+01 < 2.2e-16 ***
building_class_at_time_of_saleA1  1.0504e+05  1.4056e+03  7.4727e+01 < 2.2e-16 ***
building_class_at_time_of_saleA2  6.4862e+04  1.7563e+03  3.6931e+01 < 2.2e-16 ***
building_class_at_time_of_saleA3  3.3492e+05  5.4030e+03  6.1988e+01 < 2.2e-16 ***
building_class_at_time_of_saleA4  1.9749e+05  6.5229e+03  3.0276e+01 < 2.2e-16 ***
building_class_at_time_of_saleA5  3.9112e+04  1.4470e+03  2.7030e+01 < 2.2e-16 ***
building_class_at_time_of_saleA6 -1.5064e+05  7.0867e+03 -2.1257e+01 < 2.2e-16 ***
building_class_at_time_of_saleA7  1.8340e+05  5.3564e+01  3.4239e+03 < 2.2e-16 ***
building_class_at_time_of_saleA8 -1.1559e+05  1.4973e+01 -7.7196e+03 < 2.2e-16 ***
building_class_at_time_of_saleA9  6.3681e+04  2.0931e+03  3.0424e+01 < 2.2e-16 ***
building_class_at_time_of_saleB1  1.7355e+05  1.5154e+03  1.1453e+02 < 2.2e-16 ***
building_class_at_time_of_saleB2  1.5230e+05  1.4673e+03  1.0380e+02 < 2.2e-16 ***
building_class_at_time_of_saleB3  1.3706e+05  1.5737e+03  8.7097e+01 < 2.2e-16 ***
building_class_at_time_of_saleB9  1.2594e+05  2.0994e+03  5.9986e+01 < 2.2e-16 ***
building_class_at_time_of_saleC0  2.1164e+05  1.5671e+03  1.3505e+02 < 2.2e-16 ***
building_class_at_time_of_saleC1  2.3698e+05  4.3097e+03  5.4987e+01 < 2.2e-16 ***
building_class_at_time_of_saleC2  1.9560e+05  2.9009e+03  6.7429e+01 < 2.2e-16 ***
building_class_at_time_of_saleC3  2.1607e+05  2.6559e+03  8.1355e+01 < 2.2e-16 ***
building_class_at_time_of_saleC4  1.3695e+05  9.0729e+03  1.5095e+01 < 2.2e-16 ***
building_class_at_time_of_saleC5  1.9362e+05  6.7181e+03  2.8821e+01 < 2.2e-16 ***
building_class_at_time_of_saleC6 -1.3938e+05  3.8236e+02 -3.6451e+02 < 2.2e-16 ***
building_class_at_time_of_saleC7  1.6761e+05  7.7880e+03  2.1522e+01 < 2.2e-16 ***
building_class_at_time_of_saleC8 -3.9534e+05  3.8622e+00 -1.0236e+05 < 2.2e-16 ***
building_class_at_time_of_saleC9 -5.3525e+04  1.4023e+02 -3.8170e+02 < 2.2e-16 ***
building_class_at_time_of_saleCM -1.0373e+01          NA          NA        NA    
building_class_at_time_of_saleD0  8.2579e+04  2.5786e+01  3.2024e+03 < 2.2e-16 ***
building_class_at_time_of_saleD1 -3.0391e+04  4.1447e+02 -7.3326e+01 < 2.2e-16 ***
building_class_at_time_of_saleD2 -4.5215e+05  7.4887e+00 -6.0378e+04 < 2.2e-16 ***
building_class_at_time_of_saleD3 -1.8254e+05  9.2733e+01 -1.9684e+03 < 2.2e-16 ***
building_class_at_time_of_saleD4 -1.6205e+05  7.7115e+03 -2.1014e+01 < 2.2e-16 ***
building_class_at_time_of_saleD5  9.2455e+04  7.8430e+00  1.1788e+04 < 2.2e-16 ***
building_class_at_time_of_saleD6 -1.5591e+05  6.7311e+01 -2.3162e+03 < 2.2e-16 ***
building_class_at_time_of_saleD7 -1.0718e+05  9.0214e+01 -1.1881e+03 < 2.2e-16 ***
building_class_at_time_of_saleD8 -2.3003e+04  1.2368e+01 -1.8598e+03 < 2.2e-16 ***
building_class_at_time_of_saleD9 -1.9105e+05  7.6408e+01 -2.5004e+03 < 2.2e-16 ***
building_class_at_time_of_saleE1  1.5457e+05  4.4341e+02  3.4859e+02 < 2.2e-16 ***
building_class_at_time_of_saleE2  4.3728e+05  1.0070e+01  4.3426e+04 < 2.2e-16 ***
building_class_at_time_of_saleE3  1.4641e+05  3.0501e+02  4.8002e+02 < 2.2e-16 ***
building_class_at_time_of_saleE4  1.0907e+05  5.2813e+01  2.0653e+03 < 2.2e-16 ***
building_class_at_time_of_saleE7 -2.9949e+05  1.9951e+01 -1.5011e+04 < 2.2e-16 ***
building_class_at_time_of_saleE9  1.1786e+05  8.8237e+03  1.3357e+01 < 2.2e-16 ***
building_class_at_time_of_saleF1  1.5106e+05  1.6581e+02  9.1102e+02 < 2.2e-16 ***
building_class_at_time_of_saleF2  2.1667e+05  3.0781e+01  7.0391e+03 < 2.2e-16 ***
building_class_at_time_of_saleF4  1.6584e+05  6.3631e+02  2.6062e+02 < 2.2e-16 ***
building_class_at_time_of_saleF5  9.9344e+04  1.5017e+02  6.6155e+02 < 2.2e-16 ***
building_class_at_time_of_saleF8 -2.8220e+05  1.3758e+00 -2.0512e+05 < 2.2e-16 ***
building_class_at_time_of_saleF9  1.5692e+05  8.5776e+03  1.8295e+01 < 2.2e-16 ***
building_class_at_time_of_saleG0 -7.4920e+04  9.3074e+02 -8.0494e+01 < 2.2e-16 ***
building_class_at_time_of_saleG1  1.0360e+05  1.7836e+02  5.8084e+02 < 2.2e-16 ***
building_class_at_time_of_saleG2  1.3491e+05  7.8585e+03  1.7168e+01 < 2.2e-16 ***
building_class_at_time_of_saleG3  2.1477e+05  1.5108e+01  1.4215e+04 < 2.2e-16 ***
building_class_at_time_of_saleG4  2.3700e+05  1.2326e+02  1.9228e+03 < 2.2e-16 ***
building_class_at_time_of_saleG5  1.4077e+05  5.4218e+01  2.5963e+03 < 2.2e-16 ***
building_class_at_time_of_saleG6  1.1741e+05  4.5981e+01  2.5535e+03 < 2.2e-16 ***
building_class_at_time_of_saleG7 -1.6497e+04  4.4057e+03 -3.7444e+00 0.0001808 ***
building_class_at_time_of_saleG8  3.1372e+03  2.6921e+01  1.1653e+02 < 2.2e-16 ***
building_class_at_time_of_saleG9  1.2263e+05  7.6113e+03  1.6111e+01 < 2.2e-16 ***
building_class_at_time_of_saleGU  2.8203e+05  2.6368e+00  1.0696e+05 < 2.2e-16 ***
building_class_at_time_of_saleGW  2.0901e+05  9.7865e+00  2.1357e+04 < 2.2e-16 ***
building_class_at_time_of_saleH1 -1.7314e+00  2.4147e-12 -7.1701e+11 < 2.2e-16 ***
building_class_at_time_of_saleH2 -5.1859e+05  4.9986e+03 -1.0375e+02 < 2.2e-16 ***
building_class_at_time_of_saleH3 -5.6200e+05  8.2590e+03 -6.8046e+01 < 2.2e-16 ***
building_class_at_time_of_saleH4  1.6690e+05  1.9417e+01  8.5958e+03 < 2.2e-16 ***
building_class_at_time_of_saleH6  1.1853e+04  5.5558e+00  2.1334e+03 < 2.2e-16 ***
building_class_at_time_of_saleH7 -2.1248e+05  1.0631e+01 -1.9987e+04 < 2.2e-16 ***
building_class_at_time_of_saleH8  7.7614e-01          NA          NA        NA    
building_class_at_time_of_saleH9  2.4038e+04  2.7452e+01  8.7564e+02 < 2.2e-16 ***
building_class_at_time_of_saleHR  2.6907e+05  2.6293e+01  1.0234e+04 < 2.2e-16 ***
building_class_at_time_of_saleI1 -4.2960e+05  1.5147e+01 -2.8361e+04 < 2.2e-16 ***
building_class_at_time_of_saleI3  2.0919e+05  2.7345e+00  7.6501e+04 < 2.2e-16 ***
building_class_at_time_of_saleI4 -5.5951e+04  1.7755e+01 -3.1513e+03 < 2.2e-16 ***
building_class_at_time_of_saleI5  1.7474e+05  5.9982e+01  2.9132e+03 < 2.2e-16 ***
building_class_at_time_of_saleI6  1.1961e+05  2.4592e+01  4.8636e+03 < 2.2e-16 ***
building_class_at_time_of_saleI7  6.3836e+04  1.0070e+02  6.3393e+02 < 2.2e-16 ***
building_class_at_time_of_saleI9  8.6528e+04  4.7252e+01  1.8312e+03 < 2.2e-16 ***
building_class_at_time_of_saleJ1 -2.8717e+05  9.3349e+00 -3.0763e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ2 -4.9867e+05  5.4591e+00 -9.1348e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ4 -9.4846e+04  2.5902e+01 -3.6618e+03 < 2.2e-16 ***
building_class_at_time_of_saleJ6 -5.4367e+05  2.1236e+01 -2.5602e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ7 -4.6257e+05  3.0157e+00 -1.5339e+05 < 2.2e-16 ***
building_class_at_time_of_saleJ9 -8.2693e-01          NA          NA        NA    
building_class_at_time_of_saleK1  1.3066e+05  5.0570e+03  2.5838e+01 < 2.2e-16 ***
building_class_at_time_of_saleK2  1.8075e+05  8.3157e+03  2.1736e+01 < 2.2e-16 ***
building_class_at_time_of_saleK3 -4.7864e+05  1.6826e+00 -2.8447e+05 < 2.2e-16 ***
building_class_at_time_of_saleK4  2.0110e+05  7.0629e+03  2.8473e+01 < 2.2e-16 ***
building_class_at_time_of_saleK5  1.5896e+05  1.5770e+02  1.0080e+03 < 2.2e-16 ***
building_class_at_time_of_saleK6  5.7383e+04  1.8493e+01  3.1029e+03 < 2.2e-16 ***
building_class_at_time_of_saleK7  1.8497e+05  4.8547e+01  3.8101e+03 < 2.2e-16 ***
building_class_at_time_of_saleK8 -3.5744e+05  2.2884e+01 -1.5620e+04 < 2.2e-16 ***
building_class_at_time_of_saleK9  1.6681e+05  8.5161e+03  1.9588e+01 < 2.2e-16 ***
building_class_at_time_of_saleL1 -3.2798e+05  8.1924e+01 -4.0035e+03 < 2.2e-16 ***
building_class_at_time_of_saleL2 -4.6219e+05  2.0444e+01 -2.2608e+04 < 2.2e-16 ***
building_class_at_time_of_saleL3 -1.0495e+00          NA          NA        NA    
building_class_at_time_of_saleL8 -7.9144e+04  1.0960e+02 -7.2209e+02 < 2.2e-16 ***
building_class_at_time_of_saleL9  4.3745e+04  3.5515e+01  1.2318e+03 < 2.2e-16 ***
building_class_at_time_of_saleM1  1.5553e+05  6.4322e+02  2.4180e+02 < 2.2e-16 ***
building_class_at_time_of_saleM2  4.7448e+05  1.3285e+00  3.5715e+05 < 2.2e-16 ***
building_class_at_time_of_saleM3  1.8350e+05  2.2997e+01  7.9795e+03 < 2.2e-16 ***
building_class_at_time_of_saleM4  3.5926e+05  1.0860e+01  3.3079e+04 < 2.2e-16 ***
building_class_at_time_of_saleM9  1.7297e+05  1.9606e+02  8.8223e+02 < 2.2e-16 ***
building_class_at_time_of_saleN1 -2.5565e+05  1.2544e+00 -2.0381e+05 < 2.2e-16 ***
building_class_at_time_of_saleN2 -1.0812e+05  3.7291e+01 -2.8993e+03 < 2.2e-16 ***
building_class_at_time_of_saleN3  3.8823e+05  1.4154e+00  2.7429e+05 < 2.2e-16 ***
building_class_at_time_of_saleN4  2.5708e+05  2.3806e+00  1.0799e+05 < 2.2e-16 ***
building_class_at_time_of_saleN9  7.9835e+04  4.0318e+01  1.9802e+03 < 2.2e-16 ***
building_class_at_time_of_saleO1  9.8493e+04  5.4205e+01  1.8170e+03 < 2.2e-16 ***
building_class_at_time_of_saleO2  5.6520e+04  8.4064e+01  6.7235e+02 < 2.2e-16 ***
building_class_at_time_of_saleO3 -2.5890e+05  7.4544e+01 -3.4732e+03 < 2.2e-16 ***
building_class_at_time_of_saleO4 -2.5858e+05  4.7554e+01 -5.4377e+03 < 2.2e-16 ***
building_class_at_time_of_saleO5  2.0678e+05  6.5123e+01  3.1753e+03 < 2.2e-16 ***
building_class_at_time_of_saleO6 -1.8394e+05  1.6759e+02 -1.0975e+03 < 2.2e-16 ***
building_class_at_time_of_saleO7  1.5411e+05  9.9111e+02  1.5549e+02 < 2.2e-16 ***
building_class_at_time_of_saleO8  1.8789e+05  2.4518e+02  7.6634e+02 < 2.2e-16 ***
building_class_at_time_of_saleO9  1.2536e+05  9.2309e+03  1.3580e+01 < 2.2e-16 ***
building_class_at_time_of_saleP1 -5.0139e+05  4.3587e+00 -1.1503e+05 < 2.2e-16 ***
building_class_at_time_of_saleP2  1.6676e+05  4.9800e+01  3.3487e+03 < 2.2e-16 ***
building_class_at_time_of_saleP3  4.0284e+05  9.5488e+00  4.2188e+04 < 2.2e-16 ***
building_class_at_time_of_saleP5  1.1493e+05  3.6320e+01  3.1643e+03 < 2.2e-16 ***
building_class_at_time_of_saleP6  4.5337e+05  1.9562e+01  2.3177e+04 < 2.2e-16 ***
building_class_at_time_of_saleP7  2.5193e+05  3.6547e+00  6.8933e+04 < 2.2e-16 ***
building_class_at_time_of_saleP8 -2.4175e+04  1.0560e+01 -2.2893e+03 < 2.2e-16 ***
building_class_at_time_of_saleP9  9.8652e+04  4.2028e+01  2.3473e+03 < 2.2e-16 ***
building_class_at_time_of_saleQ1 -6.3134e+05  2.0798e+01 -3.0355e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ2 -1.9214e+04  7.9179e+00 -2.4267e+03 < 2.2e-16 ***
building_class_at_time_of_saleQ3 -2.1168e+05  3.9467e+00 -5.3634e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ8  3.3311e+05  5.2588e+00  6.3344e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ9  5.6611e+04  2.5971e+01  2.1798e+03 < 2.2e-16 ***
building_class_at_time_of_saleR0 -6.2189e+04  4.0096e+01 -1.5510e+03 < 2.2e-16 ***
building_class_at_time_of_saleR1 -2.1086e+05  8.5324e+00 -2.4713e+04 < 2.2e-16 ***
building_class_at_time_of_saleR4  1.0408e+05  2.3851e+00  4.3635e+04 < 2.2e-16 ***
building_class_at_time_of_saleR9 -1.3032e+05  7.4740e+01 -1.7436e+03 < 2.2e-16 ***
building_class_at_time_of_saleRR -2.9996e+05  4.0817e+01 -7.3490e+03 < 2.2e-16 ***
building_class_at_time_of_saleS0  1.3380e+05  1.0122e+02  1.3220e+03 < 2.2e-16 ***
building_class_at_time_of_saleS1  1.2850e+05  3.9465e+03  3.2560e+01 < 2.2e-16 ***
building_class_at_time_of_saleS2  1.7599e+05  2.8663e+03  6.1399e+01 < 2.2e-16 ***
building_class_at_time_of_saleS3  1.9130e+05  6.4023e+03  2.9880e+01 < 2.2e-16 ***
building_class_at_time_of_saleS4  2.2695e+05  6.7895e+03  3.3426e+01 < 2.2e-16 ***
building_class_at_time_of_saleS5  2.2889e+05  7.6276e+03  3.0009e+01 < 2.2e-16 ***
building_class_at_time_of_saleS9  2.2762e+05  6.1226e+03  3.7177e+01 < 2.2e-16 ***
building_class_at_time_of_saleT2  2.3426e+04  2.9571e+00  7.9217e+03 < 2.2e-16 ***
building_class_at_time_of_saleT9 -6.9349e+04  8.5254e+00 -8.1344e+03 < 2.2e-16 ***
building_class_at_time_of_saleU7 -1.1867e+05  3.5107e+00 -3.3803e+04 < 2.2e-16 ***
building_class_at_time_of_saleV0 -2.2715e+04  3.7162e+03 -6.1123e+00 9.828e-10 ***
building_class_at_time_of_saleV1 -6.4640e+04  7.6833e+03 -8.4130e+00 < 2.2e-16 ***
building_class_at_time_of_saleV2 -5.6667e+04  7.3045e+01 -7.7578e+02 < 2.2e-16 ***
building_class_at_time_of_saleV3 -6.8084e+04  9.0265e+01 -7.5427e+02 < 2.2e-16 ***
building_class_at_time_of_saleV5  3.5594e+05  4.0220e+00  8.8499e+04 < 2.2e-16 ***
building_class_at_time_of_saleV9 -1.1566e+05  6.3938e+01 -1.8090e+03 < 2.2e-16 ***
building_class_at_time_of_saleW1 -1.3223e+05  2.2746e+01 -5.8133e+03 < 2.2e-16 ***
building_class_at_time_of_saleW2  6.7435e+04  2.2155e+01  3.0438e+03 < 2.2e-16 ***
building_class_at_time_of_saleW3 -4.6763e+04  4.0345e+01 -1.1591e+03 < 2.2e-16 ***
building_class_at_time_of_saleW4 -7.7198e+04  2.8433e+00 -2.7151e+04 < 2.2e-16 ***
building_class_at_time_of_saleW6 -5.9114e+05  5.1179e+00 -1.1550e+05 < 2.2e-16 ***
building_class_at_time_of_saleW7 -3.1586e+05  8.8954e+00 -3.5508e+04 < 2.2e-16 ***
building_class_at_time_of_saleW8  2.2198e+05  2.7235e+01  8.1506e+03 < 2.2e-16 ***
building_class_at_time_of_saleW9  1.4626e+05  4.4200e+01  3.3089e+03 < 2.2e-16 ***
building_class_at_time_of_saleY1 -2.5078e+05  1.5060e+00 -1.6652e+05 < 2.2e-16 ***
building_class_at_time_of_saleY2 -3.4598e+05  7.8875e+00 -4.3864e+04 < 2.2e-16 ***
building_class_at_time_of_saleY4 -3.1975e+05  1.7551e+00 -1.8218e+05 < 2.2e-16 ***
building_class_at_time_of_saleY5 -4.7741e+05  3.0906e+00 -1.5447e+05 < 2.2e-16 ***
building_class_at_time_of_saleY6 -1.1746e+05  1.9593e+01 -5.9949e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ0 -1.3031e+05  4.8316e+01 -2.6969e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ1 -3.3428e+05  3.9923e+00 -8.3731e+04 < 2.2e-16 ***
building_class_at_time_of_saleZ2 -2.1807e+05  2.8958e+01 -7.5306e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ3 -1.2179e+05  3.0810e+00 -3.9531e+04 < 2.2e-16 ***
building_class_at_time_of_saleZ8 -2.0401e+05  1.4632e+00 -1.3942e+05 < 2.2e-16 ***
building_class_at_time_of_saleZ9  4.2683e+04  7.9866e+03  5.3443e+00 9.081e-08 ***
sale_year2003                    -1.0892e+05  1.1024e+03 -9.8802e+01 < 2.2e-16 ***
sale_year2004                    -5.1748e+04  1.0476e+03 -4.9395e+01 < 2.2e-16 ***
sale_year2005                     1.2800e+04  1.0659e+03  1.2008e+01 < 2.2e-16 ***
sale_year2006                     5.6075e+04  1.1214e+03  5.0002e+01 < 2.2e-16 ***
sale_year2007                     5.4919e+04  1.2462e+03  4.4070e+01 < 2.2e-16 ***
sale_year2008                     8.8428e+03  1.3830e+03  6.3940e+00 1.618e-10 ***
sale_year2009                    -3.6373e+04  1.4479e+03 -2.5120e+01 < 2.2e-16 ***
sale_year2010                    -3.7572e+04  1.4401e+03 -2.6089e+01 < 2.2e-16 ***
sale_year2011                    -3.6054e+04  1.5219e+03 -2.3690e+01 < 2.2e-16 ***
sale_year2012                    -2.8094e+04  1.4739e+03 -1.9061e+01 < 2.2e-16 ***
sale_year2013                    -8.9054e+03  1.3906e+03 -6.4040e+00 1.515e-10 ***
sale_year2014                     1.5791e+04  1.3950e+03  1.1320e+01 < 2.2e-16 ***
sale_year2015                     3.2799e+04  1.3994e+03  2.3437e+01 < 2.2e-16 ***
sale_year2016                     6.7075e+04  2.3071e+03  2.9073e+01 < 2.2e-16 ***
sale_year2017                     7.8186e+04  1.7508e+03  4.4658e+01 < 2.2e-16 ***
sale_year2019                     1.3103e+05  1.6770e+03  7.8134e+01 < 2.2e-16 ***
sale_year2020                     1.5247e+05  3.2118e+03  4.7472e+01 < 2.2e-16 ***
year_built0                      -1.3214e+04  4.1824e+03 -3.1593e+00 0.0015815 ** 
year_built1380                   -5.1417e+05  1.2867e+00 -3.9962e+05 < 2.2e-16 ***
year_built1800                   -2.3758e+04  7.1415e+01 -3.3268e+02 < 2.2e-16 ***
year_built1814                   -1.2841e+05  2.4966e+00 -5.1435e+04 < 2.2e-16 ***
year_built1825                   -5.7477e+05  2.6333e+00 -2.1828e+05 < 2.2e-16 ***
year_built1826                    9.3972e+04  4.4656e+00  2.1044e+04 < 2.2e-16 ***
year_built1827                    7.1977e+05  4.6640e+00  1.5432e+05 < 2.2e-16 ***
year_built1838                   -4.0203e+05  3.1975e+00 -1.2573e+05 < 2.2e-16 ***
year_built1839                   -5.6975e+05  2.6778e+00 -2.1277e+05 < 2.2e-16 ***
year_built1840                    2.8993e+04  1.3322e+00  2.1764e+04 < 2.2e-16 ***
 [ reached getOption("max.print") -- omitted 155 rows ]
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.963e+05 on 349122 degrees of freedom
Multiple R-squared: 0.2733, Adjusted R-squared: 0.2726
> #R-squared value is 25.63%
> 
> # Perform predictions on test set
> reg_pred1 <- predict(fit1, as.matrix(x_test))
> 
> # See the comparison
> results <- data.frame(cbind(y_test, reg_pred1))
> results$diff <- results$reg_pred1 - results$y_test
> results$diff_squared <- '^' (results$diff, 2)
> # RMS = 
> rms = sqrt(mean(results$diff_squared))
> rms
[1] 198239.1
> 
> # RMS Error = 
> rms_error = rms/mean(results$y_test)
> rms_error
[1] 0.4358773

5 Machine Learning Modeling

5.1 Preparing the dataframe for Machine Learning models

Add the required libraries. Also, load the file Data_sale_census_crime.rda (df_sale_census_crime) that was created in “04D: Data_Processing_Single_Dataframe”.

#Attach packages
library(caret)
library(dplyr)
library(lubridate)
library(deepnet)

load(file = "Data_sale_census_crime.rda")

Remove rows that have zero value in any column. Also, make sure that the sale_price column is numeric.

#Remove values of 0
#source: https://stackoverflow.com/questions/9977686/how-to-remove-rows-with-any-zero-value
zero_rows = apply(df_sale_census_crime, 1, function(row) all(row != 0))
df_sale_census_crime <- df_sale_census_crime[zero_rows, ]

# Make sure sale prices are numeric
df_sale_census_crime$sale_price <- as.numeric(df_sale_census_crime$sale_price)

Instead of looking at the entire data we will look at sale_price < $ 1M and 5M (90% and 97% of the data respectively). This value can be changed through this line of code.

source: https://stackoverflow.com/questions/25764810/delete-rows-based-on-range-of-values-in-column
df_sale_census_crime <- df_sale_census_crime[with(df_sale_census_crime, sale_price <= 5000000), ]

Instead of using the year_built of a house as an input variable, create another column, “age” of the building. This is calculated as below. Also, remember to remove all zero rows again.

#Adding a column about age of building 
df_sale_census_crime$age <- df_sale_census_crime$sale_year - df_sale_census_crime$year_built
#Remove negative age values
df_sale_census_crime <- df_sale_census_crime[age>0, ]

#Remove values of 0
#source: https://stackoverflow.com/questions/9977686/how-to-remove-rows-with-any-zero-value
zero_rows = apply(df_sale_census_crime, 1, function(row) all(row != 0))
df_sale_census_crime <- df_sale_census_crime[zero_rows, ]

Instead of using the entire database, use 1% of the data for trial run. This can be incrementally increased to 20% and then 100% of the data.

setup_stage <- sort(sample(nrow(df_sale_census_crime), nrow(df_sale_census_crime)*.01))
df_sale_census_crime <- df_sale_census_crime[setup_stage]

Make sure that gross_square_feet is numeric and all NA values are removed from the dataframe.

df_sale_census_crime$gross_square_feet <- as.numeric(df_sale_census_crime$gross_square_feet)
sapply(df_sale_census_crime, class)
df_sale_census_crime <- na.omit(df_sale_census_crime)
any(is.na(df_sale_census_crime))

5.2 Splitting the data into training and test data

Select data of sales from 2003 to 2013 for training and data from beyond 2013 for testing. Variable y is defined to include just the sale_price; variable y_test is the testing component of y. Variable is a matrix of all the input variables for the models; variable x_test is the testing component of x. Make sure that all the variables are formatted in the right format and all NA rows are removed before proceeding.

summary(df_sale_census_crime)
# select sales from 2003 to 2013 for validation and >2013 for testing
df_sale_census_crime_test <- df_sale_census_crime[sale_year > 2013,]
# use data from 2003 to 2013 as training data
df_sale_census_crime_train <- df_sale_census_crime[sale_year <= 2013,]

# Data setup

# Setup the y output variable
y <- df_sale_census_crime_train %>% select("sale_price")
y <- as.numeric(y[[1]]) # Making sure it's numeric
y_test <- df_sale_census_crime_test %>% select("sale_price")

# Setup the input variable
x <- df_sale_census_crime_train %>% select("gross_square_feet","PerCapitaIncome", "Unemployed", "TotalPop.x", "Hispanic", "White", "Black", "Native", "Asian", "age", "weight", "sale_year")
x_test <- df_sale_census_crime_test %>% select("gross_square_feet","PerCapitaIncome", "Unemployed", "TotalPop.x", "Hispanic", "White", "Black", "Native", "Asian", "age", "weight", "sale_year")
# NEED to make sure each column is the right format (numeric / factor as appropriate)

# BEFORE
sapply(x, class)

x$gross_square_feet <- as.numeric(x$gross_square_feet)
x$sale_year <- as.numeric(x$sale_year)

# AFTER (CHECK that these all make sense)
sapply(x, class)

#Same for x_test
sapply(x_test, class)
x_test$sale_year <- as.numeric(x_test$sale_year)
any(is.na(x_test))

5.3 Running the ML models

Set the control method of the model as cross validation (cv) or none. It is advisable to do 5-fold or 10-fold cross validation. Cross validation takes a much longer time to run then just running without any cross calidation. Root Mean Square (RMSE) is being used here as the parameter to optimize the models. Mean Average Error (MAE) can also be used instead.

# Source: https://machinelearningmastery.com/k-fold-cross-validation/
control <- trainControl(method="cv", number=10)
metric <- "RMSE"

Run the neural network, deep neural net, random forest, bagged cart and k-nearest neighbours models as shown below.

# Neural Network model
set.seed(7)
fit.nnet <- train(x, y, method="nnet", metric=metric, trControl=control, linout=TRUE, preProcess="pca") # Need linout TRUE: https://stackoverflow.com/questions/21622975/how-to-model-a-neural-network-through-the-use-of-caret-r

# Deep Neural Net
set.seed(7)
fit.dnn <- train(x, y, output="linear", metric=metric, trControl=control, linout=TRUE, preProcess="pca") # Need linout TRUE: https://stackoverflow.com/questions/21622975/how-to-model-a-neural-network-through-the-use-of-caret-r

# Random Forest
set.seed(7)
fit.rf <- train(x, y, method="rf", metric=metric, trControl=control, preProcess="pca")

#Trees using genetic algorithm- This is taking extremely long amounts of time so aborting this model
#set.seed(7)   
#fit.ga <- train(x, y, method="evtree", metric=metric, trControl=control, preProcess="pca")

#Adaptive- Network based fuzzy inference system-This is taking extremely long amounts of time so aborting this model
#install.packages("frbs")
#library(frbs)
#set.seed(7)   
#fit.anfis <- train(x, y, method="ANFIS", metric=metric, trControl=control, preProcess="pca")

#Bagged CART
set.seed(7)   
fit.bgcrt <- train(x, y, method="treebag", metric=metric, trControl=control, preProcess="pca")

#KNN
set.seed(7)   
fit.knn <- train(x, y, method="knn", metric=metric, trControl=control, preProcess="pca")

5.4 Results of validation and training

Let’s see how our models performed in 5-fold cross validation with 1% data (sale_price < 1M), 1% data (sale_price < 5M) and 100% data (sale_price < 1M, only for neural network and bagged cart).

results <- resamples(list(nnet=fit.nnet, dnn=fit.dnn, rf=fit.rf, cart=fit.bgcrt, knn=fit.knn))
summary(results)

Validation results of 1% data with sale_price <5M
metric	neural_net1	deep_neural_net1	random_forest1	bagged_cart1	k_nearest_neighbors1
RMSE	525659.600	410711.800	405804.900	437638.00	444834.500
MAE	280704.900	214190.900	213700.300	241710.50	227947.200
R-squared	0.067	0.422	0.442	0.35	0.334

Validation results of 1% data with sale_price <1M
metric	neural_net2	deep_neural_net2	random_forest2	bagged_cart2	k_nearest_neighbors2
RMSE	211077.400	193911.100	193911.100	200277.800	194873.80
MAE	164014.600	142400.700	142400.700	151958.000	144403.90
R-squared	0.107	0.256	0.256	0.196	0.25

Validation results of 100% data with sale_price <1M
metric	neural_net3	bagged_cart3
RMSE	208134.700	200819.900
MAE	161860.400	154242.500
R-squared	0.097	0.159

5.5 Results of predictions on testing data

Run the models on the x_test dataset to get the predicted sales price value.

# compare the predictions
nnet_pred <- predict(fit.nnet, x_test) 
dnn_pred <- predict(fit.dnn, x_test)
rf_pred <- predict(fit.rf, x_test)
bgcrt_pred <- predict(fit.bgcrt, x_test)
knn_pred <- predict(fit.knn, x_test)

Find the difference between the predicted values and the y_test values and calculate the RMSE and MAE values as shown below.

results3 <- data.frame(cbind(y_test, nnet_pred, dnn_pred, rf_pred, bgcrt_pred, knn_pred))
#RMSE on test data for each model
#Calculating diff from actual values
results3$diff_nnet <- abs(results3$sale_price - results3$nnet_pred)
results3$diff_dnn <- abs(results3$sale_price - results3$dnn_pred)
results3$diff_rf <- abs(results3$sale_price - results3$rf_pred)
results3$diff_bgcrt <- abs(results3$sale_price - results3$bgcrt_pred)
results3$diff_knn <- abs(results3$sale_price - results3$knn_pred)

#Calculating RMSE
RMSE_nnet <- sqrt(mean('^'(results3$diff_nnet,2))) #705,330
RMSE_dnn <- sqrt(mean('^'(results3$diff_dnn,2))) #697,094
RMSE_rf <- sqrt(mean('^'(results3$diff_rf,2))) #689,789
RMSE_bgcrt <- sqrt(mean('^'(results3$diff_bgcrt,2))) #634,809
RMSE_knn <- sqrt(mean('^'(results3$diff_knn,2))) #641,702

#Calculating MAE
MAE_nnet <- mean(results3$diff_nnet) #401,590
MAE_dnn <- mean(results3$diff_dnn) #466,828
MAE_rf <- mean(results3$diff_rf) #462,907
MAE_bgcrt <- mean(results3$diff_bgcrt) #401,397
MAE_knn <- mean(results3$diff_knn) #363,609

Calculate the MAE %error and RMSE %error by dividing the MAE and RMSE values by mean sales price in y_test. This gives a figure which can be easily compared across models. Let’s see what the results are.

#Calculating %error (RMSE/Average price)
error_nnet <- RMSE_nnet/mean(results3$sale_price) 
error_dnn <- RMSE_dnn/mean(results3$sale_price) 
error_rf <- RMSE_rf/mean(results3$sale_price) 
error_bgcrt <- RMSE_bgcrt/mean(results3$sale_price) 
error_knn <- RMSE_knn/mean(results3$sale_price) 

#Calculating %error (MAE/Average price)
error2_nnet <- MAE_nnet/mean(results3$sale_price) 
error2_dnn <- MAE_dnn/mean(results3$sale_price) 
error2_rf <- MAE_rf/mean(results3$sale_price) 
error2_bgcrt <- MAE_bgcrt/mean(results3$sale_price) 
error2_knn <- MAE_knn/mean(results3$sale_price)

Testing results of 1% data with sale_price <5M
metric4	neural_net4	deep_neural_net4	random_forest4	bagged_cart4	k_nearest_neighbors4
RMSE/mean	0.92	0.91	0.90	0.83	0.84
MAE/mean	0.53	0.61	0.61	0.53	0.48

Testing results of 1% data with sale_price <1M
metric4	neural_net4	deep_neural_net4	random_forest4	bagged_cart4	k_nearest_neighbors4
RMSE/mean	0.92	0.91	0.90	0.83	0.84
MAE/mean	0.53	0.61	0.61	0.53	0.48

Testing results of 100% (for neural_net & bagged_cart) data and 20% (for deep neural net & random forest and no cross validation) with sale_price <1M
metric4	neural_net6	deep_neural_net6	random_forest6	bagged_cart6
RMSE/mean	0.49	0.46	0.46	0.49
MAE/mean	0.40	0.37	0.37	0.39

As we can see, the model improves a lot when we only look at sale_prices < $ 1M. The final %MAE errors were in the range of 35-40% with the random forest and deep neural net performing the best.

6 Shiny App Visualization

This section describes the implmentation of our shiny app visalization, which can be viewed live on shinyapps.io.

6.1 Data Preparation

Before booting up the shiny app, we’ll perform a few more processing steps on the data and set up our polygon shapefile for drawing the zip code regions on a leaflet map.

We start by loading in the semi-processed crime data and then filtering out data before 2007, as upon closer inspection, the complaints from before that point seems to have been reported differently and would skew graphs we plan to show in the visualization.

load("Data_Score_by_Time_and_Rating.rda")

# Remove NAs and outliers
score_by_time_and_rating = score_by_time_and_rating %>%
    filter(!is.na(weight)) %>%
    filter(month > "2006-12-31")

Next, we normalize the crime score by mulitplying by 10000 (to make the minimum score greater than 1), taking it’s log, and then dividing by the maximum value to get an approximately normal distribution from 0 to 1.

hist(score_by_time_and_rating$weight)

score_by_time_and_rating$weight_transform = log((score_by_time_and_rating$weight)*100000)

score_by_time_and_rating$weight_normalized = score_by_time_and_rating$weight_transform/max(score_by_time_and_rating$weight_transform)

hist(score_by_time_and_rating$weight_normalized)

To display the 2015 census data for each zip code, we grab all the unique zip codes from the score_by_time_and_rating data frame, and their assciated demographic and economic characteristics.

unique_census = score_by_time_and_rating %>%
    group_by(zip_code) %>%
    summarize(
        PerCapitaIncome = PerCapitaIncome[1],
        Unemployed = Unemployed[1],
        TotalPop = TotalPop[1],
        Hispanic =  Hispanic[1],
        White = White[1],
        Black = Black[1], 
        Native = Native[1], 
        Asian = Asian[1],
        weight = weight[1]
    )

**unique_census**
zip_code	PerCapitaIncome	Unemployed	TotalPop	Hispanic	White	Black	Native	Asian	weight
10001	100234.00	6.768366	54931	11.217202	59.960081	7.1104367	0.3122954	18.801769	0.0302743
10002	28315.73	9.336805	89904	27.622018	20.842063	7.8137391	0.1529887	40.790532	0.0141484
10003	83351.19	5.529082	78507	6.906509	72.975596	3.8197919	0.2961048	12.900415	0.0179220
10004	97170.22	3.899262	21681	9.655233	65.617859	1.9047138	0.0000000	20.412117	0.0024445
10005	86428.86	3.880739	17798	6.414541	68.356489	2.9669963	0.2072705	20.326306	0.0043825
10006	93787.00	3.300000	4374	10.600000	66.000000	1.9000000	0.0000000	19.500000	0.0182899
10007	119242.13	3.674118	45430	8.511321	60.119892	4.3140678	0.0501695	22.972307	0.0057451
10009	51209.57	6.516734	85886	20.749042	54.242942	6.9598060	0.1141618	14.408297	0.0096873
10010	92893.21	5.785252	69574	11.339020	66.640557	4.0195015	0.3870339	14.968094	0.0093857
10011	106752.14	4.910633	82322	10.392998	70.414786	5.0095126	0.3011091	11.063606	0.0158281
10012	100495.24	4.410209	49380	5.790992	68.621183	2.2355508	0.1364682	19.006819	0.0114621
10013	113579.10	4.277498	43974	6.965232	58.260233	3.7364693	0.1236640	27.022588	0.0166007
10014	104205.84	5.425317	46324	8.411761	78.552526	2.1064567	0.2773681	7.920266	0.0145281
10016	85612.76	5.328540	71412	10.220809	65.417396	4.2940822	0.0167325	17.410382	0.0085420
10017	110212.58	3.163597	40598	6.453774	72.890379	1.7914602	0.0782255	16.572893	0.0081777
10018	91959.49	6.145309	26529	13.087546	54.653455	5.5875118	0.0715368	23.273821	0.0339628
10019	90112.07	5.688635	61795	17.626635	58.933995	4.8051477	0.0796909	15.995618	0.0165386
10020	107136.40	4.164439	6389	11.773767	60.485976	3.0516669	0.3054938	22.833793	0.0075129
10021	115294.98	3.470039	72771	6.231300	81.083166	1.7386486	0.0469473	8.290485	0.0037790
10022	130688.86	4.342849	47254	5.793163	80.830304	2.1141660	0.1132920	9.954484	0.0140517
10023	107509.81	4.706067	73211	8.576637	75.611438	2.5184795	0.0421235	10.600240	0.0078540
10024	97532.70	4.252122	81434	13.985623	71.642281	4.5466660	0.0342228	7.268941	0.0075767
10025	58202.50	8.159557	106152	21.748584	55.391525	10.1917854	0.1149446	9.581334	0.0113422
10026	33252.99	8.963258	56175	24.506978	21.874485	45.3300970	0.0698709	5.677186	0.0151313
10027	28991.62	7.676539	88710	23.809475	23.575135	41.1918916	0.1541066	8.754981	0.0200654
10028	112773.22	3.862112	83249	6.478951	83.344795	1.2763817	0.1274238	6.597452	0.0047808
10029	27052.87	11.629428	91158	48.233006	16.363058	24.2480781	0.0000000	8.805455	0.0194497
10030	26133.35	13.623160	46266	23.605652	10.925371	59.4541218	0.3181429	2.453601	0.0180694
10031	22642.82	10.759977	77692	52.828847	13.299143	27.7892524	0.0684202	3.125435	0.0181228
10032	19268.96	13.613160	97708	70.033369	10.336216	14.4302462	0.0617043	3.075249	0.0125271
10033	23814.46	12.810512	79100	68.683254	21.558509	4.5760455	0.0746119	3.380865	0.0137674
10034	23260.78	15.452328	46763	74.138492	17.012326	4.3897590	0.0000000	2.229989	0.0105425
10035	22672.20	10.971860	62345	42.117809	11.437370	38.2822263	0.0805245	5.174718	0.0193761
10036	72432.87	6.843628	31652	19.769809	54.093994	6.1912991	0.0616643	17.260862	0.0325098
10037	25108.63	13.591314	54422	23.733512	8.588791	61.6120172	0.2243192	2.784892	0.0066150
10038	52062.64	6.983886	44669	10.349260	35.814330	5.2738163	1.0839464	44.768287	0.0049475
10039	23355.00	18.036523	46042	33.090024	9.100765	53.3986860	0.4099778	1.598367	0.0136614
10040	24461.43	13.019628	75427	69.108541	21.784732	5.0114588	0.0686094	2.716590	0.0070930
10044	48284.78	5.483773	11783	14.601986	43.112908	13.9744972	0.0000000	24.635772	0.0059408
10065	120842.61	3.969969	34215	5.147400	82.168511	2.8653251	0.0998509	8.221274	0.0132398
10069	89272.56	5.022828	26752	10.560728	71.359640	3.9002841	0.0000000	10.937840	0.0002243
10075	116356.48	4.586323	72466	6.518072	84.444395	0.9766125	0.0493142	5.592533	0.0021803
10111	118143.00	6.200000	982	7.100000	72.800000	2.2000000	0.0000000	17.800000	0.0020367
10115	45187.00	5.100000	5069	7.800000	68.200000	7.6000000	0.0000000	14.600000	0.0007891
10128	89075.02	4.399807	75267	11.503636	71.495084	4.0066762	0.0000000	9.897987	0.0064969
10280	111236.72	3.224324	15762	10.034234	64.906306	1.3563063	0.0000000	18.871171	0.0015226
10281	115333.00	2.700000	8875	7.500000	69.800000	1.4000000	0.0000000	15.900000	0.0012394
10282	125477.38	2.737017	14091	8.832595	70.318231	2.7696118	0.0000000	12.827628	0.0020581
10301	33413.80	7.340241	67183	22.126743	49.492686	19.5170013	0.1022818	6.106892	0.0107170
10302	27652.69	5.647956	33898	31.306803	43.054755	16.3130981	0.0431707	6.997079	0.0083781

Now that we’ve extracted the census data, we can remove all those unecessary columns from the score_by_time_and_rating and reformat month column as a character to match the style of the grouped_housing data frame. We save this as a new data frame that we’ll load into our Shiny app.

crime_scores = score_by_time_and_rating %>%
    mutate(month_char = format(as.Date(month), "%Y-%m"))%>%
    data.frame() %>%
    select(-c("weight","weight_transform", "sum_weight","Men", "Women", "Hispanic", "White", "Black", "Native", "Asian", "TotalPop", "PerCapitaIncome", "Unemployed"))

save(crime_scores, file = "crime_scores.rda")

**crime_scores**
month	zip_code	weight_normalized	month_char
2007-01-01	10001	0.7741732	2007-01
2007-01-01	10002	0.7007013	2007-01
2007-01-01	10003	0.7235363	2007-01
2007-01-01	10004	0.5311227	2007-01
2007-01-01	10005	0.5875057	2007-01
2007-01-01	10006	0.7254990	2007-01
2007-01-01	10007	0.6136537	2007-01
2007-01-01	10009	0.6641158	2007-01
2007-01-01	10010	0.6610612	2007-01
2007-01-01	10011	0.7115365	2007-01
2007-01-01	10012	0.6803649	2007-01
2007-01-01	10013	0.7161397	2007-01
2007-01-01	10014	0.7032591	2007-01
2007-01-01	10016	0.6519636	2007-01
2007-01-01	10017	0.6477547	2007-01
2007-01-01	10018	0.7852772	2007-01
2007-01-01	10019	0.7157773	2007-01
2007-01-01	10020	0.6395650	2007-01
2007-01-01	10021	0.5731948	2007-01
2007-01-01	10022	0.7000389	2007-01
2007-01-01	10023	0.6438535	2007-01
2007-01-01	10024	0.6403814	2007-01
2007-01-01	10025	0.6793492	2007-01
2007-01-01	10026	0.7071881	2007-01
2007-01-01	10027	0.7344474	2007-01
2007-01-01	10028	0.5959080	2007-01
2007-01-01	10029	0.7314376	2007-01
2007-01-01	10030	0.7243277	2007-01
2007-01-01	10031	0.7246128	2007-01
2007-01-01	10032	0.6889462	2007-01
2007-01-01	10033	0.6980645	2007-01
2007-01-01	10034	0.6722874	2007-01
2007-01-01	10035	0.7310709	2007-01
2007-01-01	10036	0.7810539	2007-01
2007-01-01	10037	0.6272710	2007-01
2007-01-01	10038	0.5992176	2007-01
2007-01-01	10039	0.6973183	2007-01
2007-01-01	10040	0.6340093	2007-01
2007-01-01	10044	0.6168883	2007-01
2007-01-01	10065	0.6942905	2007-01
2007-01-01	10069	0.3004099	2007-01
2007-01-01	10075	0.5200755	2007-01
2007-01-01	10111	0.5134916	2007-01
2007-01-01	10128	0.6255310	2007-01
2007-01-01	10280	0.4853990	2007-01
2007-01-01	10281	0.4655224	2007-01
2007-01-01	10282	0.5145008	2007-01
2007-01-01	10301	0.6738727	2007-01
2007-01-01	10302	0.6500923	2007-01
2007-01-01	10303	0.6575468	2007-01

To display the neighborhood names for each zip code (if they have one), we’ll pull them in from a csv file created using information from the New York State’s Department of Health.

neighborhoods = read_csv("neighborhoods.csv") %>%
    mutate(zips = sapply(zips, function(x) as.list(strsplit(x," ")))) %>%
    unnest(zips) %>%
    add_row(zips = "00083", neighborhood = "Central Park")

**neighborhoods**
neighborhood	zips
Central Bronx	10453
Central Bronx	10457
Central Bronx	10460
Bronx Park and Fordham	10458
Bronx Park and Fordham	10467
Bronx Park and Fordham	10468
High Bridge and Morrisania	10451
High Bridge and Morrisania	10452
High Bridge and Morrisania	10456
Hunts Point and Mott Haven	10454
Hunts Point and Mott Haven	10455
Hunts Point and Mott Haven	10459
Hunts Point and Mott Haven	10474
Kingsbridge and Riverdale	10463
Kingsbridge and Riverdale	10471
Northeast Bronx	10466
Northeast Bronx	10469
Northeast Bronx	10470
Northeast Bronx	10475
Southeast Bronx	10461
Southeast Bronx	10462
Southeast Bronx	10464
Southeast Bronx	10465
Southeast Bronx	10472
Southeast Bronx	10473
Central Brooklyn	11212
Central Brooklyn	11213
Central Brooklyn	11216
Central Brooklyn	11233
Central Brooklyn	11238
Southwest Brooklyn	11209
Southwest Brooklyn	11214
Southwest Brooklyn	11228
Borough Park	11204
Borough Park	11218
Borough Park	11219
Borough Park	11230
Canarsie and Flatlands	11234
Canarsie and Flatlands	11236
Canarsie and Flatlands	11239
Southern Brooklyn	11223
Southern Brooklyn	11224
Southern Brooklyn	11229
Southern Brooklyn	11235
Northwest Brooklyn	11201
Northwest Brooklyn	11205
Northwest Brooklyn	11215
Northwest Brooklyn	11217
Northwest Brooklyn	11231
Flatbush	11203

Finally, we’ll load in the shapefile that contains all of the zip code polygins (with st_read) and merge it with the neighborhoods key and the 2015 census data. Before we saving it as an RDA file also we reduce the number of polygon vertices using `ms_simplify()’. This will allow for better performance of our Shiny app.

library(jsonlite)
library(sf)

zip_sf = st_read("https://raw.githubusercontent.com/fedhere/PUI2015_EC/master/mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson", stringsAsFactors = FALSE) %>%
    select(OBJECTID, postalCode, geometry) %>%
    rename(postalcode = postalCode, shape_id = OBJECTID)

## Reading layer `nyc-zip-code-tabulation-areas-polygons' from data source `https://raw.githubusercontent.com/fedhere/PUI2015_EC/master/mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson' using driver `GeoJSON'
## Simple feature collection with 262 features and 11 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -74.25576 ymin: 40.49584 xmax: -73.6996 ymax: 40.91517
## CRS:            4326

zip_sf = merge(zip_sf, unique_census, by.x="postalcode", by.y="zip_code", all.x=TRUE)
zip_sf = merge(zip_sf, neighborhoods, by.x="postalcode", by.y="zips", all.x=TRUE)

zip_sf = rmapshaper::ms_simplify(zip_sf, keep = 0.05, keep_shapes=TRUE)

save(zip_sf, file="zip_polygons.rda")

6.2 Running the App

In this sub-section, we review the specific files that are run each time we start up an instance of out Shiny app visualization.

6.2.1 global.R

This file is always executed before server.R and ui.R and allows us run preliminary code and create global variables used by both the server and UI. We begin the file by loading in the tidyverse package and creating a temporary directory where we can store the plots we’ll be generating when a user clicks a particular polygon.

library(tidyverse)

# Create temp folder for popup plots and rdas
folder <- tempfile()
dir.create(folder)

Next, we set the location from where we’ll be loading our rda files, as this is different depending on whether we’re running it locally or on shinyapps.io (in which case the RDA files are stored on dropbox). Before pushing the app to shinyapps.io, one simply has to uncomment the commented out lines so that the app can pull from dropbox rather than just outside its local directory.

rda_loc = "../"

# These lines are for hosting the Shiny app on shinyapps.io
# Leave them commented out when you're running on your local machine
# library(rdrop2)
# drop_auth(rdstoken = "token.rds")
# dir_info = drop_dir("Rstudio_cloud/engm_182_project_data")
# mapply(drop_get,
#     path = dir_info$path_display,
#     local_file = paste(folder,dir_info$name,sep="/"),
#     overwrite = rep(TRUE, nrow(dir_info))
# )
# rda_loc = paste0(folder, "/")

Now that we’ve set their location, we can load in the RDA files. We’ll also create a unique_zips data frame to hold popup titles for each zipcode polygon in the shapefile.

# Load data gropued by month and zip code
load(paste0(rda_loc, "grouped_housing.rda"))
load(paste0(rda_loc, "crime_scores.rda"))
load(paste0(rda_loc, "zip_polygons.rda"))

unique_zips = zip_sf %>% data.frame() %>%
    select(postalcode, neighborhood) %>%
    group_by(postalcode) %>%
    summarize(neighborhood = neighborhood[ifelse(min(which(!is.na(neighborhood)))==Inf,1,min(which(!is.na(neighborhood))))]) %>%
    mutate(select_names = paste0(ifelse(is.na(.data$neighborhood), "", paste0(.data$neighborhood, " - ")), .data$postalcode)) %>%
    data.frame() %>%
    select(-neighborhood)

**unique_zips**
postalcode	select_names
00083	Central Park - 00083
10001	Chelsea and Clinton - 10001
10002	Lower East Side - 10002
10003	Lower East Side - 10003
10004	Lower Manhattan - 10004
10005	Lower Manhattan - 10005
10006	Lower Manhattan - 10006
10007	Lower Manhattan - 10007
10009	Lower East Side - 10009
10010	Gramercy Park and Murray Hill - 10010
10011	Chelsea and Clinton - 10011
10012	Greenwich Village and Soho - 10012
10013	Greenwich Village and Soho - 10013
10014	Greenwich Village and Soho - 10014
10016	Gramercy Park and Murray Hill - 10016
10017	Gramercy Park and Murray Hill - 10017
10018	Chelsea and Clinton - 10018
10019	Chelsea and Clinton - 10019
10020	Chelsea and Clinton - 10020
10021	Upper East Side - 10021
10022	Gramercy Park and Murray Hill - 10022
10023	Upper West Side - 10023
10024	Upper West Side - 10024
10025	Upper West Side - 10025
10026	Central Harlem - 10026
10027	Central Harlem - 10027
10028	Upper East Side - 10028
10029	East Harlem - 10029
10030	Central Harlem - 10030
10031	Inwood and Washington Heights - 10031
10032	Inwood and Washington Heights - 10032
10033	Inwood and Washington Heights - 10033
10034	Inwood and Washington Heights - 10034
10035	East Harlem - 10035
10036	Chelsea and Clinton - 10036
10037	Central Harlem - 10037
10038	Lower Manhattan - 10038
10039	Central Harlem - 10039
10040	Inwood and Washington Heights - 10040
10041	10041
10043	10043
10044	Upper East Side - 10044
10045	10045
10047	10047
10048	10048
10055	10055
10065	Upper East Side - 10065
10069	10069
10075	Upper East Side - 10075
10080	10080

Next we build a list to hold the radio button choices to select which data the user wants to look at.

# Build list of radio button choice names and values
radioButtonOptions = setNames(
    as.list(c(colnames(grouped_housing)[-c(1:3)],colnames(crime_scores)[3])), # values
    c( # choice names
        "Average price per sq. ft.",
        "Number of sales",
        "Total proceeds from sales",
        "Crime score"
    )
)

Finally, we create some custom CSS and HTML that will be used to fix a bug in the display positioning of the NA color within the leaflet legend. We found this solution in a Github issue thread.

# Legend NA position fix
library(magrittr)
library(htmlwidgets)
css_fix <- "div.info.legend.leaflet-control br {clear: both;}"
html_fix <- as.character(htmltools::tags$style(type = "text/css", css_fix))

6.2.2 server.R

Server is the part of shiny application where processing and updating of data, map, predicted values happen. We’ll gradually build the code for Sever in steps. The inputIds in ui.R script will be linked in Sever script.

6.2.2.1 Importing necessary libraries

library(shiny)
library(leaflet)
library(sf)
library(RColorBrewer)
library(zoo)
library(scales)
library(ggplot2)
library(caret)
library(randomForest)

6.2.2.2 Loading the prediction model files

We’ll be using Neural net and Random forest models to compute prices of real estate based on 4 inputs from user.

load("neural_net.rda")
load("random_forest.rda")

6.2.2.3 Creating reactive merged dataframe and mapping

We start coding inside shinyServer function. In next steps, we build chunks of code that will go inside shinyServer function. They execute different processes in the app. Final code of server script is given in the end.

In this first step, we create a reactive obeject that changes the merged dateframe as per the user inputs: time selected and property type selected. The merged dataframe has spatial data. renderLeaflet function creates the map and shows NYC as we have put the coordinates.

shinyServer(function(input, output) {

    df <- reactive({
        if (housing_data_select()) {
            data = grouped_housing %>% 
                filter(category == input$prop_category, month_char == input$date_select) %>%
                select(zip_code, disp_data = !!input$data_select)
        } else {
            data = crime_scores %>%
                filter(month_char == input$date_select) %>%
                select(zip_code, disp_data = !!input$data_select)
        }
        tmp = merge(zip_sf, data, by.x="postalcode", by.y="zip_code", all.x=TRUE, no.dups=FALSE)
        return(tmp)
    })
    
    output$map <- renderLeaflet({
        leaflet() %>%
            addTiles() %>%
            setView(-74.0060,40.7128, zoom=11)
    })
    
})

6.2.2.4 Mapping the reactive data and adding legend

Now, we add a color palette. Using reverse geocoding, we map the reactive data based on the ZIP code as per the attribute chosen by the user with radioButton and time chosen on sliderInput. We also highlight the ZIP code areas for mouse hover.

    observe({
        
        
        tmp = df()
        legend_labels = NULL
        
        pal = colorNumeric(
            palette = rev(brewer.pal(n=9, name = "RdYlGn")),
            domain = NULL
        )
        
        if (all(is.na(tmp$disp_data))) {
            func = labelFormat()
            tmp$disp_data = as.factor(rep("NA", nrow(tmp)))
            pal = colorFactor(palette = "#808080", domain = NULL)
        } else if (input$data_select == "avg_price_per_sqft") {
            func = labelFormat(prefix = " $")
        } else if (input$data_select == "total_proceeds") {
            func = labelFormat(prefix = " $", suffix = "M", transform=function(x) x/1E6)
        } else {
            func = labelFormat(prefix = " ")
        }
        
        leafletProxy("map", data = tmp) %>%
            clearControls() %>%
            clearShapes() %>%
            clearPopups() %>%
            addPolygons(
                fillColor = ~pal(disp_data),
                color = "#b2aeae", # you need to use hex colors
                fillOpacity = 0.7, 
                weight = 1, 
                smoothFactor = 0.2,
                # Highlight neighbourhoods upon mouseover
                highlight = highlightOptions(
                    weight = 3,
                    color = "black",
                    opacity = 1.0
                ),
                layerId = ~shape_id
            ) %>%
            addLegend(
                title=names(which(radioButtonOptions == input$data_select)),
                pal = pal, 
                values = ~disp_data, 
                position = "bottomright", 
                labFormat = func,
                labels = legend_labels
            )
    })

6.2.2.5 Clicking on ZIP code areas

We create a function observeEvent that will check the click on a ZIP code area. It will show a pop up inside which we can see a graph of the chosen attribute over time. Code for the pop up in shown in the next step.

    # Observe shape clicks to display popup
    observeEvent(input$map_shape_click,{
        event = input$map_shape_click
        if (is.null(event))
            return()
        else {
            zip_clicked = zip_sf$postalcode[zip_sf$shape_id == event$id]
            curr = FALSE
            if (housing_data_select()) {
                data = grouped_housing %>%
                    filter(category == input$prop_category, zip_code == zip_clicked) %>%
                    select(month_char, !!input$data_select)
                title_suffix_str = paste("-", input$prop_category)
                if (input$prop_category == "All") {
                    title_suffix_str = paste(title_suffix_str, "Categories")
                }
                if (input$data_select %in% c("avg_price_per_sqft", "total_proceeds")) {
                    curr = TRUE
                }
            } else {
                data = crime_scores %>%
                    filter(zip_code == zip_clicked) %>%
                    select(month_char, weight_normalized)
                title_suffix_str = "over Time"
            }
            showPopup(data, zip_clicked, event$id, event$lat, event$lng, title_suffix_str, curr)
        }
    })

6.2.2.6 Adding pop up

In this step, we add pop up on each ZIP code area showing the graph of the attribute chosen over time. Pop us also shows demographic information from the 2015 census.

showPopup <- function(df, zip, id, lat, lng, suff_str, curr = FALSE) {
        
        # Get plot object and other popup data for the chosen zip code
        zip_data = zip_sf[which(zip_sf$postalcode == zip),]
        if (nrow(df) > 0) {
            y = pull(df, 2)
            x = as.yearmon(pull(df, 1))
            title = paste(names(which(radioButtonOptions == colnames(df)[2])), suff_str)
            ylabel = names(which(radioButtonOptions == colnames(df)[2]))
            
            if (curr == TRUE) {
                if (max(y, na.rm=TRUE) > 1e6) {
                    label = label_number(prefix = "$", suffix = "M", scale = 1e-6)
                } else if (max(y, na.rm = TRUE) > 1e4) {
                    label =  label_number(prefix = "$", suffix = "K", scale = 1e-3)
                } else {
                    label = label_number(prefix = "$")
                }
            } else {
                label = label_number()
            }
            plot = ggplot(data=NULL, aes(x, y)) +
                geom_line() +
                geom_smooth() +
                labs(title = title, y = ylabel, x = "Month")  +
                scale_y_continuous(label = label, expand = c(0,0), limits = c(0, ifelse(colnames(df)[2] == "weight_normalized", 1, max(y, na.rm=TRUE)))) +
                theme_classic() +
                theme(plot.title = element_text(size = 11, hjust = 0.5), axis.title = element_text(size = 10))
    
            # Write svg file to temporary folder
            svg(filename= paste(folder,"plot.svg", sep = "/"),
                width = 500 * 0.01, height = 300 * 0.01)
            print(plot)
            dev.off()
    
            content <- paste(readLines(paste(folder,"plot.svg",sep="/")), collapse = "")
        } else {
            content = ""
        }
        
        if (is.na(zip_data$neighborhood)) {
            neighborhood_str = ""
        } else {
            neighborhood_str = paste(zip_data$neighborhood, " - ")
        }
        # Create popup
        leafletProxy("map") %>%
            addPopups(
                lng,
                lat,
                popup = paste0(
                    "<h4>", neighborhood_str, zip, "</h4><br/>",
                    content, "<br/>",
                    "<b>2015 Census Data</b><br/>",
                    "Per capita income:    $", round(zip_data$PerCapitaIncome),"</b><br/>",
                    "Total Population: ", zip_data$TotalPop, "</b><br/>",
                    "Unemployment rate: ", round(zip_data$Unemployed), "%"
                ),
                layerId = id,
                options = popupOptions(maxWidth = 500)
            )
    }

6.2.2.7 Price predictor

In this step, we integrate the predictive models in the shiny app. It takes 4 inputs from the user: Age of the builing, area, sale year and ZIP code. ZIP code pulls demographic data from the 2015 cenus that’s in the dataframe. It feeds these inputs in the model chosen by the user and gives a predicted price.

    price_text = eventReactive(input$predict_price, {
        if (input$zip != "")
            zip = unique_zips[which(unique_zips$select_names==input$zip), 1]
        else {
            zip = NA
        }
        input_data = data.frame(gross_square_feet = input$sqft, age = input$age, sale_year = input$sale_year, zip_code = zip, stringsAsFactors = FALSE)
        bad_cols = colnames(input_data)[colSums(is.na(input_data)) > 0]
        if (length(bad_cols) == 0){
            zip_data = zip_sf %>%
                filter(postalcode == zip) %>%
                slice(1) %>%
                data.frame() %>%
                select(postalcode, PerCapitaIncome, White, Black, Asian, Hispanic, Native, TotalPop.x = TotalPop, Unemployed, weight) %>%
                inner_join(input_data, by = c("postalcode" = "zip_code")) %>%
                select(-postalcode)
            if (is.na(zip_data)) {return(paste0("<font color=\"#eb4034\">Error: No valid census data for this zip code!</font>"))}
            if (input$model_select == "fit.rf") {
                predicted_val = predict(fit.rf, newdata=zip_data)
            } else if (input$model_select == "fit.nnet") {
                predicted_val = predict(fit.nnet, newdata=zip_data)
            } else if (input$model_select == "fit.bgcrt") {
                predicted_val = predict(fit.bgcrt, newdata=zip_data)
            }
            return(paste0("<br/>", dollar(predicted_val)))
        } else {
            msg = ""
            if ("gross_square_feet" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a square footage!")
            }
            if ("age" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter building age!")
            }
            if ("sale_year" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a sale year!")
            }
            if ("zip_code" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a location!")
            }
            return(paste0("<font color=\"#eb4034\">", msg, "</font>"))
        }
    })

output$price = renderText({
        price_text()
    })

6.2.2.8 Choice of attribute

housing_data_select takes a boolean value of either TRUE or FALSE depending on whether the attribute chosen by the user is within the choices provided.

outputOptions(output, "housing_data_select", suspendWhenHidden = FALSE)
    
    output$housing_data_select = reactive({
        return(housing_data_select())
    })
    
    housing_data_select = reactive({
        if (input$data_select %in% colnames(grouped_housing)[-c(1:3)]) {
            return(TRUE)
        } else {
            return(FALSE)
        }
    })

6.2.2.9 Final code of server.R

Now we combine all chunks of code that we saw in steps so far inside the shinyServer function. This is the final script for server.R. This script along with ui.R and Global.R runs the shiny app.

library(shiny)
library(leaflet)
library(sf)
library(RColorBrewer)
library(zoo)
library(scales)
library(ggplot2)
library(caret)
library(randomForest)

load("neural_net.rda")
# load(paste0(rda_loc, "bagged_cart.rda"))
load("random_forest.rda")

shinyServer(function(input, output) {
    
    output$housing_data_select = reactive({
        return(housing_data_select())
    })
    
    housing_data_select = reactive({
        if (input$data_select %in% colnames(grouped_housing)[-c(1:3)]) {
            return(TRUE)
        } else {
            return(FALSE)
        }
    })
   
    
    output$price = renderText({
        price_text()
    })
    
    price_text = eventReactive(input$predict_price, {
        if (input$zip != "")
            zip = unique_zips[which(unique_zips$select_names==input$zip), 1]
        else {
            zip = NA
        }
        input_data = data.frame(gross_square_feet = input$sqft, age = input$age, sale_year = input$sale_year, zip_code = zip, stringsAsFactors = FALSE)
        bad_cols = colnames(input_data)[colSums(is.na(input_data)) > 0]
        if (length(bad_cols) == 0){
            zip_data = zip_sf %>%
                filter(postalcode == zip) %>%
                slice(1) %>%
                data.frame() %>%
                select(postalcode, PerCapitaIncome, White, Black, Asian, Hispanic, Native, TotalPop.x = TotalPop, Unemployed, weight) %>%
                inner_join(input_data, by = c("postalcode" = "zip_code")) %>%
                select(-postalcode)
            if (is.na(zip_data)) {return(paste0("<font color=\"#eb4034\">Error: No valid census data for this zip code!</font>"))}
            if (input$model_select == "fit.rf") {
                predicted_val = predict(fit.rf, newdata=zip_data)
            } else if (input$model_select == "fit.nnet") {
                predicted_val = predict(fit.nnet, newdata=zip_data)
            } else if (input$model_select == "fit.bgcrt") {
                predicted_val = predict(fit.bgcrt, newdata=zip_data)
            }
            return(paste0("<br/>", dollar(predicted_val)))
        } else {
            msg = ""
            if ("gross_square_feet" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a square footage!")
            }
            if ("age" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter building age!")
            }
            if ("sale_year" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a sale year!")
            }
            if ("zip_code" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a location!")
            }
            return(paste0("<font color=\"#eb4034\">", msg, "</font>"))
        }
    })

    df <- reactive({
        if (housing_data_select()) {
            data = grouped_housing %>% 
                filter(category == input$prop_category, month_char == input$date_select) %>%
                select(zip_code, disp_data = !!input$data_select)
        } else {
            data = crime_scores %>%
                filter(month_char == input$date_select) %>%
                select(zip_code, disp_data = !!input$data_select)
        }
        tmp = merge(zip_sf, data, by.x="postalcode", by.y="zip_code", all.x=TRUE, no.dups=FALSE)
        return(tmp)
    })
    
    output$map <- renderLeaflet({
        leaflet() %>%
            addTiles() %>%
            setView(-74.0060,40.7128, zoom=11)
    })
    
    observe({
        
        
        tmp = df()
        legend_labels = NULL
        
        pal = colorNumeric(
            palette = rev(brewer.pal(n=9, name = "RdYlGn")),
            domain = NULL
        )
        
        if (all(is.na(tmp$disp_data))) {
            func = labelFormat()
            tmp$disp_data = as.factor(rep("NA", nrow(tmp)))
            pal = colorFactor(palette = "#808080", domain = NULL)
        } else if (input$data_select == "avg_price_per_sqft") {
            func = labelFormat(prefix = " $")
        } else if (input$data_select == "total_proceeds") {
            func = labelFormat(prefix = " $", suffix = "M", transform=function(x) x/1E6)
        } else {
            func = labelFormat(prefix = " ")
        }
        
        leafletProxy("map", data = tmp) %>%
            clearControls() %>%
            clearShapes() %>%
            clearPopups() %>%
            addPolygons(
                fillColor = ~pal(disp_data),
                color = "#b2aeae", # you need to use hex colors
                fillOpacity = 0.7, 
                weight = 1, 
                smoothFactor = 0.2,
                # Highlight neighbourhoods upon mouseover
                highlight = highlightOptions(
                    weight = 3,
                    color = "black",
                    opacity = 1.0
                ),
                layerId = ~shape_id
            ) %>%
            addLegend(
                title=names(which(radioButtonOptions == input$data_select)),
                pal = pal, 
                values = ~disp_data, 
                position = "bottomright", 
                labFormat = func,
                labels = legend_labels
            )
    })
    
    
    # Observe shape clicks to display popup
    observeEvent(input$map_shape_click,{
        event = input$map_shape_click
        if (is.null(event))
            return()
        else {
            zip_clicked = zip_sf$postalcode[zip_sf$shape_id == event$id]
            curr = FALSE
            if (housing_data_select()) {
                data = grouped_housing %>%
                    filter(category == input$prop_category, zip_code == zip_clicked) %>%
                    select(month_char, !!input$data_select)
                title_suffix_str = paste("-", input$prop_category)
                if (input$prop_category == "All") {
                    title_suffix_str = paste(title_suffix_str, "Categories")
                }
                if (input$data_select %in% c("avg_price_per_sqft", "total_proceeds")) {
                    curr = TRUE
                }
            } else {
                data = crime_scores %>%
                    filter(zip_code == zip_clicked) %>%
                    select(month_char, weight_normalized)
                title_suffix_str = "over Time"
            }
            showPopup(data, zip_clicked, event$id, event$lat, event$lng, title_suffix_str, curr)
        }
    })
    
    showPopup <- function(df, zip, id, lat, lng, suff_str, curr = FALSE) {
        
        # Get plot object and other popup data for the chosen zip code
        zip_data = zip_sf[which(zip_sf$postalcode == zip),]
        if (nrow(df) > 0) {
            y = pull(df, 2)
            x = as.yearmon(pull(df, 1))
            title = paste(names(which(radioButtonOptions == colnames(df)[2])), suff_str)
            ylabel = names(which(radioButtonOptions == colnames(df)[2]))
            
            if (curr == TRUE) {
                if (max(y, na.rm=TRUE) > 1e6) {
                    label = label_number(prefix = "$", suffix = "M", scale = 1e-6)
                } else if (max(y, na.rm = TRUE) > 1e4) {
                    label =  label_number(prefix = "$", suffix = "K", scale = 1e-3)
                } else {
                    label = label_number(prefix = "$")
                }
            } else {
                label = label_number()
            }
            plot = ggplot(data=NULL, aes(x, y)) +
                geom_line() +
                geom_smooth() +
                labs(title = title, y = ylabel, x = "Month")  +
                scale_y_continuous(label = label, expand = c(0,0), limits = c(0, ifelse(colnames(df)[2] == "weight_normalized", 1, max(y, na.rm=TRUE)))) +
                theme_classic() +
                theme(plot.title = element_text(size = 11, hjust = 0.5), axis.title = element_text(size = 10))
    
            # Write svg file to temporary folder
            svg(filename= paste(folder,"plot.svg", sep = "/"),
                width = 500 * 0.01, height = 300 * 0.01)
            print(plot)
            dev.off()
    
            content <- paste(readLines(paste(folder,"plot.svg",sep="/")), collapse = "")
        } else {
            content = ""
        }
        
        if (is.na(zip_data$neighborhood)) {
            neighborhood_str = ""
        } else {
            neighborhood_str = paste(zip_data$neighborhood, " - ")
        }
        # Create popup
        leafletProxy("map") %>%
            addPopups(
                lng,
                lat,
                popup = paste0(
                    "<h4>", neighborhood_str, zip, "</h4><br/>",
                    content, "<br/>",
                    "<b>2015 Census Data</b><br/>",
                    "Per capita income:    $", round(zip_data$PerCapitaIncome),"</b><br/>",
                    "Total Population: ", zip_data$TotalPop, "</b><br/>",
                    "Unemployment rate: ", round(zip_data$Unemployed), "%"
                ),
                layerId = id,
                options = popupOptions(maxWidth = 500)
            )
    }
    
    outputOptions(output, "housing_data_select", suspendWhenHidden = FALSE)
})

6.2.3 ui.R

User interface is the user facing side of the Shiny application.It contains various display elements such as titlePanel, sidebarLayout, radioButtons, tabPanel etc. Following steps show how we step-wise buit up code for UI.

6.2.3.1 Importing necessary libraries

library(shiny)
library(leaflet)
library(shinyWidgets)

6.2.3.2 Creating basic layout of UI and display elements

Now that we have imported libraries, we start adding code inside fluidPage to create display elements. We’ll have user inputs in the sidebarPanel. We’ll add those in later steps. Final code of ui script is given in the end.

shinyUI(fluidPage(
    HTML(html_fix),
    titlePanel("NYC Real Estate and Crime"),
    sidebarLayout(
        sidebarPanel(),
        mainPanel(
            leafletOutput(
                "map",
                width="100%",
                height="90vh"
            )
        ),
        position="left"
    )
))

6.2.3.3 Creating display elements in sidebarLayout

Now we’ll start coding inside sidebarLayout to create display elements. We created 2 tabs, first tab for the attribute we want to see at a chosen time. sliderTextInput allows us to choose a specific time. radioButtons links to radioButtonOptions in Global.R file. It allows us to choose the attribute of the data we want to see. conditionalPanel shows real esate based on the “property type” we have chosen from the drop down list.

    titlePanel("NYC Real Estate and Crime"),
    sidebarLayout(
        sidebarPanel(
            tabsetPanel(
                tabPanel(
                    "Map Controls",
                    HTML("<br/>"),
                    sliderTextInput(
                        inputId = "date_select",
                        label="Month",
                        choices = sort(unique(c(unique(crime_scores$month_char),unique(grouped_housing$month_char)))),
                        animate=TRUE,
                        selected="2020-03"
                    ),
                    radioButtons(
                        inputId = "data_select",
                        label = "Mapped Data",
                        radioButtonOptions # set up in global.R
                    ),
                    conditionalPanel(
                        condition = "output.housing_data_select == true",
                        selectInput(
                            inputId = "prop_category",
                            label = "Property Category",
                            choices = sort(unique(grouped_housing$category))
                        )
                    )
                )

6.2.3.4 Predictive model inputs

Now we create second tab for the predictive model integration in the app. We build radioButtons to choose between models. Price predictor takes 4 inputs to compute the price. Clicking on the actionButton shows predicted price.

          tabPanel(
                    "Price Predictor",
                    HTML("<br/>"),
                    numericInput("age", "Building age (years):", 50, min=1800, max = 2030),
                    numericInput("sqft", "Square footage:", 2000, min=0, max = 1000000),
                    numericInput("sale_year", "Sale Year:", 2010, min=1950, max = 2030),
                    selectizeInput("zip", "Location:",
                        selected = NULL,
                        choices = unique_zips$select_names,
                        options = list(
                            placeholder = "Search by Neighborhood or Zip Code",
                            onInitialize = I('function() { this.setValue(""); }')
                        )
                    ),
                    radioButtons(
                        "model_select",
                        label="Predictive Model:",
                        choiceNames = c(
                            # "Bagged cart",
                            "Neural net",
                            "Random forest"
                        ),
                        choiceValues = c(
                            # "fit.bgcrt", 
                            "fit.nnet",
                            "fit.rf"
                        )
                    ),
                    actionButton("predict_price", label = "Compute Price"),
                    HTML("<hr/><b>Predicted Price: </b>"),
                    htmlOutput("price")
                )

6.2.3.5 Final code of ui.R

Now we combine all chunks of code that we saw in steps so far inside the shinyUI function. This is the final script for ui.R. This script along with server.R and Global.R runs the shiny app.

library(shiny)
library(leaflet)
library(shinyWidgets)

# Define UI for application that draws a histogram
shinyUI(fluidPage(
    HTML(html_fix),
    titlePanel("NYC Real Estate and Crime"),
    sidebarLayout(
        sidebarPanel(
            tabsetPanel(
                tabPanel(
                    "Map Controls",
                    HTML("<br/>"),
                    sliderTextInput(
                        inputId = "date_select",
                        label="Month",
                        choices = sort(unique(c(unique(crime_scores$month_char),unique(grouped_housing$month_char)))),
                        animate=TRUE,
                        selected="2020-03"
                    ),
                    radioButtons(
                        inputId = "data_select",
                        label = "Mapped Data",
                        radioButtonOptions # set up in global.R
                    ),
                    conditionalPanel(
                        condition = "output.housing_data_select == true",
                        selectInput(
                            inputId = "prop_category",
                            label = "Property Category",
                            choices = sort(unique(grouped_housing$category))
                        )
                    )
                ),
                tabPanel(
                    "Price Predictor",
                    HTML("<br/>"),
                    numericInput("age", "Building age (years):", 50, min=1800, max = 2030),
                    numericInput("sqft", "Square footage:", 2000, min=0, max = 1000000),
                    numericInput("sale_year", "Sale Year:", 2010, min=1950, max = 2030),
                    selectizeInput("zip", "Location:",
                        selected = NULL,
                        choices = unique_zips$select_names,
                        options = list(
                            placeholder = "Search by Neighborhood or Zip Code",
                            onInitialize = I('function() { this.setValue(""); }')
                        )
                    ),
                    radioButtons(
                        "model_select",
                        label="Predictive Model:",
                        choiceNames = c(
                            # "Bagged cart",
                            "Neural net",
                            "Random forest"
                        ),
                        choiceValues = c(
                            # "fit.bgcrt", 
                            "fit.nnet",
                            "fit.rf"
                        )
                    ),
                    actionButton("predict_price", label = "Compute Price"),
                    HTML("<hr/><b>Predicted Price: </b>"),
                    htmlOutput("price")
                )
            )
        ),
        mainPanel(
            leafletOutput(
                "map",
                width="100%",
                height="90vh"
            )
        ),
        position="left"
    )
))

7 Conclusion and next steps

7.1 Challenges faced

Selecting relevant variables

Many factors go into predicting the housing prices. So it’s difficult to select a set of variables from the data available.
Cleaning the data

Cleaning the raw data is an important step. Often, the data collected is not in the required shape so a lot wrangling has to be done before it’s ready for further processing.
Missing data

There were many data points that were missing. We couldn’t do much about them. There are many ways to deal with missing data. However, they were not within the scope of the project. Grey areas seen on the map are due to these missing data points.
Computing power limitations We experienced computing power limitations while working with machine learning models. In future, these models can be run with better computing.

7.2 What we learned

Data cleaning takes time

Cleaning and preparing the data into required form is an important step and can be time consuming. Often, it takes more time to clean and prepare the date than processing it.
Higher computing or APIs needed to run the models

To run the machine learning models, higher computing power would be useful. Othewise, using some API package like rcpp the code could be executed faster.
Lost insights due to missing data We may have missed out on many insights due to missing data. There was nothing much we could do about this. In future, missing data could be handled better.

Measure	Value
Adjusted R-squared value (\(R^2\))	27.26%
Root Mean Squared (RMS)	198,239.1
Root Mean Squared Error (RMSE)	43.59%

NYC Crime and Real Estate Data Project Tutorial

ENGM 182 - Data Analytics

Vikhyat Khare, Omkar Kshirsagar, Carter Noordsij, John Sullivan

June 9, 2020

1 Background

2 Importing Data

2.1 Importing the Real Estate Data

2.2 Importing crime data

2.2.1 Loading tidyverse package

2.2.2 Reading crime data

2.3 Importing the Census Data

2.3.1 Downloading the census block data

2.3.2 Downloading the 2nd part NYC census tracts data

3 Data Cleaning and Processing

3.1 Crime Data

3.1.1 Cleaning the Crime Data

3.1.2 Assigning Crime Rating

3.1.2.1 Adding zipcode to each criminal activity

3.1.2.2 Assigning weights to each crime

3.1.3 Group crime information by month, year, and zipcode

3.1.4 Group other fields by time and zipcode

3.1.5 Converting coordinates to zipcodes

3.1.6 Sources for latitude and longitude to zipcode code and data

3.1.7 Processing the crime data

3.2 Census Data

3.2.1 Cleaning the census data

3.2.2 Processing the census data

3.2.3 Source for census tract to zipccode mapping data

3.3 Real Estate Data

3.3.1 Geocoding the Real Estate Data

3.3.2 Cleaning the Real Estate Data

3.3.3 Grouping the Real Estate Data

3.4 Creating a single dataframe for modeling

3.4.1 Section of housing sales data

3.4.2 Adding the section about census data

3.4.3 Adding the crime score

4 Regression Modeling

4.1 Code to setup and run regressions on combined crime data

4.1.1 Load packages and import data

4.1.2 Variable selection

4.1.3 Variable coercion

4.1.4 Partition the data

4.1.5 Run the regression using fastLm

4.1.6 Summarize results

4.2 Results from the regression models

5 Machine Learning Modeling

5.1 Preparing the dataframe for Machine Learning models

5.2 Splitting the data into training and test data

5.3 Running the ML models

5.4 Results of validation and training

5.5 Results of predictions on testing data

6 Shiny App Visualization

6.1 Data Preparation

6.2 Running the App

6.2.1 global.R

6.2.2 server.R

6.2.2.1 Importing necessary libraries

6.2.2.2 Loading the prediction model files

6.2.2.3 Creating reactive merged dataframe and mapping

6.2.2.4 Mapping the reactive data and adding legend

6.2.2.5 Clicking on ZIP code areas

6.2.2.6 Adding pop up

6.2.2.7 Price predictor

6.2.2.8 Choice of attribute

6.2.2.9 Final code of server.R

6.2.3 ui.R

6.2.3.1 Importing necessary libraries

6.2.3.2 Creating basic layout of UI and display elements

6.2.3.3 Creating display elements in sidebarLayout

6.2.3.4 Predictive model inputs

6.2.3.5 Final code of ui.R

7 Conclusion and next steps

7.1 Challenges faced

7.2 What we learned