1 Background

This tutorial page describes the final project we conducted for Professor Geoffrey Parker’s Spring 2020 Data Analytics course (ENGM 182) at Dartmouth College’s Thayer School of Engineering. The goal of the project was to create a compelling and approachable a visualization of historical NYC real estate sales and crime data, and an accompanying model for predicting the price of a hypothetical real estate sale.

We chose to use the R programming language to complete this task. The secions below describe the code and methodologies for acquiring, cleaning, and processing the data, as well as the code that drives the Shiny app visualization.

The project’s source code is available on Github, and the Shiny app is hosted for viewing on shinyapps.io. Our data sources can be downloaded at the links below:

2 Importing Data

This section describes how we imported the data from each of our data sources into an R environment, where it can be then be easily cleaned and processed. Expand the tab below to see the directory structure we used for our downloaded data files.

Data directory structure

Data
├── Census_2015
│   ├── census_block_loc.csv
│   └── nyc_census_tracts.csv
├── Crimes_2006-2017
│   └── NYPD_Complaint_Data_Historic.csv
├── Crimes_2014-2015
│   ├── Crime_Column_Description.csv
│   ├── NYPD_Complaint_Data_Historic.csv
│   └── Population_by_Borough_NYC.csv
├── RE_2003-2015
│   ├── 2009_bronx.xls
│   ├── 2009_brooklyn.xls
│   ├── 2009_manhattan.xls
│   ├── 2009_queens.xls
│   ├── 2009_statenisland.xls
│   ├── 2010_bronx.xls
│   ├── 2010_brooklyn.xls
│   ├── 2010_manhattan.xls
│   ├── 2010_queens.xls
│   ├── 2010_statenisland.xls
│   ├── 2011_bronx.xls
│   ├── 2011_brooklyn.xls
│   ├── 2011_manhattan.xls
│   ├── 2011_queens.xls
│   ├── 2011_statenisland.xls
│   ├── 2012_bronx.xls
│   ├── 2012_brooklyn.xls
│   ├── 2012_manhattan.xls
│   ├── 2012_queens.xls
│   ├── 2012_statenisland.xls
│   ├── 2013_bronx.xls
│   ├── 2013_brooklyn.xls
│   ├── 2013_manhattan.xls
│   ├── 2013_queens.xls
│   ├── 2013_statenisland.xls
│   ├── 2014_bronx.xls
│   ├── 2014_brooklyn.xls
│   ├── 2014_manhattan.xls
│   ├── 2014_queens.xls
│   ├── 2014_statenisland.xls
│   ├── 2015_bronx.xls
│   ├── 2015_brooklyn.xls
│   ├── 2015_manhattan.xls
│   ├── 2015_queens.xls
│   ├── 2015_statenisland.xls
│   ├── glossary_rsf071607.pdf
│   ├── sales_2007_bronx.xls
│   ├── sales_2007_brooklyn.xls
│   ├── sales_2007_manhattan.xls
│   ├── sales_2007_queens.xls
│   ├── sales_2007_statenisland.xls
│   ├── sales_2008_bronx.xls
│   ├── sales_2008_brooklyn.xls
│   ├── sales_2008_manhattan.xls
│   ├── sales_2008_queens.xls
│   ├── sales_2008_statenisland.xls
│   ├── sales_bronx_03.xls
│   ├── sales_bronx_04.xls
│   ├── sales_bronx_05.xls
│   ├── sales_bronx_06.xls
│   ├── sales_brooklyn_03.xls
│   ├── sales_brooklyn_04.xls
│   ├── sales_brooklyn_05.xls
│   ├── sales_brooklyn_06.xls
│   ├── sales_manhattan_03.xls
│   ├── sales_manhattan_04.xls
│   ├── sales_manhattan_05.xls
│   ├── sales_manhattan_06.xls
│   ├── sales_queens_03.xls
│   ├── sales_queens_04.xls
│   ├── sales_queens_05.xls
│   ├── sales_queens_06.xls
│   ├── sales_si_03.xls
│   ├── sales_si_04.xls
│   ├── sales_si_05.xls
│   └── sales_si_06.xls
├── RE_2016-2017
│   └── nyc-rolling-sales.csv
└── RE_Apr2019-Mar2020
    ├── rollingsales_bronx.xls
    ├── rollingsales_brooklyn.xls
    ├── rollingsales_manhattan.xls
    ├── rollingsales_queens.xls
    └── rollingsales_statenisland.xls

2.1 Importing the Real Estate Data

The yearly real estate data we downloaded from the city of New York’s OpenData portal were formatted in Excel files (with the exception of that from 2016-2017, which were contained in a single CSV file, and 2018, which were missing entirely). Each Excel file contained all the real estate sales in a particular borough during a particular calendar year from 2003 to 2020.

To import the Excel file data into R, we start by creating a list of the all the filenames.

Next, we create a custom function to pass to lapply() that will import the data from each file into a dataframe, and store all those dataframes into a corresponding list called dfs_list. Although all the data across every file were consistently formatted with the same column headers, frustratingly, they did not always start on the same row (but always started on one of the first 10 rows). Because of this, we had to do two reads of each file—an initial read of the first 10 rows and columns just to find the index of the first column header (titled “BOROUGH” in our case), and a second read to actually import the data starting from that index. This stack overflow post was very helpful in setting up this ‘double-read’ functionality. We also use clean_names() from the janitor package to make the column names easier to work with later on.

The only data we still need to import are the 2016-2017 data from the CSV file. We append it to our list of dataframes before combining that list into a single dataframe with rbindlist() and saving it as in an RDA file for easy access later on.

Now that we have our housing data, we can move on to importing the crime data!

df_housing

borough neighborhood building_class_category tax_class_at_present block lot ease_ment building_class_at_present address apartment_number zip_code residential_units commercial_units total_units land_square_feet gross_square_feet year_built tax_class_at_time_of_sale building_class_at_time_of_sale sale_price sale_date
2 BATHGATE 01 ONE FAMILY HOMES 1 3035 5 NA A9 451 EAST 178 STREET NA 10457 1 0 1 1287 1992 1899 1 A9 0 2009-12-07
2 BATHGATE 01 ONE FAMILY HOMES 1 3046 42 NA A1 2069 BATHGATE AVENUE NA 10457 1 0 1 1964 1424 1899 1 A1 100 2009-04-28
2 BATHGATE 01 ONE FAMILY HOMES 1 3046 42 NA A1 2069 BATHGATE AVE NA 10457 1 0 1 1964 1424 1899 1 A1 100 2009-04-13
2 BATHGATE 01 ONE FAMILY HOMES 1 3048 19 NA A5 2186 BATHGATE AVENUE NA 10457 1 0 1 1768 1188 1901 1 A5 345752 2009-03-16
2 BATHGATE 01 ONE FAMILY HOMES 1 3048 28 NA A1 540 EAST 182ND STREET NA 10457 1 0 1 1209 1048 1901 1 A1 0 2009-05-14
2 BATHGATE 01 ONE FAMILY HOMES 1 3053 105 NA A1 2323 BASSFORD AVENUE NA 10458 1 0 1 922 1248 1901 1 A1 0 2009-01-22
2 BATHGATE 02 TWO FAMILY HOMES 1 3030 71 NA B2 4443 PARK AVENUE NA 10457 2 0 2 2352 1632 1899 1 B2 135000 2009-05-08
2 BATHGATE 02 TWO FAMILY HOMES 1 3030 76 NA B2 4435 PARK AVENUE NA 10457 2 0 2 2444 2125 1899 1 B2 325000 2009-03-30
2 BATHGATE 02 TWO FAMILY HOMES 1 3035 131 NA B1 464 EAST 179TH NA 10457 2 0 2 1790 2340 1998 1 B1 0 2009-10-07
2 BATHGATE 02 TWO FAMILY HOMES 1 3036 33 NA B3 2091 WASHINGTON AVE NA 10457 2 0 2 2022 3854 1899 1 B3 0 2009-05-08
2 BATHGATE 02 TWO FAMILY HOMES 1 3036 62 NA B1 459 EAST 179 STREET NA 10457 2 0 2 2000 2340 1998 1 B1 344500 2009-11-20
2 BATHGATE 02 TWO FAMILY HOMES 1 3036 62 NA B1 459 EAST 179TH STREET NA 10457 2 0 2 2000 2340 1998 1 B1 316474 2009-05-04
2 BATHGATE 02 TWO FAMILY HOMES 1 3037 6 NA B2 4392 PARK AVENUE NA 10457 2 0 2 2275 3240 1899 1 B2 359038 2009-01-19
2 BATHGATE 02 TWO FAMILY HOMES 1 3039 12 NA B2 454 EAST 185TH STREET NA 10458 2 0 2 1667 1944 1910 1 B2 183028 2009-11-02
2 BATHGATE 02 TWO FAMILY HOMES 1 3039 45 NA B1 446 EAST 186 STREET NA 10458 2 0 2 2000 2060 1901 1 B1 286000 2009-10-09
2 BATHGATE 02 TWO FAMILY HOMES 1 3047 16 NA B2 2125 BATHGATE AVENUE NA 10457 2 0 2 3362 1744 1901 1 B2 0 2009-04-03
2 BATHGATE 02 TWO FAMILY HOMES 1 3050 94 NA B2 2240 BASSFORD AVENUE NA 10457 2 0 2 1300 2300 1901 1 B2 389000 2009-07-28
2 BATHGATE 02 TWO FAMILY HOMES 1 3053 22 NA S2 511 EAST 183 STREET NA 10458 2 1 3 2011 4280 1901 1 S2 1 2009-11-18
2 BATHGATE 02 TWO FAMILY HOMES 1 3053 22 NA S2 511 EAST 183RD STREET NA 10458 2 1 3 2011 4280 1901 1 S2 415440 2009-08-24
2 BATHGATE 03 THREE FAMILY HOMES 1 2912 116 NA C0 1536 WASHINGTON AVENUE NA 10457 3 0 3 2000 2400 1993 1 C0 319500 2009-10-07
2 BATHGATE 03 THREE FAMILY HOMES 1 2912 116 NA C0 1536 WASHINGTON AVENUE NA 10457 3 0 3 2000 2400 1993 1 C0 100 2009-01-12
2 BATHGATE 03 THREE FAMILY HOMES 1 3029 28 NA C0 2092 WEBSTER AVENUE NA 10457 3 0 3 2122 3510 1899 1 C0 0 2009-11-18
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 67 NA C0 460 EAST 182 STREET NA 10457 3 0 3 2000 3744 1899 1 C0 162240 2009-11-12
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 103 NA C0 2135 WASHINGTON AVENUE NA 10457 3 0 3 1701 4600 2006 1 C0 630000 2009-07-17
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 104 NA C0 465 EAST 181 STREET NA 10457 3 0 3 1095 3706 2006 1 C0 602500 2009-09-02
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 105 NA C0 463 EAST 181 STREET NA 10457 3 0 3 1095 3706 2006 1 C0 602500 2009-04-24
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 106 NA C0 461 EAST 181 STREET NA 10457 3 0 3 1095 3706 2006 1 C0 602500 2009-04-27
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 107 NA C0 459 EAST 181 STREET NA 10457 3 0 3 1095 3706 2006 1 C0 640000 2009-05-20
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 108 NA C0 457 EAST 181 STREET NA 10457 3 0 3 1419 3706 2006 1 C0 602500 2009-07-23
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 109 NA C0 455 EAST 181 STREET NA 10457 3 0 3 1687 4050 2006 1 C0 602500 2009-07-15
2 BATHGATE 03 THREE FAMILY HOMES 1 3037 110 NA C0 453 EAST 181 STREET NA 10457 3 0 3 1687 4050 2006 1 C0 602500 2009-04-29
2 BATHGATE 03 THREE FAMILY HOMES 1 3044 51 NA C0 1966 BATHGATE AVENUE NA 10457 3 0 3 2880 3132 1910 1 C0 101400 2009-12-16
2 BATHGATE 03 THREE FAMILY HOMES 1 3044 51 NA C0 1964 BATHGATE AVENUE NA 10457 3 0 3 2880 3132 1910 1 C0 0 2009-02-23
2 BATHGATE 03 THREE FAMILY HOMES 1 3050 3 NA C0 2190 WASHINGTON AVENUE NA 10457 3 0 3 1331 2244 1901 1 C0 178500 2009-07-07
2 BATHGATE 05 TAX CLASS 1 VACANT LAND 1B 3050 12 NA V0 2246 WASHINGTON AVENUE NA 10457 0 0 0 2901 0 0 1 V0 0 2009-09-30
2 BATHGATE 05 TAX CLASS 1 VACANT LAND 1B 3050 13 NA V0 2250 WASHINGTON AVENUE NA 10457 0 0 0 2668 0 0 1 V0 0 2009-09-30
2 BATHGATE 05 TAX CLASS 1 VACANT LAND 1B 3050 15 NA V0 2254 WASHINGTON AVENUE NA 10457 0 0 0 2673 0 0 1 V0 0 2009-09-30
2 BATHGATE 05 TAX CLASS 1 VACANT LAND 2 3050 41 NA D1 491 FLETCHER PLACE NA 10457 0 0 0 1637 0 0 1 V0 0 2009-09-30
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2 3027 29 NA C1 422 EAST 178TH STREET NA 10457 16 0 16 3344 11918 1909 2 C1 1315000 2009-08-26
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2 3028 27 NA C7 4305 PARK AVENUE NA 10457 24 1 25 5568 16680 1909 2 C7 0 2009-08-26
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2 3028 29 NA C1 4301 PARK AVENUE NA 10457 24 0 24 5189 15300 1909 2 C1 0 2009-08-26
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2 3028 34 NA C1 4293 PARK AVENUE NA 10457 20 0 20 6045 15512 1909 2 C1 0 2009-08-26
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2 3028 44 NA C1 4289 PARK AVENUE NA 10457 22 0 22 6098 16728 1908 2 C1 6885000 2009-08-26
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2A 3037 99 NA C3 2137 WASHINGTON AVENUE NA 10457 4 0 4 2719 3348 1901 2 C3 10 2009-04-17
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2B 3041 2 NA C1 4678 PARK AVENUE NA 10458 8 0 8 3417 9360 1910 2 C1 0 2009-11-04
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2B 3041 3 NA C1 4682 PARK AVENUE NA 10458 8 0 8 3425 9360 1910 2 C1 0 2009-11-04
2 BATHGATE 07 RENTALS - WALKUP APARTMENTS 2B 3048 20 NA C1 2190 BATHGATE AVE NA 10457 7 0 7 2417 5400 1931 2 C1 1e+05 2009-02-19
2 BATHGATE 08 RENTALS - ELEVATOR APARTMENTS 2 3050 10 NA D1 2240 WASHINGTON AVENUE NA 10457 68 1 69 13424 54365 2010 2 D1 0 2009-09-30
2 BATHGATE 10 COOPS - ELEVATOR APARTMENTS 2 3035 48 NA D4 463 EAST 178TH STREET, 1E NA 10457 0 0 0 0 0 1935 2 D4 24000 2009-01-21
2 BATHGATE 10 COOPS - ELEVATOR APARTMENTS 2 3035 48 NA D4 463 EAST 178TH STREET, 6E NA 10457 0 0 0 0 0 1935 2 D4 20000 2009-12-09

2.2 Importing crime data

2.2.1 Loading tidyverse package

2.2.2 Reading crime data

We use read.csv instead of read_csv to get a data.frame instead of tibble, which is more useful when doing analysis in other programs

2.3 Importing the Census Data

2.3.1 Downloading the census block data

Download the United States census bureau data estimates for 2015 for all 2,167 individual NYC census tracts. This data is in two parts. Import the first part which includes the block code/census tract, its latitude and longitude as well as the name of the county and state through the following code. After getting them in a dataframe (block_loc), remove the rows with ‘NA’. A basic ‘unique value check’ shows that there are 38,936 locations in 20,406 blocks in 15 counties. These longitude and latitude locations would help us later when we classify the demographic data by zip code.


The dataframe looks like this:
Latitude Longitude BlockCode County State
40.48 -74.28000 3.402301e+14 Middlesex NJ
40.48 -74.27683 3.402301e+14 Middlesex NJ
40.48 -74.27367 3.402301e+14 Middlesex NJ
40.48 -74.27050 3.402301e+14 Middlesex NJ
40.48 -74.26734 3.402301e+14 Middlesex NJ
40.48 -74.26417 3.402301e+14 Middlesex NJ


We divide this dataframe into different counties in NY and NJ to get a better sense about the data. This gives us information about the number of locations in each of the counties in our data.


2.3.2 Downloading the 2nd part NYC census tracts data

The second part of the census data is the NYC census tracts data which is imported below. This is also split into different dataframes which contained data about sex, race, income, jobs, commute and employment, albeit these dataframes will not be used separately for analysis in the future.


The second dataframe looks like this:
CensusTract County Borough TotalPop Men Women Hispanic White Black Native Asian Citizen Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp WorkAtHome MeanCommute Employed PrivateWork PublicWork SelfEmployed FamilyWork Unemployment
3.6005e+10 Bronx Bronx 7703 7133 570 29.9 6.1 60.9 0.2 1.6 6476 NA NA 2440 373 NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 NA NA NA NA NA
3.6005e+10 Bronx Bronx 5403 2659 2744 75.8 2.3 16.0 0.0 4.2 3639 72034 13991 22180 2206 20.0 20.7 28.7 17.1 23.9 8.0 22.3 44.8 13.7 38.6 2.9 0.0 0.0 43.0 2308 80.8 16.2 2.9 0.0 7.7
3.6005e+10 Bronx Bronx 5915 2896 3019 62.7 3.6 30.7 0.0 0.3 4100 74836 8407 27700 2449 13.2 23.6 32.2 23.4 24.9 9.0 10.5 41.3 10.0 44.6 1.4 0.5 2.1 45.0 2675 71.7 25.3 2.5 0.6 9.5
3.6005e+10 Bronx Bronx 5879 2558 3321 65.1 1.6 32.4 0.0 0.0 3536 32312 6859 17526 2945 26.3 35.9 19.1 36.1 26.2 4.9 13.8 37.2 5.3 45.5 8.6 1.6 1.7 38.8 2120 75.0 21.3 3.8 0.0 8.7
3.6005e+10 Bronx Bronx 2591 1206 1385 55.4 9.0 29.0 0.0 2.1 1557 37936 3771 17986 2692 37.1 31.5 35.4 20.9 26.2 6.6 11.0 19.2 5.3 63.9 3.0 2.4 6.2 45.4 1083 76.8 15.5 7.7 0.0 19.2
3.6005e+10 Bronx Bronx 8516 3301 5215 61.1 1.6 31.1 0.3 3.3 5436 18086 3694 12023 2346 53.2 67.7 14.5 41.1 16.7 7.1 20.6 19.6 7.0 68.2 4.3 1.0 0.0 46.0 2508 71.0 21.3 7.7 0.0 17.2

3 Data Cleaning and Processing

After importing the data, we moved on to cleaning and processing. Cleaning the data involved removing the entries that were either irrelevant or incomplete for the purposes of our project. Processing involved geocoding the data, grouping it spatiotemporally, and then computing new metrics to help describe each group.

Grouping the data would allow us to more easily draw observe trends and draw meaningful conclusions, and we decided to group the real estate and crime data entries by zip code and month. The ~250 zip code areas in NYC would give us enough granularity to see the how the myriad of different regions in New York City differ from one another over time, but also enough courseness to guarantee acceptable response times of our visualization app, and ensure that there were usually multiple entries in each group to average across.

3.1 Crime Data

3.1.1 Cleaning the Crime Data

Generally, the crime data is cleaned as part of the other sections. When the crime scores are taken to classify the crimes and these scores are summed across zipcodes, crimes that could not be classified into defined categories are excluded. While excluding NAs could introduce bias, it is believed that the effect would be consistent across time and population for the analysis. However, it did appear that there was a difference in the reporting for crime scores after 2006, which was accounted for when representing the data visually.

3.1.2 Assigning Crime Rating

3.1.2.2 Assigning weights to each crime

Assign a weight to every crime based on the degree of severity. A violation which includes minor offenses like speeding ticket, public intoxication etc. is assigned a weight of 1. A misdemeanour which can include a simple assault or driving under influence is assigned a wight of 2. These were easy to identify as the LAW_CAT_CD column in df_crime gives the category of the crime (Violation, Misdemeanour or Felony).


Assigning weights to felonies is a little more tricky. Since, felonies include a vast variety of criminal activity, we decided to first divide them into 4 categories, 1st to 4th degree. First-degree felonyincludes murder, rape, kidnapping, arson and fraud. Second-degree felony includes aggravated assault, felony assault, arson, manslaughter, possession of a controlled substance and child molestation. Third-degree felony includes assault and battery, elder abuse, transmission of pornography, driving under the influence, fraud and arson. Finally, fourth-degree felony includes involuntary manslaughter, burglary, larceny and resisting arrest. (These lists are just examples and not exhaustive and are based on a source: https://legaldictionary.net/felony/) From the OFNS_DESC column generate a list of unique keywords and group them into 4 lists pertaining to the type of feloniy. Then, the crime is assigned a weight of 6, 5, 4 or 3 for felonies of 1st degree, 2nd degree, 3rd degree and 4th degree respectively depending on which category the OFNS_DESC column of that crime suggests.

#Felonies are the most serious of offenses and will be sub-classified
#Common law and statutes in most states divide felonies into first through fourth degree felonies, each carrying decreasing penalties
#source: https://legaldictionary.net/felony/

df_crime2_felony <- subset(df_crime2, df_crime2$LAW_CAT_CD=="FELONY")
felonytypes <- unique(df_crime2_felony$OFNS_DESC)
felony_1stdegree <- c("RAPE", "THEFT-FRAUD", "ARSON", "KIDNAPPING & RELATED OFFENSES", "MURDER & NON-NEGL. MANSLAUGHTER", "KIDNAPPING", "KIDNAPPING AND RELATED OFFENSES", "NYS LAWS-UNCLASSIFIED FELONY" )
felony_2nddegree <- c("FELONY ASSAULT", "FORGERY", "SEX CRIMES", "DANGEROUS DRUGS", "DANGEROUS WEAPONS")
felony_3rddegree <- c("INTOXICATED/IMPAIRED DRIVING","ROBBERY", "CRIMINAL MISCHIEF & RELATED OF", "PROSTITUTION & RELATED OFFENSES")
felony_4thdegree <- c("GAMBLING","", "BURGLARY","GRAND LARCENY", "GRAND LARCENY OF MOTOR VEHICLE", "MISCELLANEOUS PENAL LAW", "POSSESSION OF STOLEN PROPERTY", "CHILD ABANDONMENT/NON SUPPORT", "HOMICIDE-NEGLIGENT-VEHICLE", "HOMICIDE-NEGLIGENT,UNCLASSIFIE", "ENDAN WELFARE INCOMP")

df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_1stdegree == TRUE)]=6
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_2nddegree == TRUE)]=5
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_3rddegree == TRUE)]=4
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_4thdegree == TRUE)]=3

df_crime <- df_crime2
save(df_crime, file='Data_Crime.rda')

#Extract month and year for each crime
df_crime3 <- subset(df_crime2, select = c(CMPLNT_FR_DT, OFNS_DESC, LAW_CAT_CD, Longitude, Latitude, weight))
head(df_crime3)

3.1.3 Group crime information by month, year, and zipcode

First, the relevant datafiles are imported into the environment and the packages are loaded:

For analysis, it is helpful to have the events grouped by month, year, and zipcode. Then, a field like the crime score can be aggregated and normalized for this criterion. In this case, the crime data is formatted in the desired way with the following command:

Source: https://stackoverflow.com/questions/33221425/how-do-i-group-my-date-variable-into-month-year-in-r

3.1.4 Group other fields by time and zipcode

In addition, commands are run to normalize the crime score by the total population and only take data greater than the year 2000. There are extraneous entries in the data for crimes hundreds of years ago and 2000 is a good baseline. It also appears that the reporting method for these crimes changes in 2006, as evidenced by a dramatic change in crime scores at this time.

# Summing the scores
names(score_by_time_and_rating)[names(score_by_time_and_rating) == 'summary_variable'] <- 'sum_weight'

# Joining the zipcodes
score_by_time_and_rating <- left_join(score_by_time_and_rating, zip_population, by="zip_code")

# Dividing sum by the total population (at 2015 census levels)
score_by_time_and_rating$weight <- score_by_time_and_rating$sum_weight / score_by_time_and_rating$TotalPop

# Dipslaying the head
head(score_by_time_and_rating)

save(score_by_time_and_rating, file = "Data_Score_by_Time_and_Rating.rda")

#Group by year and rating (This is for the regression model for house sales)
score_by_zip_and_year <- df_crime %>% group_by(year=floor_date(CMPLNT_FR_DT, "year"), zip_code)  %>% summarize(summary_variable=sum(weight))

#Summing the scores
names(score_by_zip_and_year)[names(score_by_zip_and_year) == 'summary_variable'] <- 'sum_weight'

#Extracting the year
score_by_zip_and_year$year <- year(score_by_zip_and_year$year)
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(year,zip_code,sum_weight))

# Joining the zipcodes with populations
score_by_zip_and_year <- left_join(score_by_zip_and_year, zip_population, by="zip_code")
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(year,zip_code,sum_weight, TotalPop))

# Dividing sum by the total population (at 2015 census levels)
score_by_zip_and_year$weight <- score_by_zip_and_year$sum_weight / score_by_zip_and_year$TotalPop

#Taking data only for years 2001 and above 
score_by_zip_and_year <- subset(score_by_zip_and_year, year > 2000)
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(zip_code, year, sum_weight, weight, TotalPop))

save(score_by_zip_and_year, file = "Data_Score_by_year_and_zipcode.rda")

3.1.5 Converting coordinates to zipcodes

There was difficulty with initially processing the location data into a format for analysis. For the crime and census data:

  • Crime data - complaints are recorded with only a latitude and longitude
  • Census data - the census data uses census tracts for identifying locations

For the regressions and displaying the data visually, the data needs to be grouped by similar location. In this case, zipcodes were chosen to be the level to group by. By classifying the data entries in a zipcode, the effect of the zipcode can be studied.

At first, an API was used to convert the coordinates from the crime data to zip codes. The revgeocode function from the ggmap package was used and applied to each crime complaint. However, it was taking too long to make the requests to retrieve each of the zipcodes. Although this route was not pursued, it may be possible to perform the API requests in parallel and improve performance.

Instead, it was chosen to perform the reverse geocoding locally and avoid having to make requests to an API service for each zip code. Further, since the zip codes are limited to only New York, it was possible to only download the location data for this area. In this section, the same method will be applied to extract the zipcodes.

3.1.6 Sources for latitude and longitude to zipcode code and data

The source for the code to retrieve the zipcodes from latitude and longitude coordinates locally is: https://stackoverflow.com/questions/46267287/reverse-geocoding-speed

The source for the geospatial polygon data to process the coordinates is: https://jsspina.carto.com/tables/nyc_zip_code_tabulation_areas_polygons/public/map

3.1.7 Processing the crime data

First, the packages are loaded:

The crime data is read in with the following command:

Then the desired latitude and longitude fields are taken. the complaint number is also taken so that the crime-zipcode data can be joined again later. The NA values are omitted.

As stated in the sourced guide above for reverse geocoding this data, “To perform spatial operations, the coordinate reference system needs to match between the two geometries.”

The coordinates are processed with the st_within function and put into an output using sapply:

The output for each zipcode was a list of zipcodes even though there was just one. There were multiple outputs in the case that one was trying to link overlapping geometries with the data. The zipcodes are converted to the proper format:

Finally, the zipcodes are linked back with the crime data and saved:

3.2 Census Data

3.2.1 Cleaning the census data


After importing the census data in section “02C: Data_Import_Census”, no more data cleaning is required as the data was organized into a usable format there itself and the NA values were removed.

3.2.2 Processing the census data

First, the packages are loaded:

The process for finding the zipcodes that corresponded to the census data is different. The census tract information is more granular than at the zipcode level. The mapping between tracts and zipcodes is imported and joined to the census data, so that each element has a corresponding zipcode. The Excel file can be downloaded from the source at end of this section.

The most important element was the corresponding populations for each zipcode. The population data for each demographic was summed across each zipcode. The demographic information was initially stored as percentages, so it was necessary to multiply those percentages by the total population for each census tract.

#Converting the demographic and unemployment percentages to numbers and income per capita to income
zip_population$Hispanic.number <- zip_population$Hispanic*zip_population$TotalPop/100
zip_population$White.number <- zip_population$White*zip_population$TotalPop/100
zip_population$Black.number <- zip_population$Black*zip_population$TotalPop/100
zip_population$Native.number <- zip_population$Native*zip_population$TotalPop/100
zip_population$Asian.number <- zip_population$Asian*zip_population$TotalPop/100
zip_population$IncomeTot <- zip_population$IncomePerCap*zip_population$TotalPop
#Percentage of people in the age group of 18-65 in NY is 65% source: https://www.census.gov/quickfacts/newyorkcitynewyork
zip_population$Unemployment.number <- zip_population$Unemployment*0.65*zip_population$TotalPop/100

#Aggregating the various census tract data into 1 zip code
zip_population <- subset(zip_population, select = c(zip_code,TotalPop, Men, Women, Hispanic.number, White.number, Black.number, Native.number, Asian.number, IncomeTot, Unemployment.number))
zip_population <- aggregate(.~zip_code, data = zip_population, FUN = sum)

#Convert back the demographics and unemployment to % and IncomeTot to per capita income
zip_population$Hispanic.number <- zip_population$Hispanic.number*100/zip_population$TotalPop
zip_population$White.number <- zip_population$White.number*100/zip_population$TotalPop
zip_population$Black.number <- zip_population$Black.number*100/zip_population$TotalPop
zip_population$Native.number <- zip_population$Native.number*100/zip_population$TotalPop
zip_population$Asian.number <- zip_population$Asian.number*100/zip_population$TotalPop
zip_population$IncomeTot <- zip_population$IncomeTot/zip_population$TotalPop
zip_population$Unemployment.number <- zip_population$Unemployment.number*100/(0.65*zip_population$TotalPop)

Finally, the columns are renamed as needed. The data for zipcode to population demographic was saved for analysis later:

3.2.3 Source for census tract to zipccode mapping data

The following site includes the most up to date census tract to zipcode information: https://www.huduser.gov/portal/datasets/usps_crosswalk.html

3.3 Real Estate Data

3.3.1 Geocoding the Real Estate Data

In order to group the real estate data by zip code, we first need to attach a zip code value to each entry in our data (a process known as “geocoding”, or “reverse-geocoding” to be more precise). To do this, we use the Geoclient API available on the city’s Developer Portal. Before pulling from the API, we do some quick cleaning by converting the borough, sale_price, and gross_square_feet columns to the numeric data type (to make sure they’re not incorrectly stored as character or factor types). Then we filter out some of the data with bad sale_price and gross_square_feet values, as this makes for fewer API queries and we know we won’t want use those entries in our analysis anyway. The last step before pulling from the API is creating the bbl column, which is a 10 digit combination of the borough, block, and lot columns. We will pass each entry’s bbl number to the API, which will then return the zip code (among other data) for that specific plot of land.

A nifty CRAN package appropriately titled geoclient allows us to pull data from the Geoclient API in just a single line of code. However, because the API is public and free, the rate of requests from a single account is limited and it took ~7 hours to query all 1.4 mllion properties. After the queries are complete, we bind the returned API data to the existing df_housing data frame in a new data frame called geocoded_df_housing.

3.3.2 Cleaning the Real Estate Data

In the next step, we modify data from a CSV file with all NYC building class codes to include a column labeling the building category of each class code (e.g., Residential, Commercial, Industrial, etc.).

building_class_key

class_code category
A0 Residential
A1 Residential
A2 Residential
A3 Residential
A4 Residential
A5 Residential
A6 Residential
A7 Residential
A8 Residential
A9 Residential
B1 Residential
B2 Residential
B3 Residential
B9 Residential
C0 Residential
C1 Residential
C2 Residential
C3 Residential
C4 Residential
C5 Residential
C6 Residential
C7 Residential
C8 Residential
C9 Residential
CM Residential
D0 Residential
D1 Residential
D2 Residential
D3 Residential
D4 Residential
D5 Residential
D6 Residential
D7 Residential
D8 Residential
D9 Residential
E1 Industrial
E2 Industrial
E3 Industrial
E4 Industrial
E7 Industrial
E9 Industrial
F1 Industrial
F2 Industrial
F4 Industrial
F5 Industrial
F8 Industrial
F9 Industrial
G0 Industrial
G1 Industrial
G2 Industrial
G3 Industrial
G4 Industrial
G5 Industrial
G6 Industrial
G7 Industrial
G8 Industrial
G9 Industrial
GU Industrial
GW Industrial
H1 Commercial
H2 Commercial
H3 Commercial
H4 Commercial
H5 Commercial
H6 Commercial
H7 Commercial
H8 Commercial
H9 Commercial
HB Commercial
HH Commercial
HR Commercial
HS Commercial
I1 Civic
I2 Civic
I3 Civic
I4 Civic
I5 Civic
I6 Civic
I7 Civic
I9 Civic
J1 Civic
J2 Civic
J3 Civic
J4 Civic
J5 Civic
J6 Civic
J7 Civic
J8 Civic
J9 Civic
K1 Commercial
K2 Commercial
K3 Commercial
K4 Commercial
K5 Commercial
K6 Commercial
K7 Commercial
K8 Commercial
K9 Commercial
L1 Commercial
L2 Commercial
L3 Commercial
L8 Commercial
L9 Commercial
M1 Civic
M2 Civic
M3 Civic
M4 Civic
M9 Civic
N1 Civic
N2 Civic
N3 Civic
N4 Civic
N9 Civic
O1 Commercial
O2 Commercial
O3 Commercial
O4 Commercial
O5 Commercial
O6 Commercial
O7 Commercial
O8 Commercial
O9 Commercial
P1 Civic
P2 Civic
P3 Civic
P4 Civic
P5 Civic
P6 Civic
P7 Civic
P8 Civic
P9 Civic
Q1 Civic
Q2 Civic
Q3 Civic
Q4 Civic
Q5 Civic
Q6 Civic
Q7 Civic
Q8 Civic
Q9 Civic
R0 Residential
R1 Residential
R2 Residential
R3 Residential
R4 Residential
R5 Residential
R6 Residential
R7 Residential
R8 Residential
R9 Residential
RA Residential
RB Residential
RG Residential
RH Residential
RK Residential
RP Residential
RR Residential
RS Residential
RT Residential
RW Residential
S0 Mixed
S1 Mixed
S2 Mixed
S3 Mixed
S4 Mixed
S5 Mixed
S9 Mixed
T1 Industrial
T2 Industrial
T9 Industrial
U0 Industrial
U1 Industrial
U2 Industrial
U3 Industrial
U4 Industrial
U5 Industrial
U6 Industrial
U7 Industrial
U8 Industrial
U9 Industrial
V0 Vacant
V1 Vacant
V2 Vacant
V3 Vacant
V4 Vacant
V5 Vacant
V6 Vacant
V7 Vacant
V8 Vacant
V9 Vacant
W1 Civic
W2 Civic
W3 Civic
W4 Civic
W5 Civic
W6 Civic
W7 Civic
W8 Civic
W9 Civic
Y1 Government
Y2 Government
Y3 Government
Y4 Government
Y5 Government
Y6 Government
Y7 Government
Y8 Government
Y9 Government
Z0 Other
Z1 Other
Z2 Other
Z3 Other
Z4 Other
Z5 Other
Z7 Other
Z8 Other
Z9 Other


Now we select the columns from geocoded_df_housing that we want to work with moving forward, filter out missing values and high/low-priced outliers, and use the building class codes to assign a category to each building sale. This is all stored and saved new data frame called cleaned_df_housing, which we will use for grouping in the next sub-section.

cleaned_df_housing

building_class_at_present zip_code gross_square_feet sale_price sale_date price_per_sqft category
A5 10457 1188 345752 2009-03-16 291.03704 Residential
B2 10457 1632 135000 2009-05-08 82.72059 Residential
B2 10457 2125 325000 2009-03-30 152.94118 Residential
B1 10457 2340 344500 2009-11-20 147.22222 Residential
B1 10457 2340 316474 2009-05-04 135.24530 Residential
B2 10457 3240 359038 2009-01-19 110.81420 Residential
B2 10458 1944 183028 2009-11-02 94.15021 Residential
B1 10458 2060 286000 2009-10-09 138.83495 Residential
B2 10457 2300 389000 2009-07-28 169.13043 Residential
S2 10458 4280 415440 2009-08-24 97.06542 Mixed
C0 10457 2400 319500 2009-10-07 133.12500 Residential
C0 10457 3744 162240 2009-11-12 43.33333 Residential
C0 10457 4600 630000 2009-07-17 136.95652 Residential
C0 10457 3706 602500 2009-09-02 162.57420 Residential
C0 10457 3706 602500 2009-04-24 162.57420 Residential
C0 10457 3706 602500 2009-04-27 162.57420 Residential
C0 10457 3706 640000 2009-05-20 172.69293 Residential
C0 10457 3706 602500 2009-07-23 162.57420 Residential
C0 10457 4050 602500 2009-07-15 148.76543 Residential
C0 10457 4050 602500 2009-04-29 148.76543 Residential
C0 10457 3132 101400 2009-12-16 32.37548 Residential
C0 10457 2244 178500 2009-07-07 79.54545 Residential
C1 10457 11918 1315000 2009-08-26 110.33730 Residential
C1 10457 16728 6885000 2009-08-26 411.58537 Residential
K1 10457 1775 336000 2009-07-23 189.29577 Commercial
S4 10457 3233 185000 2009-07-07 57.22239 Mixed
M1 10458 9000 860000 2009-07-15 95.55556 Civic
A5 10469 1512 370000 2009-05-04 244.70899 Residential
A5 10469 1701 258570 2009-06-04 152.01058 Residential
S1 10469 1889 325000 2009-09-04 172.04870 Mixed
A1 10469 2232 349990 2009-10-09 156.80556 Residential
A5 10469 1681 260000 2009-09-17 154.66984 Residential
A5 10469 1920 312000 2009-06-12 162.50000 Residential
A5 10469 1920 315000 2009-08-05 164.06250 Residential
A5 10469 1920 308510 2009-10-28 160.68229 Residential
A5 10469 1742 330000 2009-03-30 189.43743 Residential
A5 10469 1440 310000 2009-04-17 215.27778 Residential
A5 10469 1778 380000 2009-03-25 213.72328 Residential
A5 10469 1280 300000 2009-10-05 234.37500 Residential
A5 10469 1320 364000 2009-01-12 275.75758 Residential
A5 10469 1890 423470 2009-06-18 224.05820 Residential
A2 10469 1896 395000 2009-04-13 208.33333 Residential
A5 10469 1534 250000 2009-06-08 162.97262 Residential
A2 10469 779 125000 2009-08-24 160.46213 Residential
A2 10469 396 113645 2009-07-27 286.98232 Residential
A1 10469 1288 400000 2009-06-25 310.55901 Residential
A2 10469 1140 135000 2009-09-30 118.42105 Residential
A2 10469 2234 320000 2009-07-23 143.24082 Residential
A2 10469 1000 325943 2009-09-03 325.94300 Residential
B3 10469 4019 165000 2009-11-09 41.05499 Residential


3.3.3 Grouping the Real Estate Data

Now that we have the real estate geocoded and cleaned, we can group it by aggregating all sales that took place in the same zip code and during the same month. We will create 3 new variables to describe each group: average price per square foot (avg_price_per_sqft), total number of sales (num_sales), and total proceeds from all sales (total_proceeds).

Next, we peform the same grouping function again, but with building category as an extra grouping variable. This will allow us to filter the displayed data by building category in the UI for our visualization. We combine the groups of individual categories with the groups of all categories to create a single data frame. Then we filter high-sales outliers and save this data frame for visualization in our Shiny app.

grouped_housing

zip_code month_char category avg_price_per_sqft num_sales total_proceeds
10001 2003-01 Commercial 277.77778 1 2200000
10001 2003-01 Mixed 418.26420 1 2400000
10001 2003-01 Residential 201.08860 1 665000
10001 2003-02 Commercial 391.52249 7 65050000
10001 2003-03 Commercial 267.30755 1 10900000
10001 2003-04 Commercial 541.50438 2 4670000
10001 2003-04 Residential 295.63932 1 2000000
10001 2003-06 Commercial 258.90032 2 4780000
10001 2003-07 Commercial 298.33823 1 8025000
10001 2003-09 Commercial 506.40168 1 2650000
10001 2003-09 Mixed 377.17284 1 1150000
10001 2003-10 Commercial 148.92465 1 11550000
10001 2003-11 Commercial 108.53280 2 4263310
10001 2003-12 Commercial 77.25308 2 5234000
10001 2004-01 Commercial 84.60919 1 210000
10001 2004-01 Industrial 179.62312 1 38319000
10001 2004-02 Commercial 456.69442 3 17785074
10001 2004-02 Mixed 314.91003 1 1225000
10001 2004-03 Commercial 275.29213 2 3550000
10001 2004-04 Commercial 603.31827 10 58435000
10001 2004-04 Residential 143.24058 2 19900000
10001 2004-06 Commercial 132.41494 5 96970000
10001 2004-07 Commercial 177.95878 3 18380000
10001 2004-08 Civic 822.78481 1 11375000
10001 2004-08 Residential 468.70275 1 1550000
10001 2004-09 Commercial 154.36808 3 14033333
10001 2004-10 Commercial 200.30904 2 8350000
10001 2004-10 Residential 616.69386 2 4625000
10001 2005-01 Commercial 217.81244 4 12475000
10001 2005-01 Mixed 437.01799 1 1700000
10001 2005-01 Residential 321.61235 1 1875000
10001 2005-02 Commercial 133.75306 2 7700000
10001 2005-02 Residential 483.55932 2 9933590
10001 2005-03 Commercial 100.31802 2 16800000
10001 2005-03 Residential 134.52915 1 3000000
10001 2005-04 Commercial 285.02850 1 3800000
10001 2005-05 Civic 85.59236 1 940660
10001 2005-05 Commercial 139.86014 1 1000000
10001 2005-06 Commercial 667.11141 1 4000000
10001 2005-07 Commercial 602.09691 2 28500000
10001 2005-07 Residential 419.50293 2 8350000
10001 2005-09 Commercial 280.31495 4 120277465
10001 2005-10 Commercial 218.40519 1 31750000
10001 2005-11 Commercial 226.83706 1 35500000
10001 2005-12 Commercial 35.30153 1 3605875
10001 2005-12 Mixed 332.08247 2 3818622
10001 2006-01 Commercial 175.38672 2 32525000
10001 2006-01 Residential 132.40123 1 12500000
10001 2006-02 Commercial 638.95109 3 32350000
10001 2006-02 Residential 329.28586 1 17600000


3.4 Creating a single dataframe for modeling

3.4.1 Section of housing sales data

First, load the 2 relevant files. Data_Housing.rda (df_housing) contains all the housing sales data. Data_Zip_Population.rda (zip_population) contains the census data for each zip code.


Next, remove rows which are either blank or have NA values in the sale_price or land_square_feet column of df_housing.


Create a new dataframe, df_sale, which containse the relevant sales price prediction input variables from the df_housing dataframe. Add 2 more column, sale_year and sale_month, which contain the year and month of the sale. These values are extracted from the sale_date column by using the functions year() and month() respectively.
Add a new column, crime_score_year. If the sale was completed in the months of January, February or March, add the year just before the sale_year in the crime_score_year column. If the sale was completed in the rest of the months, crime_score_year = sale_year. This is done to map the relevant year’s crime score for each sale’s analysis. For e.g. if a sale happens in January of 2006, then it doesn’t make sense to add the crime score for 2006. Crimes happening in 2005 would have affected that sale’s price more than 2006.

3.4.3 Adding the crime score

Load the file Data_Score_by_year_and_zipcode.rda (score_by_zip_and_year) that was created in section “04B: Data_Processing_Month_Year”. Find the zipcodes that are present in both df_sale_census and score_by_zip_and_year and store them in zip_codes_inhousing_incensus_incrime. Merge the df_sale_census and score_by_zip_year dataframes by zipcodes and store them in df_sale_census_crime. Remove all rows from this final dataframe which have a zero value and save this dataframe as Data_sale_census_crime.rda for future analysis.

Final Dataframe

zip_code crime_score_year land_square_feet gross_square_feet year_built tax_class_at_time_of_sale building_class_at_time_of_sale sale_date sale_price sale_year sale_month TotalPop.x Men Women Hispanic White Black Native Asian PerCapitaIncome Unemployed sum_weight weight TotalPop.y
10001 2002 4937 54000 1912 4 L1 2003-02-12 5e+06 2003 2 127587350 61690470 65896880 3.824723 74.48506 2.424281 1.450839 15.4563 114905.6 4.76503 19 1e-07 127587350
10001 2002 4937 54000 1912 4 L1 2003-02-12 13500000 2003 2 127587350 61690470 65896880 3.824723 74.48506 2.424281 1.450839 15.4563 114905.6 4.76503 19 1e-07 127587350
10001 2002 1480 7060 1920 4 L9 2003-02-07 2150000 2003 2 127587350 61690470 65896880 3.824723 74.48506 2.424281 1.450839 15.4563 114905.6 4.76503 19 1e-07 127587350
10001 2002 2469 4494 1901 2 S4 2003-01-03 24375 2003 1 127587350 61690470 65896880 3.824723 74.48506 2.424281 1.450839 15.4563 114905.6 4.76503 19 1e-07 127587350
10001 2002 2345 7035 1920 4 K9 2003-01-03 46687 2003 1 127587350 61690470 65896880 3.824723 74.48506 2.424281 1.450839 15.4563 114905.6 4.76503 19 1e-07 127587350
10001 2002 1975 5738 1910 2 S5 2003-01-17 2400000 2003 1 127587350 61690470 65896880 3.824723 74.48506 2.424281 1.450839 15.4563 114905.6 4.76503 19 1e-07 127587350

4 Regression Modeling

The regression model helps to establish a baseline. Unfortunately, when working with a large set of data, there are issues faced when running regressions:

  1. Large dataset - it takes a long time to run the regression
  2. P-values - they go to zero quickly (they all became significant)
  3. Mean-squared error (MSE) - the mean square errors of the predicted values are high, despite the amount of data

Each of the issues can be respectively addressed:

  1. A function called fastLm removes overhead from the regression
  2. The data can be partitioned, and the accuracy prediction can be analyzed (MSE)
  3. Machine learning models can be used to capture data granularity (see machine learning code section)

4.1 Code to setup and run regressions on combined crime data

The steps for setting up the regression, running the model, and making predictions are in the following sections.

4.1.1 Load packages and import data

First, the packages are loaded:

The desired data is imported into R for the regression.

4.1.4 Partition the data

The data is partitioned into training and test sets for running regressions and determining the accuracy of a model on the test set respectively. For this initial regression analysis, a subset of 80% of the data is used for training and the rest is used for testing the accuracy of the predictions.

4.1.5 Run the regression using fastLm

The regression is run using fastLm to expedite the process and remove the overhead from the normal regression functions.

4.2 Results from the regression models

The results that were found when the regression was run for this project are summarized:

Measure Value
Adjusted R-squared value (\(R^2\)) 27.26%
Root Mean Squared (RMS) 198,239.1
Root Mean Squared Error (RMSE) 43.59%

The more detailed output from the regression model is summarized below. While it may be possible to improve the regression results by running further regressions, the machine learning models explored next will likely help to capture the granularity in the data.

Regression output

> # Run regression
> fit1 <- fastLm(x_train, y_train)
> summary(fit1)

Call:
fastLm.default(X = x_train, y = y_train)

Residuals:
     Min.   1st Qu.    Median   3rd Qu.      Max. 
-951240.0  -99101.0    2810.9  110820.0 1088100.0 

                                    Estimate      StdErr     t.value   p.value    
land_square_feet                  2.8982e-02  7.3239e-03  3.9572e+00 7.583e-05 ***
PerCapitaIncome                   1.5932e+00  5.5788e-02  2.8558e+01 < 2.2e-16 ***
Unemployed                       -1.2420e+04  1.6778e+02 -7.4027e+01 < 2.2e-16 ***
TotalPop.x                       -6.1569e+09  2.5369e-01 -2.4269e+10 < 2.2e-16 ***
Men                               6.1569e+09  2.4254e-01  2.5385e+10 < 2.2e-16 ***
Women                             6.1569e+09  2.9053e-01  2.1192e+10 < 2.2e-16 ***
building_class_at_time_of_saleA0  1.3952e+05  3.7534e+03  3.7171e+01 < 2.2e-16 ***
building_class_at_time_of_saleA1  1.0504e+05  1.4056e+03  7.4727e+01 < 2.2e-16 ***
building_class_at_time_of_saleA2  6.4862e+04  1.7563e+03  3.6931e+01 < 2.2e-16 ***
building_class_at_time_of_saleA3  3.3492e+05  5.4030e+03  6.1988e+01 < 2.2e-16 ***
building_class_at_time_of_saleA4  1.9749e+05  6.5229e+03  3.0276e+01 < 2.2e-16 ***
building_class_at_time_of_saleA5  3.9112e+04  1.4470e+03  2.7030e+01 < 2.2e-16 ***
building_class_at_time_of_saleA6 -1.5064e+05  7.0867e+03 -2.1257e+01 < 2.2e-16 ***
building_class_at_time_of_saleA7  1.8340e+05  5.3564e+01  3.4239e+03 < 2.2e-16 ***
building_class_at_time_of_saleA8 -1.1559e+05  1.4973e+01 -7.7196e+03 < 2.2e-16 ***
building_class_at_time_of_saleA9  6.3681e+04  2.0931e+03  3.0424e+01 < 2.2e-16 ***
building_class_at_time_of_saleB1  1.7355e+05  1.5154e+03  1.1453e+02 < 2.2e-16 ***
building_class_at_time_of_saleB2  1.5230e+05  1.4673e+03  1.0380e+02 < 2.2e-16 ***
building_class_at_time_of_saleB3  1.3706e+05  1.5737e+03  8.7097e+01 < 2.2e-16 ***
building_class_at_time_of_saleB9  1.2594e+05  2.0994e+03  5.9986e+01 < 2.2e-16 ***
building_class_at_time_of_saleC0  2.1164e+05  1.5671e+03  1.3505e+02 < 2.2e-16 ***
building_class_at_time_of_saleC1  2.3698e+05  4.3097e+03  5.4987e+01 < 2.2e-16 ***
building_class_at_time_of_saleC2  1.9560e+05  2.9009e+03  6.7429e+01 < 2.2e-16 ***
building_class_at_time_of_saleC3  2.1607e+05  2.6559e+03  8.1355e+01 < 2.2e-16 ***
building_class_at_time_of_saleC4  1.3695e+05  9.0729e+03  1.5095e+01 < 2.2e-16 ***
building_class_at_time_of_saleC5  1.9362e+05  6.7181e+03  2.8821e+01 < 2.2e-16 ***
building_class_at_time_of_saleC6 -1.3938e+05  3.8236e+02 -3.6451e+02 < 2.2e-16 ***
building_class_at_time_of_saleC7  1.6761e+05  7.7880e+03  2.1522e+01 < 2.2e-16 ***
building_class_at_time_of_saleC8 -3.9534e+05  3.8622e+00 -1.0236e+05 < 2.2e-16 ***
building_class_at_time_of_saleC9 -5.3525e+04  1.4023e+02 -3.8170e+02 < 2.2e-16 ***
building_class_at_time_of_saleCM -1.0373e+01          NA          NA        NA    
building_class_at_time_of_saleD0  8.2579e+04  2.5786e+01  3.2024e+03 < 2.2e-16 ***
building_class_at_time_of_saleD1 -3.0391e+04  4.1447e+02 -7.3326e+01 < 2.2e-16 ***
building_class_at_time_of_saleD2 -4.5215e+05  7.4887e+00 -6.0378e+04 < 2.2e-16 ***
building_class_at_time_of_saleD3 -1.8254e+05  9.2733e+01 -1.9684e+03 < 2.2e-16 ***
building_class_at_time_of_saleD4 -1.6205e+05  7.7115e+03 -2.1014e+01 < 2.2e-16 ***
building_class_at_time_of_saleD5  9.2455e+04  7.8430e+00  1.1788e+04 < 2.2e-16 ***
building_class_at_time_of_saleD6 -1.5591e+05  6.7311e+01 -2.3162e+03 < 2.2e-16 ***
building_class_at_time_of_saleD7 -1.0718e+05  9.0214e+01 -1.1881e+03 < 2.2e-16 ***
building_class_at_time_of_saleD8 -2.3003e+04  1.2368e+01 -1.8598e+03 < 2.2e-16 ***
building_class_at_time_of_saleD9 -1.9105e+05  7.6408e+01 -2.5004e+03 < 2.2e-16 ***
building_class_at_time_of_saleE1  1.5457e+05  4.4341e+02  3.4859e+02 < 2.2e-16 ***
building_class_at_time_of_saleE2  4.3728e+05  1.0070e+01  4.3426e+04 < 2.2e-16 ***
building_class_at_time_of_saleE3  1.4641e+05  3.0501e+02  4.8002e+02 < 2.2e-16 ***
building_class_at_time_of_saleE4  1.0907e+05  5.2813e+01  2.0653e+03 < 2.2e-16 ***
building_class_at_time_of_saleE7 -2.9949e+05  1.9951e+01 -1.5011e+04 < 2.2e-16 ***
building_class_at_time_of_saleE9  1.1786e+05  8.8237e+03  1.3357e+01 < 2.2e-16 ***
building_class_at_time_of_saleF1  1.5106e+05  1.6581e+02  9.1102e+02 < 2.2e-16 ***
building_class_at_time_of_saleF2  2.1667e+05  3.0781e+01  7.0391e+03 < 2.2e-16 ***
building_class_at_time_of_saleF4  1.6584e+05  6.3631e+02  2.6062e+02 < 2.2e-16 ***
building_class_at_time_of_saleF5  9.9344e+04  1.5017e+02  6.6155e+02 < 2.2e-16 ***
building_class_at_time_of_saleF8 -2.8220e+05  1.3758e+00 -2.0512e+05 < 2.2e-16 ***
building_class_at_time_of_saleF9  1.5692e+05  8.5776e+03  1.8295e+01 < 2.2e-16 ***
building_class_at_time_of_saleG0 -7.4920e+04  9.3074e+02 -8.0494e+01 < 2.2e-16 ***
building_class_at_time_of_saleG1  1.0360e+05  1.7836e+02  5.8084e+02 < 2.2e-16 ***
building_class_at_time_of_saleG2  1.3491e+05  7.8585e+03  1.7168e+01 < 2.2e-16 ***
building_class_at_time_of_saleG3  2.1477e+05  1.5108e+01  1.4215e+04 < 2.2e-16 ***
building_class_at_time_of_saleG4  2.3700e+05  1.2326e+02  1.9228e+03 < 2.2e-16 ***
building_class_at_time_of_saleG5  1.4077e+05  5.4218e+01  2.5963e+03 < 2.2e-16 ***
building_class_at_time_of_saleG6  1.1741e+05  4.5981e+01  2.5535e+03 < 2.2e-16 ***
building_class_at_time_of_saleG7 -1.6497e+04  4.4057e+03 -3.7444e+00 0.0001808 ***
building_class_at_time_of_saleG8  3.1372e+03  2.6921e+01  1.1653e+02 < 2.2e-16 ***
building_class_at_time_of_saleG9  1.2263e+05  7.6113e+03  1.6111e+01 < 2.2e-16 ***
building_class_at_time_of_saleGU  2.8203e+05  2.6368e+00  1.0696e+05 < 2.2e-16 ***
building_class_at_time_of_saleGW  2.0901e+05  9.7865e+00  2.1357e+04 < 2.2e-16 ***
building_class_at_time_of_saleH1 -1.7314e+00  2.4147e-12 -7.1701e+11 < 2.2e-16 ***
building_class_at_time_of_saleH2 -5.1859e+05  4.9986e+03 -1.0375e+02 < 2.2e-16 ***
building_class_at_time_of_saleH3 -5.6200e+05  8.2590e+03 -6.8046e+01 < 2.2e-16 ***
building_class_at_time_of_saleH4  1.6690e+05  1.9417e+01  8.5958e+03 < 2.2e-16 ***
building_class_at_time_of_saleH6  1.1853e+04  5.5558e+00  2.1334e+03 < 2.2e-16 ***
building_class_at_time_of_saleH7 -2.1248e+05  1.0631e+01 -1.9987e+04 < 2.2e-16 ***
building_class_at_time_of_saleH8  7.7614e-01          NA          NA        NA    
building_class_at_time_of_saleH9  2.4038e+04  2.7452e+01  8.7564e+02 < 2.2e-16 ***
building_class_at_time_of_saleHR  2.6907e+05  2.6293e+01  1.0234e+04 < 2.2e-16 ***
building_class_at_time_of_saleI1 -4.2960e+05  1.5147e+01 -2.8361e+04 < 2.2e-16 ***
building_class_at_time_of_saleI3  2.0919e+05  2.7345e+00  7.6501e+04 < 2.2e-16 ***
building_class_at_time_of_saleI4 -5.5951e+04  1.7755e+01 -3.1513e+03 < 2.2e-16 ***
building_class_at_time_of_saleI5  1.7474e+05  5.9982e+01  2.9132e+03 < 2.2e-16 ***
building_class_at_time_of_saleI6  1.1961e+05  2.4592e+01  4.8636e+03 < 2.2e-16 ***
building_class_at_time_of_saleI7  6.3836e+04  1.0070e+02  6.3393e+02 < 2.2e-16 ***
building_class_at_time_of_saleI9  8.6528e+04  4.7252e+01  1.8312e+03 < 2.2e-16 ***
building_class_at_time_of_saleJ1 -2.8717e+05  9.3349e+00 -3.0763e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ2 -4.9867e+05  5.4591e+00 -9.1348e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ4 -9.4846e+04  2.5902e+01 -3.6618e+03 < 2.2e-16 ***
building_class_at_time_of_saleJ6 -5.4367e+05  2.1236e+01 -2.5602e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ7 -4.6257e+05  3.0157e+00 -1.5339e+05 < 2.2e-16 ***
building_class_at_time_of_saleJ9 -8.2693e-01          NA          NA        NA    
building_class_at_time_of_saleK1  1.3066e+05  5.0570e+03  2.5838e+01 < 2.2e-16 ***
building_class_at_time_of_saleK2  1.8075e+05  8.3157e+03  2.1736e+01 < 2.2e-16 ***
building_class_at_time_of_saleK3 -4.7864e+05  1.6826e+00 -2.8447e+05 < 2.2e-16 ***
building_class_at_time_of_saleK4  2.0110e+05  7.0629e+03  2.8473e+01 < 2.2e-16 ***
building_class_at_time_of_saleK5  1.5896e+05  1.5770e+02  1.0080e+03 < 2.2e-16 ***
building_class_at_time_of_saleK6  5.7383e+04  1.8493e+01  3.1029e+03 < 2.2e-16 ***
building_class_at_time_of_saleK7  1.8497e+05  4.8547e+01  3.8101e+03 < 2.2e-16 ***
building_class_at_time_of_saleK8 -3.5744e+05  2.2884e+01 -1.5620e+04 < 2.2e-16 ***
building_class_at_time_of_saleK9  1.6681e+05  8.5161e+03  1.9588e+01 < 2.2e-16 ***
building_class_at_time_of_saleL1 -3.2798e+05  8.1924e+01 -4.0035e+03 < 2.2e-16 ***
building_class_at_time_of_saleL2 -4.6219e+05  2.0444e+01 -2.2608e+04 < 2.2e-16 ***
building_class_at_time_of_saleL3 -1.0495e+00          NA          NA        NA    
building_class_at_time_of_saleL8 -7.9144e+04  1.0960e+02 -7.2209e+02 < 2.2e-16 ***
building_class_at_time_of_saleL9  4.3745e+04  3.5515e+01  1.2318e+03 < 2.2e-16 ***
building_class_at_time_of_saleM1  1.5553e+05  6.4322e+02  2.4180e+02 < 2.2e-16 ***
building_class_at_time_of_saleM2  4.7448e+05  1.3285e+00  3.5715e+05 < 2.2e-16 ***
building_class_at_time_of_saleM3  1.8350e+05  2.2997e+01  7.9795e+03 < 2.2e-16 ***
building_class_at_time_of_saleM4  3.5926e+05  1.0860e+01  3.3079e+04 < 2.2e-16 ***
building_class_at_time_of_saleM9  1.7297e+05  1.9606e+02  8.8223e+02 < 2.2e-16 ***
building_class_at_time_of_saleN1 -2.5565e+05  1.2544e+00 -2.0381e+05 < 2.2e-16 ***
building_class_at_time_of_saleN2 -1.0812e+05  3.7291e+01 -2.8993e+03 < 2.2e-16 ***
building_class_at_time_of_saleN3  3.8823e+05  1.4154e+00  2.7429e+05 < 2.2e-16 ***
building_class_at_time_of_saleN4  2.5708e+05  2.3806e+00  1.0799e+05 < 2.2e-16 ***
building_class_at_time_of_saleN9  7.9835e+04  4.0318e+01  1.9802e+03 < 2.2e-16 ***
building_class_at_time_of_saleO1  9.8493e+04  5.4205e+01  1.8170e+03 < 2.2e-16 ***
building_class_at_time_of_saleO2  5.6520e+04  8.4064e+01  6.7235e+02 < 2.2e-16 ***
building_class_at_time_of_saleO3 -2.5890e+05  7.4544e+01 -3.4732e+03 < 2.2e-16 ***
building_class_at_time_of_saleO4 -2.5858e+05  4.7554e+01 -5.4377e+03 < 2.2e-16 ***
building_class_at_time_of_saleO5  2.0678e+05  6.5123e+01  3.1753e+03 < 2.2e-16 ***
building_class_at_time_of_saleO6 -1.8394e+05  1.6759e+02 -1.0975e+03 < 2.2e-16 ***
building_class_at_time_of_saleO7  1.5411e+05  9.9111e+02  1.5549e+02 < 2.2e-16 ***
building_class_at_time_of_saleO8  1.8789e+05  2.4518e+02  7.6634e+02 < 2.2e-16 ***
building_class_at_time_of_saleO9  1.2536e+05  9.2309e+03  1.3580e+01 < 2.2e-16 ***
building_class_at_time_of_saleP1 -5.0139e+05  4.3587e+00 -1.1503e+05 < 2.2e-16 ***
building_class_at_time_of_saleP2  1.6676e+05  4.9800e+01  3.3487e+03 < 2.2e-16 ***
building_class_at_time_of_saleP3  4.0284e+05  9.5488e+00  4.2188e+04 < 2.2e-16 ***
building_class_at_time_of_saleP5  1.1493e+05  3.6320e+01  3.1643e+03 < 2.2e-16 ***
building_class_at_time_of_saleP6  4.5337e+05  1.9562e+01  2.3177e+04 < 2.2e-16 ***
building_class_at_time_of_saleP7  2.5193e+05  3.6547e+00  6.8933e+04 < 2.2e-16 ***
building_class_at_time_of_saleP8 -2.4175e+04  1.0560e+01 -2.2893e+03 < 2.2e-16 ***
building_class_at_time_of_saleP9  9.8652e+04  4.2028e+01  2.3473e+03 < 2.2e-16 ***
building_class_at_time_of_saleQ1 -6.3134e+05  2.0798e+01 -3.0355e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ2 -1.9214e+04  7.9179e+00 -2.4267e+03 < 2.2e-16 ***
building_class_at_time_of_saleQ3 -2.1168e+05  3.9467e+00 -5.3634e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ8  3.3311e+05  5.2588e+00  6.3344e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ9  5.6611e+04  2.5971e+01  2.1798e+03 < 2.2e-16 ***
building_class_at_time_of_saleR0 -6.2189e+04  4.0096e+01 -1.5510e+03 < 2.2e-16 ***
building_class_at_time_of_saleR1 -2.1086e+05  8.5324e+00 -2.4713e+04 < 2.2e-16 ***
building_class_at_time_of_saleR4  1.0408e+05  2.3851e+00  4.3635e+04 < 2.2e-16 ***
building_class_at_time_of_saleR9 -1.3032e+05  7.4740e+01 -1.7436e+03 < 2.2e-16 ***
building_class_at_time_of_saleRR -2.9996e+05  4.0817e+01 -7.3490e+03 < 2.2e-16 ***
building_class_at_time_of_saleS0  1.3380e+05  1.0122e+02  1.3220e+03 < 2.2e-16 ***
building_class_at_time_of_saleS1  1.2850e+05  3.9465e+03  3.2560e+01 < 2.2e-16 ***
building_class_at_time_of_saleS2  1.7599e+05  2.8663e+03  6.1399e+01 < 2.2e-16 ***
building_class_at_time_of_saleS3  1.9130e+05  6.4023e+03  2.9880e+01 < 2.2e-16 ***
building_class_at_time_of_saleS4  2.2695e+05  6.7895e+03  3.3426e+01 < 2.2e-16 ***
building_class_at_time_of_saleS5  2.2889e+05  7.6276e+03  3.0009e+01 < 2.2e-16 ***
building_class_at_time_of_saleS9  2.2762e+05  6.1226e+03  3.7177e+01 < 2.2e-16 ***
building_class_at_time_of_saleT2  2.3426e+04  2.9571e+00  7.9217e+03 < 2.2e-16 ***
building_class_at_time_of_saleT9 -6.9349e+04  8.5254e+00 -8.1344e+03 < 2.2e-16 ***
building_class_at_time_of_saleU7 -1.1867e+05  3.5107e+00 -3.3803e+04 < 2.2e-16 ***
building_class_at_time_of_saleV0 -2.2715e+04  3.7162e+03 -6.1123e+00 9.828e-10 ***
building_class_at_time_of_saleV1 -6.4640e+04  7.6833e+03 -8.4130e+00 < 2.2e-16 ***
building_class_at_time_of_saleV2 -5.6667e+04  7.3045e+01 -7.7578e+02 < 2.2e-16 ***
building_class_at_time_of_saleV3 -6.8084e+04  9.0265e+01 -7.5427e+02 < 2.2e-16 ***
building_class_at_time_of_saleV5  3.5594e+05  4.0220e+00  8.8499e+04 < 2.2e-16 ***
building_class_at_time_of_saleV9 -1.1566e+05  6.3938e+01 -1.8090e+03 < 2.2e-16 ***
building_class_at_time_of_saleW1 -1.3223e+05  2.2746e+01 -5.8133e+03 < 2.2e-16 ***
building_class_at_time_of_saleW2  6.7435e+04  2.2155e+01  3.0438e+03 < 2.2e-16 ***
building_class_at_time_of_saleW3 -4.6763e+04  4.0345e+01 -1.1591e+03 < 2.2e-16 ***
building_class_at_time_of_saleW4 -7.7198e+04  2.8433e+00 -2.7151e+04 < 2.2e-16 ***
building_class_at_time_of_saleW6 -5.9114e+05  5.1179e+00 -1.1550e+05 < 2.2e-16 ***
building_class_at_time_of_saleW7 -3.1586e+05  8.8954e+00 -3.5508e+04 < 2.2e-16 ***
building_class_at_time_of_saleW8  2.2198e+05  2.7235e+01  8.1506e+03 < 2.2e-16 ***
building_class_at_time_of_saleW9  1.4626e+05  4.4200e+01  3.3089e+03 < 2.2e-16 ***
building_class_at_time_of_saleY1 -2.5078e+05  1.5060e+00 -1.6652e+05 < 2.2e-16 ***
building_class_at_time_of_saleY2 -3.4598e+05  7.8875e+00 -4.3864e+04 < 2.2e-16 ***
building_class_at_time_of_saleY4 -3.1975e+05  1.7551e+00 -1.8218e+05 < 2.2e-16 ***
building_class_at_time_of_saleY5 -4.7741e+05  3.0906e+00 -1.5447e+05 < 2.2e-16 ***
building_class_at_time_of_saleY6 -1.1746e+05  1.9593e+01 -5.9949e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ0 -1.3031e+05  4.8316e+01 -2.6969e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ1 -3.3428e+05  3.9923e+00 -8.3731e+04 < 2.2e-16 ***
building_class_at_time_of_saleZ2 -2.1807e+05  2.8958e+01 -7.5306e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ3 -1.2179e+05  3.0810e+00 -3.9531e+04 < 2.2e-16 ***
building_class_at_time_of_saleZ8 -2.0401e+05  1.4632e+00 -1.3942e+05 < 2.2e-16 ***
building_class_at_time_of_saleZ9  4.2683e+04  7.9866e+03  5.3443e+00 9.081e-08 ***
sale_year2003                    -1.0892e+05  1.1024e+03 -9.8802e+01 < 2.2e-16 ***
sale_year2004                    -5.1748e+04  1.0476e+03 -4.9395e+01 < 2.2e-16 ***
sale_year2005                     1.2800e+04  1.0659e+03  1.2008e+01 < 2.2e-16 ***
sale_year2006                     5.6075e+04  1.1214e+03  5.0002e+01 < 2.2e-16 ***
sale_year2007                     5.4919e+04  1.2462e+03  4.4070e+01 < 2.2e-16 ***
sale_year2008                     8.8428e+03  1.3830e+03  6.3940e+00 1.618e-10 ***
sale_year2009                    -3.6373e+04  1.4479e+03 -2.5120e+01 < 2.2e-16 ***
sale_year2010                    -3.7572e+04  1.4401e+03 -2.6089e+01 < 2.2e-16 ***
sale_year2011                    -3.6054e+04  1.5219e+03 -2.3690e+01 < 2.2e-16 ***
sale_year2012                    -2.8094e+04  1.4739e+03 -1.9061e+01 < 2.2e-16 ***
sale_year2013                    -8.9054e+03  1.3906e+03 -6.4040e+00 1.515e-10 ***
sale_year2014                     1.5791e+04  1.3950e+03  1.1320e+01 < 2.2e-16 ***
sale_year2015                     3.2799e+04  1.3994e+03  2.3437e+01 < 2.2e-16 ***
sale_year2016                     6.7075e+04  2.3071e+03  2.9073e+01 < 2.2e-16 ***
sale_year2017                     7.8186e+04  1.7508e+03  4.4658e+01 < 2.2e-16 ***
sale_year2019                     1.3103e+05  1.6770e+03  7.8134e+01 < 2.2e-16 ***
sale_year2020                     1.5247e+05  3.2118e+03  4.7472e+01 < 2.2e-16 ***
year_built0                      -1.3214e+04  4.1824e+03 -3.1593e+00 0.0015815 ** 
year_built1380                   -5.1417e+05  1.2867e+00 -3.9962e+05 < 2.2e-16 ***
year_built1800                   -2.3758e+04  7.1415e+01 -3.3268e+02 < 2.2e-16 ***
year_built1814                   -1.2841e+05  2.4966e+00 -5.1435e+04 < 2.2e-16 ***
year_built1825                   -5.7477e+05  2.6333e+00 -2.1828e+05 < 2.2e-16 ***
year_built1826                    9.3972e+04  4.4656e+00  2.1044e+04 < 2.2e-16 ***
year_built1827                    7.1977e+05  4.6640e+00  1.5432e+05 < 2.2e-16 ***
year_built1838                   -4.0203e+05  3.1975e+00 -1.2573e+05 < 2.2e-16 ***
year_built1839                   -5.6975e+05  2.6778e+00 -2.1277e+05 < 2.2e-16 ***
year_built1840                    2.8993e+04  1.3322e+00  2.1764e+04 < 2.2e-16 ***
 [ reached getOption("max.print") -- omitted 155 rows ]
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.963e+05 on 349122 degrees of freedom
Multiple R-squared: 0.2733, Adjusted R-squared: 0.2726
> #R-squared value is 25.63%
> 
> # Perform predictions on test set
> reg_pred1 <- predict(fit1, as.matrix(x_test))
> 
> # See the comparison
> results <- data.frame(cbind(y_test, reg_pred1))
> results$diff <- results$reg_pred1 - results$y_test
> results$diff_squared <- '^' (results$diff, 2)
> # RMS = 
> rms = sqrt(mean(results$diff_squared))
> rms
[1] 198239.1
> 
> # RMS Error = 
> rms_error = rms/mean(results$y_test)
> rms_error
[1] 0.4358773

5 Machine Learning Modeling

5.1 Preparing the dataframe for Machine Learning models

Add the required libraries. Also, load the file Data_sale_census_crime.rda (df_sale_census_crime) that was created in “04D: Data_Processing_Single_Dataframe”.


Remove rows that have zero value in any column. Also, make sure that the sale_price column is numeric.


Instead of looking at the entire data we will look at sale_price < $ 1M and 5M (90% and 97% of the data respectively). This value can be changed through this line of code.


Instead of using the year_built of a house as an input variable, create another column, “age” of the building. This is calculated as below. Also, remember to remove all zero rows again.


Instead of using the entire database, use 1% of the data for trial run. This can be incrementally increased to 20% and then 100% of the data.


Make sure that gross_square_feet is numeric and all NA values are removed from the dataframe.

5.2 Splitting the data into training and test data

Select data of sales from 2003 to 2013 for training and data from beyond 2013 for testing. Variable y is defined to include just the sale_price; variable y_test is the testing component of y. Variable is a matrix of all the input variables for the models; variable x_test is the testing component of x. Make sure that all the variables are formatted in the right format and all NA rows are removed before proceeding.

5.3 Running the ML models

Set the control method of the model as cross validation (cv) or none. It is advisable to do 5-fold or 10-fold cross validation. Cross validation takes a much longer time to run then just running without any cross calidation. Root Mean Square (RMSE) is being used here as the parameter to optimize the models. Mean Average Error (MAE) can also be used instead.


Run the neural network, deep neural net, random forest, bagged cart and k-nearest neighbours models as shown below.

5.4 Results of validation and training

Let’s see how our models performed in 5-fold cross validation with 1% data (sale_price < 1M), 1% data (sale_price < 5M) and 100% data (sale_price < 1M, only for neural network and bagged cart).

Validation results of 1% data with sale_price <5M
metric neural_net1 deep_neural_net1 random_forest1 bagged_cart1 k_nearest_neighbors1
RMSE 525659.600 410711.800 405804.900 437638.00 444834.500
MAE 280704.900 214190.900 213700.300 241710.50 227947.200
R-squared 0.067 0.422 0.442 0.35 0.334
Validation results of 1% data with sale_price <1M
metric neural_net2 deep_neural_net2 random_forest2 bagged_cart2 k_nearest_neighbors2
RMSE 211077.400 193911.100 193911.100 200277.800 194873.80
MAE 164014.600 142400.700 142400.700 151958.000 144403.90
R-squared 0.107 0.256 0.256 0.196 0.25
Validation results of 100% data with sale_price <1M
metric neural_net3 bagged_cart3
RMSE 208134.700 200819.900
MAE 161860.400 154242.500
R-squared 0.097 0.159

5.5 Results of predictions on testing data

Run the models on the x_test dataset to get the predicted sales price value.


Find the difference between the predicted values and the y_test values and calculate the RMSE and MAE values as shown below.


Calculate the MAE %error and RMSE %error by dividing the MAE and RMSE values by mean sales price in y_test. This gives a figure which can be easily compared across models. Let’s see what the results are.

Testing results of 1% data with sale_price <5M
metric4 neural_net4 deep_neural_net4 random_forest4 bagged_cart4 k_nearest_neighbors4
RMSE/mean 0.92 0.91 0.90 0.83 0.84
MAE/mean 0.53 0.61 0.61 0.53 0.48
Testing results of 1% data with sale_price <1M
metric4 neural_net4 deep_neural_net4 random_forest4 bagged_cart4 k_nearest_neighbors4
RMSE/mean 0.92 0.91 0.90 0.83 0.84
MAE/mean 0.53 0.61 0.61 0.53 0.48
Testing results of 100% (for neural_net & bagged_cart) data and 20% (for deep neural net & random forest and no cross validation) with sale_price <1M
metric4 neural_net6 deep_neural_net6 random_forest6 bagged_cart6
RMSE/mean 0.49 0.46 0.46 0.49
MAE/mean 0.40 0.37 0.37 0.39


As we can see, the model improves a lot when we only look at sale_prices < $ 1M. The final %MAE errors were in the range of 35-40% with the random forest and deep neural net performing the best.

6 Shiny App Visualization

This section describes the implmentation of our shiny app visalization, which can be viewed live on shinyapps.io.

6.1 Data Preparation

Before booting up the shiny app, we’ll perform a few more processing steps on the data and set up our polygon shapefile for drawing the zip code regions on a leaflet map.

We start by loading in the semi-processed crime data and then filtering out data before 2007, as upon closer inspection, the complaints from before that point seems to have been reported differently and would skew graphs we plan to show in the visualization.

Next, we normalize the crime score by mulitplying by 10000 (to make the minimum score greater than 1), taking it’s log, and then dividing by the maximum value to get an approximately normal distribution from 0 to 1.

To display the 2015 census data for each zip code, we grab all the unique zip codes from the score_by_time_and_rating data frame, and their assciated demographic and economic characteristics.

unique_census

zip_code PerCapitaIncome Unemployed TotalPop Hispanic White Black Native Asian weight
10001 100234.00 6.768366 54931 11.217202 59.960081 7.1104367 0.3122954 18.801769 0.0302743
10002 28315.73 9.336805 89904 27.622018 20.842063 7.8137391 0.1529887 40.790532 0.0141484
10003 83351.19 5.529082 78507 6.906509 72.975596 3.8197919 0.2961048 12.900415 0.0179220
10004 97170.22 3.899262 21681 9.655233 65.617859 1.9047138 0.0000000 20.412117 0.0024445
10005 86428.86 3.880739 17798 6.414541 68.356489 2.9669963 0.2072705 20.326306 0.0043825
10006 93787.00 3.300000 4374 10.600000 66.000000 1.9000000 0.0000000 19.500000 0.0182899
10007 119242.13 3.674118 45430 8.511321 60.119892 4.3140678 0.0501695 22.972307 0.0057451
10009 51209.57 6.516734 85886 20.749042 54.242942 6.9598060 0.1141618 14.408297 0.0096873
10010 92893.21 5.785252 69574 11.339020 66.640557 4.0195015 0.3870339 14.968094 0.0093857
10011 106752.14 4.910633 82322 10.392998 70.414786 5.0095126 0.3011091 11.063606 0.0158281
10012 100495.24 4.410209 49380 5.790992 68.621183 2.2355508 0.1364682 19.006819 0.0114621
10013 113579.10 4.277498 43974 6.965232 58.260233 3.7364693 0.1236640 27.022588 0.0166007
10014 104205.84 5.425317 46324 8.411761 78.552526 2.1064567 0.2773681 7.920266 0.0145281
10016 85612.76 5.328540 71412 10.220809 65.417396 4.2940822 0.0167325 17.410382 0.0085420
10017 110212.58 3.163597 40598 6.453774 72.890379 1.7914602 0.0782255 16.572893 0.0081777
10018 91959.49 6.145309 26529 13.087546 54.653455 5.5875118 0.0715368 23.273821 0.0339628
10019 90112.07 5.688635 61795 17.626635 58.933995 4.8051477 0.0796909 15.995618 0.0165386
10020 107136.40 4.164439 6389 11.773767 60.485976 3.0516669 0.3054938 22.833793 0.0075129
10021 115294.98 3.470039 72771 6.231300 81.083166 1.7386486 0.0469473 8.290485 0.0037790
10022 130688.86 4.342849 47254 5.793163 80.830304 2.1141660 0.1132920 9.954484 0.0140517
10023 107509.81 4.706067 73211 8.576637 75.611438 2.5184795 0.0421235 10.600240 0.0078540
10024 97532.70 4.252122 81434 13.985623 71.642281 4.5466660 0.0342228 7.268941 0.0075767
10025 58202.50 8.159557 106152 21.748584 55.391525 10.1917854 0.1149446 9.581334 0.0113422
10026 33252.99 8.963258 56175 24.506978 21.874485 45.3300970 0.0698709 5.677186 0.0151313
10027 28991.62 7.676539 88710 23.809475 23.575135 41.1918916 0.1541066 8.754981 0.0200654
10028 112773.22 3.862112 83249 6.478951 83.344795 1.2763817 0.1274238 6.597452 0.0047808
10029 27052.87 11.629428 91158 48.233006 16.363058 24.2480781 0.0000000 8.805455 0.0194497
10030 26133.35 13.623160 46266 23.605652 10.925371 59.4541218 0.3181429 2.453601 0.0180694
10031 22642.82 10.759977 77692 52.828847 13.299143 27.7892524 0.0684202 3.125435 0.0181228
10032 19268.96 13.613160 97708 70.033369 10.336216 14.4302462 0.0617043 3.075249 0.0125271
10033 23814.46 12.810512 79100 68.683254 21.558509 4.5760455 0.0746119 3.380865 0.0137674
10034 23260.78 15.452328 46763 74.138492 17.012326 4.3897590 0.0000000 2.229989 0.0105425
10035 22672.20 10.971860 62345 42.117809 11.437370 38.2822263 0.0805245 5.174718 0.0193761
10036 72432.87 6.843628 31652 19.769809 54.093994 6.1912991 0.0616643 17.260862 0.0325098
10037 25108.63 13.591314 54422 23.733512 8.588791 61.6120172 0.2243192 2.784892 0.0066150
10038 52062.64 6.983886 44669 10.349260 35.814330 5.2738163 1.0839464 44.768287 0.0049475
10039 23355.00 18.036523 46042 33.090024 9.100765 53.3986860 0.4099778 1.598367 0.0136614
10040 24461.43 13.019628 75427 69.108541 21.784732 5.0114588 0.0686094 2.716590 0.0070930
10044 48284.78 5.483773 11783 14.601986 43.112908 13.9744972 0.0000000 24.635772 0.0059408
10065 120842.61 3.969969 34215 5.147400 82.168511 2.8653251 0.0998509 8.221274 0.0132398
10069 89272.56 5.022828 26752 10.560728 71.359640 3.9002841 0.0000000 10.937840 0.0002243
10075 116356.48 4.586323 72466 6.518072 84.444395 0.9766125 0.0493142 5.592533 0.0021803
10111 118143.00 6.200000 982 7.100000 72.800000 2.2000000 0.0000000 17.800000 0.0020367
10115 45187.00 5.100000 5069 7.800000 68.200000 7.6000000 0.0000000 14.600000 0.0007891
10128 89075.02 4.399807 75267 11.503636 71.495084 4.0066762 0.0000000 9.897987 0.0064969
10280 111236.72 3.224324 15762 10.034234 64.906306 1.3563063 0.0000000 18.871171 0.0015226
10281 115333.00 2.700000 8875 7.500000 69.800000 1.4000000 0.0000000 15.900000 0.0012394
10282 125477.38 2.737017 14091 8.832595 70.318231 2.7696118 0.0000000 12.827628 0.0020581
10301 33413.80 7.340241 67183 22.126743 49.492686 19.5170013 0.1022818 6.106892 0.0107170
10302 27652.69 5.647956 33898 31.306803 43.054755 16.3130981 0.0431707 6.997079 0.0083781


Now that we’ve extracted the census data, we can remove all those unecessary columns from the score_by_time_and_rating and reformat month column as a character to match the style of the grouped_housing data frame. We save this as a new data frame that we’ll load into our Shiny app.

crime_scores

month zip_code weight_normalized month_char
2007-01-01 10001 0.7741732 2007-01
2007-01-01 10002 0.7007013 2007-01
2007-01-01 10003 0.7235363 2007-01
2007-01-01 10004 0.5311227 2007-01
2007-01-01 10005 0.5875057 2007-01
2007-01-01 10006 0.7254990 2007-01
2007-01-01 10007 0.6136537 2007-01
2007-01-01 10009 0.6641158 2007-01
2007-01-01 10010 0.6610612 2007-01
2007-01-01 10011 0.7115365 2007-01
2007-01-01 10012 0.6803649 2007-01
2007-01-01 10013 0.7161397 2007-01
2007-01-01 10014 0.7032591 2007-01
2007-01-01 10016 0.6519636 2007-01
2007-01-01 10017 0.6477547 2007-01
2007-01-01 10018 0.7852772 2007-01
2007-01-01 10019 0.7157773 2007-01
2007-01-01 10020 0.6395650 2007-01
2007-01-01 10021 0.5731948 2007-01
2007-01-01 10022 0.7000389 2007-01
2007-01-01 10023 0.6438535 2007-01
2007-01-01 10024 0.6403814 2007-01
2007-01-01 10025 0.6793492 2007-01
2007-01-01 10026 0.7071881 2007-01
2007-01-01 10027 0.7344474 2007-01
2007-01-01 10028 0.5959080 2007-01
2007-01-01 10029 0.7314376 2007-01
2007-01-01 10030 0.7243277 2007-01
2007-01-01 10031 0.7246128 2007-01
2007-01-01 10032 0.6889462 2007-01
2007-01-01 10033 0.6980645 2007-01
2007-01-01 10034 0.6722874 2007-01
2007-01-01 10035 0.7310709 2007-01
2007-01-01 10036 0.7810539 2007-01
2007-01-01 10037 0.6272710 2007-01
2007-01-01 10038 0.5992176 2007-01
2007-01-01 10039 0.6973183 2007-01
2007-01-01 10040 0.6340093 2007-01
2007-01-01 10044 0.6168883 2007-01
2007-01-01 10065 0.6942905 2007-01
2007-01-01 10069 0.3004099 2007-01
2007-01-01 10075 0.5200755 2007-01
2007-01-01 10111 0.5134916 2007-01
2007-01-01 10128 0.6255310 2007-01
2007-01-01 10280 0.4853990 2007-01
2007-01-01 10281 0.4655224 2007-01
2007-01-01 10282 0.5145008 2007-01
2007-01-01 10301 0.6738727 2007-01
2007-01-01 10302 0.6500923 2007-01
2007-01-01 10303 0.6575468 2007-01


To display the neighborhood names for each zip code (if they have one), we’ll pull them in from a csv file created using information from the New York State’s Department of Health.

neighborhoods

neighborhood zips
Central Bronx 10453
Central Bronx 10457
Central Bronx 10460
Bronx Park and Fordham 10458
Bronx Park and Fordham 10467
Bronx Park and Fordham 10468
High Bridge and Morrisania 10451
High Bridge and Morrisania 10452
High Bridge and Morrisania 10456
Hunts Point and Mott Haven 10454
Hunts Point and Mott Haven 10455
Hunts Point and Mott Haven 10459
Hunts Point and Mott Haven 10474
Kingsbridge and Riverdale 10463
Kingsbridge and Riverdale 10471
Northeast Bronx 10466
Northeast Bronx 10469
Northeast Bronx 10470
Northeast Bronx 10475
Southeast Bronx 10461
Southeast Bronx 10462
Southeast Bronx 10464
Southeast Bronx 10465
Southeast Bronx 10472
Southeast Bronx 10473
Central Brooklyn 11212
Central Brooklyn 11213
Central Brooklyn 11216
Central Brooklyn 11233
Central Brooklyn 11238
Southwest Brooklyn 11209
Southwest Brooklyn 11214
Southwest Brooklyn 11228
Borough Park 11204
Borough Park 11218
Borough Park 11219
Borough Park 11230
Canarsie and Flatlands 11234
Canarsie and Flatlands 11236
Canarsie and Flatlands 11239
Southern Brooklyn 11223
Southern Brooklyn 11224
Southern Brooklyn 11229
Southern Brooklyn 11235
Northwest Brooklyn 11201
Northwest Brooklyn 11205
Northwest Brooklyn 11215
Northwest Brooklyn 11217
Northwest Brooklyn 11231
Flatbush 11203


Finally, we’ll load in the shapefile that contains all of the zip code polygins (with st_read) and merge it with the neighborhoods key and the 2015 census data. Before we saving it as an RDA file also we reduce the number of polygon vertices using `ms_simplify()’. This will allow for better performance of our Shiny app.

## Reading layer `nyc-zip-code-tabulation-areas-polygons' from data source `https://raw.githubusercontent.com/fedhere/PUI2015_EC/master/mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson' using driver `GeoJSON'
## Simple feature collection with 262 features and 11 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -74.25576 ymin: 40.49584 xmax: -73.6996 ymax: 40.91517
## CRS:            4326

6.2 Running the App

In this sub-section, we review the specific files that are run each time we start up an instance of out Shiny app visualization.

6.2.1 global.R

This file is always executed before server.R and ui.R and allows us run preliminary code and create global variables used by both the server and UI. We begin the file by loading in the tidyverse package and creating a temporary directory where we can store the plots we’ll be generating when a user clicks a particular polygon.

Next, we set the location from where we’ll be loading our rda files, as this is different depending on whether we’re running it locally or on shinyapps.io (in which case the RDA files are stored on dropbox). Before pushing the app to shinyapps.io, one simply has to uncomment the commented out lines so that the app can pull from dropbox rather than just outside its local directory.

Now that we’ve set their location, we can load in the RDA files. We’ll also create a unique_zips data frame to hold popup titles for each zipcode polygon in the shapefile.

unique_zips

postalcode select_names
00083 Central Park - 00083
10001 Chelsea and Clinton - 10001
10002 Lower East Side - 10002
10003 Lower East Side - 10003
10004 Lower Manhattan - 10004
10005 Lower Manhattan - 10005
10006 Lower Manhattan - 10006
10007 Lower Manhattan - 10007
10009 Lower East Side - 10009
10010 Gramercy Park and Murray Hill - 10010
10011 Chelsea and Clinton - 10011
10012 Greenwich Village and Soho - 10012
10013 Greenwich Village and Soho - 10013
10014 Greenwich Village and Soho - 10014
10016 Gramercy Park and Murray Hill - 10016
10017 Gramercy Park and Murray Hill - 10017
10018 Chelsea and Clinton - 10018
10019 Chelsea and Clinton - 10019
10020 Chelsea and Clinton - 10020
10021 Upper East Side - 10021
10022 Gramercy Park and Murray Hill - 10022
10023 Upper West Side - 10023
10024 Upper West Side - 10024
10025 Upper West Side - 10025
10026 Central Harlem - 10026
10027 Central Harlem - 10027
10028 Upper East Side - 10028
10029 East Harlem - 10029
10030 Central Harlem - 10030
10031 Inwood and Washington Heights - 10031
10032 Inwood and Washington Heights - 10032
10033 Inwood and Washington Heights - 10033
10034 Inwood and Washington Heights - 10034
10035 East Harlem - 10035
10036 Chelsea and Clinton - 10036
10037 Central Harlem - 10037
10038 Lower Manhattan - 10038
10039 Central Harlem - 10039
10040 Inwood and Washington Heights - 10040
10041 10041
10043 10043
10044 Upper East Side - 10044
10045 10045
10047 10047
10048 10048
10055 10055
10065 Upper East Side - 10065
10069 10069
10075 Upper East Side - 10075
10080 10080


Next we build a list to hold the radio button choices to select which data the user wants to look at.

Finally, we create some custom CSS and HTML that will be used to fix a bug in the display positioning of the NA color within the leaflet legend. We found this solution in a Github issue thread.

6.2.2 server.R

Server is the part of shiny application where processing and updating of data, map, predicted values happen. We’ll gradually build the code for Sever in steps. The inputIds in ui.R script will be linked in Sever script.

6.2.2.2 Loading the prediction model files

We’ll be using Neural net and Random forest models to compute prices of real estate based on 4 inputs from user.

6.2.2.3 Creating reactive merged dataframe and mapping

We start coding inside shinyServer function. In next steps, we build chunks of code that will go inside shinyServer function. They execute different processes in the app. Final code of server script is given in the end.

In this first step, we create a reactive obeject that changes the merged dateframe as per the user inputs: time selected and property type selected. The merged dataframe has spatial data. renderLeaflet function creates the map and shows NYC as we have put the coordinates.

6.2.2.6 Adding pop up

In this step, we add pop up on each ZIP code area showing the graph of the attribute chosen over time. Pop us also shows demographic information from the 2015 census.

showPopup <- function(df, zip, id, lat, lng, suff_str, curr = FALSE) {
        
        # Get plot object and other popup data for the chosen zip code
        zip_data = zip_sf[which(zip_sf$postalcode == zip),]
        if (nrow(df) > 0) {
            y = pull(df, 2)
            x = as.yearmon(pull(df, 1))
            title = paste(names(which(radioButtonOptions == colnames(df)[2])), suff_str)
            ylabel = names(which(radioButtonOptions == colnames(df)[2]))
            
            if (curr == TRUE) {
                if (max(y, na.rm=TRUE) > 1e6) {
                    label = label_number(prefix = "$", suffix = "M", scale = 1e-6)
                } else if (max(y, na.rm = TRUE) > 1e4) {
                    label =  label_number(prefix = "$", suffix = "K", scale = 1e-3)
                } else {
                    label = label_number(prefix = "$")
                }
            } else {
                label = label_number()
            }
            plot = ggplot(data=NULL, aes(x, y)) +
                geom_line() +
                geom_smooth() +
                labs(title = title, y = ylabel, x = "Month")  +
                scale_y_continuous(label = label, expand = c(0,0), limits = c(0, ifelse(colnames(df)[2] == "weight_normalized", 1, max(y, na.rm=TRUE)))) +
                theme_classic() +
                theme(plot.title = element_text(size = 11, hjust = 0.5), axis.title = element_text(size = 10))
    
            # Write svg file to temporary folder
            svg(filename= paste(folder,"plot.svg", sep = "/"),
                width = 500 * 0.01, height = 300 * 0.01)
            print(plot)
            dev.off()
    
            content <- paste(readLines(paste(folder,"plot.svg",sep="/")), collapse = "")
        } else {
            content = ""
        }
        
        if (is.na(zip_data$neighborhood)) {
            neighborhood_str = ""
        } else {
            neighborhood_str = paste(zip_data$neighborhood, " - ")
        }
        # Create popup
        leafletProxy("map") %>%
            addPopups(
                lng,
                lat,
                popup = paste0(
                    "<h4>", neighborhood_str, zip, "</h4><br/>",
                    content, "<br/>",
                    "<b>2015 Census Data</b><br/>",
                    "Per capita income:    $", round(zip_data$PerCapitaIncome),"</b><br/>",
                    "Total Population: ", zip_data$TotalPop, "</b><br/>",
                    "Unemployment rate: ", round(zip_data$Unemployed), "%"
                ),
                layerId = id,
                options = popupOptions(maxWidth = 500)
            )
    }

6.2.2.7 Price predictor

In this step, we integrate the predictive models in the shiny app. It takes 4 inputs from the user: Age of the builing, area, sale year and ZIP code. ZIP code pulls demographic data from the 2015 cenus that’s in the dataframe. It feeds these inputs in the model chosen by the user and gives a predicted price.

    price_text = eventReactive(input$predict_price, {
        if (input$zip != "")
            zip = unique_zips[which(unique_zips$select_names==input$zip), 1]
        else {
            zip = NA
        }
        input_data = data.frame(gross_square_feet = input$sqft, age = input$age, sale_year = input$sale_year, zip_code = zip, stringsAsFactors = FALSE)
        bad_cols = colnames(input_data)[colSums(is.na(input_data)) > 0]
        if (length(bad_cols) == 0){
            zip_data = zip_sf %>%
                filter(postalcode == zip) %>%
                slice(1) %>%
                data.frame() %>%
                select(postalcode, PerCapitaIncome, White, Black, Asian, Hispanic, Native, TotalPop.x = TotalPop, Unemployed, weight) %>%
                inner_join(input_data, by = c("postalcode" = "zip_code")) %>%
                select(-postalcode)
            if (is.na(zip_data)) {return(paste0("<font color=\"#eb4034\">Error: No valid census data for this zip code!</font>"))}
            if (input$model_select == "fit.rf") {
                predicted_val = predict(fit.rf, newdata=zip_data)
            } else if (input$model_select == "fit.nnet") {
                predicted_val = predict(fit.nnet, newdata=zip_data)
            } else if (input$model_select == "fit.bgcrt") {
                predicted_val = predict(fit.bgcrt, newdata=zip_data)
            }
            return(paste0("<br/>", dollar(predicted_val)))
        } else {
            msg = ""
            if ("gross_square_feet" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a square footage!")
            }
            if ("age" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter building age!")
            }
            if ("sale_year" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a sale year!")
            }
            if ("zip_code" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a location!")
            }
            return(paste0("<font color=\"#eb4034\">", msg, "</font>"))
        }
    })

output$price = renderText({
        price_text()
    })

6.2.2.9 Final code of server.R

Now we combine all chunks of code that we saw in steps so far inside the shinyServer function. This is the final script for server.R. This script along with ui.R and Global.R runs the shiny app.

library(shiny)
library(leaflet)
library(sf)
library(RColorBrewer)
library(zoo)
library(scales)
library(ggplot2)
library(caret)
library(randomForest)

load("neural_net.rda")
# load(paste0(rda_loc, "bagged_cart.rda"))
load("random_forest.rda")

shinyServer(function(input, output) {
    
    output$housing_data_select = reactive({
        return(housing_data_select())
    })
    
    housing_data_select = reactive({
        if (input$data_select %in% colnames(grouped_housing)[-c(1:3)]) {
            return(TRUE)
        } else {
            return(FALSE)
        }
    })
   
    
    output$price = renderText({
        price_text()
    })
    
    price_text = eventReactive(input$predict_price, {
        if (input$zip != "")
            zip = unique_zips[which(unique_zips$select_names==input$zip), 1]
        else {
            zip = NA
        }
        input_data = data.frame(gross_square_feet = input$sqft, age = input$age, sale_year = input$sale_year, zip_code = zip, stringsAsFactors = FALSE)
        bad_cols = colnames(input_data)[colSums(is.na(input_data)) > 0]
        if (length(bad_cols) == 0){
            zip_data = zip_sf %>%
                filter(postalcode == zip) %>%
                slice(1) %>%
                data.frame() %>%
                select(postalcode, PerCapitaIncome, White, Black, Asian, Hispanic, Native, TotalPop.x = TotalPop, Unemployed, weight) %>%
                inner_join(input_data, by = c("postalcode" = "zip_code")) %>%
                select(-postalcode)
            if (is.na(zip_data)) {return(paste0("<font color=\"#eb4034\">Error: No valid census data for this zip code!</font>"))}
            if (input$model_select == "fit.rf") {
                predicted_val = predict(fit.rf, newdata=zip_data)
            } else if (input$model_select == "fit.nnet") {
                predicted_val = predict(fit.nnet, newdata=zip_data)
            } else if (input$model_select == "fit.bgcrt") {
                predicted_val = predict(fit.bgcrt, newdata=zip_data)
            }
            return(paste0("<br/>", dollar(predicted_val)))
        } else {
            msg = ""
            if ("gross_square_feet" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a square footage!")
            }
            if ("age" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter building age!")
            }
            if ("sale_year" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a sale year!")
            }
            if ("zip_code" %in% bad_cols) {
                msg = paste0(msg, "<br/>Error: Please enter a location!")
            }
            return(paste0("<font color=\"#eb4034\">", msg, "</font>"))
        }
    })

    df <- reactive({
        if (housing_data_select()) {
            data = grouped_housing %>% 
                filter(category == input$prop_category, month_char == input$date_select) %>%
                select(zip_code, disp_data = !!input$data_select)
        } else {
            data = crime_scores %>%
                filter(month_char == input$date_select) %>%
                select(zip_code, disp_data = !!input$data_select)
        }
        tmp = merge(zip_sf, data, by.x="postalcode", by.y="zip_code", all.x=TRUE, no.dups=FALSE)
        return(tmp)
    })
    
    output$map <- renderLeaflet({
        leaflet() %>%
            addTiles() %>%
            setView(-74.0060,40.7128, zoom=11)
    })
    
    observe({
        
        
        tmp = df()
        legend_labels = NULL
        
        pal = colorNumeric(
            palette = rev(brewer.pal(n=9, name = "RdYlGn")),
            domain = NULL
        )
        
        if (all(is.na(tmp$disp_data))) {
            func = labelFormat()
            tmp$disp_data = as.factor(rep("NA", nrow(tmp)))
            pal = colorFactor(palette = "#808080", domain = NULL)
        } else if (input$data_select == "avg_price_per_sqft") {
            func = labelFormat(prefix = " $")
        } else if (input$data_select == "total_proceeds") {
            func = labelFormat(prefix = " $", suffix = "M", transform=function(x) x/1E6)
        } else {
            func = labelFormat(prefix = " ")
        }
        
        leafletProxy("map", data = tmp) %>%
            clearControls() %>%
            clearShapes() %>%
            clearPopups() %>%
            addPolygons(
                fillColor = ~pal(disp_data),
                color = "#b2aeae", # you need to use hex colors
                fillOpacity = 0.7, 
                weight = 1, 
                smoothFactor = 0.2,
                # Highlight neighbourhoods upon mouseover
                highlight = highlightOptions(
                    weight = 3,
                    color = "black",
                    opacity = 1.0
                ),
                layerId = ~shape_id
            ) %>%
            addLegend(
                title=names(which(radioButtonOptions == input$data_select)),
                pal = pal, 
                values = ~disp_data, 
                position = "bottomright", 
                labFormat = func,
                labels = legend_labels
            )
    })
    
    
    # Observe shape clicks to display popup
    observeEvent(input$map_shape_click,{
        event = input$map_shape_click
        if (is.null(event))
            return()
        else {
            zip_clicked = zip_sf$postalcode[zip_sf$shape_id == event$id]
            curr = FALSE
            if (housing_data_select()) {
                data = grouped_housing %>%
                    filter(category == input$prop_category, zip_code == zip_clicked) %>%
                    select(month_char, !!input$data_select)
                title_suffix_str = paste("-", input$prop_category)
                if (input$prop_category == "All") {
                    title_suffix_str = paste(title_suffix_str, "Categories")
                }
                if (input$data_select %in% c("avg_price_per_sqft", "total_proceeds")) {
                    curr = TRUE
                }
            } else {
                data = crime_scores %>%
                    filter(zip_code == zip_clicked) %>%
                    select(month_char, weight_normalized)
                title_suffix_str = "over Time"
            }
            showPopup(data, zip_clicked, event$id, event$lat, event$lng, title_suffix_str, curr)
        }
    })
    
    showPopup <- function(df, zip, id, lat, lng, suff_str, curr = FALSE) {
        
        # Get plot object and other popup data for the chosen zip code
        zip_data = zip_sf[which(zip_sf$postalcode == zip),]
        if (nrow(df) > 0) {
            y = pull(df, 2)
            x = as.yearmon(pull(df, 1))
            title = paste(names(which(radioButtonOptions == colnames(df)[2])), suff_str)
            ylabel = names(which(radioButtonOptions == colnames(df)[2]))
            
            if (curr == TRUE) {
                if (max(y, na.rm=TRUE) > 1e6) {
                    label = label_number(prefix = "$", suffix = "M", scale = 1e-6)
                } else if (max(y, na.rm = TRUE) > 1e4) {
                    label =  label_number(prefix = "$", suffix = "K", scale = 1e-3)
                } else {
                    label = label_number(prefix = "$")
                }
            } else {
                label = label_number()
            }
            plot = ggplot(data=NULL, aes(x, y)) +
                geom_line() +
                geom_smooth() +
                labs(title = title, y = ylabel, x = "Month")  +
                scale_y_continuous(label = label, expand = c(0,0), limits = c(0, ifelse(colnames(df)[2] == "weight_normalized", 1, max(y, na.rm=TRUE)))) +
                theme_classic() +
                theme(plot.title = element_text(size = 11, hjust = 0.5), axis.title = element_text(size = 10))
    
            # Write svg file to temporary folder
            svg(filename= paste(folder,"plot.svg", sep = "/"),
                width = 500 * 0.01, height = 300 * 0.01)
            print(plot)
            dev.off()
    
            content <- paste(readLines(paste(folder,"plot.svg",sep="/")), collapse = "")
        } else {
            content = ""
        }
        
        if (is.na(zip_data$neighborhood)) {
            neighborhood_str = ""
        } else {
            neighborhood_str = paste(zip_data$neighborhood, " - ")
        }
        # Create popup
        leafletProxy("map") %>%
            addPopups(
                lng,
                lat,
                popup = paste0(
                    "<h4>", neighborhood_str, zip, "</h4><br/>",
                    content, "<br/>",
                    "<b>2015 Census Data</b><br/>",
                    "Per capita income:    $", round(zip_data$PerCapitaIncome),"</b><br/>",
                    "Total Population: ", zip_data$TotalPop, "</b><br/>",
                    "Unemployment rate: ", round(zip_data$Unemployed), "%"
                ),
                layerId = id,
                options = popupOptions(maxWidth = 500)
            )
    }
    
    outputOptions(output, "housing_data_select", suspendWhenHidden = FALSE)
})

6.2.3 ui.R

User interface is the user facing side of the Shiny application.It contains various display elements such as titlePanel, sidebarLayout, radioButtons, tabPanel etc. Following steps show how we step-wise buit up code for UI.

6.2.3.1 Importing necessary libraries

6.2.3.2 Creating basic layout of UI and display elements

Now that we have imported libraries, we start adding code inside fluidPage to create display elements. We’ll have user inputs in the sidebarPanel. We’ll add those in later steps. Final code of ui script is given in the end.

6.2.3.3 Creating display elements in sidebarLayout

Now we’ll start coding inside sidebarLayout to create display elements. We created 2 tabs, first tab for the attribute we want to see at a chosen time. sliderTextInput allows us to choose a specific time. radioButtons links to radioButtonOptions in Global.R file. It allows us to choose the attribute of the data we want to see. conditionalPanel shows real esate based on the “property type” we have chosen from the drop down list.

6.2.3.5 Final code of ui.R

Now we combine all chunks of code that we saw in steps so far inside the shinyUI function. This is the final script for ui.R. This script along with server.R and Global.R runs the shiny app.

library(shiny)
library(leaflet)
library(shinyWidgets)

# Define UI for application that draws a histogram
shinyUI(fluidPage(
    HTML(html_fix),
    titlePanel("NYC Real Estate and Crime"),
    sidebarLayout(
        sidebarPanel(
            tabsetPanel(
                tabPanel(
                    "Map Controls",
                    HTML("<br/>"),
                    sliderTextInput(
                        inputId = "date_select",
                        label="Month",
                        choices = sort(unique(c(unique(crime_scores$month_char),unique(grouped_housing$month_char)))),
                        animate=TRUE,
                        selected="2020-03"
                    ),
                    radioButtons(
                        inputId = "data_select",
                        label = "Mapped Data",
                        radioButtonOptions # set up in global.R
                    ),
                    conditionalPanel(
                        condition = "output.housing_data_select == true",
                        selectInput(
                            inputId = "prop_category",
                            label = "Property Category",
                            choices = sort(unique(grouped_housing$category))
                        )
                    )
                ),
                tabPanel(
                    "Price Predictor",
                    HTML("<br/>"),
                    numericInput("age", "Building age (years):", 50, min=1800, max = 2030),
                    numericInput("sqft", "Square footage:", 2000, min=0, max = 1000000),
                    numericInput("sale_year", "Sale Year:", 2010, min=1950, max = 2030),
                    selectizeInput("zip", "Location:",
                        selected = NULL,
                        choices = unique_zips$select_names,
                        options = list(
                            placeholder = "Search by Neighborhood or Zip Code",
                            onInitialize = I('function() { this.setValue(""); }')
                        )
                    ),
                    radioButtons(
                        "model_select",
                        label="Predictive Model:",
                        choiceNames = c(
                            # "Bagged cart",
                            "Neural net",
                            "Random forest"
                        ),
                        choiceValues = c(
                            # "fit.bgcrt", 
                            "fit.nnet",
                            "fit.rf"
                        )
                    ),
                    actionButton("predict_price", label = "Compute Price"),
                    HTML("<hr/><b>Predicted Price: </b>"),
                    htmlOutput("price")
                )
            )
        ),
        mainPanel(
            leafletOutput(
                "map",
                width="100%",
                height="90vh"
            )
        ),
        position="left"
    )
))

7 Conclusion and next steps

7.1 Challenges faced

  • Selecting relevant variables

    Many factors go into predicting the housing prices. So it’s difficult to select a set of variables from the data available.

  • Cleaning the data

    Cleaning the raw data is an important step. Often, the data collected is not in the required shape so a lot wrangling has to be done before it’s ready for further processing.

  • Missing data

    There were many data points that were missing. We couldn’t do much about them. There are many ways to deal with missing data. However, they were not within the scope of the project. Grey areas seen on the map are due to these missing data points.

  • Computing power limitations We experienced computing power limitations while working with machine learning models. In future, these models can be run with better computing.

7.2 What we learned

  • Data cleaning takes time

    Cleaning and preparing the data into required form is an important step and can be time consuming. Often, it takes more time to clean and prepare the date than processing it.

  • Higher computing or APIs needed to run the models

    To run the machine learning models, higher computing power would be useful. Othewise, using some API package like rcpp the code could be executed faster.

  • Lost insights due to missing data We may have missed out on many insights due to missing data. There was nothing much we could do about this. In future, missing data could be handled better.