This tutorial page describes the final project we conducted for Professor Geoffrey Parker’s Spring 2020 Data Analytics course (ENGM 182) at Dartmouth College’s Thayer School of Engineering. The goal of the project was to create a compelling and approachable a visualization of historical NYC real estate sales and crime data, and an accompanying model for predicting the price of a hypothetical real estate sale.
We chose to use the R programming language to complete this task. The secions below describe the code and methodologies for acquiring, cleaning, and processing the data, as well as the code that drives the Shiny app visualization.
The project’s source code is available on Github, and the Shiny app is hosted for viewing on shinyapps.io. Our data sources can be downloaded at the links below:
Data directory structure
Data
├── Census_2015
│ ├── census_block_loc.csv
│ └── nyc_census_tracts.csv
├── Crimes_2006-2017
│ └── NYPD_Complaint_Data_Historic.csv
├── Crimes_2014-2015
│ ├── Crime_Column_Description.csv
│ ├── NYPD_Complaint_Data_Historic.csv
│ └── Population_by_Borough_NYC.csv
├── RE_2003-2015
│ ├── 2009_bronx.xls
│ ├── 2009_brooklyn.xls
│ ├── 2009_manhattan.xls
│ ├── 2009_queens.xls
│ ├── 2009_statenisland.xls
│ ├── 2010_bronx.xls
│ ├── 2010_brooklyn.xls
│ ├── 2010_manhattan.xls
│ ├── 2010_queens.xls
│ ├── 2010_statenisland.xls
│ ├── 2011_bronx.xls
│ ├── 2011_brooklyn.xls
│ ├── 2011_manhattan.xls
│ ├── 2011_queens.xls
│ ├── 2011_statenisland.xls
│ ├── 2012_bronx.xls
│ ├── 2012_brooklyn.xls
│ ├── 2012_manhattan.xls
│ ├── 2012_queens.xls
│ ├── 2012_statenisland.xls
│ ├── 2013_bronx.xls
│ ├── 2013_brooklyn.xls
│ ├── 2013_manhattan.xls
│ ├── 2013_queens.xls
│ ├── 2013_statenisland.xls
│ ├── 2014_bronx.xls
│ ├── 2014_brooklyn.xls
│ ├── 2014_manhattan.xls
│ ├── 2014_queens.xls
│ ├── 2014_statenisland.xls
│ ├── 2015_bronx.xls
│ ├── 2015_brooklyn.xls
│ ├── 2015_manhattan.xls
│ ├── 2015_queens.xls
│ ├── 2015_statenisland.xls
│ ├── glossary_rsf071607.pdf
│ ├── sales_2007_bronx.xls
│ ├── sales_2007_brooklyn.xls
│ ├── sales_2007_manhattan.xls
│ ├── sales_2007_queens.xls
│ ├── sales_2007_statenisland.xls
│ ├── sales_2008_bronx.xls
│ ├── sales_2008_brooklyn.xls
│ ├── sales_2008_manhattan.xls
│ ├── sales_2008_queens.xls
│ ├── sales_2008_statenisland.xls
│ ├── sales_bronx_03.xls
│ ├── sales_bronx_04.xls
│ ├── sales_bronx_05.xls
│ ├── sales_bronx_06.xls
│ ├── sales_brooklyn_03.xls
│ ├── sales_brooklyn_04.xls
│ ├── sales_brooklyn_05.xls
│ ├── sales_brooklyn_06.xls
│ ├── sales_manhattan_03.xls
│ ├── sales_manhattan_04.xls
│ ├── sales_manhattan_05.xls
│ ├── sales_manhattan_06.xls
│ ├── sales_queens_03.xls
│ ├── sales_queens_04.xls
│ ├── sales_queens_05.xls
│ ├── sales_queens_06.xls
│ ├── sales_si_03.xls
│ ├── sales_si_04.xls
│ ├── sales_si_05.xls
│ └── sales_si_06.xls
├── RE_2016-2017
│ └── nyc-rolling-sales.csv
└── RE_Apr2019-Mar2020
├── rollingsales_bronx.xls
├── rollingsales_brooklyn.xls
├── rollingsales_manhattan.xls
├── rollingsales_queens.xls
└── rollingsales_statenisland.xls
The yearly real estate data we downloaded from the city of New York’s OpenData portal were formatted in Excel files (with the exception of that from 2016-2017, which were contained in a single CSV file, and 2018, which were missing entirely). Each Excel file contained all the real estate sales in a particular borough during a particular calendar year from 2003 to 2020.
To import the Excel file data into R, we start by creating a list of the all the filenames.
# Get list of all excel file names for 2003-2015 and 2019-2020 data
all_files = c(
list.files(path="../Data/RE_2003-2015", pattern="*.xls", full.names=TRUE, recursive=FALSE),
list.files(path="../Data/RE_Apr2019-Mar2020", pattern="*.xls", full.names=TRUE, recursive=FALSE)
)Next, we create a custom function to pass to lapply() that will import the data from each file into a dataframe, and store all those dataframes into a corresponding list called dfs_list. Although all the data across every file were consistently formatted with the same column headers, frustratingly, they did not always start on the same row (but always started on one of the first 10 rows). Because of this, we had to do two reads of each file—an initial read of the first 10 rows and columns just to find the index of the first column header (titled “BOROUGH” in our case), and a second read to actually import the data starting from that index. This stack overflow post was very helpful in setting up this ‘double-read’ functionality. We also use clean_names() from the janitor package to make the column names easier to work with later on.
library(tidyverse)
library(readxl)
library(data.table)
library(janitor)
# Import data from all files into a list of datarames
dfs_list = lapply(all_files, function(filename) {
# Setting up the first read
temp_read = suppressMessages(read_excel(filename))
desired_sheet = 1
skip_rows = NULL
col_skip = 0
search_string = "BOROUGH"
max_cols_to_search = 10
max_rows_to_search = 10
# First read to get the cell index of the first column header
while (length(skip_rows) == 0) {
col_skip = col_skip + 1
if (col_skip == max_cols_to_search) break
skip_rows = which(stringr::str_detect(temp_read[1:max_rows_to_search,col_skip][[1]],search_string)) - 0
}
# Second read to actually import the data
df = suppressMessages(read_excel(
filename,
sheet = desired_sheet,
skip = skip_rows,
.name_repair="universal"
) ) %>% clean_names()
return(df)
})The only data we still need to import are the 2016-2017 data from the CSV file. We append it to our list of dataframes before combining that list into a single dataframe with rbindlist() and saving it as in an RDA file for easy access later on.
# Append 2016-2017 data from csv file to list of data frames
dfs_list = append(dfs_list, list(read_csv("../Data/RE_2016-2017/nyc-rolling-sales.csv")[,-(1)] %>% clean_names()))
# Combine list of data frames into a single data frame and save as RDA file
df_housing = rbindlist(dfs_list)
save(df_housing, file="Data_Housing.rda")| borough | neighborhood | building_class_category | tax_class_at_present | block | lot | ease_ment | building_class_at_present | address | apartment_number | zip_code | residential_units | commercial_units | total_units | land_square_feet | gross_square_feet | year_built | tax_class_at_time_of_sale | building_class_at_time_of_sale | sale_price | sale_date |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | BATHGATE | 01 ONE FAMILY HOMES | 1 | 3035 | 5 | NA | A9 | 451 EAST 178 STREET | NA | 10457 | 1 | 0 | 1 | 1287 | 1992 | 1899 | 1 | A9 | 0 | 2009-12-07 |
| 2 | BATHGATE | 01 ONE FAMILY HOMES | 1 | 3046 | 42 | NA | A1 | 2069 BATHGATE AVENUE | NA | 10457 | 1 | 0 | 1 | 1964 | 1424 | 1899 | 1 | A1 | 100 | 2009-04-28 |
| 2 | BATHGATE | 01 ONE FAMILY HOMES | 1 | 3046 | 42 | NA | A1 | 2069 BATHGATE AVE | NA | 10457 | 1 | 0 | 1 | 1964 | 1424 | 1899 | 1 | A1 | 100 | 2009-04-13 |
| 2 | BATHGATE | 01 ONE FAMILY HOMES | 1 | 3048 | 19 | NA | A5 | 2186 BATHGATE AVENUE | NA | 10457 | 1 | 0 | 1 | 1768 | 1188 | 1901 | 1 | A5 | 345752 | 2009-03-16 |
| 2 | BATHGATE | 01 ONE FAMILY HOMES | 1 | 3048 | 28 | NA | A1 | 540 EAST 182ND STREET | NA | 10457 | 1 | 0 | 1 | 1209 | 1048 | 1901 | 1 | A1 | 0 | 2009-05-14 |
| 2 | BATHGATE | 01 ONE FAMILY HOMES | 1 | 3053 | 105 | NA | A1 | 2323 BASSFORD AVENUE | NA | 10458 | 1 | 0 | 1 | 922 | 1248 | 1901 | 1 | A1 | 0 | 2009-01-22 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3030 | 71 | NA | B2 | 4443 PARK AVENUE | NA | 10457 | 2 | 0 | 2 | 2352 | 1632 | 1899 | 1 | B2 | 135000 | 2009-05-08 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3030 | 76 | NA | B2 | 4435 PARK AVENUE | NA | 10457 | 2 | 0 | 2 | 2444 | 2125 | 1899 | 1 | B2 | 325000 | 2009-03-30 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3035 | 131 | NA | B1 | 464 EAST 179TH | NA | 10457 | 2 | 0 | 2 | 1790 | 2340 | 1998 | 1 | B1 | 0 | 2009-10-07 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3036 | 33 | NA | B3 | 2091 WASHINGTON AVE | NA | 10457 | 2 | 0 | 2 | 2022 | 3854 | 1899 | 1 | B3 | 0 | 2009-05-08 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3036 | 62 | NA | B1 | 459 EAST 179 STREET | NA | 10457 | 2 | 0 | 2 | 2000 | 2340 | 1998 | 1 | B1 | 344500 | 2009-11-20 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3036 | 62 | NA | B1 | 459 EAST 179TH STREET | NA | 10457 | 2 | 0 | 2 | 2000 | 2340 | 1998 | 1 | B1 | 316474 | 2009-05-04 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3037 | 6 | NA | B2 | 4392 PARK AVENUE | NA | 10457 | 2 | 0 | 2 | 2275 | 3240 | 1899 | 1 | B2 | 359038 | 2009-01-19 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3039 | 12 | NA | B2 | 454 EAST 185TH STREET | NA | 10458 | 2 | 0 | 2 | 1667 | 1944 | 1910 | 1 | B2 | 183028 | 2009-11-02 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3039 | 45 | NA | B1 | 446 EAST 186 STREET | NA | 10458 | 2 | 0 | 2 | 2000 | 2060 | 1901 | 1 | B1 | 286000 | 2009-10-09 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3047 | 16 | NA | B2 | 2125 BATHGATE AVENUE | NA | 10457 | 2 | 0 | 2 | 3362 | 1744 | 1901 | 1 | B2 | 0 | 2009-04-03 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3050 | 94 | NA | B2 | 2240 BASSFORD AVENUE | NA | 10457 | 2 | 0 | 2 | 1300 | 2300 | 1901 | 1 | B2 | 389000 | 2009-07-28 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3053 | 22 | NA | S2 | 511 EAST 183 STREET | NA | 10458 | 2 | 1 | 3 | 2011 | 4280 | 1901 | 1 | S2 | 1 | 2009-11-18 |
| 2 | BATHGATE | 02 TWO FAMILY HOMES | 1 | 3053 | 22 | NA | S2 | 511 EAST 183RD STREET | NA | 10458 | 2 | 1 | 3 | 2011 | 4280 | 1901 | 1 | S2 | 415440 | 2009-08-24 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 2912 | 116 | NA | C0 | 1536 WASHINGTON AVENUE | NA | 10457 | 3 | 0 | 3 | 2000 | 2400 | 1993 | 1 | C0 | 319500 | 2009-10-07 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 2912 | 116 | NA | C0 | 1536 WASHINGTON AVENUE | NA | 10457 | 3 | 0 | 3 | 2000 | 2400 | 1993 | 1 | C0 | 100 | 2009-01-12 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3029 | 28 | NA | C0 | 2092 WEBSTER AVENUE | NA | 10457 | 3 | 0 | 3 | 2122 | 3510 | 1899 | 1 | C0 | 0 | 2009-11-18 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 67 | NA | C0 | 460 EAST 182 STREET | NA | 10457 | 3 | 0 | 3 | 2000 | 3744 | 1899 | 1 | C0 | 162240 | 2009-11-12 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 103 | NA | C0 | 2135 WASHINGTON AVENUE | NA | 10457 | 3 | 0 | 3 | 1701 | 4600 | 2006 | 1 | C0 | 630000 | 2009-07-17 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 104 | NA | C0 | 465 EAST 181 STREET | NA | 10457 | 3 | 0 | 3 | 1095 | 3706 | 2006 | 1 | C0 | 602500 | 2009-09-02 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 105 | NA | C0 | 463 EAST 181 STREET | NA | 10457 | 3 | 0 | 3 | 1095 | 3706 | 2006 | 1 | C0 | 602500 | 2009-04-24 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 106 | NA | C0 | 461 EAST 181 STREET | NA | 10457 | 3 | 0 | 3 | 1095 | 3706 | 2006 | 1 | C0 | 602500 | 2009-04-27 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 107 | NA | C0 | 459 EAST 181 STREET | NA | 10457 | 3 | 0 | 3 | 1095 | 3706 | 2006 | 1 | C0 | 640000 | 2009-05-20 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 108 | NA | C0 | 457 EAST 181 STREET | NA | 10457 | 3 | 0 | 3 | 1419 | 3706 | 2006 | 1 | C0 | 602500 | 2009-07-23 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 109 | NA | C0 | 455 EAST 181 STREET | NA | 10457 | 3 | 0 | 3 | 1687 | 4050 | 2006 | 1 | C0 | 602500 | 2009-07-15 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3037 | 110 | NA | C0 | 453 EAST 181 STREET | NA | 10457 | 3 | 0 | 3 | 1687 | 4050 | 2006 | 1 | C0 | 602500 | 2009-04-29 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3044 | 51 | NA | C0 | 1966 BATHGATE AVENUE | NA | 10457 | 3 | 0 | 3 | 2880 | 3132 | 1910 | 1 | C0 | 101400 | 2009-12-16 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3044 | 51 | NA | C0 | 1964 BATHGATE AVENUE | NA | 10457 | 3 | 0 | 3 | 2880 | 3132 | 1910 | 1 | C0 | 0 | 2009-02-23 |
| 2 | BATHGATE | 03 THREE FAMILY HOMES | 1 | 3050 | 3 | NA | C0 | 2190 WASHINGTON AVENUE | NA | 10457 | 3 | 0 | 3 | 1331 | 2244 | 1901 | 1 | C0 | 178500 | 2009-07-07 |
| 2 | BATHGATE | 05 TAX CLASS 1 VACANT LAND | 1B | 3050 | 12 | NA | V0 | 2246 WASHINGTON AVENUE | NA | 10457 | 0 | 0 | 0 | 2901 | 0 | 0 | 1 | V0 | 0 | 2009-09-30 |
| 2 | BATHGATE | 05 TAX CLASS 1 VACANT LAND | 1B | 3050 | 13 | NA | V0 | 2250 WASHINGTON AVENUE | NA | 10457 | 0 | 0 | 0 | 2668 | 0 | 0 | 1 | V0 | 0 | 2009-09-30 |
| 2 | BATHGATE | 05 TAX CLASS 1 VACANT LAND | 1B | 3050 | 15 | NA | V0 | 2254 WASHINGTON AVENUE | NA | 10457 | 0 | 0 | 0 | 2673 | 0 | 0 | 1 | V0 | 0 | 2009-09-30 |
| 2 | BATHGATE | 05 TAX CLASS 1 VACANT LAND | 2 | 3050 | 41 | NA | D1 | 491 FLETCHER PLACE | NA | 10457 | 0 | 0 | 0 | 1637 | 0 | 0 | 1 | V0 | 0 | 2009-09-30 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2 | 3027 | 29 | NA | C1 | 422 EAST 178TH STREET | NA | 10457 | 16 | 0 | 16 | 3344 | 11918 | 1909 | 2 | C1 | 1315000 | 2009-08-26 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2 | 3028 | 27 | NA | C7 | 4305 PARK AVENUE | NA | 10457 | 24 | 1 | 25 | 5568 | 16680 | 1909 | 2 | C7 | 0 | 2009-08-26 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2 | 3028 | 29 | NA | C1 | 4301 PARK AVENUE | NA | 10457 | 24 | 0 | 24 | 5189 | 15300 | 1909 | 2 | C1 | 0 | 2009-08-26 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2 | 3028 | 34 | NA | C1 | 4293 PARK AVENUE | NA | 10457 | 20 | 0 | 20 | 6045 | 15512 | 1909 | 2 | C1 | 0 | 2009-08-26 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2 | 3028 | 44 | NA | C1 | 4289 PARK AVENUE | NA | 10457 | 22 | 0 | 22 | 6098 | 16728 | 1908 | 2 | C1 | 6885000 | 2009-08-26 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2A | 3037 | 99 | NA | C3 | 2137 WASHINGTON AVENUE | NA | 10457 | 4 | 0 | 4 | 2719 | 3348 | 1901 | 2 | C3 | 10 | 2009-04-17 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2B | 3041 | 2 | NA | C1 | 4678 PARK AVENUE | NA | 10458 | 8 | 0 | 8 | 3417 | 9360 | 1910 | 2 | C1 | 0 | 2009-11-04 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2B | 3041 | 3 | NA | C1 | 4682 PARK AVENUE | NA | 10458 | 8 | 0 | 8 | 3425 | 9360 | 1910 | 2 | C1 | 0 | 2009-11-04 |
| 2 | BATHGATE | 07 RENTALS - WALKUP APARTMENTS | 2B | 3048 | 20 | NA | C1 | 2190 BATHGATE AVE | NA | 10457 | 7 | 0 | 7 | 2417 | 5400 | 1931 | 2 | C1 | 1e+05 | 2009-02-19 |
| 2 | BATHGATE | 08 RENTALS - ELEVATOR APARTMENTS | 2 | 3050 | 10 | NA | D1 | 2240 WASHINGTON AVENUE | NA | 10457 | 68 | 1 | 69 | 13424 | 54365 | 2010 | 2 | D1 | 0 | 2009-09-30 |
| 2 | BATHGATE | 10 COOPS - ELEVATOR APARTMENTS | 2 | 3035 | 48 | NA | D4 | 463 EAST 178TH STREET, 1E | NA | 10457 | 0 | 0 | 0 | 0 | 0 | 1935 | 2 | D4 | 24000 | 2009-01-21 |
| 2 | BATHGATE | 10 COOPS - ELEVATOR APARTMENTS | 2 | 3035 | 48 | NA | D4 | 463 EAST 178TH STREET, 6E | NA | 10457 | 0 | 0 | 0 | 0 | 0 | 1935 | 2 | D4 | 20000 | 2009-12-09 |
We use read.csv instead of read_csv to get a data.frame instead of tibble, which is more useful when doing analysis in other programs
Download the United States census bureau data estimates for 2015 for all 2,167 individual NYC census tracts. This data is in two parts. Import the first part which includes the block code/census tract, its latitude and longitude as well as the name of the county and state through the following code. After getting them in a dataframe (block_loc), remove the rows with ‘NA’. A basic ‘unique value check’ shows that there are 38,936 locations in 20,406 blocks in 15 counties. These longitude and latitude locations would help us later when we classify the demographic data by zip code.
#Code for census data
block_loc <- read.csv("../Data/Census_2015/census_block_loc.csv")
#This data frame has 5 columns (Latitude, longitude and block code for a block as well as its county and state) and 38,396 rows
#Checking if there are any missing or infinite
missing_val = any(is.na(block_loc))
#There are no missing values in the dataframe
#Checking for unique values
unique_state <- unique(block_loc$State)
unique_county <- unique(block_loc$County)
unique_block <- unique(block_loc$BlockCode)
#There are 38,936 locations in 20,406 blocks which are in 15 counties in NJ and NY in the data| Latitude | Longitude | BlockCode | County | State |
|---|---|---|---|---|
| 40.48 | -74.28000 | 3.402301e+14 | Middlesex | NJ |
| 40.48 | -74.27683 | 3.402301e+14 | Middlesex | NJ |
| 40.48 | -74.27367 | 3.402301e+14 | Middlesex | NJ |
| 40.48 | -74.27050 | 3.402301e+14 | Middlesex | NJ |
| 40.48 | -74.26734 | 3.402301e+14 | Middlesex | NJ |
| 40.48 | -74.26417 | 3.402301e+14 | Middlesex | NJ |
We divide this dataframe into different counties in NY and NJ to get a better sense about the data. This gives us information about the number of locations in each of the counties in our data.
#Splitting dataframes into different states and then different counties
library(dplyr)
block_loc_NY <- filter(block_loc, State == "NY")
#This has 24,240 locations
block_loc_NJ <- filter(block_loc, State == "NJ")
#This has 14,156 locations
unique(block_loc_NY$County)
block_loc_Richmond <- filter(block_loc, County == "Richmond") #3933 locations
block_loc_Queens <- filter(block_loc, County == "Queens") #6872 locations
block_loc_Nassau <- filter(block_loc, County == "Nassau") #4906 locations
block_loc_Kings <- filter(block_loc, County == "Kings") #3729 locations
block_loc_NewYork <- filter(block_loc, County == "New York") #1295 locations
block_loc_Bronx <- filter(block_loc, County == "Bronx") #2224 locations
block_loc_Westchester <- filter(block_loc, County == "Westchester") #1281 locations
unique(block_loc_NJ$County)
block_loc_Middlesex <- filter(block_loc, County == "Middlesex") #760 locations
block_loc_Monmouth <- filter(block_loc, County == "Monmouth") #1117 locations
block_loc_Union<- filter(block_loc, County == "Union") #1553 locations
block_loc_Hudson <- filter(block_loc, County == "Hudson") #2400 locations
block_loc_Essex <- filter(block_loc, County == "Essex") #3302 locations
block_loc_Bergen <- filter(block_loc, County == "Bergen") #3456 locations
block_loc_Passaic <- filter(block_loc, County == "Passaic") #1539 locations
block_loc_Morris <- filter(block_loc, County == "Morris") #29 locationsThe second part of the census data is the NYC census tracts data which is imported below. This is also split into different dataframes which contained data about sex, race, income, jobs, commute and employment, albeit these dataframes will not be used separately for analysis in the future.
#importing nyc census tracts file
nyc_tracts <- read.csv("../Data/Census_2015/nyc_census_tracts.csv")
#Splitting this into different dataframes
nyc_tracts_sex <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Men, Women) #Actual number of men and women in population
nyc_tracts_race <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Hispanic, White, Black, Native, Asian) #Percentage of different race in population
nyc_tracts_income <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Citizen, Professional, Service, Office, Construction, Production) #Percentage of different jobs
nyc_tracts_jobs <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Citizen, Income, IncomeErr, IncomePerCap, IncomePerCapErr, Poverty, ChildPoverty) #Income and Poverty data
nyc_tracts_commute <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Citizen, Drive, Carpool, Transit, Walk, OtherTransp, WorkAtHome, MeanCommute) #Percentage of different ways that people use for commuting. Don't know MeanCommute, maybe it is the percentage of people commuting every day
nyc_tracts_employment <- select(nyc_tracts, CensusTract, County, Borough, TotalPop, Citizen, Employed, PrivateWork, PublicWork, SelfEmployed, FamilyWork, Unemployment)#Percentage of different kinds of employment| CensusTract | County | Borough | TotalPop | Men | Women | Hispanic | White | Black | Native | Asian | Citizen | Income | IncomeErr | IncomePerCap | IncomePerCapErr | Poverty | ChildPoverty | Professional | Service | Office | Construction | Production | Drive | Carpool | Transit | Walk | OtherTransp | WorkAtHome | MeanCommute | Employed | PrivateWork | PublicWork | SelfEmployed | FamilyWork | Unemployment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3.6005e+10 | Bronx | Bronx | 7703 | 7133 | 570 | 29.9 | 6.1 | 60.9 | 0.2 | 1.6 | 6476 | NA | NA | 2440 | 373 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | NA | NA | NA | NA | NA |
| 3.6005e+10 | Bronx | Bronx | 5403 | 2659 | 2744 | 75.8 | 2.3 | 16.0 | 0.0 | 4.2 | 3639 | 72034 | 13991 | 22180 | 2206 | 20.0 | 20.7 | 28.7 | 17.1 | 23.9 | 8.0 | 22.3 | 44.8 | 13.7 | 38.6 | 2.9 | 0.0 | 0.0 | 43.0 | 2308 | 80.8 | 16.2 | 2.9 | 0.0 | 7.7 |
| 3.6005e+10 | Bronx | Bronx | 5915 | 2896 | 3019 | 62.7 | 3.6 | 30.7 | 0.0 | 0.3 | 4100 | 74836 | 8407 | 27700 | 2449 | 13.2 | 23.6 | 32.2 | 23.4 | 24.9 | 9.0 | 10.5 | 41.3 | 10.0 | 44.6 | 1.4 | 0.5 | 2.1 | 45.0 | 2675 | 71.7 | 25.3 | 2.5 | 0.6 | 9.5 |
| 3.6005e+10 | Bronx | Bronx | 5879 | 2558 | 3321 | 65.1 | 1.6 | 32.4 | 0.0 | 0.0 | 3536 | 32312 | 6859 | 17526 | 2945 | 26.3 | 35.9 | 19.1 | 36.1 | 26.2 | 4.9 | 13.8 | 37.2 | 5.3 | 45.5 | 8.6 | 1.6 | 1.7 | 38.8 | 2120 | 75.0 | 21.3 | 3.8 | 0.0 | 8.7 |
| 3.6005e+10 | Bronx | Bronx | 2591 | 1206 | 1385 | 55.4 | 9.0 | 29.0 | 0.0 | 2.1 | 1557 | 37936 | 3771 | 17986 | 2692 | 37.1 | 31.5 | 35.4 | 20.9 | 26.2 | 6.6 | 11.0 | 19.2 | 5.3 | 63.9 | 3.0 | 2.4 | 6.2 | 45.4 | 1083 | 76.8 | 15.5 | 7.7 | 0.0 | 19.2 |
| 3.6005e+10 | Bronx | Bronx | 8516 | 3301 | 5215 | 61.1 | 1.6 | 31.1 | 0.3 | 3.3 | 5436 | 18086 | 3694 | 12023 | 2346 | 53.2 | 67.7 | 14.5 | 41.1 | 16.7 | 7.1 | 20.6 | 19.6 | 7.0 | 68.2 | 4.3 | 1.0 | 0.0 | 46.0 | 2508 | 71.0 | 21.3 | 7.7 | 0.0 | 17.2 |
After importing the data, we moved on to cleaning and processing. Cleaning the data involved removing the entries that were either irrelevant or incomplete for the purposes of our project. Processing involved geocoding the data, grouping it spatiotemporally, and then computing new metrics to help describe each group.
Grouping the data would allow us to more easily draw observe trends and draw meaningful conclusions, and we decided to group the real estate and crime data entries by zip code and month. The ~250 zip code areas in NYC would give us enough granularity to see the how the myriad of different regions in New York City differ from one another over time, but also enough courseness to guarantee acceptable response times of our visualization app, and ensure that there were usually multiple entries in each group to average across.
Generally, the crime data is cleaned as part of the other sections. When the crime scores are taken to classify the crimes and these scores are summed across zipcodes, crimes that could not be classified into defined categories are excluded. While excluding NAs could introduce bias, it is believed that the effect would be consistent across time and population for the analysis. However, it did appear that there was a difference in the reporting for crime scores after 2006, which was accounted for when representing the data visually.
The df_crime dataframe that contains data about all the crimes committed is first of all purged of all NA rows in the latitude and longitude columns. Then by reverse geocoding, the latitude and longitude values are mapped to a zip code in the following way.
# These should be taken care of with the higher level Import_All.R:
# source("../Engm182_Project/Import_Scripts/Import_Crime_Data.R")
# head(df_crime)
library(tidyr)
library(dplyr)
#Dropping the rows whcih have blank in latitude or longitude column
df_crime2 <- drop_na(df_crime, Latitude, Longitude)
#Reverse geocoding to find the zip code for each Lonlat
library("ggmap")
dset_crime2 <- as.data.frame(df_crime2[,29:28])
loc <- dset_crime2
res <- lapply(seq(nrow(location)), function(i){
revgeocode(location[i,],
output = c("all"),
messaging = FALSE,
sensor = FALSE,
override_limit = FALSE)
})Assign a weight to every crime based on the degree of severity. A violation which includes minor offenses like speeding ticket, public intoxication etc. is assigned a weight of 1. A misdemeanour which can include a simple assault or driving under influence is assigned a wight of 2. These were easy to identify as the LAW_CAT_CD column in df_crime gives the category of the crime (Violation, Misdemeanour or Felony).
#Assigning weights to each crime
#Violations are the most minor of offenses. They have been assigned a weight of 1
#source: https://patrickparrottalaw.com/differences-between-a-violation-a-misdemeanor-and-a-felony/
df_crime2$weight <- 0
df_crime2$weight[which(df_crime2$LAW_CAT_CD=="VIOLATION")]=1
#Misdemeanours are more serious than violations but less severe than felonies, they can carry up to a year in jail and have score of 2
#source: https://patrickparrottalaw.com/differences-between-a-violation-a-misdemeanor-and-a-felony/
df_crime2$weight[which(df_crime2$LAW_CAT_CD=="MISDEMEANOR")]=2
Assigning weights to felonies is a little more tricky. Since, felonies include a vast variety of criminal activity, we decided to first divide them into 4 categories, 1st to 4th degree. First-degree felonyincludes murder, rape, kidnapping, arson and fraud. Second-degree felony includes aggravated assault, felony assault, arson, manslaughter, possession of a controlled substance and child molestation. Third-degree felony includes assault and battery, elder abuse, transmission of pornography, driving under the influence, fraud and arson. Finally, fourth-degree felony includes involuntary manslaughter, burglary, larceny and resisting arrest. (These lists are just examples and not exhaustive and are based on a source: https://legaldictionary.net/felony/) From the OFNS_DESC column generate a list of unique keywords and group them into 4 lists pertaining to the type of feloniy. Then, the crime is assigned a weight of 6, 5, 4 or 3 for felonies of 1st degree, 2nd degree, 3rd degree and 4th degree respectively depending on which category the OFNS_DESC column of that crime suggests.
#Felonies are the most serious of offenses and will be sub-classified
#Common law and statutes in most states divide felonies into first through fourth degree felonies, each carrying decreasing penalties
#source: https://legaldictionary.net/felony/
df_crime2_felony <- subset(df_crime2, df_crime2$LAW_CAT_CD=="FELONY")
felonytypes <- unique(df_crime2_felony$OFNS_DESC)
felony_1stdegree <- c("RAPE", "THEFT-FRAUD", "ARSON", "KIDNAPPING & RELATED OFFENSES", "MURDER & NON-NEGL. MANSLAUGHTER", "KIDNAPPING", "KIDNAPPING AND RELATED OFFENSES", "NYS LAWS-UNCLASSIFIED FELONY" )
felony_2nddegree <- c("FELONY ASSAULT", "FORGERY", "SEX CRIMES", "DANGEROUS DRUGS", "DANGEROUS WEAPONS")
felony_3rddegree <- c("INTOXICATED/IMPAIRED DRIVING","ROBBERY", "CRIMINAL MISCHIEF & RELATED OF", "PROSTITUTION & RELATED OFFENSES")
felony_4thdegree <- c("GAMBLING","", "BURGLARY","GRAND LARCENY", "GRAND LARCENY OF MOTOR VEHICLE", "MISCELLANEOUS PENAL LAW", "POSSESSION OF STOLEN PROPERTY", "CHILD ABANDONMENT/NON SUPPORT", "HOMICIDE-NEGLIGENT-VEHICLE", "HOMICIDE-NEGLIGENT,UNCLASSIFIE", "ENDAN WELFARE INCOMP")
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_1stdegree == TRUE)]=6
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_2nddegree == TRUE)]=5
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_3rddegree == TRUE)]=4
df_crime2$weight[which(df_crime2$OFNS_DESC %in% felony_4thdegree == TRUE)]=3
df_crime <- df_crime2
save(df_crime, file='Data_Crime.rda')
#Extract month and year for each crime
df_crime3 <- subset(df_crime2, select = c(CMPLNT_FR_DT, OFNS_DESC, LAW_CAT_CD, Longitude, Latitude, weight))
head(df_crime3)First, the relevant datafiles are imported into the environment and the packages are loaded:
library(lubridate)
library(dplyr)
load(file = "Data_Crime_w_Zipcodes.rda")
load(file = "Data_Zip_Population.rda")For analysis, it is helpful to have the events grouped by month, year, and zipcode. Then, a field like the crime score can be aggregated and normalized for this criterion. In this case, the crime data is formatted in the desired way with the following command:
Source: https://stackoverflow.com/questions/33221425/how-do-i-group-my-date-variable-into-month-year-in-r
In addition, commands are run to normalize the crime score by the total population and only take data greater than the year 2000. There are extraneous entries in the data for crimes hundreds of years ago and 2000 is a good baseline. It also appears that the reporting method for these crimes changes in 2006, as evidenced by a dramatic change in crime scores at this time.
# Summing the scores
names(score_by_time_and_rating)[names(score_by_time_and_rating) == 'summary_variable'] <- 'sum_weight'
# Joining the zipcodes
score_by_time_and_rating <- left_join(score_by_time_and_rating, zip_population, by="zip_code")
# Dividing sum by the total population (at 2015 census levels)
score_by_time_and_rating$weight <- score_by_time_and_rating$sum_weight / score_by_time_and_rating$TotalPop
# Dipslaying the head
head(score_by_time_and_rating)
save(score_by_time_and_rating, file = "Data_Score_by_Time_and_Rating.rda")
#Group by year and rating (This is for the regression model for house sales)
score_by_zip_and_year <- df_crime %>% group_by(year=floor_date(CMPLNT_FR_DT, "year"), zip_code) %>% summarize(summary_variable=sum(weight))
#Summing the scores
names(score_by_zip_and_year)[names(score_by_zip_and_year) == 'summary_variable'] <- 'sum_weight'
#Extracting the year
score_by_zip_and_year$year <- year(score_by_zip_and_year$year)
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(year,zip_code,sum_weight))
# Joining the zipcodes with populations
score_by_zip_and_year <- left_join(score_by_zip_and_year, zip_population, by="zip_code")
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(year,zip_code,sum_weight, TotalPop))
# Dividing sum by the total population (at 2015 census levels)
score_by_zip_and_year$weight <- score_by_zip_and_year$sum_weight / score_by_zip_and_year$TotalPop
#Taking data only for years 2001 and above
score_by_zip_and_year <- subset(score_by_zip_and_year, year > 2000)
score_by_zip_and_year <- subset(score_by_zip_and_year, select = c(zip_code, year, sum_weight, weight, TotalPop))
save(score_by_zip_and_year, file = "Data_Score_by_year_and_zipcode.rda")There was difficulty with initially processing the location data into a format for analysis. For the crime and census data:
For the regressions and displaying the data visually, the data needs to be grouped by similar location. In this case, zipcodes were chosen to be the level to group by. By classifying the data entries in a zipcode, the effect of the zipcode can be studied.
At first, an API was used to convert the coordinates from the crime data to zip codes. The revgeocode function from the ggmap package was used and applied to each crime complaint. However, it was taking too long to make the requests to retrieve each of the zipcodes. Although this route was not pursued, it may be possible to perform the API requests in parallel and improve performance.
Instead, it was chosen to perform the reverse geocoding locally and avoid having to make requests to an API service for each zip code. Further, since the zip codes are limited to only New York, it was possible to only download the location data for this area. In this section, the same method will be applied to extract the zipcodes.
The source for the code to retrieve the zipcodes from latitude and longitude coordinates locally is: https://stackoverflow.com/questions/46267287/reverse-geocoding-speed
The source for the geospatial polygon data to process the coordinates is: https://jsspina.carto.com/tables/nyc_zip_code_tabulation_areas_polygons/public/map
First, the packages are loaded:
The crime data is read in with the following command:
Then the desired latitude and longitude fields are taken. the complaint number is also taken so that the crime-zipcode data can be joined again later. The NA values are omitted.
CMPLNT_NUM <- c(df_crime$CMPLNT_NUM)
latitude <- c(df_crime$Latitude)
longitude <- c(df_crime$Longitude)
x = data.frame(CMPLNT_NUM, longitude, latitude)
x <- na.omit(x)
sf_x <- sf::st_as_sf(x, coords = c("longitude", "latitude"))As stated in the sourced guide above for reverse geocoding this data, “To perform spatial operations, the coordinate reference system needs to match between the two geometries.”
The coordinates are processed with the st_within function and put into an output using sapply:
The output for each zipcode was a list of zipcodes even though there was just one. There were multiple outputs in the case that one was trying to link overlapping geometries with the data. The zipcodes are converted to the proper format:
out2 <- map(out, 1) # Take first element from each list
out3 <- lapply(out, function(x) ifelse(length(x) == 1, x[1], NA)) # Handle NAs
out4 <- vapply(out3, paste, collapse = ", ", character(1L)) # Flattten listst; source: https://stackoverflow.com/questions/24829027/unimplemented-type-list-when-trying-to-write-table
sf_x$zip_code <- out4Finally, the zipcodes are linked back with the crime data and saved:
# Putting it back with the sf_data
df_crime <- left_join(df_crime, sf_x, by = "CMPLNT_NUM")
# Format column as dates
df_crime$CMPLNT_FR_DT <- as.Date(df_crime$CMPLNT_FR_DT, format = "%m/%d/%Y")
# This column NEEDS to be removed because it causes errors
df_crime$geometry<-NULL
# Saving the data with thet important information
save(df_crime, file='Data_Crime_w_Zipcodes.rda')
After importing the census data in section “02C: Data_Import_Census”, no more data cleaning is required as the data was organized into a usable format there itself and the NA values were removed.
First, the packages are loaded:
The process for finding the zipcodes that corresponded to the census data is different. The census tract information is more granular than at the zipcode level. The mapping between tracts and zipcodes is imported and joined to the census data, so that each element has a corresponding zipcode. The Excel file can be downloaded from the source at end of this section.
tract_zips <- my_data <- read_excel("TRACT_ZIP_032020.xlsx")
names(tract_zips)[names(tract_zips) == 'ZIP'] <- 'zip_code'
names(tract_zips)[names(tract_zips) == 'TRACT'] <- 'CensusTract'
tract_zips$CensusTract <- as.numeric(tract_zips$CensusTract)
nyc_tracts <- left_join(nyc_tracts, tract_zips, by="CensusTract")
# Save updated nyc_tracs
save(nyc_tracts, file='Data_NYC_Tracts.rda')
#Adding all the parameters required for regression in zip_population file
zip_population <- subset(nyc_tracts, select = c(CensusTract,zip_code,TotalPop, Men, Women, Hispanic, White, Black, Native, Asian, IncomePerCap, Unemployment))The most important element was the corresponding populations for each zipcode. The population data for each demographic was summed across each zipcode. The demographic information was initially stored as percentages, so it was necessary to multiply those percentages by the total population for each census tract.
#Converting the demographic and unemployment percentages to numbers and income per capita to income
zip_population$Hispanic.number <- zip_population$Hispanic*zip_population$TotalPop/100
zip_population$White.number <- zip_population$White*zip_population$TotalPop/100
zip_population$Black.number <- zip_population$Black*zip_population$TotalPop/100
zip_population$Native.number <- zip_population$Native*zip_population$TotalPop/100
zip_population$Asian.number <- zip_population$Asian*zip_population$TotalPop/100
zip_population$IncomeTot <- zip_population$IncomePerCap*zip_population$TotalPop
#Percentage of people in the age group of 18-65 in NY is 65% source: https://www.census.gov/quickfacts/newyorkcitynewyork
zip_population$Unemployment.number <- zip_population$Unemployment*0.65*zip_population$TotalPop/100
#Aggregating the various census tract data into 1 zip code
zip_population <- subset(zip_population, select = c(zip_code,TotalPop, Men, Women, Hispanic.number, White.number, Black.number, Native.number, Asian.number, IncomeTot, Unemployment.number))
zip_population <- aggregate(.~zip_code, data = zip_population, FUN = sum)
#Convert back the demographics and unemployment to % and IncomeTot to per capita income
zip_population$Hispanic.number <- zip_population$Hispanic.number*100/zip_population$TotalPop
zip_population$White.number <- zip_population$White.number*100/zip_population$TotalPop
zip_population$Black.number <- zip_population$Black.number*100/zip_population$TotalPop
zip_population$Native.number <- zip_population$Native.number*100/zip_population$TotalPop
zip_population$Asian.number <- zip_population$Asian.number*100/zip_population$TotalPop
zip_population$IncomeTot <- zip_population$IncomeTot/zip_population$TotalPop
zip_population$Unemployment.number <- zip_population$Unemployment.number*100/(0.65*zip_population$TotalPop)Finally, the columns are renamed as needed. The data for zipcode to population demographic was saved for analysis later:
The following site includes the most up to date census tract to zipcode information: https://www.huduser.gov/portal/datasets/usps_crosswalk.html
In order to group the real estate data by zip code, we first need to attach a zip code value to each entry in our data (a process known as “geocoding”, or “reverse-geocoding” to be more precise). To do this, we use the Geoclient API available on the city’s Developer Portal. Before pulling from the API, we do some quick cleaning by converting the borough, sale_price, and gross_square_feet columns to the numeric data type (to make sure they’re not incorrectly stored as character or factor types). Then we filter out some of the data with bad sale_price and gross_square_feet values, as this makes for fewer API queries and we know we won’t want use those entries in our analysis anyway. The last step before pulling from the API is creating the bbl column, which is a 10 digit combination of the borough, block, and lot columns. We will pass each entry’s bbl number to the API, which will then return the zip code (among other data) for that specific plot of land.
library(tidyverse)
load("Data_Housing.rda")
# Convert chr variables to numerics
df_housing$borough = as.numeric(df_housing$borough)
df_housing$sale_price = as.numeric(df_housing$sale_price)
df_housing$gross_square_feet = as.numeric(df_housing$gross_square_feet)
# Get only data with valid sale prices and square footages
# Build borough-block-lot (BBL) ID for reverse geocoding
df_housing = df_housing %>%
filter(sale_price > 0) %>%
filter(gross_square_feet > 1) %>%
mutate(bbl = borough * 1e9 + block * 1e4 + lot)A nifty CRAN package appropriately titled geoclient allows us to pull data from the Geoclient API in just a single line of code. However, because the API is public and free, the rate of requests from a single account is limited and it took ~7 hours to query all 1.4 mllion properties. After the queries are complete, we bind the returned API data to the existing df_housing data frame in a new data frame called geocoded_df_housing.
In the next step, we modify data from a CSV file with all NYC building class codes to include a column labeling the building category of each class code (e.g., Residential, Commercial, Industrial, etc.).
building_class_key = read_csv("BUILDING_CLASS.csv") %>%
group_by(BUILDING_CODE_ID) %>%
summarize(category = str_to_title(TYPE[1])) %>%
rename(class_code = BUILDING_CODE_ID) %>%
mutate(category = case_when(
substring(class_code, 1,1) %in% c("A","B","C","D","R") ~ "Residential",
substring(class_code, 1,1) %in% c("S") ~ "Mixed",
substring(class_code, 1,1) %in% c("H","K","L","O") ~ "Commercial",
substring(class_code, 1,1) %in% c("E","F","G","T","U") ~ "Industrial",
substring(class_code, 1,1) %in% c("V") ~ "Vacant",
substring(class_code, 1,1) %in% c("I","J","M","N","P","Q","W") ~ "Civic",
substring(class_code, 1,1) %in% c("Y") ~ "Government",
substring(class_code, 1,1) %in% c("Z") ~ "Other"
))| class_code | category |
|---|---|
| A0 | Residential |
| A1 | Residential |
| A2 | Residential |
| A3 | Residential |
| A4 | Residential |
| A5 | Residential |
| A6 | Residential |
| A7 | Residential |
| A8 | Residential |
| A9 | Residential |
| B1 | Residential |
| B2 | Residential |
| B3 | Residential |
| B9 | Residential |
| C0 | Residential |
| C1 | Residential |
| C2 | Residential |
| C3 | Residential |
| C4 | Residential |
| C5 | Residential |
| C6 | Residential |
| C7 | Residential |
| C8 | Residential |
| C9 | Residential |
| CM | Residential |
| D0 | Residential |
| D1 | Residential |
| D2 | Residential |
| D3 | Residential |
| D4 | Residential |
| D5 | Residential |
| D6 | Residential |
| D7 | Residential |
| D8 | Residential |
| D9 | Residential |
| E1 | Industrial |
| E2 | Industrial |
| E3 | Industrial |
| E4 | Industrial |
| E7 | Industrial |
| E9 | Industrial |
| F1 | Industrial |
| F2 | Industrial |
| F4 | Industrial |
| F5 | Industrial |
| F8 | Industrial |
| F9 | Industrial |
| G0 | Industrial |
| G1 | Industrial |
| G2 | Industrial |
| G3 | Industrial |
| G4 | Industrial |
| G5 | Industrial |
| G6 | Industrial |
| G7 | Industrial |
| G8 | Industrial |
| G9 | Industrial |
| GU | Industrial |
| GW | Industrial |
| H1 | Commercial |
| H2 | Commercial |
| H3 | Commercial |
| H4 | Commercial |
| H5 | Commercial |
| H6 | Commercial |
| H7 | Commercial |
| H8 | Commercial |
| H9 | Commercial |
| HB | Commercial |
| HH | Commercial |
| HR | Commercial |
| HS | Commercial |
| I1 | Civic |
| I2 | Civic |
| I3 | Civic |
| I4 | Civic |
| I5 | Civic |
| I6 | Civic |
| I7 | Civic |
| I9 | Civic |
| J1 | Civic |
| J2 | Civic |
| J3 | Civic |
| J4 | Civic |
| J5 | Civic |
| J6 | Civic |
| J7 | Civic |
| J8 | Civic |
| J9 | Civic |
| K1 | Commercial |
| K2 | Commercial |
| K3 | Commercial |
| K4 | Commercial |
| K5 | Commercial |
| K6 | Commercial |
| K7 | Commercial |
| K8 | Commercial |
| K9 | Commercial |
| L1 | Commercial |
| L2 | Commercial |
| L3 | Commercial |
| L8 | Commercial |
| L9 | Commercial |
| M1 | Civic |
| M2 | Civic |
| M3 | Civic |
| M4 | Civic |
| M9 | Civic |
| N1 | Civic |
| N2 | Civic |
| N3 | Civic |
| N4 | Civic |
| N9 | Civic |
| O1 | Commercial |
| O2 | Commercial |
| O3 | Commercial |
| O4 | Commercial |
| O5 | Commercial |
| O6 | Commercial |
| O7 | Commercial |
| O8 | Commercial |
| O9 | Commercial |
| P1 | Civic |
| P2 | Civic |
| P3 | Civic |
| P4 | Civic |
| P5 | Civic |
| P6 | Civic |
| P7 | Civic |
| P8 | Civic |
| P9 | Civic |
| Q1 | Civic |
| Q2 | Civic |
| Q3 | Civic |
| Q4 | Civic |
| Q5 | Civic |
| Q6 | Civic |
| Q7 | Civic |
| Q8 | Civic |
| Q9 | Civic |
| R0 | Residential |
| R1 | Residential |
| R2 | Residential |
| R3 | Residential |
| R4 | Residential |
| R5 | Residential |
| R6 | Residential |
| R7 | Residential |
| R8 | Residential |
| R9 | Residential |
| RA | Residential |
| RB | Residential |
| RG | Residential |
| RH | Residential |
| RK | Residential |
| RP | Residential |
| RR | Residential |
| RS | Residential |
| RT | Residential |
| RW | Residential |
| S0 | Mixed |
| S1 | Mixed |
| S2 | Mixed |
| S3 | Mixed |
| S4 | Mixed |
| S5 | Mixed |
| S9 | Mixed |
| T1 | Industrial |
| T2 | Industrial |
| T9 | Industrial |
| U0 | Industrial |
| U1 | Industrial |
| U2 | Industrial |
| U3 | Industrial |
| U4 | Industrial |
| U5 | Industrial |
| U6 | Industrial |
| U7 | Industrial |
| U8 | Industrial |
| U9 | Industrial |
| V0 | Vacant |
| V1 | Vacant |
| V2 | Vacant |
| V3 | Vacant |
| V4 | Vacant |
| V5 | Vacant |
| V6 | Vacant |
| V7 | Vacant |
| V8 | Vacant |
| V9 | Vacant |
| W1 | Civic |
| W2 | Civic |
| W3 | Civic |
| W4 | Civic |
| W5 | Civic |
| W6 | Civic |
| W7 | Civic |
| W8 | Civic |
| W9 | Civic |
| Y1 | Government |
| Y2 | Government |
| Y3 | Government |
| Y4 | Government |
| Y5 | Government |
| Y6 | Government |
| Y7 | Government |
| Y8 | Government |
| Y9 | Government |
| Z0 | Other |
| Z1 | Other |
| Z2 | Other |
| Z3 | Other |
| Z4 | Other |
| Z5 | Other |
| Z7 | Other |
| Z8 | Other |
| Z9 | Other |
Now we select the columns from geocoded_df_housing that we want to work with moving forward, filter out missing values and high/low-priced outliers, and use the building class codes to assign a category to each building sale. This is all stored and saved new data frame called cleaned_df_housing, which we will use for grouping in the next sub-section.
cleaned_df_housing = geocoded_df_housing %>%
select(
building_class_at_present,
zip_code,
gross_square_feet,
sale_price,
sale_date,
) %>%
rename(lat = latitudeInternalLabel) %>%
rename(lng = longitudeInternalLabel) %>%
filter(!(is.na(lat) | is.na(lng))) %>%
filter(!(is.na(zip_code) | zip_code == 0)) %>%
filter(!(is.na(sale_price))) %>%
filter(!(is.na(gross_square_feet) | gross_square_feet <= 0)) %>%
mutate(price_per_sqft = sale_price/gross_square_feet) %>%
filter(price_per_sqft <= 2500) %>%
filter(price_per_sqft > 20) %>%
mutate(zip_code = sapply(zip_code, as.character)) %>%
left_join(building_class_key, by = c("building_class_at_present" = "class_code"))
save(cleaned_df_housing, file="cleaned_housing.rda")| building_class_at_present | zip_code | gross_square_feet | sale_price | sale_date | price_per_sqft | category |
|---|---|---|---|---|---|---|
| A5 | 10457 | 1188 | 345752 | 2009-03-16 | 291.03704 | Residential |
| B2 | 10457 | 1632 | 135000 | 2009-05-08 | 82.72059 | Residential |
| B2 | 10457 | 2125 | 325000 | 2009-03-30 | 152.94118 | Residential |
| B1 | 10457 | 2340 | 344500 | 2009-11-20 | 147.22222 | Residential |
| B1 | 10457 | 2340 | 316474 | 2009-05-04 | 135.24530 | Residential |
| B2 | 10457 | 3240 | 359038 | 2009-01-19 | 110.81420 | Residential |
| B2 | 10458 | 1944 | 183028 | 2009-11-02 | 94.15021 | Residential |
| B1 | 10458 | 2060 | 286000 | 2009-10-09 | 138.83495 | Residential |
| B2 | 10457 | 2300 | 389000 | 2009-07-28 | 169.13043 | Residential |
| S2 | 10458 | 4280 | 415440 | 2009-08-24 | 97.06542 | Mixed |
| C0 | 10457 | 2400 | 319500 | 2009-10-07 | 133.12500 | Residential |
| C0 | 10457 | 3744 | 162240 | 2009-11-12 | 43.33333 | Residential |
| C0 | 10457 | 4600 | 630000 | 2009-07-17 | 136.95652 | Residential |
| C0 | 10457 | 3706 | 602500 | 2009-09-02 | 162.57420 | Residential |
| C0 | 10457 | 3706 | 602500 | 2009-04-24 | 162.57420 | Residential |
| C0 | 10457 | 3706 | 602500 | 2009-04-27 | 162.57420 | Residential |
| C0 | 10457 | 3706 | 640000 | 2009-05-20 | 172.69293 | Residential |
| C0 | 10457 | 3706 | 602500 | 2009-07-23 | 162.57420 | Residential |
| C0 | 10457 | 4050 | 602500 | 2009-07-15 | 148.76543 | Residential |
| C0 | 10457 | 4050 | 602500 | 2009-04-29 | 148.76543 | Residential |
| C0 | 10457 | 3132 | 101400 | 2009-12-16 | 32.37548 | Residential |
| C0 | 10457 | 2244 | 178500 | 2009-07-07 | 79.54545 | Residential |
| C1 | 10457 | 11918 | 1315000 | 2009-08-26 | 110.33730 | Residential |
| C1 | 10457 | 16728 | 6885000 | 2009-08-26 | 411.58537 | Residential |
| K1 | 10457 | 1775 | 336000 | 2009-07-23 | 189.29577 | Commercial |
| S4 | 10457 | 3233 | 185000 | 2009-07-07 | 57.22239 | Mixed |
| M1 | 10458 | 9000 | 860000 | 2009-07-15 | 95.55556 | Civic |
| A5 | 10469 | 1512 | 370000 | 2009-05-04 | 244.70899 | Residential |
| A5 | 10469 | 1701 | 258570 | 2009-06-04 | 152.01058 | Residential |
| S1 | 10469 | 1889 | 325000 | 2009-09-04 | 172.04870 | Mixed |
| A1 | 10469 | 2232 | 349990 | 2009-10-09 | 156.80556 | Residential |
| A5 | 10469 | 1681 | 260000 | 2009-09-17 | 154.66984 | Residential |
| A5 | 10469 | 1920 | 312000 | 2009-06-12 | 162.50000 | Residential |
| A5 | 10469 | 1920 | 315000 | 2009-08-05 | 164.06250 | Residential |
| A5 | 10469 | 1920 | 308510 | 2009-10-28 | 160.68229 | Residential |
| A5 | 10469 | 1742 | 330000 | 2009-03-30 | 189.43743 | Residential |
| A5 | 10469 | 1440 | 310000 | 2009-04-17 | 215.27778 | Residential |
| A5 | 10469 | 1778 | 380000 | 2009-03-25 | 213.72328 | Residential |
| A5 | 10469 | 1280 | 300000 | 2009-10-05 | 234.37500 | Residential |
| A5 | 10469 | 1320 | 364000 | 2009-01-12 | 275.75758 | Residential |
| A5 | 10469 | 1890 | 423470 | 2009-06-18 | 224.05820 | Residential |
| A2 | 10469 | 1896 | 395000 | 2009-04-13 | 208.33333 | Residential |
| A5 | 10469 | 1534 | 250000 | 2009-06-08 | 162.97262 | Residential |
| A2 | 10469 | 779 | 125000 | 2009-08-24 | 160.46213 | Residential |
| A2 | 10469 | 396 | 113645 | 2009-07-27 | 286.98232 | Residential |
| A1 | 10469 | 1288 | 400000 | 2009-06-25 | 310.55901 | Residential |
| A2 | 10469 | 1140 | 135000 | 2009-09-30 | 118.42105 | Residential |
| A2 | 10469 | 2234 | 320000 | 2009-07-23 | 143.24082 | Residential |
| A2 | 10469 | 1000 | 325943 | 2009-09-03 | 325.94300 | Residential |
| B3 | 10469 | 4019 | 165000 | 2009-11-09 | 41.05499 | Residential |
Now that we have the real estate geocoded and cleaned, we can group it by aggregating all sales that took place in the same zip code and during the same month. We will create 3 new variables to describe each group: average price per square foot (avg_price_per_sqft), total number of sales (num_sales), and total proceeds from all sales (total_proceeds).
load("cleaned_housing.rda")
grouped_housing_all = cleaned_df_housing %>%
mutate(month_char = format(as.Date(sale_date), "%Y-%m")) %>%
group_by(zip_code, month_char) %>%
summarize(
category = "All",
avg_price_per_sqft = mean(price_per_sqft),
num_sales = n(),
total_proceeds = sum(sale_price)
)Next, we peform the same grouping function again, but with building category as an extra grouping variable. This will allow us to filter the displayed data by building category in the UI for our visualization. We combine the groups of individual categories with the groups of all categories to create a single data frame. Then we filter high-sales outliers and save this data frame for visualization in our Shiny app.
grouped_housing = cleaned_df_housing %>%
mutate(month_char = format(as.Date(sale_date), "%Y-%m")) %>%
group_by(zip_code, month_char, category) %>%
summarize(
avg_price_per_sqft = mean(price_per_sqft),
num_sales = n(),
total_proceeds = sum(sale_price)
) %>%
bind_rows(grouped_housing_all) %>%
filter(total_proceeds < 200000000) %>%
data.frame()
save(grouped_housing, file="grouped_housing.rda")| zip_code | month_char | category | avg_price_per_sqft | num_sales | total_proceeds |
|---|---|---|---|---|---|
| 10001 | 2003-01 | Commercial | 277.77778 | 1 | 2200000 |
| 10001 | 2003-01 | Mixed | 418.26420 | 1 | 2400000 |
| 10001 | 2003-01 | Residential | 201.08860 | 1 | 665000 |
| 10001 | 2003-02 | Commercial | 391.52249 | 7 | 65050000 |
| 10001 | 2003-03 | Commercial | 267.30755 | 1 | 10900000 |
| 10001 | 2003-04 | Commercial | 541.50438 | 2 | 4670000 |
| 10001 | 2003-04 | Residential | 295.63932 | 1 | 2000000 |
| 10001 | 2003-06 | Commercial | 258.90032 | 2 | 4780000 |
| 10001 | 2003-07 | Commercial | 298.33823 | 1 | 8025000 |
| 10001 | 2003-09 | Commercial | 506.40168 | 1 | 2650000 |
| 10001 | 2003-09 | Mixed | 377.17284 | 1 | 1150000 |
| 10001 | 2003-10 | Commercial | 148.92465 | 1 | 11550000 |
| 10001 | 2003-11 | Commercial | 108.53280 | 2 | 4263310 |
| 10001 | 2003-12 | Commercial | 77.25308 | 2 | 5234000 |
| 10001 | 2004-01 | Commercial | 84.60919 | 1 | 210000 |
| 10001 | 2004-01 | Industrial | 179.62312 | 1 | 38319000 |
| 10001 | 2004-02 | Commercial | 456.69442 | 3 | 17785074 |
| 10001 | 2004-02 | Mixed | 314.91003 | 1 | 1225000 |
| 10001 | 2004-03 | Commercial | 275.29213 | 2 | 3550000 |
| 10001 | 2004-04 | Commercial | 603.31827 | 10 | 58435000 |
| 10001 | 2004-04 | Residential | 143.24058 | 2 | 19900000 |
| 10001 | 2004-06 | Commercial | 132.41494 | 5 | 96970000 |
| 10001 | 2004-07 | Commercial | 177.95878 | 3 | 18380000 |
| 10001 | 2004-08 | Civic | 822.78481 | 1 | 11375000 |
| 10001 | 2004-08 | Residential | 468.70275 | 1 | 1550000 |
| 10001 | 2004-09 | Commercial | 154.36808 | 3 | 14033333 |
| 10001 | 2004-10 | Commercial | 200.30904 | 2 | 8350000 |
| 10001 | 2004-10 | Residential | 616.69386 | 2 | 4625000 |
| 10001 | 2005-01 | Commercial | 217.81244 | 4 | 12475000 |
| 10001 | 2005-01 | Mixed | 437.01799 | 1 | 1700000 |
| 10001 | 2005-01 | Residential | 321.61235 | 1 | 1875000 |
| 10001 | 2005-02 | Commercial | 133.75306 | 2 | 7700000 |
| 10001 | 2005-02 | Residential | 483.55932 | 2 | 9933590 |
| 10001 | 2005-03 | Commercial | 100.31802 | 2 | 16800000 |
| 10001 | 2005-03 | Residential | 134.52915 | 1 | 3000000 |
| 10001 | 2005-04 | Commercial | 285.02850 | 1 | 3800000 |
| 10001 | 2005-05 | Civic | 85.59236 | 1 | 940660 |
| 10001 | 2005-05 | Commercial | 139.86014 | 1 | 1000000 |
| 10001 | 2005-06 | Commercial | 667.11141 | 1 | 4000000 |
| 10001 | 2005-07 | Commercial | 602.09691 | 2 | 28500000 |
| 10001 | 2005-07 | Residential | 419.50293 | 2 | 8350000 |
| 10001 | 2005-09 | Commercial | 280.31495 | 4 | 120277465 |
| 10001 | 2005-10 | Commercial | 218.40519 | 1 | 31750000 |
| 10001 | 2005-11 | Commercial | 226.83706 | 1 | 35500000 |
| 10001 | 2005-12 | Commercial | 35.30153 | 1 | 3605875 |
| 10001 | 2005-12 | Mixed | 332.08247 | 2 | 3818622 |
| 10001 | 2006-01 | Commercial | 175.38672 | 2 | 32525000 |
| 10001 | 2006-01 | Residential | 132.40123 | 1 | 12500000 |
| 10001 | 2006-02 | Commercial | 638.95109 | 3 | 32350000 |
| 10001 | 2006-02 | Residential | 329.28586 | 1 | 17600000 |
First, load the 2 relevant files. Data_Housing.rda (df_housing) contains all the housing sales data. Data_Zip_Population.rda (zip_population) contains the census data for each zip code.
Next, remove rows which are either blank or have NA values in the sale_price or land_square_feet column of df_housing.
# Only sales prices > 0 for housing data
# Only take > 0; source: https://stackoverflow.com/questions/33535006/remove-rows-if-it-is-negative-number
df <- df_housing[df_housing$land_square_feet > 0, ]
df <- df[df$sale_price > 0, ]
# Get rid of NA values; source: https://stackoverflow.com/questions/48658832/how-to-remove-row-if-it-has-a-na-value-in-one-certain-column/48659101
df <- df[!is.na(df$sale_price), ]
df <- df[!is.na(df$land_square_feet), ]
Create a new dataframe, df_sale, which containse the relevant sales price prediction input variables from the df_housing dataframe. Add 2 more column, sale_year and sale_month, which contain the year and month of the sale. These values are extracted from the sale_date column by using the functions year() and month() respectively.
Add a new column, crime_score_year. If the sale was completed in the months of January, February or March, add the year just before the sale_year in the crime_score_year column. If the sale was completed in the rest of the months, crime_score_year = sale_year. This is done to map the relevant year’s crime score for each sale’s analysis. For e.g. if a sale happens in January of 2006, then it doesn’t make sense to add the crime score for 2006. Crimes happening in 2005 would have affected that sale’s price more than 2006.
library(lubridate)
#Identify the input variables for sales price prediction
#Zipcode (will be used to join census and crime data), land_square_feet, gross_square_feet, year built, tax_class_at_time_of_sale, building_class_at_time_of_sale, sale_date (will be used to link crime rating for that year)
#Subset new dataframe with relevant variables
df_sale <- subset(df, select = c(zip_code,land_square_feet, gross_square_feet, year_built, tax_class_at_time_of_sale, building_class_at_time_of_sale, sale_date, sale_price))
df_sale$sale_year <- year(df_sale$sale_date)
df_sale$sale_month <- month(df_sale$sale_date)
#If sale is in months 1, 2, 3 then crime score will be taken from previous year else crime score of that year
df_sale$crime_score_year <- ifelse(df_sale$sale_month == 1 | df_sale$sale_month == 2 | df_sale$sale_month == 3, df_sale$sale_year-1, df_sale$sale_year)Now, we will add the census data to the housing sales data. First find the zipcodes which are present in both df_sale and zip_population dataframes. Then add the df_sale and zip_population dataframes by these unique zipcodes through the following code. This merged dataframe is saved as df_sale_census.
#Adding the census data by linking through zipcode
#Taking a subset of df_sale for only the zip codes for which we have census data
zip_codes_inhousing <- unique(df_sale$zip_code)
zip_codes_incensus <- unique(zip_population$zip_code)
zip_codes_inhousing_incensus <- intersect(zip_codes_inhousing, zip_codes_incensus)
#Merging the sales and census data into one dataframe
df_sale$zip_code <- as.character(df_sale$zip_code)
df_sale_census <- merge(df_sale[df_sale$zip_code %in% zip_codes_inhousing_incensus, ], zip_population[zip_population$zip_code %in% zip_codes_inhousing_incensus, ], by = "zip_code")Load the file Data_Score_by_year_and_zipcode.rda (score_by_zip_and_year) that was created in section “04B: Data_Processing_Month_Year”. Find the zipcodes that are present in both df_sale_census and score_by_zip_and_year and store them in zip_codes_inhousing_incensus_incrime. Merge the df_sale_census and score_by_zip_year dataframes by zipcodes and store them in df_sale_census_crime. Remove all rows from this final dataframe which have a zero value and save this dataframe as Data_sale_census_crime.rda for future analysis.
load(file = "Data_Score_by_year_and_zipcode.rda")
#Adding the crime score data for each year
zip_codes_inhousing_incensus_incrime <- intersect(unique(df_sale_census$zip_code), unique(score_by_zip_and_year$zip_code))
colnames(score_by_zip_and_year) <- c("zip_code", "crime_score_year", "sum_weight", "weight", "TotalPop")
df_sale_census_crime <- merge(df_sale_census[df_sale_census$zip_code %in% zip_codes_inhousing_incensus_incrime, ], score_by_zip_and_year[score_by_zip_and_year$zip_code %in% zip_codes_inhousing_incensus_incrime, ], by = c("zip_code", "crime_score_year"))
df_sale_census_crime <- na.omit(df_sale_census_crime)
#any(is.na(df_sale_census_crime))
#Remove values of 0
#source: https://stackoverflow.com/questions/9977686/how-to-remove-rows-with-any-zero-value
zero_rows = apply(df_sale_census_crime, 1, function(row) all(row != 0))
df_sale_census_crime <- df_sale_census_crime[zero_rows, ]
save(df_sale_census_crime, file = "Data_sale_census_crime.rda")| zip_code | crime_score_year | land_square_feet | gross_square_feet | year_built | tax_class_at_time_of_sale | building_class_at_time_of_sale | sale_date | sale_price | sale_year | sale_month | TotalPop.x | Men | Women | Hispanic | White | Black | Native | Asian | PerCapitaIncome | Unemployed | sum_weight | weight | TotalPop.y |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10001 | 2002 | 4937 | 54000 | 1912 | 4 | L1 | 2003-02-12 | 5e+06 | 2003 | 2 | 127587350 | 61690470 | 65896880 | 3.824723 | 74.48506 | 2.424281 | 1.450839 | 15.4563 | 114905.6 | 4.76503 | 19 | 1e-07 | 127587350 |
| 10001 | 2002 | 4937 | 54000 | 1912 | 4 | L1 | 2003-02-12 | 13500000 | 2003 | 2 | 127587350 | 61690470 | 65896880 | 3.824723 | 74.48506 | 2.424281 | 1.450839 | 15.4563 | 114905.6 | 4.76503 | 19 | 1e-07 | 127587350 |
| 10001 | 2002 | 1480 | 7060 | 1920 | 4 | L9 | 2003-02-07 | 2150000 | 2003 | 2 | 127587350 | 61690470 | 65896880 | 3.824723 | 74.48506 | 2.424281 | 1.450839 | 15.4563 | 114905.6 | 4.76503 | 19 | 1e-07 | 127587350 |
| 10001 | 2002 | 2469 | 4494 | 1901 | 2 | S4 | 2003-01-03 | 24375 | 2003 | 1 | 127587350 | 61690470 | 65896880 | 3.824723 | 74.48506 | 2.424281 | 1.450839 | 15.4563 | 114905.6 | 4.76503 | 19 | 1e-07 | 127587350 |
| 10001 | 2002 | 2345 | 7035 | 1920 | 4 | K9 | 2003-01-03 | 46687 | 2003 | 1 | 127587350 | 61690470 | 65896880 | 3.824723 | 74.48506 | 2.424281 | 1.450839 | 15.4563 | 114905.6 | 4.76503 | 19 | 1e-07 | 127587350 |
| 10001 | 2002 | 1975 | 5738 | 1910 | 2 | S5 | 2003-01-17 | 2400000 | 2003 | 1 | 127587350 | 61690470 | 65896880 | 3.824723 | 74.48506 | 2.424281 | 1.450839 | 15.4563 | 114905.6 | 4.76503 | 19 | 1e-07 | 127587350 |
The regression model helps to establish a baseline. Unfortunately, when working with a large set of data, there are issues faced when running regressions:
Each of the issues can be respectively addressed:
The steps for setting up the regression, running the model, and making predictions are in the following sections.
First, the packages are loaded:
The desired data is imported into R for the regression.
The desired independent and dependent (sale_price) variables are selected with the code below and certain parameters are set to limit the amount of data.
# Make sure sale prices are numeric
df_sale_census_crime$sale_price <- as.numeric(df_sale_census_crime$sale_price)
# Remove large house sale prices; source: https://stackoverflow.com/questions/25764810/delete-rows-based-on-range-of-values-in-column
df_sale_census_crime <- df_sale_census_crime[with(df_sale_census_crime, sale_price <= 1000000), ]
#Regression 1
x <- df_sale_census_crime %>% select("land_square_feet", "PerCapitaIncome", "Unemployed", "TotalPop.x", "Men", "Women")
y <- df_sale_census_crime %>% select("sale_price")The variables are coerced into the proper types for the regression (e.g. factor variable).
x$land_square_feet <- as.numeric(x[[1]])
x$PerCapitaIncome <- as.numeric(x[[2]])
x$Unemployed <- as.numeric(x[[3]])
x$TotalPop.x <- as.numeric(x[[4]])
x$Men <- as.numeric(x[[5]])
x$Women <- as.numeric(x[[6]])
# Setting up factor variable
df_sale_census_crime$sale_year <- as.character(df_sale_census_crime$sale_year)
df_sale_census_crime$year_built <- as.character(df_sale_census_crime$year_built)
building_class <- model.matrix( ~ building_class_at_time_of_sale - 1, data=df_sale_census_crime )
sale_year <- model.matrix( ~ sale_year - 1, data=df_sale_census_crime )
year_built <- model.matrix( ~ year_built - 1, data=df_sale_census_crime )
# Binding factor variables
x<-cbind(x, building_class)
x<-cbind(x, sale_year)
x<-cbind(x, year_built)
# Add intercept column and renaming
x<-cbind(x, 1)
colnames(x)[dim(x)[2]] <- "Intercept" # Source: https://www.dummies.com/programming/r/how-to-name-matrix-rows-and-columns-in-r/
# Setup the y output variable
y <- as.numeric(y[[1]])
# Display the classes
sapply(x, class)
sapply(y, class)The data is partitioned into training and test sets for running regressions and determining the accuracy of a model on the test set respectively. For this initial regression analysis, a subset of 80% of the data is used for training and the rest is used for testing the accuracy of the predictions.
# ----- Split data into training and validation sets -----
# create a list of 80% of the rows in the original dataset we can use for training
validation_index <- sort(sample(nrow(x), nrow(x)*.8))
# select 20% of the data for validation
x_test <- x[-validation_index,]
y_test <- y[-validation_index]
# use the remaining 80% of data to training and testing the models
x_train <- x[validation_index,]
y_train <- y[validation_index]The regression is run using fastLm to expedite the process and remove the overhead from the normal regression functions.
The results are summarized for analysis.
summary(fit1)
#R-squared value is 25.63%
# Perform predictions on test set
reg_pred1 <- predict(fit1, as.matrix(x_test))
# See the comparison
results <- data.frame(cbind(y_test, reg_pred1))
results$diff <- results$reg_pred1 - results$y_test
results$diff_squared <- '^' (results$diff, 2)
# RMS =
rms = sqrt(mean(results$diff_squared))
rms
# RMS Error =
rms_error = rms/mean(results$y_test)
rms_errorThe results that were found when the regression was run for this project are summarized:
| Measure | Value |
|---|---|
| Adjusted R-squared value (\(R^2\)) | 27.26% |
| Root Mean Squared (RMS) | 198,239.1 |
| Root Mean Squared Error (RMSE) | 43.59% |
The more detailed output from the regression model is summarized below. While it may be possible to improve the regression results by running further regressions, the machine learning models explored next will likely help to capture the granularity in the data.
Regression output
> # Run regression
> fit1 <- fastLm(x_train, y_train)
> summary(fit1)
Call:
fastLm.default(X = x_train, y = y_train)
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-951240.0 -99101.0 2810.9 110820.0 1088100.0
Estimate StdErr t.value p.value
land_square_feet 2.8982e-02 7.3239e-03 3.9572e+00 7.583e-05 ***
PerCapitaIncome 1.5932e+00 5.5788e-02 2.8558e+01 < 2.2e-16 ***
Unemployed -1.2420e+04 1.6778e+02 -7.4027e+01 < 2.2e-16 ***
TotalPop.x -6.1569e+09 2.5369e-01 -2.4269e+10 < 2.2e-16 ***
Men 6.1569e+09 2.4254e-01 2.5385e+10 < 2.2e-16 ***
Women 6.1569e+09 2.9053e-01 2.1192e+10 < 2.2e-16 ***
building_class_at_time_of_saleA0 1.3952e+05 3.7534e+03 3.7171e+01 < 2.2e-16 ***
building_class_at_time_of_saleA1 1.0504e+05 1.4056e+03 7.4727e+01 < 2.2e-16 ***
building_class_at_time_of_saleA2 6.4862e+04 1.7563e+03 3.6931e+01 < 2.2e-16 ***
building_class_at_time_of_saleA3 3.3492e+05 5.4030e+03 6.1988e+01 < 2.2e-16 ***
building_class_at_time_of_saleA4 1.9749e+05 6.5229e+03 3.0276e+01 < 2.2e-16 ***
building_class_at_time_of_saleA5 3.9112e+04 1.4470e+03 2.7030e+01 < 2.2e-16 ***
building_class_at_time_of_saleA6 -1.5064e+05 7.0867e+03 -2.1257e+01 < 2.2e-16 ***
building_class_at_time_of_saleA7 1.8340e+05 5.3564e+01 3.4239e+03 < 2.2e-16 ***
building_class_at_time_of_saleA8 -1.1559e+05 1.4973e+01 -7.7196e+03 < 2.2e-16 ***
building_class_at_time_of_saleA9 6.3681e+04 2.0931e+03 3.0424e+01 < 2.2e-16 ***
building_class_at_time_of_saleB1 1.7355e+05 1.5154e+03 1.1453e+02 < 2.2e-16 ***
building_class_at_time_of_saleB2 1.5230e+05 1.4673e+03 1.0380e+02 < 2.2e-16 ***
building_class_at_time_of_saleB3 1.3706e+05 1.5737e+03 8.7097e+01 < 2.2e-16 ***
building_class_at_time_of_saleB9 1.2594e+05 2.0994e+03 5.9986e+01 < 2.2e-16 ***
building_class_at_time_of_saleC0 2.1164e+05 1.5671e+03 1.3505e+02 < 2.2e-16 ***
building_class_at_time_of_saleC1 2.3698e+05 4.3097e+03 5.4987e+01 < 2.2e-16 ***
building_class_at_time_of_saleC2 1.9560e+05 2.9009e+03 6.7429e+01 < 2.2e-16 ***
building_class_at_time_of_saleC3 2.1607e+05 2.6559e+03 8.1355e+01 < 2.2e-16 ***
building_class_at_time_of_saleC4 1.3695e+05 9.0729e+03 1.5095e+01 < 2.2e-16 ***
building_class_at_time_of_saleC5 1.9362e+05 6.7181e+03 2.8821e+01 < 2.2e-16 ***
building_class_at_time_of_saleC6 -1.3938e+05 3.8236e+02 -3.6451e+02 < 2.2e-16 ***
building_class_at_time_of_saleC7 1.6761e+05 7.7880e+03 2.1522e+01 < 2.2e-16 ***
building_class_at_time_of_saleC8 -3.9534e+05 3.8622e+00 -1.0236e+05 < 2.2e-16 ***
building_class_at_time_of_saleC9 -5.3525e+04 1.4023e+02 -3.8170e+02 < 2.2e-16 ***
building_class_at_time_of_saleCM -1.0373e+01 NA NA NA
building_class_at_time_of_saleD0 8.2579e+04 2.5786e+01 3.2024e+03 < 2.2e-16 ***
building_class_at_time_of_saleD1 -3.0391e+04 4.1447e+02 -7.3326e+01 < 2.2e-16 ***
building_class_at_time_of_saleD2 -4.5215e+05 7.4887e+00 -6.0378e+04 < 2.2e-16 ***
building_class_at_time_of_saleD3 -1.8254e+05 9.2733e+01 -1.9684e+03 < 2.2e-16 ***
building_class_at_time_of_saleD4 -1.6205e+05 7.7115e+03 -2.1014e+01 < 2.2e-16 ***
building_class_at_time_of_saleD5 9.2455e+04 7.8430e+00 1.1788e+04 < 2.2e-16 ***
building_class_at_time_of_saleD6 -1.5591e+05 6.7311e+01 -2.3162e+03 < 2.2e-16 ***
building_class_at_time_of_saleD7 -1.0718e+05 9.0214e+01 -1.1881e+03 < 2.2e-16 ***
building_class_at_time_of_saleD8 -2.3003e+04 1.2368e+01 -1.8598e+03 < 2.2e-16 ***
building_class_at_time_of_saleD9 -1.9105e+05 7.6408e+01 -2.5004e+03 < 2.2e-16 ***
building_class_at_time_of_saleE1 1.5457e+05 4.4341e+02 3.4859e+02 < 2.2e-16 ***
building_class_at_time_of_saleE2 4.3728e+05 1.0070e+01 4.3426e+04 < 2.2e-16 ***
building_class_at_time_of_saleE3 1.4641e+05 3.0501e+02 4.8002e+02 < 2.2e-16 ***
building_class_at_time_of_saleE4 1.0907e+05 5.2813e+01 2.0653e+03 < 2.2e-16 ***
building_class_at_time_of_saleE7 -2.9949e+05 1.9951e+01 -1.5011e+04 < 2.2e-16 ***
building_class_at_time_of_saleE9 1.1786e+05 8.8237e+03 1.3357e+01 < 2.2e-16 ***
building_class_at_time_of_saleF1 1.5106e+05 1.6581e+02 9.1102e+02 < 2.2e-16 ***
building_class_at_time_of_saleF2 2.1667e+05 3.0781e+01 7.0391e+03 < 2.2e-16 ***
building_class_at_time_of_saleF4 1.6584e+05 6.3631e+02 2.6062e+02 < 2.2e-16 ***
building_class_at_time_of_saleF5 9.9344e+04 1.5017e+02 6.6155e+02 < 2.2e-16 ***
building_class_at_time_of_saleF8 -2.8220e+05 1.3758e+00 -2.0512e+05 < 2.2e-16 ***
building_class_at_time_of_saleF9 1.5692e+05 8.5776e+03 1.8295e+01 < 2.2e-16 ***
building_class_at_time_of_saleG0 -7.4920e+04 9.3074e+02 -8.0494e+01 < 2.2e-16 ***
building_class_at_time_of_saleG1 1.0360e+05 1.7836e+02 5.8084e+02 < 2.2e-16 ***
building_class_at_time_of_saleG2 1.3491e+05 7.8585e+03 1.7168e+01 < 2.2e-16 ***
building_class_at_time_of_saleG3 2.1477e+05 1.5108e+01 1.4215e+04 < 2.2e-16 ***
building_class_at_time_of_saleG4 2.3700e+05 1.2326e+02 1.9228e+03 < 2.2e-16 ***
building_class_at_time_of_saleG5 1.4077e+05 5.4218e+01 2.5963e+03 < 2.2e-16 ***
building_class_at_time_of_saleG6 1.1741e+05 4.5981e+01 2.5535e+03 < 2.2e-16 ***
building_class_at_time_of_saleG7 -1.6497e+04 4.4057e+03 -3.7444e+00 0.0001808 ***
building_class_at_time_of_saleG8 3.1372e+03 2.6921e+01 1.1653e+02 < 2.2e-16 ***
building_class_at_time_of_saleG9 1.2263e+05 7.6113e+03 1.6111e+01 < 2.2e-16 ***
building_class_at_time_of_saleGU 2.8203e+05 2.6368e+00 1.0696e+05 < 2.2e-16 ***
building_class_at_time_of_saleGW 2.0901e+05 9.7865e+00 2.1357e+04 < 2.2e-16 ***
building_class_at_time_of_saleH1 -1.7314e+00 2.4147e-12 -7.1701e+11 < 2.2e-16 ***
building_class_at_time_of_saleH2 -5.1859e+05 4.9986e+03 -1.0375e+02 < 2.2e-16 ***
building_class_at_time_of_saleH3 -5.6200e+05 8.2590e+03 -6.8046e+01 < 2.2e-16 ***
building_class_at_time_of_saleH4 1.6690e+05 1.9417e+01 8.5958e+03 < 2.2e-16 ***
building_class_at_time_of_saleH6 1.1853e+04 5.5558e+00 2.1334e+03 < 2.2e-16 ***
building_class_at_time_of_saleH7 -2.1248e+05 1.0631e+01 -1.9987e+04 < 2.2e-16 ***
building_class_at_time_of_saleH8 7.7614e-01 NA NA NA
building_class_at_time_of_saleH9 2.4038e+04 2.7452e+01 8.7564e+02 < 2.2e-16 ***
building_class_at_time_of_saleHR 2.6907e+05 2.6293e+01 1.0234e+04 < 2.2e-16 ***
building_class_at_time_of_saleI1 -4.2960e+05 1.5147e+01 -2.8361e+04 < 2.2e-16 ***
building_class_at_time_of_saleI3 2.0919e+05 2.7345e+00 7.6501e+04 < 2.2e-16 ***
building_class_at_time_of_saleI4 -5.5951e+04 1.7755e+01 -3.1513e+03 < 2.2e-16 ***
building_class_at_time_of_saleI5 1.7474e+05 5.9982e+01 2.9132e+03 < 2.2e-16 ***
building_class_at_time_of_saleI6 1.1961e+05 2.4592e+01 4.8636e+03 < 2.2e-16 ***
building_class_at_time_of_saleI7 6.3836e+04 1.0070e+02 6.3393e+02 < 2.2e-16 ***
building_class_at_time_of_saleI9 8.6528e+04 4.7252e+01 1.8312e+03 < 2.2e-16 ***
building_class_at_time_of_saleJ1 -2.8717e+05 9.3349e+00 -3.0763e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ2 -4.9867e+05 5.4591e+00 -9.1348e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ4 -9.4846e+04 2.5902e+01 -3.6618e+03 < 2.2e-16 ***
building_class_at_time_of_saleJ6 -5.4367e+05 2.1236e+01 -2.5602e+04 < 2.2e-16 ***
building_class_at_time_of_saleJ7 -4.6257e+05 3.0157e+00 -1.5339e+05 < 2.2e-16 ***
building_class_at_time_of_saleJ9 -8.2693e-01 NA NA NA
building_class_at_time_of_saleK1 1.3066e+05 5.0570e+03 2.5838e+01 < 2.2e-16 ***
building_class_at_time_of_saleK2 1.8075e+05 8.3157e+03 2.1736e+01 < 2.2e-16 ***
building_class_at_time_of_saleK3 -4.7864e+05 1.6826e+00 -2.8447e+05 < 2.2e-16 ***
building_class_at_time_of_saleK4 2.0110e+05 7.0629e+03 2.8473e+01 < 2.2e-16 ***
building_class_at_time_of_saleK5 1.5896e+05 1.5770e+02 1.0080e+03 < 2.2e-16 ***
building_class_at_time_of_saleK6 5.7383e+04 1.8493e+01 3.1029e+03 < 2.2e-16 ***
building_class_at_time_of_saleK7 1.8497e+05 4.8547e+01 3.8101e+03 < 2.2e-16 ***
building_class_at_time_of_saleK8 -3.5744e+05 2.2884e+01 -1.5620e+04 < 2.2e-16 ***
building_class_at_time_of_saleK9 1.6681e+05 8.5161e+03 1.9588e+01 < 2.2e-16 ***
building_class_at_time_of_saleL1 -3.2798e+05 8.1924e+01 -4.0035e+03 < 2.2e-16 ***
building_class_at_time_of_saleL2 -4.6219e+05 2.0444e+01 -2.2608e+04 < 2.2e-16 ***
building_class_at_time_of_saleL3 -1.0495e+00 NA NA NA
building_class_at_time_of_saleL8 -7.9144e+04 1.0960e+02 -7.2209e+02 < 2.2e-16 ***
building_class_at_time_of_saleL9 4.3745e+04 3.5515e+01 1.2318e+03 < 2.2e-16 ***
building_class_at_time_of_saleM1 1.5553e+05 6.4322e+02 2.4180e+02 < 2.2e-16 ***
building_class_at_time_of_saleM2 4.7448e+05 1.3285e+00 3.5715e+05 < 2.2e-16 ***
building_class_at_time_of_saleM3 1.8350e+05 2.2997e+01 7.9795e+03 < 2.2e-16 ***
building_class_at_time_of_saleM4 3.5926e+05 1.0860e+01 3.3079e+04 < 2.2e-16 ***
building_class_at_time_of_saleM9 1.7297e+05 1.9606e+02 8.8223e+02 < 2.2e-16 ***
building_class_at_time_of_saleN1 -2.5565e+05 1.2544e+00 -2.0381e+05 < 2.2e-16 ***
building_class_at_time_of_saleN2 -1.0812e+05 3.7291e+01 -2.8993e+03 < 2.2e-16 ***
building_class_at_time_of_saleN3 3.8823e+05 1.4154e+00 2.7429e+05 < 2.2e-16 ***
building_class_at_time_of_saleN4 2.5708e+05 2.3806e+00 1.0799e+05 < 2.2e-16 ***
building_class_at_time_of_saleN9 7.9835e+04 4.0318e+01 1.9802e+03 < 2.2e-16 ***
building_class_at_time_of_saleO1 9.8493e+04 5.4205e+01 1.8170e+03 < 2.2e-16 ***
building_class_at_time_of_saleO2 5.6520e+04 8.4064e+01 6.7235e+02 < 2.2e-16 ***
building_class_at_time_of_saleO3 -2.5890e+05 7.4544e+01 -3.4732e+03 < 2.2e-16 ***
building_class_at_time_of_saleO4 -2.5858e+05 4.7554e+01 -5.4377e+03 < 2.2e-16 ***
building_class_at_time_of_saleO5 2.0678e+05 6.5123e+01 3.1753e+03 < 2.2e-16 ***
building_class_at_time_of_saleO6 -1.8394e+05 1.6759e+02 -1.0975e+03 < 2.2e-16 ***
building_class_at_time_of_saleO7 1.5411e+05 9.9111e+02 1.5549e+02 < 2.2e-16 ***
building_class_at_time_of_saleO8 1.8789e+05 2.4518e+02 7.6634e+02 < 2.2e-16 ***
building_class_at_time_of_saleO9 1.2536e+05 9.2309e+03 1.3580e+01 < 2.2e-16 ***
building_class_at_time_of_saleP1 -5.0139e+05 4.3587e+00 -1.1503e+05 < 2.2e-16 ***
building_class_at_time_of_saleP2 1.6676e+05 4.9800e+01 3.3487e+03 < 2.2e-16 ***
building_class_at_time_of_saleP3 4.0284e+05 9.5488e+00 4.2188e+04 < 2.2e-16 ***
building_class_at_time_of_saleP5 1.1493e+05 3.6320e+01 3.1643e+03 < 2.2e-16 ***
building_class_at_time_of_saleP6 4.5337e+05 1.9562e+01 2.3177e+04 < 2.2e-16 ***
building_class_at_time_of_saleP7 2.5193e+05 3.6547e+00 6.8933e+04 < 2.2e-16 ***
building_class_at_time_of_saleP8 -2.4175e+04 1.0560e+01 -2.2893e+03 < 2.2e-16 ***
building_class_at_time_of_saleP9 9.8652e+04 4.2028e+01 2.3473e+03 < 2.2e-16 ***
building_class_at_time_of_saleQ1 -6.3134e+05 2.0798e+01 -3.0355e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ2 -1.9214e+04 7.9179e+00 -2.4267e+03 < 2.2e-16 ***
building_class_at_time_of_saleQ3 -2.1168e+05 3.9467e+00 -5.3634e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ8 3.3311e+05 5.2588e+00 6.3344e+04 < 2.2e-16 ***
building_class_at_time_of_saleQ9 5.6611e+04 2.5971e+01 2.1798e+03 < 2.2e-16 ***
building_class_at_time_of_saleR0 -6.2189e+04 4.0096e+01 -1.5510e+03 < 2.2e-16 ***
building_class_at_time_of_saleR1 -2.1086e+05 8.5324e+00 -2.4713e+04 < 2.2e-16 ***
building_class_at_time_of_saleR4 1.0408e+05 2.3851e+00 4.3635e+04 < 2.2e-16 ***
building_class_at_time_of_saleR9 -1.3032e+05 7.4740e+01 -1.7436e+03 < 2.2e-16 ***
building_class_at_time_of_saleRR -2.9996e+05 4.0817e+01 -7.3490e+03 < 2.2e-16 ***
building_class_at_time_of_saleS0 1.3380e+05 1.0122e+02 1.3220e+03 < 2.2e-16 ***
building_class_at_time_of_saleS1 1.2850e+05 3.9465e+03 3.2560e+01 < 2.2e-16 ***
building_class_at_time_of_saleS2 1.7599e+05 2.8663e+03 6.1399e+01 < 2.2e-16 ***
building_class_at_time_of_saleS3 1.9130e+05 6.4023e+03 2.9880e+01 < 2.2e-16 ***
building_class_at_time_of_saleS4 2.2695e+05 6.7895e+03 3.3426e+01 < 2.2e-16 ***
building_class_at_time_of_saleS5 2.2889e+05 7.6276e+03 3.0009e+01 < 2.2e-16 ***
building_class_at_time_of_saleS9 2.2762e+05 6.1226e+03 3.7177e+01 < 2.2e-16 ***
building_class_at_time_of_saleT2 2.3426e+04 2.9571e+00 7.9217e+03 < 2.2e-16 ***
building_class_at_time_of_saleT9 -6.9349e+04 8.5254e+00 -8.1344e+03 < 2.2e-16 ***
building_class_at_time_of_saleU7 -1.1867e+05 3.5107e+00 -3.3803e+04 < 2.2e-16 ***
building_class_at_time_of_saleV0 -2.2715e+04 3.7162e+03 -6.1123e+00 9.828e-10 ***
building_class_at_time_of_saleV1 -6.4640e+04 7.6833e+03 -8.4130e+00 < 2.2e-16 ***
building_class_at_time_of_saleV2 -5.6667e+04 7.3045e+01 -7.7578e+02 < 2.2e-16 ***
building_class_at_time_of_saleV3 -6.8084e+04 9.0265e+01 -7.5427e+02 < 2.2e-16 ***
building_class_at_time_of_saleV5 3.5594e+05 4.0220e+00 8.8499e+04 < 2.2e-16 ***
building_class_at_time_of_saleV9 -1.1566e+05 6.3938e+01 -1.8090e+03 < 2.2e-16 ***
building_class_at_time_of_saleW1 -1.3223e+05 2.2746e+01 -5.8133e+03 < 2.2e-16 ***
building_class_at_time_of_saleW2 6.7435e+04 2.2155e+01 3.0438e+03 < 2.2e-16 ***
building_class_at_time_of_saleW3 -4.6763e+04 4.0345e+01 -1.1591e+03 < 2.2e-16 ***
building_class_at_time_of_saleW4 -7.7198e+04 2.8433e+00 -2.7151e+04 < 2.2e-16 ***
building_class_at_time_of_saleW6 -5.9114e+05 5.1179e+00 -1.1550e+05 < 2.2e-16 ***
building_class_at_time_of_saleW7 -3.1586e+05 8.8954e+00 -3.5508e+04 < 2.2e-16 ***
building_class_at_time_of_saleW8 2.2198e+05 2.7235e+01 8.1506e+03 < 2.2e-16 ***
building_class_at_time_of_saleW9 1.4626e+05 4.4200e+01 3.3089e+03 < 2.2e-16 ***
building_class_at_time_of_saleY1 -2.5078e+05 1.5060e+00 -1.6652e+05 < 2.2e-16 ***
building_class_at_time_of_saleY2 -3.4598e+05 7.8875e+00 -4.3864e+04 < 2.2e-16 ***
building_class_at_time_of_saleY4 -3.1975e+05 1.7551e+00 -1.8218e+05 < 2.2e-16 ***
building_class_at_time_of_saleY5 -4.7741e+05 3.0906e+00 -1.5447e+05 < 2.2e-16 ***
building_class_at_time_of_saleY6 -1.1746e+05 1.9593e+01 -5.9949e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ0 -1.3031e+05 4.8316e+01 -2.6969e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ1 -3.3428e+05 3.9923e+00 -8.3731e+04 < 2.2e-16 ***
building_class_at_time_of_saleZ2 -2.1807e+05 2.8958e+01 -7.5306e+03 < 2.2e-16 ***
building_class_at_time_of_saleZ3 -1.2179e+05 3.0810e+00 -3.9531e+04 < 2.2e-16 ***
building_class_at_time_of_saleZ8 -2.0401e+05 1.4632e+00 -1.3942e+05 < 2.2e-16 ***
building_class_at_time_of_saleZ9 4.2683e+04 7.9866e+03 5.3443e+00 9.081e-08 ***
sale_year2003 -1.0892e+05 1.1024e+03 -9.8802e+01 < 2.2e-16 ***
sale_year2004 -5.1748e+04 1.0476e+03 -4.9395e+01 < 2.2e-16 ***
sale_year2005 1.2800e+04 1.0659e+03 1.2008e+01 < 2.2e-16 ***
sale_year2006 5.6075e+04 1.1214e+03 5.0002e+01 < 2.2e-16 ***
sale_year2007 5.4919e+04 1.2462e+03 4.4070e+01 < 2.2e-16 ***
sale_year2008 8.8428e+03 1.3830e+03 6.3940e+00 1.618e-10 ***
sale_year2009 -3.6373e+04 1.4479e+03 -2.5120e+01 < 2.2e-16 ***
sale_year2010 -3.7572e+04 1.4401e+03 -2.6089e+01 < 2.2e-16 ***
sale_year2011 -3.6054e+04 1.5219e+03 -2.3690e+01 < 2.2e-16 ***
sale_year2012 -2.8094e+04 1.4739e+03 -1.9061e+01 < 2.2e-16 ***
sale_year2013 -8.9054e+03 1.3906e+03 -6.4040e+00 1.515e-10 ***
sale_year2014 1.5791e+04 1.3950e+03 1.1320e+01 < 2.2e-16 ***
sale_year2015 3.2799e+04 1.3994e+03 2.3437e+01 < 2.2e-16 ***
sale_year2016 6.7075e+04 2.3071e+03 2.9073e+01 < 2.2e-16 ***
sale_year2017 7.8186e+04 1.7508e+03 4.4658e+01 < 2.2e-16 ***
sale_year2019 1.3103e+05 1.6770e+03 7.8134e+01 < 2.2e-16 ***
sale_year2020 1.5247e+05 3.2118e+03 4.7472e+01 < 2.2e-16 ***
year_built0 -1.3214e+04 4.1824e+03 -3.1593e+00 0.0015815 **
year_built1380 -5.1417e+05 1.2867e+00 -3.9962e+05 < 2.2e-16 ***
year_built1800 -2.3758e+04 7.1415e+01 -3.3268e+02 < 2.2e-16 ***
year_built1814 -1.2841e+05 2.4966e+00 -5.1435e+04 < 2.2e-16 ***
year_built1825 -5.7477e+05 2.6333e+00 -2.1828e+05 < 2.2e-16 ***
year_built1826 9.3972e+04 4.4656e+00 2.1044e+04 < 2.2e-16 ***
year_built1827 7.1977e+05 4.6640e+00 1.5432e+05 < 2.2e-16 ***
year_built1838 -4.0203e+05 3.1975e+00 -1.2573e+05 < 2.2e-16 ***
year_built1839 -5.6975e+05 2.6778e+00 -2.1277e+05 < 2.2e-16 ***
year_built1840 2.8993e+04 1.3322e+00 2.1764e+04 < 2.2e-16 ***
[ reached getOption("max.print") -- omitted 155 rows ]
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.963e+05 on 349122 degrees of freedom
Multiple R-squared: 0.2733, Adjusted R-squared: 0.2726
> #R-squared value is 25.63%
>
> # Perform predictions on test set
> reg_pred1 <- predict(fit1, as.matrix(x_test))
>
> # See the comparison
> results <- data.frame(cbind(y_test, reg_pred1))
> results$diff <- results$reg_pred1 - results$y_test
> results$diff_squared <- '^' (results$diff, 2)
> # RMS =
> rms = sqrt(mean(results$diff_squared))
> rms
[1] 198239.1
>
> # RMS Error =
> rms_error = rms/mean(results$y_test)
> rms_error
[1] 0.4358773
Add the required libraries. Also, load the file Data_sale_census_crime.rda (df_sale_census_crime) that was created in “04D: Data_Processing_Single_Dataframe”.
#Attach packages
library(caret)
library(dplyr)
library(lubridate)
library(deepnet)
load(file = "Data_sale_census_crime.rda")
Remove rows that have zero value in any column. Also, make sure that the sale_price column is numeric.
#Remove values of 0
#source: https://stackoverflow.com/questions/9977686/how-to-remove-rows-with-any-zero-value
zero_rows = apply(df_sale_census_crime, 1, function(row) all(row != 0))
df_sale_census_crime <- df_sale_census_crime[zero_rows, ]
# Make sure sale prices are numeric
df_sale_census_crime$sale_price <- as.numeric(df_sale_census_crime$sale_price)
Instead of looking at the entire data we will look at sale_price < $ 1M and 5M (90% and 97% of the data respectively). This value can be changed through this line of code.
source: https://stackoverflow.com/questions/25764810/delete-rows-based-on-range-of-values-in-column
df_sale_census_crime <- df_sale_census_crime[with(df_sale_census_crime, sale_price <= 5000000), ]
Instead of using the year_built of a house as an input variable, create another column, “age” of the building. This is calculated as below. Also, remember to remove all zero rows again.
#Adding a column about age of building
df_sale_census_crime$age <- df_sale_census_crime$sale_year - df_sale_census_crime$year_built
#Remove negative age values
df_sale_census_crime <- df_sale_census_crime[age>0, ]
#Remove values of 0
#source: https://stackoverflow.com/questions/9977686/how-to-remove-rows-with-any-zero-value
zero_rows = apply(df_sale_census_crime, 1, function(row) all(row != 0))
df_sale_census_crime <- df_sale_census_crime[zero_rows, ]
Instead of using the entire database, use 1% of the data for trial run. This can be incrementally increased to 20% and then 100% of the data.
setup_stage <- sort(sample(nrow(df_sale_census_crime), nrow(df_sale_census_crime)*.01))
df_sale_census_crime <- df_sale_census_crime[setup_stage]
Make sure that gross_square_feet is numeric and all NA values are removed from the dataframe.
Select data of sales from 2003 to 2013 for training and data from beyond 2013 for testing. Variable y is defined to include just the sale_price; variable y_test is the testing component of y. Variable is a matrix of all the input variables for the models; variable x_test is the testing component of x. Make sure that all the variables are formatted in the right format and all NA rows are removed before proceeding.
summary(df_sale_census_crime)
# select sales from 2003 to 2013 for validation and >2013 for testing
df_sale_census_crime_test <- df_sale_census_crime[sale_year > 2013,]
# use data from 2003 to 2013 as training data
df_sale_census_crime_train <- df_sale_census_crime[sale_year <= 2013,]
# Data setup
# Setup the y output variable
y <- df_sale_census_crime_train %>% select("sale_price")
y <- as.numeric(y[[1]]) # Making sure it's numeric
y_test <- df_sale_census_crime_test %>% select("sale_price")
# Setup the input variable
x <- df_sale_census_crime_train %>% select("gross_square_feet","PerCapitaIncome", "Unemployed", "TotalPop.x", "Hispanic", "White", "Black", "Native", "Asian", "age", "weight", "sale_year")
x_test <- df_sale_census_crime_test %>% select("gross_square_feet","PerCapitaIncome", "Unemployed", "TotalPop.x", "Hispanic", "White", "Black", "Native", "Asian", "age", "weight", "sale_year")
# NEED to make sure each column is the right format (numeric / factor as appropriate)
# BEFORE
sapply(x, class)
x$gross_square_feet <- as.numeric(x$gross_square_feet)
x$sale_year <- as.numeric(x$sale_year)
# AFTER (CHECK that these all make sense)
sapply(x, class)
#Same for x_test
sapply(x_test, class)
x_test$sale_year <- as.numeric(x_test$sale_year)
any(is.na(x_test))Set the control method of the model as cross validation (cv) or none. It is advisable to do 5-fold or 10-fold cross validation. Cross validation takes a much longer time to run then just running without any cross calidation. Root Mean Square (RMSE) is being used here as the parameter to optimize the models. Mean Average Error (MAE) can also be used instead.
# Source: https://machinelearningmastery.com/k-fold-cross-validation/
control <- trainControl(method="cv", number=10)
metric <- "RMSE"
Run the neural network, deep neural net, random forest, bagged cart and k-nearest neighbours models as shown below.
# Neural Network model
set.seed(7)
fit.nnet <- train(x, y, method="nnet", metric=metric, trControl=control, linout=TRUE, preProcess="pca") # Need linout TRUE: https://stackoverflow.com/questions/21622975/how-to-model-a-neural-network-through-the-use-of-caret-r
# Deep Neural Net
set.seed(7)
fit.dnn <- train(x, y, output="linear", metric=metric, trControl=control, linout=TRUE, preProcess="pca") # Need linout TRUE: https://stackoverflow.com/questions/21622975/how-to-model-a-neural-network-through-the-use-of-caret-r
# Random Forest
set.seed(7)
fit.rf <- train(x, y, method="rf", metric=metric, trControl=control, preProcess="pca")
#Trees using genetic algorithm- This is taking extremely long amounts of time so aborting this model
#set.seed(7)
#fit.ga <- train(x, y, method="evtree", metric=metric, trControl=control, preProcess="pca")
#Adaptive- Network based fuzzy inference system-This is taking extremely long amounts of time so aborting this model
#install.packages("frbs")
#library(frbs)
#set.seed(7)
#fit.anfis <- train(x, y, method="ANFIS", metric=metric, trControl=control, preProcess="pca")
#Bagged CART
set.seed(7)
fit.bgcrt <- train(x, y, method="treebag", metric=metric, trControl=control, preProcess="pca")
#KNN
set.seed(7)
fit.knn <- train(x, y, method="knn", metric=metric, trControl=control, preProcess="pca")Let’s see how our models performed in 5-fold cross validation with 1% data (sale_price < 1M), 1% data (sale_price < 5M) and 100% data (sale_price < 1M, only for neural network and bagged cart).
results <- resamples(list(nnet=fit.nnet, dnn=fit.dnn, rf=fit.rf, cart=fit.bgcrt, knn=fit.knn))
summary(results)| metric | neural_net1 | deep_neural_net1 | random_forest1 | bagged_cart1 | k_nearest_neighbors1 |
|---|---|---|---|---|---|
| RMSE | 525659.600 | 410711.800 | 405804.900 | 437638.00 | 444834.500 |
| MAE | 280704.900 | 214190.900 | 213700.300 | 241710.50 | 227947.200 |
| R-squared | 0.067 | 0.422 | 0.442 | 0.35 | 0.334 |
| metric | neural_net2 | deep_neural_net2 | random_forest2 | bagged_cart2 | k_nearest_neighbors2 |
|---|---|---|---|---|---|
| RMSE | 211077.400 | 193911.100 | 193911.100 | 200277.800 | 194873.80 |
| MAE | 164014.600 | 142400.700 | 142400.700 | 151958.000 | 144403.90 |
| R-squared | 0.107 | 0.256 | 0.256 | 0.196 | 0.25 |
| metric | neural_net3 | bagged_cart3 |
|---|---|---|
| RMSE | 208134.700 | 200819.900 |
| MAE | 161860.400 | 154242.500 |
| R-squared | 0.097 | 0.159 |
Run the models on the x_test dataset to get the predicted sales price value.
# compare the predictions
nnet_pred <- predict(fit.nnet, x_test)
dnn_pred <- predict(fit.dnn, x_test)
rf_pred <- predict(fit.rf, x_test)
bgcrt_pred <- predict(fit.bgcrt, x_test)
knn_pred <- predict(fit.knn, x_test)
Find the difference between the predicted values and the y_test values and calculate the RMSE and MAE values as shown below.
results3 <- data.frame(cbind(y_test, nnet_pred, dnn_pred, rf_pred, bgcrt_pred, knn_pred))
#RMSE on test data for each model
#Calculating diff from actual values
results3$diff_nnet <- abs(results3$sale_price - results3$nnet_pred)
results3$diff_dnn <- abs(results3$sale_price - results3$dnn_pred)
results3$diff_rf <- abs(results3$sale_price - results3$rf_pred)
results3$diff_bgcrt <- abs(results3$sale_price - results3$bgcrt_pred)
results3$diff_knn <- abs(results3$sale_price - results3$knn_pred)
#Calculating RMSE
RMSE_nnet <- sqrt(mean('^'(results3$diff_nnet,2))) #705,330
RMSE_dnn <- sqrt(mean('^'(results3$diff_dnn,2))) #697,094
RMSE_rf <- sqrt(mean('^'(results3$diff_rf,2))) #689,789
RMSE_bgcrt <- sqrt(mean('^'(results3$diff_bgcrt,2))) #634,809
RMSE_knn <- sqrt(mean('^'(results3$diff_knn,2))) #641,702
#Calculating MAE
MAE_nnet <- mean(results3$diff_nnet) #401,590
MAE_dnn <- mean(results3$diff_dnn) #466,828
MAE_rf <- mean(results3$diff_rf) #462,907
MAE_bgcrt <- mean(results3$diff_bgcrt) #401,397
MAE_knn <- mean(results3$diff_knn) #363,609
Calculate the MAE %error and RMSE %error by dividing the MAE and RMSE values by mean sales price in y_test. This gives a figure which can be easily compared across models. Let’s see what the results are.
#Calculating %error (RMSE/Average price)
error_nnet <- RMSE_nnet/mean(results3$sale_price)
error_dnn <- RMSE_dnn/mean(results3$sale_price)
error_rf <- RMSE_rf/mean(results3$sale_price)
error_bgcrt <- RMSE_bgcrt/mean(results3$sale_price)
error_knn <- RMSE_knn/mean(results3$sale_price)
#Calculating %error (MAE/Average price)
error2_nnet <- MAE_nnet/mean(results3$sale_price)
error2_dnn <- MAE_dnn/mean(results3$sale_price)
error2_rf <- MAE_rf/mean(results3$sale_price)
error2_bgcrt <- MAE_bgcrt/mean(results3$sale_price)
error2_knn <- MAE_knn/mean(results3$sale_price) | metric4 | neural_net4 | deep_neural_net4 | random_forest4 | bagged_cart4 | k_nearest_neighbors4 |
|---|---|---|---|---|---|
| RMSE/mean | 0.92 | 0.91 | 0.90 | 0.83 | 0.84 |
| MAE/mean | 0.53 | 0.61 | 0.61 | 0.53 | 0.48 |
| metric4 | neural_net4 | deep_neural_net4 | random_forest4 | bagged_cart4 | k_nearest_neighbors4 |
|---|---|---|---|---|---|
| RMSE/mean | 0.92 | 0.91 | 0.90 | 0.83 | 0.84 |
| MAE/mean | 0.53 | 0.61 | 0.61 | 0.53 | 0.48 |
| metric4 | neural_net6 | deep_neural_net6 | random_forest6 | bagged_cart6 |
|---|---|---|---|---|
| RMSE/mean | 0.49 | 0.46 | 0.46 | 0.49 |
| MAE/mean | 0.40 | 0.37 | 0.37 | 0.39 |
As we can see, the model improves a lot when we only look at sale_prices < $ 1M. The final %MAE errors were in the range of 35-40% with the random forest and deep neural net performing the best.
This section describes the implmentation of our shiny app visalization, which can be viewed live on shinyapps.io.
Before booting up the shiny app, we’ll perform a few more processing steps on the data and set up our polygon shapefile for drawing the zip code regions on a leaflet map.
We start by loading in the semi-processed crime data and then filtering out data before 2007, as upon closer inspection, the complaints from before that point seems to have been reported differently and would skew graphs we plan to show in the visualization.
load("Data_Score_by_Time_and_Rating.rda")
# Remove NAs and outliers
score_by_time_and_rating = score_by_time_and_rating %>%
filter(!is.na(weight)) %>%
filter(month > "2006-12-31")Next, we normalize the crime score by mulitplying by 10000 (to make the minimum score greater than 1), taking it’s log, and then dividing by the maximum value to get an approximately normal distribution from 0 to 1.
score_by_time_and_rating$weight_transform = log((score_by_time_and_rating$weight)*100000)
score_by_time_and_rating$weight_normalized = score_by_time_and_rating$weight_transform/max(score_by_time_and_rating$weight_transform)
hist(score_by_time_and_rating$weight_normalized)To display the 2015 census data for each zip code, we grab all the unique zip codes from the score_by_time_and_rating data frame, and their assciated demographic and economic characteristics.
unique_census = score_by_time_and_rating %>%
group_by(zip_code) %>%
summarize(
PerCapitaIncome = PerCapitaIncome[1],
Unemployed = Unemployed[1],
TotalPop = TotalPop[1],
Hispanic = Hispanic[1],
White = White[1],
Black = Black[1],
Native = Native[1],
Asian = Asian[1],
weight = weight[1]
)| zip_code | PerCapitaIncome | Unemployed | TotalPop | Hispanic | White | Black | Native | Asian | weight |
|---|---|---|---|---|---|---|---|---|---|
| 10001 | 100234.00 | 6.768366 | 54931 | 11.217202 | 59.960081 | 7.1104367 | 0.3122954 | 18.801769 | 0.0302743 |
| 10002 | 28315.73 | 9.336805 | 89904 | 27.622018 | 20.842063 | 7.8137391 | 0.1529887 | 40.790532 | 0.0141484 |
| 10003 | 83351.19 | 5.529082 | 78507 | 6.906509 | 72.975596 | 3.8197919 | 0.2961048 | 12.900415 | 0.0179220 |
| 10004 | 97170.22 | 3.899262 | 21681 | 9.655233 | 65.617859 | 1.9047138 | 0.0000000 | 20.412117 | 0.0024445 |
| 10005 | 86428.86 | 3.880739 | 17798 | 6.414541 | 68.356489 | 2.9669963 | 0.2072705 | 20.326306 | 0.0043825 |
| 10006 | 93787.00 | 3.300000 | 4374 | 10.600000 | 66.000000 | 1.9000000 | 0.0000000 | 19.500000 | 0.0182899 |
| 10007 | 119242.13 | 3.674118 | 45430 | 8.511321 | 60.119892 | 4.3140678 | 0.0501695 | 22.972307 | 0.0057451 |
| 10009 | 51209.57 | 6.516734 | 85886 | 20.749042 | 54.242942 | 6.9598060 | 0.1141618 | 14.408297 | 0.0096873 |
| 10010 | 92893.21 | 5.785252 | 69574 | 11.339020 | 66.640557 | 4.0195015 | 0.3870339 | 14.968094 | 0.0093857 |
| 10011 | 106752.14 | 4.910633 | 82322 | 10.392998 | 70.414786 | 5.0095126 | 0.3011091 | 11.063606 | 0.0158281 |
| 10012 | 100495.24 | 4.410209 | 49380 | 5.790992 | 68.621183 | 2.2355508 | 0.1364682 | 19.006819 | 0.0114621 |
| 10013 | 113579.10 | 4.277498 | 43974 | 6.965232 | 58.260233 | 3.7364693 | 0.1236640 | 27.022588 | 0.0166007 |
| 10014 | 104205.84 | 5.425317 | 46324 | 8.411761 | 78.552526 | 2.1064567 | 0.2773681 | 7.920266 | 0.0145281 |
| 10016 | 85612.76 | 5.328540 | 71412 | 10.220809 | 65.417396 | 4.2940822 | 0.0167325 | 17.410382 | 0.0085420 |
| 10017 | 110212.58 | 3.163597 | 40598 | 6.453774 | 72.890379 | 1.7914602 | 0.0782255 | 16.572893 | 0.0081777 |
| 10018 | 91959.49 | 6.145309 | 26529 | 13.087546 | 54.653455 | 5.5875118 | 0.0715368 | 23.273821 | 0.0339628 |
| 10019 | 90112.07 | 5.688635 | 61795 | 17.626635 | 58.933995 | 4.8051477 | 0.0796909 | 15.995618 | 0.0165386 |
| 10020 | 107136.40 | 4.164439 | 6389 | 11.773767 | 60.485976 | 3.0516669 | 0.3054938 | 22.833793 | 0.0075129 |
| 10021 | 115294.98 | 3.470039 | 72771 | 6.231300 | 81.083166 | 1.7386486 | 0.0469473 | 8.290485 | 0.0037790 |
| 10022 | 130688.86 | 4.342849 | 47254 | 5.793163 | 80.830304 | 2.1141660 | 0.1132920 | 9.954484 | 0.0140517 |
| 10023 | 107509.81 | 4.706067 | 73211 | 8.576637 | 75.611438 | 2.5184795 | 0.0421235 | 10.600240 | 0.0078540 |
| 10024 | 97532.70 | 4.252122 | 81434 | 13.985623 | 71.642281 | 4.5466660 | 0.0342228 | 7.268941 | 0.0075767 |
| 10025 | 58202.50 | 8.159557 | 106152 | 21.748584 | 55.391525 | 10.1917854 | 0.1149446 | 9.581334 | 0.0113422 |
| 10026 | 33252.99 | 8.963258 | 56175 | 24.506978 | 21.874485 | 45.3300970 | 0.0698709 | 5.677186 | 0.0151313 |
| 10027 | 28991.62 | 7.676539 | 88710 | 23.809475 | 23.575135 | 41.1918916 | 0.1541066 | 8.754981 | 0.0200654 |
| 10028 | 112773.22 | 3.862112 | 83249 | 6.478951 | 83.344795 | 1.2763817 | 0.1274238 | 6.597452 | 0.0047808 |
| 10029 | 27052.87 | 11.629428 | 91158 | 48.233006 | 16.363058 | 24.2480781 | 0.0000000 | 8.805455 | 0.0194497 |
| 10030 | 26133.35 | 13.623160 | 46266 | 23.605652 | 10.925371 | 59.4541218 | 0.3181429 | 2.453601 | 0.0180694 |
| 10031 | 22642.82 | 10.759977 | 77692 | 52.828847 | 13.299143 | 27.7892524 | 0.0684202 | 3.125435 | 0.0181228 |
| 10032 | 19268.96 | 13.613160 | 97708 | 70.033369 | 10.336216 | 14.4302462 | 0.0617043 | 3.075249 | 0.0125271 |
| 10033 | 23814.46 | 12.810512 | 79100 | 68.683254 | 21.558509 | 4.5760455 | 0.0746119 | 3.380865 | 0.0137674 |
| 10034 | 23260.78 | 15.452328 | 46763 | 74.138492 | 17.012326 | 4.3897590 | 0.0000000 | 2.229989 | 0.0105425 |
| 10035 | 22672.20 | 10.971860 | 62345 | 42.117809 | 11.437370 | 38.2822263 | 0.0805245 | 5.174718 | 0.0193761 |
| 10036 | 72432.87 | 6.843628 | 31652 | 19.769809 | 54.093994 | 6.1912991 | 0.0616643 | 17.260862 | 0.0325098 |
| 10037 | 25108.63 | 13.591314 | 54422 | 23.733512 | 8.588791 | 61.6120172 | 0.2243192 | 2.784892 | 0.0066150 |
| 10038 | 52062.64 | 6.983886 | 44669 | 10.349260 | 35.814330 | 5.2738163 | 1.0839464 | 44.768287 | 0.0049475 |
| 10039 | 23355.00 | 18.036523 | 46042 | 33.090024 | 9.100765 | 53.3986860 | 0.4099778 | 1.598367 | 0.0136614 |
| 10040 | 24461.43 | 13.019628 | 75427 | 69.108541 | 21.784732 | 5.0114588 | 0.0686094 | 2.716590 | 0.0070930 |
| 10044 | 48284.78 | 5.483773 | 11783 | 14.601986 | 43.112908 | 13.9744972 | 0.0000000 | 24.635772 | 0.0059408 |
| 10065 | 120842.61 | 3.969969 | 34215 | 5.147400 | 82.168511 | 2.8653251 | 0.0998509 | 8.221274 | 0.0132398 |
| 10069 | 89272.56 | 5.022828 | 26752 | 10.560728 | 71.359640 | 3.9002841 | 0.0000000 | 10.937840 | 0.0002243 |
| 10075 | 116356.48 | 4.586323 | 72466 | 6.518072 | 84.444395 | 0.9766125 | 0.0493142 | 5.592533 | 0.0021803 |
| 10111 | 118143.00 | 6.200000 | 982 | 7.100000 | 72.800000 | 2.2000000 | 0.0000000 | 17.800000 | 0.0020367 |
| 10115 | 45187.00 | 5.100000 | 5069 | 7.800000 | 68.200000 | 7.6000000 | 0.0000000 | 14.600000 | 0.0007891 |
| 10128 | 89075.02 | 4.399807 | 75267 | 11.503636 | 71.495084 | 4.0066762 | 0.0000000 | 9.897987 | 0.0064969 |
| 10280 | 111236.72 | 3.224324 | 15762 | 10.034234 | 64.906306 | 1.3563063 | 0.0000000 | 18.871171 | 0.0015226 |
| 10281 | 115333.00 | 2.700000 | 8875 | 7.500000 | 69.800000 | 1.4000000 | 0.0000000 | 15.900000 | 0.0012394 |
| 10282 | 125477.38 | 2.737017 | 14091 | 8.832595 | 70.318231 | 2.7696118 | 0.0000000 | 12.827628 | 0.0020581 |
| 10301 | 33413.80 | 7.340241 | 67183 | 22.126743 | 49.492686 | 19.5170013 | 0.1022818 | 6.106892 | 0.0107170 |
| 10302 | 27652.69 | 5.647956 | 33898 | 31.306803 | 43.054755 | 16.3130981 | 0.0431707 | 6.997079 | 0.0083781 |
Now that we’ve extracted the census data, we can remove all those unecessary columns from the score_by_time_and_rating and reformat month column as a character to match the style of the grouped_housing data frame. We save this as a new data frame that we’ll load into our Shiny app.
crime_scores = score_by_time_and_rating %>%
mutate(month_char = format(as.Date(month), "%Y-%m"))%>%
data.frame() %>%
select(-c("weight","weight_transform", "sum_weight","Men", "Women", "Hispanic", "White", "Black", "Native", "Asian", "TotalPop", "PerCapitaIncome", "Unemployed"))
save(crime_scores, file = "crime_scores.rda")| month | zip_code | weight_normalized | month_char |
|---|---|---|---|
| 2007-01-01 | 10001 | 0.7741732 | 2007-01 |
| 2007-01-01 | 10002 | 0.7007013 | 2007-01 |
| 2007-01-01 | 10003 | 0.7235363 | 2007-01 |
| 2007-01-01 | 10004 | 0.5311227 | 2007-01 |
| 2007-01-01 | 10005 | 0.5875057 | 2007-01 |
| 2007-01-01 | 10006 | 0.7254990 | 2007-01 |
| 2007-01-01 | 10007 | 0.6136537 | 2007-01 |
| 2007-01-01 | 10009 | 0.6641158 | 2007-01 |
| 2007-01-01 | 10010 | 0.6610612 | 2007-01 |
| 2007-01-01 | 10011 | 0.7115365 | 2007-01 |
| 2007-01-01 | 10012 | 0.6803649 | 2007-01 |
| 2007-01-01 | 10013 | 0.7161397 | 2007-01 |
| 2007-01-01 | 10014 | 0.7032591 | 2007-01 |
| 2007-01-01 | 10016 | 0.6519636 | 2007-01 |
| 2007-01-01 | 10017 | 0.6477547 | 2007-01 |
| 2007-01-01 | 10018 | 0.7852772 | 2007-01 |
| 2007-01-01 | 10019 | 0.7157773 | 2007-01 |
| 2007-01-01 | 10020 | 0.6395650 | 2007-01 |
| 2007-01-01 | 10021 | 0.5731948 | 2007-01 |
| 2007-01-01 | 10022 | 0.7000389 | 2007-01 |
| 2007-01-01 | 10023 | 0.6438535 | 2007-01 |
| 2007-01-01 | 10024 | 0.6403814 | 2007-01 |
| 2007-01-01 | 10025 | 0.6793492 | 2007-01 |
| 2007-01-01 | 10026 | 0.7071881 | 2007-01 |
| 2007-01-01 | 10027 | 0.7344474 | 2007-01 |
| 2007-01-01 | 10028 | 0.5959080 | 2007-01 |
| 2007-01-01 | 10029 | 0.7314376 | 2007-01 |
| 2007-01-01 | 10030 | 0.7243277 | 2007-01 |
| 2007-01-01 | 10031 | 0.7246128 | 2007-01 |
| 2007-01-01 | 10032 | 0.6889462 | 2007-01 |
| 2007-01-01 | 10033 | 0.6980645 | 2007-01 |
| 2007-01-01 | 10034 | 0.6722874 | 2007-01 |
| 2007-01-01 | 10035 | 0.7310709 | 2007-01 |
| 2007-01-01 | 10036 | 0.7810539 | 2007-01 |
| 2007-01-01 | 10037 | 0.6272710 | 2007-01 |
| 2007-01-01 | 10038 | 0.5992176 | 2007-01 |
| 2007-01-01 | 10039 | 0.6973183 | 2007-01 |
| 2007-01-01 | 10040 | 0.6340093 | 2007-01 |
| 2007-01-01 | 10044 | 0.6168883 | 2007-01 |
| 2007-01-01 | 10065 | 0.6942905 | 2007-01 |
| 2007-01-01 | 10069 | 0.3004099 | 2007-01 |
| 2007-01-01 | 10075 | 0.5200755 | 2007-01 |
| 2007-01-01 | 10111 | 0.5134916 | 2007-01 |
| 2007-01-01 | 10128 | 0.6255310 | 2007-01 |
| 2007-01-01 | 10280 | 0.4853990 | 2007-01 |
| 2007-01-01 | 10281 | 0.4655224 | 2007-01 |
| 2007-01-01 | 10282 | 0.5145008 | 2007-01 |
| 2007-01-01 | 10301 | 0.6738727 | 2007-01 |
| 2007-01-01 | 10302 | 0.6500923 | 2007-01 |
| 2007-01-01 | 10303 | 0.6575468 | 2007-01 |
To display the neighborhood names for each zip code (if they have one), we’ll pull them in from a csv file created using information from the New York State’s Department of Health.
neighborhoods = read_csv("neighborhoods.csv") %>%
mutate(zips = sapply(zips, function(x) as.list(strsplit(x," ")))) %>%
unnest(zips) %>%
add_row(zips = "00083", neighborhood = "Central Park")| neighborhood | zips |
|---|---|
| Central Bronx | 10453 |
| Central Bronx | 10457 |
| Central Bronx | 10460 |
| Bronx Park and Fordham | 10458 |
| Bronx Park and Fordham | 10467 |
| Bronx Park and Fordham | 10468 |
| High Bridge and Morrisania | 10451 |
| High Bridge and Morrisania | 10452 |
| High Bridge and Morrisania | 10456 |
| Hunts Point and Mott Haven | 10454 |
| Hunts Point and Mott Haven | 10455 |
| Hunts Point and Mott Haven | 10459 |
| Hunts Point and Mott Haven | 10474 |
| Kingsbridge and Riverdale | 10463 |
| Kingsbridge and Riverdale | 10471 |
| Northeast Bronx | 10466 |
| Northeast Bronx | 10469 |
| Northeast Bronx | 10470 |
| Northeast Bronx | 10475 |
| Southeast Bronx | 10461 |
| Southeast Bronx | 10462 |
| Southeast Bronx | 10464 |
| Southeast Bronx | 10465 |
| Southeast Bronx | 10472 |
| Southeast Bronx | 10473 |
| Central Brooklyn | 11212 |
| Central Brooklyn | 11213 |
| Central Brooklyn | 11216 |
| Central Brooklyn | 11233 |
| Central Brooklyn | 11238 |
| Southwest Brooklyn | 11209 |
| Southwest Brooklyn | 11214 |
| Southwest Brooklyn | 11228 |
| Borough Park | 11204 |
| Borough Park | 11218 |
| Borough Park | 11219 |
| Borough Park | 11230 |
| Canarsie and Flatlands | 11234 |
| Canarsie and Flatlands | 11236 |
| Canarsie and Flatlands | 11239 |
| Southern Brooklyn | 11223 |
| Southern Brooklyn | 11224 |
| Southern Brooklyn | 11229 |
| Southern Brooklyn | 11235 |
| Northwest Brooklyn | 11201 |
| Northwest Brooklyn | 11205 |
| Northwest Brooklyn | 11215 |
| Northwest Brooklyn | 11217 |
| Northwest Brooklyn | 11231 |
| Flatbush | 11203 |
Finally, we’ll load in the shapefile that contains all of the zip code polygins (with st_read) and merge it with the neighborhoods key and the 2015 census data. Before we saving it as an RDA file also we reduce the number of polygon vertices using `ms_simplify()’. This will allow for better performance of our Shiny app.
library(jsonlite)
library(sf)
zip_sf = st_read("https://raw.githubusercontent.com/fedhere/PUI2015_EC/master/mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson", stringsAsFactors = FALSE) %>%
select(OBJECTID, postalCode, geometry) %>%
rename(postalcode = postalCode, shape_id = OBJECTID)## Reading layer `nyc-zip-code-tabulation-areas-polygons' from data source `https://raw.githubusercontent.com/fedhere/PUI2015_EC/master/mam1612_EC/nyc-zip-code-tabulation-areas-polygons.geojson' using driver `GeoJSON'
## Simple feature collection with 262 features and 11 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: -74.25576 ymin: 40.49584 xmax: -73.6996 ymax: 40.91517
## CRS: 4326
In this sub-section, we review the specific files that are run each time we start up an instance of out Shiny app visualization.
This file is always executed before server.R and ui.R and allows us run preliminary code and create global variables used by both the server and UI. We begin the file by loading in the tidyverse package and creating a temporary directory where we can store the plots we’ll be generating when a user clicks a particular polygon.
library(tidyverse)
# Create temp folder for popup plots and rdas
folder <- tempfile()
dir.create(folder)Next, we set the location from where we’ll be loading our rda files, as this is different depending on whether we’re running it locally or on shinyapps.io (in which case the RDA files are stored on dropbox). Before pushing the app to shinyapps.io, one simply has to uncomment the commented out lines so that the app can pull from dropbox rather than just outside its local directory.
rda_loc = "../"
# These lines are for hosting the Shiny app on shinyapps.io
# Leave them commented out when you're running on your local machine
# library(rdrop2)
# drop_auth(rdstoken = "token.rds")
# dir_info = drop_dir("Rstudio_cloud/engm_182_project_data")
# mapply(drop_get,
# path = dir_info$path_display,
# local_file = paste(folder,dir_info$name,sep="/"),
# overwrite = rep(TRUE, nrow(dir_info))
# )
# rda_loc = paste0(folder, "/")Now that we’ve set their location, we can load in the RDA files. We’ll also create a unique_zips data frame to hold popup titles for each zipcode polygon in the shapefile.
# Load data gropued by month and zip code
load(paste0(rda_loc, "grouped_housing.rda"))
load(paste0(rda_loc, "crime_scores.rda"))
load(paste0(rda_loc, "zip_polygons.rda"))
unique_zips = zip_sf %>% data.frame() %>%
select(postalcode, neighborhood) %>%
group_by(postalcode) %>%
summarize(neighborhood = neighborhood[ifelse(min(which(!is.na(neighborhood)))==Inf,1,min(which(!is.na(neighborhood))))]) %>%
mutate(select_names = paste0(ifelse(is.na(.data$neighborhood), "", paste0(.data$neighborhood, " - ")), .data$postalcode)) %>%
data.frame() %>%
select(-neighborhood)| postalcode | select_names |
|---|---|
| 00083 | Central Park - 00083 |
| 10001 | Chelsea and Clinton - 10001 |
| 10002 | Lower East Side - 10002 |
| 10003 | Lower East Side - 10003 |
| 10004 | Lower Manhattan - 10004 |
| 10005 | Lower Manhattan - 10005 |
| 10006 | Lower Manhattan - 10006 |
| 10007 | Lower Manhattan - 10007 |
| 10009 | Lower East Side - 10009 |
| 10010 | Gramercy Park and Murray Hill - 10010 |
| 10011 | Chelsea and Clinton - 10011 |
| 10012 | Greenwich Village and Soho - 10012 |
| 10013 | Greenwich Village and Soho - 10013 |
| 10014 | Greenwich Village and Soho - 10014 |
| 10016 | Gramercy Park and Murray Hill - 10016 |
| 10017 | Gramercy Park and Murray Hill - 10017 |
| 10018 | Chelsea and Clinton - 10018 |
| 10019 | Chelsea and Clinton - 10019 |
| 10020 | Chelsea and Clinton - 10020 |
| 10021 | Upper East Side - 10021 |
| 10022 | Gramercy Park and Murray Hill - 10022 |
| 10023 | Upper West Side - 10023 |
| 10024 | Upper West Side - 10024 |
| 10025 | Upper West Side - 10025 |
| 10026 | Central Harlem - 10026 |
| 10027 | Central Harlem - 10027 |
| 10028 | Upper East Side - 10028 |
| 10029 | East Harlem - 10029 |
| 10030 | Central Harlem - 10030 |
| 10031 | Inwood and Washington Heights - 10031 |
| 10032 | Inwood and Washington Heights - 10032 |
| 10033 | Inwood and Washington Heights - 10033 |
| 10034 | Inwood and Washington Heights - 10034 |
| 10035 | East Harlem - 10035 |
| 10036 | Chelsea and Clinton - 10036 |
| 10037 | Central Harlem - 10037 |
| 10038 | Lower Manhattan - 10038 |
| 10039 | Central Harlem - 10039 |
| 10040 | Inwood and Washington Heights - 10040 |
| 10041 | 10041 |
| 10043 | 10043 |
| 10044 | Upper East Side - 10044 |
| 10045 | 10045 |
| 10047 | 10047 |
| 10048 | 10048 |
| 10055 | 10055 |
| 10065 | Upper East Side - 10065 |
| 10069 | 10069 |
| 10075 | Upper East Side - 10075 |
| 10080 | 10080 |
Next we build a list to hold the radio button choices to select which data the user wants to look at.
# Build list of radio button choice names and values
radioButtonOptions = setNames(
as.list(c(colnames(grouped_housing)[-c(1:3)],colnames(crime_scores)[3])), # values
c( # choice names
"Average price per sq. ft.",
"Number of sales",
"Total proceeds from sales",
"Crime score"
)
)Finally, we create some custom CSS and HTML that will be used to fix a bug in the display positioning of the NA color within the leaflet legend. We found this solution in a Github issue thread.
Server is the part of shiny application where processing and updating of data, map, predicted values happen. We’ll gradually build the code for Sever in steps. The inputIds in ui.R script will be linked in Sever script.
We’ll be using Neural net and Random forest models to compute prices of real estate based on 4 inputs from user.
We start coding inside shinyServer function. In next steps, we build chunks of code that will go inside shinyServer function. They execute different processes in the app. Final code of server script is given in the end.
In this first step, we create a reactive obeject that changes the merged dateframe as per the user inputs: time selected and property type selected. The merged dataframe has spatial data. renderLeaflet function creates the map and shows NYC as we have put the coordinates.
shinyServer(function(input, output) {
df <- reactive({
if (housing_data_select()) {
data = grouped_housing %>%
filter(category == input$prop_category, month_char == input$date_select) %>%
select(zip_code, disp_data = !!input$data_select)
} else {
data = crime_scores %>%
filter(month_char == input$date_select) %>%
select(zip_code, disp_data = !!input$data_select)
}
tmp = merge(zip_sf, data, by.x="postalcode", by.y="zip_code", all.x=TRUE, no.dups=FALSE)
return(tmp)
})
output$map <- renderLeaflet({
leaflet() %>%
addTiles() %>%
setView(-74.0060,40.7128, zoom=11)
})
})Now, we add a color palette. Using reverse geocoding, we map the reactive data based on the ZIP code as per the attribute chosen by the user with radioButton and time chosen on sliderInput. We also highlight the ZIP code areas for mouse hover.
observe({
tmp = df()
legend_labels = NULL
pal = colorNumeric(
palette = rev(brewer.pal(n=9, name = "RdYlGn")),
domain = NULL
)
if (all(is.na(tmp$disp_data))) {
func = labelFormat()
tmp$disp_data = as.factor(rep("NA", nrow(tmp)))
pal = colorFactor(palette = "#808080", domain = NULL)
} else if (input$data_select == "avg_price_per_sqft") {
func = labelFormat(prefix = " $")
} else if (input$data_select == "total_proceeds") {
func = labelFormat(prefix = " $", suffix = "M", transform=function(x) x/1E6)
} else {
func = labelFormat(prefix = " ")
}
leafletProxy("map", data = tmp) %>%
clearControls() %>%
clearShapes() %>%
clearPopups() %>%
addPolygons(
fillColor = ~pal(disp_data),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.7,
weight = 1,
smoothFactor = 0.2,
# Highlight neighbourhoods upon mouseover
highlight = highlightOptions(
weight = 3,
color = "black",
opacity = 1.0
),
layerId = ~shape_id
) %>%
addLegend(
title=names(which(radioButtonOptions == input$data_select)),
pal = pal,
values = ~disp_data,
position = "bottomright",
labFormat = func,
labels = legend_labels
)
})We create a function observeEvent that will check the click on a ZIP code area. It will show a pop up inside which we can see a graph of the chosen attribute over time. Code for the pop up in shown in the next step.
# Observe shape clicks to display popup
observeEvent(input$map_shape_click,{
event = input$map_shape_click
if (is.null(event))
return()
else {
zip_clicked = zip_sf$postalcode[zip_sf$shape_id == event$id]
curr = FALSE
if (housing_data_select()) {
data = grouped_housing %>%
filter(category == input$prop_category, zip_code == zip_clicked) %>%
select(month_char, !!input$data_select)
title_suffix_str = paste("-", input$prop_category)
if (input$prop_category == "All") {
title_suffix_str = paste(title_suffix_str, "Categories")
}
if (input$data_select %in% c("avg_price_per_sqft", "total_proceeds")) {
curr = TRUE
}
} else {
data = crime_scores %>%
filter(zip_code == zip_clicked) %>%
select(month_char, weight_normalized)
title_suffix_str = "over Time"
}
showPopup(data, zip_clicked, event$id, event$lat, event$lng, title_suffix_str, curr)
}
})In this step, we add pop up on each ZIP code area showing the graph of the attribute chosen over time. Pop us also shows demographic information from the 2015 census.
showPopup <- function(df, zip, id, lat, lng, suff_str, curr = FALSE) {
# Get plot object and other popup data for the chosen zip code
zip_data = zip_sf[which(zip_sf$postalcode == zip),]
if (nrow(df) > 0) {
y = pull(df, 2)
x = as.yearmon(pull(df, 1))
title = paste(names(which(radioButtonOptions == colnames(df)[2])), suff_str)
ylabel = names(which(radioButtonOptions == colnames(df)[2]))
if (curr == TRUE) {
if (max(y, na.rm=TRUE) > 1e6) {
label = label_number(prefix = "$", suffix = "M", scale = 1e-6)
} else if (max(y, na.rm = TRUE) > 1e4) {
label = label_number(prefix = "$", suffix = "K", scale = 1e-3)
} else {
label = label_number(prefix = "$")
}
} else {
label = label_number()
}
plot = ggplot(data=NULL, aes(x, y)) +
geom_line() +
geom_smooth() +
labs(title = title, y = ylabel, x = "Month") +
scale_y_continuous(label = label, expand = c(0,0), limits = c(0, ifelse(colnames(df)[2] == "weight_normalized", 1, max(y, na.rm=TRUE)))) +
theme_classic() +
theme(plot.title = element_text(size = 11, hjust = 0.5), axis.title = element_text(size = 10))
# Write svg file to temporary folder
svg(filename= paste(folder,"plot.svg", sep = "/"),
width = 500 * 0.01, height = 300 * 0.01)
print(plot)
dev.off()
content <- paste(readLines(paste(folder,"plot.svg",sep="/")), collapse = "")
} else {
content = ""
}
if (is.na(zip_data$neighborhood)) {
neighborhood_str = ""
} else {
neighborhood_str = paste(zip_data$neighborhood, " - ")
}
# Create popup
leafletProxy("map") %>%
addPopups(
lng,
lat,
popup = paste0(
"<h4>", neighborhood_str, zip, "</h4><br/>",
content, "<br/>",
"<b>2015 Census Data</b><br/>",
"Per capita income: $", round(zip_data$PerCapitaIncome),"</b><br/>",
"Total Population: ", zip_data$TotalPop, "</b><br/>",
"Unemployment rate: ", round(zip_data$Unemployed), "%"
),
layerId = id,
options = popupOptions(maxWidth = 500)
)
}In this step, we integrate the predictive models in the shiny app. It takes 4 inputs from the user: Age of the builing, area, sale year and ZIP code. ZIP code pulls demographic data from the 2015 cenus that’s in the dataframe. It feeds these inputs in the model chosen by the user and gives a predicted price.
price_text = eventReactive(input$predict_price, {
if (input$zip != "")
zip = unique_zips[which(unique_zips$select_names==input$zip), 1]
else {
zip = NA
}
input_data = data.frame(gross_square_feet = input$sqft, age = input$age, sale_year = input$sale_year, zip_code = zip, stringsAsFactors = FALSE)
bad_cols = colnames(input_data)[colSums(is.na(input_data)) > 0]
if (length(bad_cols) == 0){
zip_data = zip_sf %>%
filter(postalcode == zip) %>%
slice(1) %>%
data.frame() %>%
select(postalcode, PerCapitaIncome, White, Black, Asian, Hispanic, Native, TotalPop.x = TotalPop, Unemployed, weight) %>%
inner_join(input_data, by = c("postalcode" = "zip_code")) %>%
select(-postalcode)
if (is.na(zip_data)) {return(paste0("<font color=\"#eb4034\">Error: No valid census data for this zip code!</font>"))}
if (input$model_select == "fit.rf") {
predicted_val = predict(fit.rf, newdata=zip_data)
} else if (input$model_select == "fit.nnet") {
predicted_val = predict(fit.nnet, newdata=zip_data)
} else if (input$model_select == "fit.bgcrt") {
predicted_val = predict(fit.bgcrt, newdata=zip_data)
}
return(paste0("<br/>", dollar(predicted_val)))
} else {
msg = ""
if ("gross_square_feet" %in% bad_cols) {
msg = paste0(msg, "<br/>Error: Please enter a square footage!")
}
if ("age" %in% bad_cols) {
msg = paste0(msg, "<br/>Error: Please enter building age!")
}
if ("sale_year" %in% bad_cols) {
msg = paste0(msg, "<br/>Error: Please enter a sale year!")
}
if ("zip_code" %in% bad_cols) {
msg = paste0(msg, "<br/>Error: Please enter a location!")
}
return(paste0("<font color=\"#eb4034\">", msg, "</font>"))
}
})
output$price = renderText({
price_text()
})housing_data_select takes a boolean value of either TRUE or FALSE depending on whether the attribute chosen by the user is within the choices provided.
Now we combine all chunks of code that we saw in steps so far inside the shinyServer function. This is the final script for server.R. This script along with ui.R and Global.R runs the shiny app.
library(shiny)
library(leaflet)
library(sf)
library(RColorBrewer)
library(zoo)
library(scales)
library(ggplot2)
library(caret)
library(randomForest)
load("neural_net.rda")
# load(paste0(rda_loc, "bagged_cart.rda"))
load("random_forest.rda")
shinyServer(function(input, output) {
output$housing_data_select = reactive({
return(housing_data_select())
})
housing_data_select = reactive({
if (input$data_select %in% colnames(grouped_housing)[-c(1:3)]) {
return(TRUE)
} else {
return(FALSE)
}
})
output$price = renderText({
price_text()
})
price_text = eventReactive(input$predict_price, {
if (input$zip != "")
zip = unique_zips[which(unique_zips$select_names==input$zip), 1]
else {
zip = NA
}
input_data = data.frame(gross_square_feet = input$sqft, age = input$age, sale_year = input$sale_year, zip_code = zip, stringsAsFactors = FALSE)
bad_cols = colnames(input_data)[colSums(is.na(input_data)) > 0]
if (length(bad_cols) == 0){
zip_data = zip_sf %>%
filter(postalcode == zip) %>%
slice(1) %>%
data.frame() %>%
select(postalcode, PerCapitaIncome, White, Black, Asian, Hispanic, Native, TotalPop.x = TotalPop, Unemployed, weight) %>%
inner_join(input_data, by = c("postalcode" = "zip_code")) %>%
select(-postalcode)
if (is.na(zip_data)) {return(paste0("<font color=\"#eb4034\">Error: No valid census data for this zip code!</font>"))}
if (input$model_select == "fit.rf") {
predicted_val = predict(fit.rf, newdata=zip_data)
} else if (input$model_select == "fit.nnet") {
predicted_val = predict(fit.nnet, newdata=zip_data)
} else if (input$model_select == "fit.bgcrt") {
predicted_val = predict(fit.bgcrt, newdata=zip_data)
}
return(paste0("<br/>", dollar(predicted_val)))
} else {
msg = ""
if ("gross_square_feet" %in% bad_cols) {
msg = paste0(msg, "<br/>Error: Please enter a square footage!")
}
if ("age" %in% bad_cols) {
msg = paste0(msg, "<br/>Error: Please enter building age!")
}
if ("sale_year" %in% bad_cols) {
msg = paste0(msg, "<br/>Error: Please enter a sale year!")
}
if ("zip_code" %in% bad_cols) {
msg = paste0(msg, "<br/>Error: Please enter a location!")
}
return(paste0("<font color=\"#eb4034\">", msg, "</font>"))
}
})
df <- reactive({
if (housing_data_select()) {
data = grouped_housing %>%
filter(category == input$prop_category, month_char == input$date_select) %>%
select(zip_code, disp_data = !!input$data_select)
} else {
data = crime_scores %>%
filter(month_char == input$date_select) %>%
select(zip_code, disp_data = !!input$data_select)
}
tmp = merge(zip_sf, data, by.x="postalcode", by.y="zip_code", all.x=TRUE, no.dups=FALSE)
return(tmp)
})
output$map <- renderLeaflet({
leaflet() %>%
addTiles() %>%
setView(-74.0060,40.7128, zoom=11)
})
observe({
tmp = df()
legend_labels = NULL
pal = colorNumeric(
palette = rev(brewer.pal(n=9, name = "RdYlGn")),
domain = NULL
)
if (all(is.na(tmp$disp_data))) {
func = labelFormat()
tmp$disp_data = as.factor(rep("NA", nrow(tmp)))
pal = colorFactor(palette = "#808080", domain = NULL)
} else if (input$data_select == "avg_price_per_sqft") {
func = labelFormat(prefix = " $")
} else if (input$data_select == "total_proceeds") {
func = labelFormat(prefix = " $", suffix = "M", transform=function(x) x/1E6)
} else {
func = labelFormat(prefix = " ")
}
leafletProxy("map", data = tmp) %>%
clearControls() %>%
clearShapes() %>%
clearPopups() %>%
addPolygons(
fillColor = ~pal(disp_data),
color = "#b2aeae", # you need to use hex colors
fillOpacity = 0.7,
weight = 1,
smoothFactor = 0.2,
# Highlight neighbourhoods upon mouseover
highlight = highlightOptions(
weight = 3,
color = "black",
opacity = 1.0
),
layerId = ~shape_id
) %>%
addLegend(
title=names(which(radioButtonOptions == input$data_select)),
pal = pal,
values = ~disp_data,
position = "bottomright",
labFormat = func,
labels = legend_labels
)
})
# Observe shape clicks to display popup
observeEvent(input$map_shape_click,{
event = input$map_shape_click
if (is.null(event))
return()
else {
zip_clicked = zip_sf$postalcode[zip_sf$shape_id == event$id]
curr = FALSE
if (housing_data_select()) {
data = grouped_housing %>%
filter(category == input$prop_category, zip_code == zip_clicked) %>%
select(month_char, !!input$data_select)
title_suffix_str = paste("-", input$prop_category)
if (input$prop_category == "All") {
title_suffix_str = paste(title_suffix_str, "Categories")
}
if (input$data_select %in% c("avg_price_per_sqft", "total_proceeds")) {
curr = TRUE
}
} else {
data = crime_scores %>%
filter(zip_code == zip_clicked) %>%
select(month_char, weight_normalized)
title_suffix_str = "over Time"
}
showPopup(data, zip_clicked, event$id, event$lat, event$lng, title_suffix_str, curr)
}
})
showPopup <- function(df, zip, id, lat, lng, suff_str, curr = FALSE) {
# Get plot object and other popup data for the chosen zip code
zip_data = zip_sf[which(zip_sf$postalcode == zip),]
if (nrow(df) > 0) {
y = pull(df, 2)
x = as.yearmon(pull(df, 1))
title = paste(names(which(radioButtonOptions == colnames(df)[2])), suff_str)
ylabel = names(which(radioButtonOptions == colnames(df)[2]))
if (curr == TRUE) {
if (max(y, na.rm=TRUE) > 1e6) {
label = label_number(prefix = "$", suffix = "M", scale = 1e-6)
} else if (max(y, na.rm = TRUE) > 1e4) {
label = label_number(prefix = "$", suffix = "K", scale = 1e-3)
} else {
label = label_number(prefix = "$")
}
} else {
label = label_number()
}
plot = ggplot(data=NULL, aes(x, y)) +
geom_line() +
geom_smooth() +
labs(title = title, y = ylabel, x = "Month") +
scale_y_continuous(label = label, expand = c(0,0), limits = c(0, ifelse(colnames(df)[2] == "weight_normalized", 1, max(y, na.rm=TRUE)))) +
theme_classic() +
theme(plot.title = element_text(size = 11, hjust = 0.5), axis.title = element_text(size = 10))
# Write svg file to temporary folder
svg(filename= paste(folder,"plot.svg", sep = "/"),
width = 500 * 0.01, height = 300 * 0.01)
print(plot)
dev.off()
content <- paste(readLines(paste(folder,"plot.svg",sep="/")), collapse = "")
} else {
content = ""
}
if (is.na(zip_data$neighborhood)) {
neighborhood_str = ""
} else {
neighborhood_str = paste(zip_data$neighborhood, " - ")
}
# Create popup
leafletProxy("map") %>%
addPopups(
lng,
lat,
popup = paste0(
"<h4>", neighborhood_str, zip, "</h4><br/>",
content, "<br/>",
"<b>2015 Census Data</b><br/>",
"Per capita income: $", round(zip_data$PerCapitaIncome),"</b><br/>",
"Total Population: ", zip_data$TotalPop, "</b><br/>",
"Unemployment rate: ", round(zip_data$Unemployed), "%"
),
layerId = id,
options = popupOptions(maxWidth = 500)
)
}
outputOptions(output, "housing_data_select", suspendWhenHidden = FALSE)
})User interface is the user facing side of the Shiny application.It contains various display elements such as titlePanel, sidebarLayout, radioButtons, tabPanel etc. Following steps show how we step-wise buit up code for UI.
Now that we have imported libraries, we start adding code inside fluidPage to create display elements. We’ll have user inputs in the sidebarPanel. We’ll add those in later steps. Final code of ui script is given in the end.
Now we create second tab for the predictive model integration in the app. We build radioButtons to choose between models. Price predictor takes 4 inputs to compute the price. Clicking on the actionButton shows predicted price.
tabPanel(
"Price Predictor",
HTML("<br/>"),
numericInput("age", "Building age (years):", 50, min=1800, max = 2030),
numericInput("sqft", "Square footage:", 2000, min=0, max = 1000000),
numericInput("sale_year", "Sale Year:", 2010, min=1950, max = 2030),
selectizeInput("zip", "Location:",
selected = NULL,
choices = unique_zips$select_names,
options = list(
placeholder = "Search by Neighborhood or Zip Code",
onInitialize = I('function() { this.setValue(""); }')
)
),
radioButtons(
"model_select",
label="Predictive Model:",
choiceNames = c(
# "Bagged cart",
"Neural net",
"Random forest"
),
choiceValues = c(
# "fit.bgcrt",
"fit.nnet",
"fit.rf"
)
),
actionButton("predict_price", label = "Compute Price"),
HTML("<hr/><b>Predicted Price: </b>"),
htmlOutput("price")
)Now we combine all chunks of code that we saw in steps so far inside the shinyUI function. This is the final script for ui.R. This script along with server.R and Global.R runs the shiny app.
library(shiny)
library(leaflet)
library(shinyWidgets)
# Define UI for application that draws a histogram
shinyUI(fluidPage(
HTML(html_fix),
titlePanel("NYC Real Estate and Crime"),
sidebarLayout(
sidebarPanel(
tabsetPanel(
tabPanel(
"Map Controls",
HTML("<br/>"),
sliderTextInput(
inputId = "date_select",
label="Month",
choices = sort(unique(c(unique(crime_scores$month_char),unique(grouped_housing$month_char)))),
animate=TRUE,
selected="2020-03"
),
radioButtons(
inputId = "data_select",
label = "Mapped Data",
radioButtonOptions # set up in global.R
),
conditionalPanel(
condition = "output.housing_data_select == true",
selectInput(
inputId = "prop_category",
label = "Property Category",
choices = sort(unique(grouped_housing$category))
)
)
),
tabPanel(
"Price Predictor",
HTML("<br/>"),
numericInput("age", "Building age (years):", 50, min=1800, max = 2030),
numericInput("sqft", "Square footage:", 2000, min=0, max = 1000000),
numericInput("sale_year", "Sale Year:", 2010, min=1950, max = 2030),
selectizeInput("zip", "Location:",
selected = NULL,
choices = unique_zips$select_names,
options = list(
placeholder = "Search by Neighborhood or Zip Code",
onInitialize = I('function() { this.setValue(""); }')
)
),
radioButtons(
"model_select",
label="Predictive Model:",
choiceNames = c(
# "Bagged cart",
"Neural net",
"Random forest"
),
choiceValues = c(
# "fit.bgcrt",
"fit.nnet",
"fit.rf"
)
),
actionButton("predict_price", label = "Compute Price"),
HTML("<hr/><b>Predicted Price: </b>"),
htmlOutput("price")
)
)
),
mainPanel(
leafletOutput(
"map",
width="100%",
height="90vh"
)
),
position="left"
)
))Selecting relevant variables
Many factors go into predicting the housing prices. So it’s difficult to select a set of variables from the data available.
Cleaning the data
Cleaning the raw data is an important step. Often, the data collected is not in the required shape so a lot wrangling has to be done before it’s ready for further processing.
Missing data
There were many data points that were missing. We couldn’t do much about them. There are many ways to deal with missing data. However, they were not within the scope of the project. Grey areas seen on the map are due to these missing data points.
Computing power limitations We experienced computing power limitations while working with machine learning models. In future, these models can be run with better computing.
Data cleaning takes time
Cleaning and preparing the data into required form is an important step and can be time consuming. Often, it takes more time to clean and prepare the date than processing it.
Higher computing or APIs needed to run the models
To run the machine learning models, higher computing power would be useful. Othewise, using some API package like rcpp the code could be executed faster.
Lost insights due to missing data We may have missed out on many insights due to missing data. There was nothing much we could do about this. In future, missing data could be handled better.