Predicting Home Sale Price in Philadelphia

MUSA 508 Midterm

Author

Akira Di Sandro and Nissim Lebovits

Published

October 11, 2023

Summary

Below, we implement an Ordinary Least Squares (OLS) regression to predict sale prices for homes in Philadelphia. We create a dataset of features taken from Zillow data and several public data sources and attempt to narrow them down to a list of predictors that will minimize the mean absolute percent error (MAPE) of predictions. We find that our model 1) has limited accuracy, with a MAPE of more than 35%, and 2) does not generalize well to lower income or non-white neighborhoods. We attribute these shortcomings to poor selection and cleaning of predictor variables, and the inadequacy of OLS regression when dealing with data of a fundamentally spatial nature. To improve our model, we recommend a more parsimonious selection and transformation of variables, and the consideration of the spatial nature of the data through the use of a method such as geographically-weighted regression.

Introduction

Housing market price estimators like Zillow’s Zestimate can give a general idea of a house’s value, but more often than not, professional appraisers and real estate agents do no recommend relying on these tools as they tend to have high error rates. In addition to inconveniencing a home buyer or seller, the high error rates of automated valuation models (AVM; e.g. Zillow’s Zestimate) may misrepresent low-income and majority-minority, especially Black-majority neighborhoods where home price value errors are higher compared to majority-white neighborhoods1. As well, a home’s value is often used to determine a homeowner’s mortgage loan approval and to inform policymakers on the general state of the housing market 2. Inaccurate housing estimates may lead to misinformed housing and economic policies disproportionately affecting non-white neighborhoods, reifying racial and class disparities in housing.

There are many reasons why these price estimators may be inaccurate (e.g. data availability, inability to capture temporal housing market patterns, etc.), and among them lack of local intelligence stands out. Each urban area has its own culture and unique characteristics that need to be factored into AVMs in order to more accurately predict housing prices. As residents and spatial analysts of the city of Philadelphia, our team was able to apply our knowledge of the city and its neighborhoods to the creation of an improved prediction model.

Even with a deeper understanding of Philadelphia, creating an improved home price estimator model for the city is still a challenging task. All kinds of data are available including health and human services, transportation, public safety, parks and recreation, and food in addition to the traditional housing market and attribute data used in AVMs. Though all of these data are interrelated and inform each other and home price, limiting the amount of predictors to only keep those that are most informative can be a daunting task. We often create new features by redefining (sometimes by combining) variables that improve our prediction of home price. Luckily we implement technology that makes this decision-making process easier.

Another possible challenge when it comes to constructing a hone price estimator model is data availability, or rather unavailability. Some neighborhoods, especially low-income neighborhoods and regions where the population does not trust in government, data that we use in our AVMs may be missing for various reasons. Some of the variables that are most informative to a model may be resources that are unavailable in low-income neighborhoods. On the other hand, even if the resources are available and utilized in the community, a region with high distrust in government may not report information in the government Census and other data surveys, resulting in missing data. Ultimately, if a model relies on variables that are missing in neighborhoods with certain characteristics, it will predict home price with more error in these regions.

Keeping these potential pitfalls in mind, our overall modeling strategy was as follows. We first gathered open-source data about Philadelphia to examine and create variables of Philadelphia houses and neighborhoods to assess whethere it made sense to include them in our model. We then input a selection of predictors into an ordinary least squares (OLS) regression, specifically an OLS step-wise regression, to narrow down and determine the combination of the most informative predictors. After assessing our model fit by looking at mean absolute error (MAE) and mean absolute percentage error (MAPE), we revised our selection of predictors and repeated the process until we were happy with our model.

Our newly improved model uses 21 independent variables to predict housing prices in Philadelphia and is able to do so with a mean MAE of $68,687.82 and mean MAPE of 36.74%. Notable predictors in our OLS regression are number of bathrooms, percent of population that is white (in the census tract that the house is part of), and the rating of a house’s interior condition. We observed clustering in home price predictions with a Moran’s I of 0.615 (p < 0.001) meaning our model errors exhibit spatial autocorrelation. We found that our model predicts with higher errors for majority non-White neighborhoods and low-income neighborhoods.

Show the code
library(tidyverse)
library(olsrr)
library(sf)
library(caret) # add dummy vars
library(tmap)
library(fastDummies)
library(tidycensus)
library(spdep)
library(sfdep)
library(curl)
library(zip)
library(rsgeo)
library(janitor)
library(spatstat)
library(maptools)
library(terra)
library(ggthemr)
library(ggcorrplot)
library(missForest)
library(psych)
library(kableExtra)
library(gridExtra)
library(ggpubr)

tmap_mode('plot')
options(tigris_use_cache = TRUE, scipen = 999)
ggthemr('flat')


pal_full <- c("#34495e", "#3498db", "#2ecc71", "#f1c40f", "#e74c3c", "#9b59b6", "#1abc9c", "#f39c12", "#d35400")
plt_pal <- c("#3498db", "#2ecc71", "#f1c40f")
corr_mat_pal <- c("#2ecc71", "white", "#e74c3c")
bi_pal <- c("#9b59b6", "#d6bddd")

mypalette1 <- colorRampPalette(c("#bbf0e1","#025e44"))(5)
mypalette2 <- colorRampPalette(c("#f0cab9","#6e2e10"))(5)

crs <- "epsg:2272"

set.seed(42)

# load functions from functions.R
source("https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/functions.r")

Methods

Data Sources

We gathered data about the shape and outline of the city of Philadelphia and its neighborhoods from OpenData arcGIS and Azavea’s Github repository3, respectively. We also pulled information about commercial centrs from OpenData arcGIS. Housing market and attribute data (including house features, sale price, ownership, etc.) was provided by Zillow (aka studentData), and additional census tract data (including total population, median income, etc.) was gathered from 5-year ACS Census data for 2020. We would like to note here that we previously had gathered data from the American forests website and Philly Carto websites about tree canopy cover and shootings, respectively, but our exploratory data anlalysis found that these data did not help us improve our model.

Data Wrangling

Show the code
phl_path <- "https://opendata.arcgis.com/datasets/405ec3da942d4e20869d4e1449a2be48_0.geojson"
phl <- st_read(phl_path, quiet = TRUE) %>%
          st_transform(crs = crs)

### base data-----------------------------------------------
data_path <- ("data/2023/studentData.geojson")
data <- st_read(data_path, quiet = TRUE)

drop <- c("objectid", "assessment_date", "beginning_point", "book_and_page", "category_code", 
          "cross_reference", "date_exterior_condition", "house_number", "location", "owner_1", 
          "owner_2", "parcel_number", "recording_date", "registry_number", "sale_date",
          "mailing_address_1", "mailing_address_2", "mailing_care_of", "mailing_zip", "mailing_street", 
          "mailing_city_state", "building_code", "geographic_ward", "state_code", "street_code", 
          "street_name", "street_designation", "street_direction", "census_tract", "suffix",
          "zip_code", "building_code_new", "year_built_estimate", "pin", "unit", "exempt_land",
          "building_code_description"
          )

to_cat <- c("category_code_description", "garage_type")

data <- data %>%
          mutate(non_resident_owner = mailing_address_1 == mailing_street) %>%
          select(-drop) %>%
          mutate_at(to_cat, as.character) %>% 
          st_transform(crs = crs) %>%
          mutate(
            nb = st_knn(geometry, 5),
            wt = st_weights(nb),
            nb15 = st_knn(geometry, 15),
            wt15 = st_weights(nb15),
            total_area = ifelse(is.na(total_area), purrr::map_dbl(find_xj(total_area, nb), mean, na.rm = TRUE), total_area),
            total_livable_area = ifelse(is.na(total_livable_area), purrr::map_dbl(find_xj(total_livable_area, nb), mean, na.rm = TRUE), total_livable_area),
            number_of_bathrooms = round(ifelse(is.na(number_of_bathrooms), purrr::map_dbl(find_xj(number_of_bathrooms, nb), mean, na.rm = TRUE), number_of_bathrooms)),
            number_of_bedrooms = round(ifelse(is.na(number_of_bedrooms), purrr::map_dbl(find_xj(number_of_bedrooms, nb), mean, na.rm = TRUE), number_of_bedrooms)),
            number_of_rooms = ifelse(is.na(number_of_rooms), number_of_bedrooms + number_of_bathrooms, number_of_rooms),
            fireplaces = round(ifelse(is.na(fireplaces), purrr::map_dbl(find_xj(fireplaces, nb), mean, na.rm = TRUE), fireplaces)),
            interior_condition = round(ifelse(is.na(interior_condition), purrr::map_dbl(find_xj(interior_condition, nb), mean, na.rm = TRUE), interior_condition)),
            exterior_condition = round(ifelse(is.na(exterior_condition), purrr::map_dbl(find_xj(exterior_condition, nb), mean, na.rm = TRUE), exterior_condition)),
            number_stories = round(ifelse(is.na(number_stories), purrr::map_dbl(find_xj(number_stories, nb), mean, na.rm = TRUE), number_stories)),
            garage_spaces = round(ifelse(is.na(garage_spaces), 0, garage_spaces)), # per OPA's metadata, NA indicates 0 garage spaces
            sale_price_NA = ifelse(sale_price == 0, NA, sale_price),
                 price_lag = st_lag(sale_price_NA, nb15, wt15, na_ok = T)) %>% #spatial lag of sale price
          select(-c(nb, wt, sale_price_NA))

data <- data[phl, ]
keep_columns <- sapply(data, function(col) length(unique(col)) > 1) #drop columns with only one unique value (i.e., no variance)
data <- data[, keep_columns]


avg_sale_prices_x_building_type <- data %>%
  group_by(building_code_description_new) %>%
  summarize(avg_sale_price = mean(sale_price)) %>%
  mutate(building_type_price_class = case_when(
    avg_sale_price > 800000 ~ "most expensive",
    avg_sale_price <= 800000 & avg_sale_price > 600000 ~ "more expensive",
    avg_sale_price <= 600000 & avg_sale_price > 400000 ~ "expensive",
    avg_sale_price <= 400000 & avg_sale_price > 200000 ~ "less expensive",
    avg_sale_price <= 200000 ~ "least expensive",
  )) %>%
  select(building_code_description_new, building_type_price_class) %>%
  st_drop_geometry()

data <- left_join(data, avg_sale_prices_x_building_type, by = "building_code_description_new")

data <- data %>%
          mutate(quality_grade = case_when(
            quality_grade == "X" ~ "Highest",
            quality_grade %in% c("X-", "A+", "A", "A-") ~ "High",
            quality_grade %in% c("C", 'D', "D-", "D+", "6", "D-", "E", "C-", "E-", "E+") ~ "Lowest",
            TRUE ~ "Mid"
          ))


### neighborhoods-----------------------------------------------
hoods_path <- 'https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson'
hoods <- st_read(hoods_path, quiet = T) %>%
  st_transform(crs = crs) %>%
  select(mapname)

data <- st_join(data, hoods)

avg_sale_prices_x_hood <- data %>%
  group_by(mapname) %>%
  summarize(avg_sale_price = mean(sale_price)) %>%
  mutate(hood_price_class = case_when(
    avg_sale_price > 750000 ~ "most expensive",
    avg_sale_price <= 750000 & avg_sale_price > 500000 ~ "more expensive",
    avg_sale_price <= 500000 & avg_sale_price > 250000 ~ "expensive",
    avg_sale_price <= 250000 & avg_sale_price > 100000 ~ "less expensive",
    avg_sale_price <= 100000 ~ "least expensive",
  )) %>%
  select(mapname, hood_price_class) %>%
  st_drop_geometry()

data <- left_join(data, avg_sale_prices_x_hood, by = "mapname")


### ACS-----------------------------------------------
phl_acs <- readRDS("phl_acs.RData")
phl_acs <- phl_acs %>%
              select(-c(
                medRent,
                pctPov
              ))
data <- st_join(data, phl_acs) %>% 
  mutate(pctWhite = ifelse(is.na(pctWhite), purrr::map_dbl(find_xj(pctWhite, nb), mean, na.rm = TRUE), pctWhite)) # there were some values > 1 here. not sure what's happening 

city_mean_inc <- round(mean(phl_acs$medHHInc,na.rm=T),2)

split_data <- phl_acs %>% 
  mutate(incomeContext = ifelse(medHHInc > city_mean_inc, "High Income", "Low Income"),  # really above/below avg inc
         raceContext = ifelse(pctWhite > .5, "Majority White", "Majority Non-White"))

### tree canopy cover-----------------------------------------------
# url <- "https://national-tes-data-share.s3.amazonaws.com/national_tes_share/pa.zip.zip"
# tmp_file <- tempfile(fileext = ".zip")
# curl_download(url, tmp_file)
# 
# unzipped_folder_1 <- tempfile()
# unzip(tmp_file, exdir = unzipped_folder_1)
# shp_files <- list.files(unzipped_folder_1, pattern = "\\.shp$", recursive = TRUE, full.names = TRUE)
# tree_canopy_gap <- st_read(shp_files[1], quiet = TRUE)  # assuming there's only one .shp file
# phl_tree_canopy <- tree_canopy_gap %>%
#   filter(state == "PA",
#          county == "Philadelphia County") %>%
#   transmute(tree_cover = 1 - tc_gap) %>%
#   st_transform(crs = crs)
# 
# data <- st_join(data, phl_tree_canopy)




### proximity to commmercial corridors-----------------------------------------------
corridors_path <- "https://opendata.arcgis.com/datasets/f43e5f92d34e41249e7a11f269792d11_0.geojson"
corridors <- st_read(corridors_path, quiet= TRUE) %>% st_transform(crs = crs)

nearest_fts <- sf::st_nearest_feature(data, corridors)

# convert to rsgeo geometries
x <- rsgeo::as_rsgeo(data)
y <- rsgeo::as_rsgeo(corridors)

# calculate distance
data$dist_to_commerce <- rsgeo::distance_euclidean_pairwise(x, y[nearest_fts])
data <- data %>%
          mutate(dist_to_commerce = log(dist_to_commerce + 1)) # log transform

### proximity to downtown-----------------------------------------------
# downtown <- st_sfc(st_point(c(-75.16408, 39.95266)), crs = 4326)
# downtown_sf <- st_sf(geometry = downtown)
# downtown_sf <- downtown_sf %>% st_transform(crs= crs)
# 
# nearest_fts <- sf::st_nearest_feature(data, downtown_sf)
# 
# # convert to rsgeo geometries
# x <- rsgeo::as_rsgeo(data)
# y <- rsgeo::as_rsgeo(downtown_sf)
# 
# # calculate distance
# data$dist_to_downtown <- rsgeo::distance_euclidean_pairwise(x, y[nearest_fts])

### shootings density-----------------------------------------------
# shootings_url <- 'https://phl.carto.com/api/v2/sql?q=SELECT+*+FROM+shootings+WHERE+year+%3E+2022&filename=shootings&format=geojson&skipfields=cartodb_id'
# shootings <- st_read(shootings_url, quiet = TRUE) %>% 
#                   st_transform(crs = crs) %>%
#                   filter(!st_is_empty(geometry))
# 
# sp_points <- as(shootings, "Spatial")
# ppp_object <- as.ppp(sp_points)
# kde <- densityAdaptiveKernel(ppp_object)
# kde_spatraster <- rast(kde)
# 
# guncrime <- terra::extract(kde_spatraster, data) %>%
#               transmute(ID = ID,
#                         guncrime_density = scale(lyr.1))
# 
# data <- cbind(data, guncrime) 

# data %>%
#   group_by(mapname) %>%
#   summarize(avg_sale_price = mean(sale_price)) %>%
#   arrange(desc(avg_sale_price)) %>%
#   head()

# m <- cor(numeric_only %>% na.omit())
# corrplot::corrplot(m, na.action = na.exclude)

# numeric_only <- data %>% st_drop_geometry() %>% select(where(is.numeric))
# ols_step_best_subset(model)
# model <- lm(sale_price ~ ., data = numeric_only)
# aic_preds <- ols_step_both_aic(model)$predictors

keep_vars <- c("price_lag",
               "garage_spaces",
               "total_livable_area",
               "medHHInc",
               "number_of_rooms",
               "number_of_bathrooms",        
               "depth",
               "off_street_open",
               "fireplaces",
               "interior_condition",
               "number_stories",
               "pctWhite",
              # "number_of_bedrooms",
               "exterior_condition",
               "dist_to_commerce",
               "year_built",
               "totPop",
            #  "total_area",
               "sale_price", 
               "hood_price_class", 
               "building_type_price_class", 
               "quality_grade",
               "musaID",
               "toPredict") # keeping toPredict to split data after creating dummy columns

data_to_dummy <- data %>%
                  select(all_of(keep_vars)) # for making summary statistics, keep categorical data as is and add imputed values

# count sum of missing values
# na_count <- c()
# for (i in 1:ncol(data_to_dummy)) {
#     row1 <- data_to_dummy[,i]
#     na_count <- c(na_count,sum(is.na(row1)))
# }
# 
# na_count <- data.frame(name = names(data_to_dummy), na_count) %>% 
#   filter(na_count > 0)
# columns with missing data: garage_spaces, medHHInc, number_of_rooms, number_of_bathrooms, depth, off_street_open, fireplaces, interior_condition, number_stories, number_of_bedrooms, exterior_condition, total_area

dummied_data <- data_to_dummy %>% 
                  dummy_cols(select_columns = c("hood_price_class",
                                                "building_type_price_class",
                                                "quality_grade")) %>%
                    clean_names() %>%
                    select(-c(hood_price_class,
                              building_type_price_class,
                              quality_grade))

reg_data <- dummied_data %>%
              filter(to_predict == "MODELLING") %>% # only keep modelling data
              select(-c(to_predict))

# geom and musaID to add back later
reg_data_geom <- data_to_dummy %>% 
  filter(toPredict == "MODELLING") %>%
  dplyr::select(musaID,geometry)

Exploratory Analysis

We first limited studentData to rows with realistic values for sale price (sale_price > 1). After combining all data described above (matching by geographical location), we explored the relationship between variables and sale price. Scatterplots (like those of Figure 1) plot sale price as a function of interesting predictors. Plots B and D show that total livable area (measured in square ft) and the spatial lag of home price (using 15 nearest neighbors) have strong positive relationships with sale price. As expected, as median household income (measured at the census tract level) increases, so does the sale price of a home, though this positive relationship is not as strong as that with the previously mentioned features. Another interesting relationship we see in figure 1 is that of distance to nearest commercial center and sale price. Though one might expect a strong positive correlation between these two variables, the data shows a rather weak relationship between the two.

Show the code
# below is just a template of the code
data %>%
  st_drop_geometry() %>%
  rename(med_hh_inc = medHHInc) %>%
  dplyr::select(sale_price,
                dist_to_commerce,
                price_lag,
                total_livable_area,
                med_hh_inc) %>%
  rename(`Distance to Commercial Center (ft)` = dist_to_commerce,
         `Total Livable Area (sq ft)` = total_livable_area,
         `Median Household Income ($)` = med_hh_inc,
         `Spatial Lag of Home Sale Price` = price_lag) %>%
  gather(Variable, 
         Value, 
         -sale_price) %>%
  ggplot(aes(Value, 
             sale_price)) +
  ggplot2::geom_point(size = .5, alpha = 0.2) + 
  geom_smooth(method = "lm", se=F) +
  scale_color_manual(values = plt_pal) +
  facet_wrap(~Variable, ncol = 2, scales = "free") +
  labs(title = "Price as a function of 4 variables of interest",
       y = "Home Sale Price", x = "",
       caption = "Figure 1") +
  theme_minimal()

In general, this is not ideal, as one of the basic assumptions of OLS regression (see below) is a linear relationship between the dependent variable and the predictors. Under the time constraints, however, we have not taken this into account, which is reflected in the low accuracy of our model.

Similarly, we note that our dependent variable, sale price, is not normally distributed. Ideally, we would apply a log transformation to normalize it and improve model performance, but ran out of time before successfully implementing this transformation. This would be a path for improvement in a future iteration of this model.

Show the code
untreated <- ggplot(data) +
  geom_histogram(aes(x = sale_price) )+
  labs(title = "Untreated Sale Price",
       x = "Price",
       y = "Count")

treated <- ggplot(reg_data) +
  geom_histogram(aes(x = sale_price)) +
  scale_x_log10() +
  labs(title = "Log-Transformed Sale Price",
       x = "Price",
       y = "Count")

ggarrange(untreated, treated, ncol = 2)

Feature Engineering

We observed some impossible or missing values in our data, namely values of zero or one in total_area and total_livable_area, missing values for spatial lag in price, zero for number of rooms, bedrooms and bathrooms, and zero for year_built. Though we knew why some of these variables were missing (e.g. spatial lag of price was missing for those that had a house in the challenge set as one of the 15 nearest neighbors since we set sale price for those houses as missing for this calculation), we had to get creative for the other cases. The different solutions we implemented were replacing impossible values as NA and imputing more reasonable values (in the case of total_area and total_livable_area) and replacing zeros with the mean value of that variable (for year_built, number of rooms, number of bathrooms, and number of bedrooms).

For categorical variables such as building type price class, quality grade, and neighborhood price class, we created dummy variables. Dummy variables help us break down each category of a categorical variable into its own feature, usually made up of ones and zeros indicating whether that row has that feature or not, respectively. Below, we have provided summary and descriptive statistics of the final predictors we decided to include in our model.

Show the code
# # reorder numeric data
sum_stat_data <- reg_data %>% st_drop_geometry() %>% 
  dplyr::select(!matches("hood_price|building_type|quality|geometry|musa_id|sale_price")) %>% 
  dplyr::select(year_built,exterior_condition,number_stories,interior_condition,fireplaces,depth,number_of_bathrooms,
                number_of_rooms,total_livable_area,garage_spaces,
                off_street_open,dist_to_commerce,# amenities
                price_lag,tot_pop,pct_white,med_hh_inc) # spatial_process


# summary statistics for numeric variables
sum_stat_table <- sum_stat_data %>%
  select(where(is.numeric)) %>%
  psych::describe() %>% dplyr::select(mean:median,min,max) %>%
  mutate(Mean = as.character(round(mean)),
         SD = as.character(round(sd)),
         Median = as.character(round(median)),
         Min = as.character(min),
         Max = as.character(round(max))) %>%
  dplyr::select(-(mean:max)) %>%
  # filter(Max > 1) %>%
  as.data.frame()

# keep pctWhite rounded to nearest third
# sum_stat_pctwhite <- sum_stat_data %>%
#   dplyr::select(pct_white) %>%
#   psych::describe() %>% dplyr::select(mean:median,min,max) %>%
#   mutate(Mean = as.character(round(mean,3)),
#          SD = as.character(round(sd,3)),
#          Median = as.character(round(median,3)),
#          Min = as.character(min),
#          Max = as.character(round(max,3))) %>%
#   dplyr::select(-(mean:max)) %>%
#   as.data.frame()
# 
# sum_stat_table <- bind_rows(sum_stat_table[1:16,],
#                             sum_stat_pctwhite,
#                             sum_stat_table[17,])
# 
# 
sum_stat_table$Description <-
  c("",
    "Overall condition of the exterior",
    "",
    "Overall condition of the interior",
    "Number of fireplaces",
    "Depth, as measured from the principal street back to the rear property line or secondary street",
    "", "", 
    "Total liveable area in square feet",
    "Number of garage spaces",
    "", 
    "Distance to nearest commercial center",
    "Spatial lag of sale price, 15 nearest neighbors",
    "Total population, census tract level data",
    "Percent of population that is white, census tract level data",
    "Median household income, census tract level data")
# 
sum_stat_table %>%
  kbl(caption = "Table 1: Summary Statistics of numeric variables", align = rep("c", 8)) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 1: Summary Statistics of numeric variables
Mean SD Median Min Max Description
year_built 1937 51 1925 0 2024
exterior_condition 4 1 4 1 7 Overall condition of the exterior
number_stories 2 1 2 1 5
interior_condition 4 1 4 0 7 Overall condition of the interior
fireplaces 0 0 0 0 5 Number of fireplaces
depth 76 37 73 0 1115 Depth, as measured from the principal street back to the rear property line or secondary street
number_of_bathrooms 1 1 1 0 8
number_of_rooms 4 2 4 0 32
total_livable_area 1334 549 1208 0 15246 Total liveable area in square feet
garage_spaces 0 1 0 0 72 Number of garage spaces
off_street_open 2011 1709 1545 14 12685
dist_to_commerce 6 2 6 0 9 Distance to nearest commercial center
price_lag 269887 175766 228193 35440 1913110 Spatial lag of sale price, 15 nearest neighbors
tot_pop 4690 1689 4473 0 9146 Total population, census tract level data
pct_white 0 0 0 0 9 Percent of population that is white, census tract level data
med_hh_inc 54741 25363 50227 9276 155000 Median household income, census tract level data
Show the code
# 
# # descriptive statistics for categorical data
# # data$building_code_description_new <- 
# 
# data %>% 
#   st_drop_geometry() %>%
#   group_by(building_type_price_class) %>%
#   summarize(row_count = n()) %>%
#   mutate(building_type_price_class = factor(building_type_price_class, levels = c("least expensive", "less expensive", "expensive", "more expensive", "most expensive")),
#     Description = c("Building types with average sale price of under 200,000 USD",
#                          "Building types with average sale price between 200,000.01 and 400,000 USD",
#                          "Building types with average sale price between 400,000.01 and 600,000 USD",
#                          "Building types with average sale price between 600,000.01 and 800,000 USD",
#                          "Building types with average sale price of over 800,000.01 USD")) %>% 
#   kbl(caption = "Table 2: Descriptive statistics of sales price classes of building type", 
#       align = rep("c", 8), col.names = c("Building Type", "Row count", "Description")) %>%
#   kable_classic(full_width = F, html_font = "Cambria")
Show the code
data %>%
  st_drop_geometry() %>%
  dplyr::select(hood_price_class) %>%
  mutate(hood_price_class =
           factor(hood_price_class,
           levels = c("least expensive","less expensive","expensive","more expensive","most expensive"))) %>%
  group_by(hood_price_class) %>%
  summarize(count = n()) %>%
  mutate(Description = c("Houses in a neighborhood with average sale price of under 100,000 USD",
                         "Houses in a neighborhood with average sale price between 100,000.01 and 250,000 USD",
                         "Houses in a neighborhood with average sale price between 240,000.01 and 500,000 USD",
                         "Houses in a neighborhood with average sale price between 500,000.01 and 750,000 USD",
                         "Houses in a neighborhood with average sale price of over 750,000.01 USD")) %>%
  kbl(caption = "Table 3: Descriptive statistics of sales price classes of nieghborhoods",
      align = rep("c", 8), col.names = c("Neighborhood", "Row count", "Description")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 3: Descriptive statistics of sales price classes of nieghborhoods
Neighborhood Row count Description
least expensive 1763 Houses in a neighborhood with average sale price of under 100,000 USD
less expensive 11746 Houses in a neighborhood with average sale price between 100,000.01 and 250,000 USD
expensive 8301 Houses in a neighborhood with average sale price between 240,000.01 and 500,000 USD
more expensive 1445 Houses in a neighborhood with average sale price between 500,000.01 and 750,000 USD
most expensive 626 Houses in a neighborhood with average sale price of over 750,000.01 USD
Show the code
data %>%
  st_drop_geometry() %>%
  dplyr::select(quality_grade)  %>%
  mutate(quality_grade =
           factor(quality_grade,
           levels = c("Lowest","Mid","High"))) %>%
  group_by(quality_grade) %>%
  summarize(count = n()) %>%
  mutate(Description = c("Houses with a quality grade of \"X\"",
                         "Houses with a quality grade of \"X-\", \"A+\", \"A\", or \"A-\"",
                         "Houses with a quality grade of \"C\", \"C-\", \"D+\", \"D\", \"D-\", \"E+\", \"E\", or \"E-\"")) %>%
  kbl(caption = "Table 4: Descriptive statistics of quality grade classes",
      align = rep("c", 8), col.names = c("Quality Grade", "Row count", "Description")) %>%
  kable_classic(full_width = F, html_font = "Cambria")
Table 4: Descriptive statistics of quality grade classes
Quality Grade Row count Description
Lowest 776 Houses with a quality grade of "X"
Mid 23045 Houses with a quality grade of "X-", "A+", "A", or "A-"
High 60 Houses with a quality grade of "C", "C-", "D+", "D", "D-", "E+", "E", or "E-"

Checking for Collinearity

When using numerous predictors in a model, it is important to inspect for collinearity, or how similar or correlated predictors are to each other. We explored these relationships between features by examining a correlation matrix. As shown in Figure 2, some variables such as number of rooms, number of bedrooms and number of bathrooms have strong collinearity. There is also a high correlation between median household income, spatial lag in price, and percent of population that is white. Taking note of these patterns, we kept our subset of predictors as our next step accounted for collinearity.

Show the code
numeric_vars <- reg_data %>% 
  select(where(is.numeric))

corr <- round(cor(numeric_vars), 1)
p.mat <- cor_pmat(numeric_vars)

ggcorrplot(corr,
           colors = corr_mat_pal,
           #p.mat = cor_pmat(numeric_vars),
  type="lower",
  insig = "blank",
  tl.cex = 6) +  
  labs(title = "Correlation across numeric variables",
       caption = "Figure 2") 

Spatial Process

In building this model, it is essential to account for the spatial process of features. It is clear, for example, that prices are stratefied geographically and are higher than some neighborhoods than others. We account for this geographic feature by using neighborhoods as features in our dataset via dummy variables.

Show the code
buildtype_data <- data_to_dummy %>% 
  mutate(buildtype = factor(building_type_price_class, 
           levels = c("least expensive","less expensive","expensive","more expensive","most expensive")))

price_x_hood <- data %>%
  group_by(mapname) %>%
  summarize(avg_sale_price = mean(sale_price)) %>%
  st_drop_geometry()

price_x_hood <- left_join(hoods, price_x_hood) %>%
                  st_as_sf()

avg_price_hood_map <- tm_shape(hoods) +
  tm_borders() +
tm_shape(price_x_hood) +
  tm_polygons(col = "avg_sale_price", 
             title = "Avg. Sale Price ($)",
              style = "jenks", 
             palette = "viridis", 
             size = 0.2, 
             alpha = 1,
             textNA = "No Data"
      ) +
tm_layout(frame = FALSE,
          main.title = "Avg. Price by Neighborhood",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 3", fontface = "bold", align = "right")

avg_price_hood_map

Much of this spatial process can be explained by proximity to amenities (as the adage in real estate goes, “location, location, location”). To this end, we incorporate proximity to amenities by calculating the distance of each home to the nearest commercial corridor and incorporating this into our dataset.

Show the code
# # add geometry back in
# imp_data_geom <- cbind(imp_data, data_to_dummy %>% 
#                          dplyr::select(geometry)) %>% st_as_sf() %>% 
#   mutate(num_rooms = as.numeric(number_of_rooms))

tm_shape(hoods) +
  tm_borders() +
tm_shape(data) +
  tm_dots(col = "dist_to_commerce", 
             title = "Distance",
             style = "jenks", 
             palette = "-viridis", 
             size = 0.1, 
             alpha = 0.3
      ) +
tm_layout(frame = FALSE,
          main.title = "Distance to Nearest Commercial Corridor",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 4", fontface = "bold", align = "right") +
  tm_credits("Distances are log-transformed from US feet", fontface = "italic", align = "right")

Non-Spatial Features

Not all features are inherently spatial; some, like the type of home, exert a major influence on the value of a property. In some vases, such as home type, we account for these as dummy variables.

Show the code
data %>%
  st_drop_geometry() %>%
  mutate(building_code_description_new = fct_reorder(building_code_description_new, sale_price, .fun='mean')) %>%
  ggplot()+
  geom_boxplot(aes(x = sale_price, y = building_code_description_new), lwd = 0.3, fill = NA, outlier.alpha = 0.1) +
  labs(title = "Avg. Sale Price by Type of Home",
       subtitle = "Figure 5",
       x = "Sale Price ($)") +
  theme(axis.title.y = element_blank())

In other cases, when dealing with ordered integer values for things like interior quality of the home, we retain them as numeric data.

Show the code
data %>%
  st_drop_geometry() %>%
  mutate(interior_condition = fct_reorder(as.factor(interior_condition), sale_price, .fun='mean')) %>%
  ggplot()+
  geom_boxplot(aes(x = sale_price, y = interior_condition), lwd = 0.3, fill = NA, outlier.alpha = 0.1) +
  labs(title = "Avg. Sale Price by Interior Condition Score",
       subtitle = "Figure 6",
       x = "Sale Price ($)",
       y = "Interior Condition")

That said, it is worth noting that even some of these features are implicitly spatial. For example, while home type is not an explicitly spatial feature, sections of neighborhoods are often built together around the same time and in the same style. Thus, building types tend to cluster spatially in Philadelphia. Given the limitations of an OLS regression, the most we can do to account for this in our model is use dummy variables. All the same, it is worth noting.

Show the code
tm_shape(hoods) +
  tm_borders() +
tm_shape(data) +
  tm_dots(col = "building_code_description_new", 
             title = "Type",
             style = "cat", 
             size = 0.075, 
             alpha = 0.5
      ) +
tm_layout(frame = FALSE,
          main.title = "Spatial Distribution of Building Types",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 7", fontface = "bold", align = "right")

Similarly, we find that total liveable area is higher in some neighborhoods than others, likely due to the housing typology, available space, and so forth. We account for this variable in our dataset as a continuous variable, but, again, it highlights the inadequacies of a non-spatial regression model.

Show the code
area_x_hood <- avg_sale_areas_x_hood <- data %>%
  group_by(mapname) %>%
  summarize(avg_area = mean(total_livable_area)) %>%
  st_drop_geometry()

area_x_hood <- left_join(hoods, area_x_hood) %>%
                  st_as_sf()

tm_shape(hoods) +
  tm_borders() +
tm_shape(area_x_hood) +
  tm_polygons(col = "avg_area", 
             title = "Avg. Total Area",
              style = "jenks", 
             palette = "viridis", 
             size = 0.2, 
             alpha = 1,
             textNA = "No Data",
             border.alpha = 0
      ) +
tm_layout(frame = FALSE,
          main.title = "Avg. Total Area by Neighborhood",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 8", fontface = "bold", align = "right")

OLS Model

We used an ordinary least squares (OLS) regression model as our predictive model. The components of an OLS model are the dependent variable, independent variable(s), a coefficient for each independent variable, an intercept, and an error value. In our case, home sale price for a the ith house is the dependent variable (\(Y_i\)), or what we’re trying to predict and all of our predictors that we put into the model are independent variables (\(X_{1_i},X_{2_i},...X_{n_i}\); where there are n predictors). The model calculates coefficients (\(\beta_1,\beta_2,...,\beta_n\)) for each independent variable and the intercept (\(\beta_0\)), where the coefficient is defined as the amount of change in sale price we expect for every unit of change in the independent variable that coefficient is paired with. The intercept is the expected sale price given we have no knowledge of any of the predictors. The error value (\(\epsilon_i\)) represents the difference between the actual home sale price and that predicted by the model, and it varies for every prediction.

\[ Y_i = \beta_0 + \beta_1 X_{1_i} + \beta_2 X_{2_i} + ... + \beta_n X_{n_i} + \epsilon_i \]

When running an OLS regression, it is crucial to ensure that the following five assumptions are not violated:

  1. A linear relationship between the dependent variable and predictors

  2. Normality of residuals

  3. Homoscedasticity

  4. Independence of observations (no spatial, temporal or other forms of dependence in the data)

  5. No multicollinearity

Yet, specifically in the context of predictive modeling, we have greater leeway.

Finally, it is essential to recall that an OLS model does not account for spatial processes in the data. As a result, it is inevitable that we will see spatial autocorrelation not only in our dataset but in the residuals of our regression as well.

Results

Regression Results

The results of our regression are underwhelming. Based on cross-fold validation with a k of 100 (i.e., over 100 iterations), we find a mean absolute percent error around 36%. We also note that, based on our single iteration, several of our predictors have p-values over 5%, suggesting that they are not statistically significant and therefore do not contribute to the model. Although the R-squared value is relatively high (0.74), our purpose here is not research, but rather economic prediction, and this is therefore not a strong model.

Show the code
# book's method of doing this
inTrain <- createDataPartition(
              y = paste(reg_data$fireplaces, reg_data$number_of_rooms,
                        reg_data$exterior_condition), 
              p = .60, list = FALSE) # p = .60 = split to 60/40 ratio for train/test
phl.training <- reg_data[inTrain,] # defining training set
phl.test <- reg_data[-inTrain,]    # defining testing set

reg.training <- lm(sale_price ~ ., data = as.data.frame(phl.training) %>% dplyr::select(-c(musa_id, geometry)))

# plot(reg.training) # useful plots to look at model fit

phl.test <- phl.test %>%
              mutate(sale_price.Predict = predict(reg.training, phl.test), 
               # ^^ actually adding the predicted values from model defined in reg.training
               sale_price.Error = sale_price.Predict - sale_price,
               sale_price.AbsError = abs(sale_price.Predict - sale_price),
               sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict)

mape1 <- mean(phl.test$sale_price.APE, na.rm = TRUE) # 0.2826403
mae1 <- mean(phl.test$sale_price.AbsError, na.rm = TRUE) # 69203.8

summ_reg.training <- summary(reg.training)
summ_coeff1 <- as.data.frame(summary(reg.training)$coefficients) # same as summ_coeff

footnote1 <- paste0("R-squared: ",round(summ_reg.training$r.squared,3),
                    "; adjusted R-squared: ",round(summ_reg.training$adj.r.squared,3),
                    "; F-statistic: ", round(summ_reg.training$fstatistic[1],3), 
                    " (df=", round(summ_reg.training$fstatistic[2],3), 
                    "; ", round(summ_reg.training$fstatistic[3],3), 
                    "); N = ", length(summ_reg.training$residuals),
                    "; Residual Std. Err.:", round(summ_reg.training$sigma,3), 
                    " (df=",summ_reg.training$df[2],")")

footnote2 <- "* p < 0.1; ** p < 0.05; *** p < 0.01"

summ_coeff1 %>% 
  mutate(`Sale Price` = round(Estimate,3),
         SE = round(`Std. Error`,3),
         `Sale Price` = case_when(
           `Pr(>|t|)` >= 0.1 ~ as.character(`Sale Price`),
           `Pr(>|t|)` < 0.1 & `Pr(>|t|)` >= 0.05 ~ paste0(`Sale Price`,"*"),
           `Pr(>|t|)` < 0.05 & `Pr(>|t|)` >= 0.01 ~ paste0(`Sale Price`,"**"),
           `Pr(>|t|)` < 0.01 ~ paste0(`Sale Price`,"***")
         )) %>% 
  dplyr::select(`Sale Price`,SE) %>% 
  kbl(caption = "Table 2: Summary of Training set LM", align = rep("c", 8)) %>%
  footnote(c(footnote1,footnote2)) %>% 
  kable_classic(full_width = F, html_font = "Cambria")
Table 2: Summary of Training set LM
Sale Price SE
(Intercept) -21266.881 79966.968
price_lag 0.576*** 0.014
garage_spaces -9185.909*** 1522.746
total_livable_area 158.371*** 2.800
med_hh_inc -0.039 0.085
number_of_rooms -4723.232*** 1088.208
number_of_bathrooms 31883.481*** 3093.785
depth -30.716 37.443
off_street_open 0.272 0.761
fireplaces 35787.141*** 4101.616
interior_condition -22828.521*** 2025.730
number_stories -5786.781** 2329.335
pct_white 54637.747*** 5469.618
exterior_condition -14620.009*** 2010.045
dist_to_commerce -1626.735** 712.827
year_built 89.906*** 31.243
tot_pop 2.11*** 0.730
hood_price_class_expensive -179896.876*** 9691.631
hood_price_class_least_expensive -196992.526*** 12617.184
hood_price_class_less_expensive -184395.913*** 11113.786
hood_price_class_more_expensive -147556.406*** 9028.989
building_type_price_class_expensive 81899.215 50169.386
building_type_price_class_least_expensive 39598.922 49875.521
building_type_price_class_less_expensive 43548.472 49819.853
building_type_price_class_more_expensive 64962.137 49966.230
quality_grade_high 90577.86*** 20872.406
quality_grade_lowest 14047.109** 6503.387
Note:
R-squared: 0.732; adjusted R-squared: 0.731; F-statistic: 1380.442 (df=26; 13174); N = 13201; Residual Std. Err.:129340.157 (df=13174)
* p < 0.1; ** p < 0.05; *** p < 0.01
Show the code
data.frame(MAE = sprintf('%.2f',round(mae1,3)),
           MAPE = round(mape1,3)) %>% 
  kbl(caption = "Table 3: Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) of one test set",
      align = c("c","c")) %>% 
  kable_classic(full_width = F, html_font = "Cambria") %>% 
  column_spec(1, width = "10em") %>% 
  column_spec(2, width = "10em")
Table 3: Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) of one test set
MAE MAPE
67146.55 0.376
Show the code
customSummary <- function(data, lev = NULL, model = NULL) {
  mape <- mean((abs(data$obs - data$pred)) / data$obs) * 100
  mae <- mean(abs(data$obs - data$pred))
  rmse <- sqrt(mean((data$obs - data$pred)^2))
  rsq <- cor(data$obs, data$pred)^2
  out <- c(MAE = mae, RMSE = rmse, Rsquared = rsq, MAPE = mape)
  out
}

train_control <- trainControl(method = "cv",
                              number = 100,
                              summaryFunction = customSummary)

model <- train(sale_price ~ .,
               data = reg_data %>% dplyr::select(-c(musa_id, geometry)),
               trControl = train_control,
               method = "lm",
               na.action = na.exclude)

print_model <- as.data.frame(print(model))
Linear Regression 

23781 samples
   29 predictor

No pre-processing
Resampling: Cross-Validated (100 fold) 
Summary of sample sizes: 21668, 21669, 21669, 21670, 21669, 21670, ... 
Resampling results:

  MAE       RMSE      Rsquared  MAPE    
  68264.89  120863.9  0.74275   36.44254

Tuning parameter 'intercept' was held constant at a value of TRUE
Show the code
model_resample <- model$resample

data.frame(MeanMAE = print_model$MAE,
           SDMAE = round(sd(model_resample$MAE),3)) %>% 
  kbl(caption = "Table 4: Results of 100-fold Cross-Validation",
      align = c("c","c"), 
      col.names = c("Mean MAE", "SD of MAE")) %>% 
  kable_classic(full_width = F, html_font = "Cambria") %>% 
  column_spec(1, width = "10em") %>% 
  column_spec(2, width = "10em")
Table 4: Results of 100-fold Cross-Validation
Mean MAE SD of MAE
68264.89 6625.005
Show the code
ggplot(data = model$resample) +
  ggplot2::geom_histogram(aes(x = MAE), bins = 12) +
  labs(title = "Distribution of MAE",
       subtitle = "K-Fold Cross Validation; k = 100",
       x = "Mean Absolute Error",
       y = "Count",
       caption = "Figure 10")

Based on a scatterplot of the residuals, we notice that, as the sale price increases, the absolute error increases as well.

Show the code
phl.test %>%
  dplyr::select(sale_price.Predict, sale_price) %>%
    ggplot(aes(sale_price, sale_price.Predict)) +
  ggplot2::geom_point(alpha = 0.4) +
  geom_abline(color = pal_full[3]) +
  geom_smooth(method = "lm", se = FALSE, color = pal_full[3], lwd = .5, lty = "dashed") +
  labs(title="Predicted sale price as a function of observed price",
       subtitle="Solid line indicates a hypothetical perfect prediction, \ndashed line indicates slope of actual predictions",
       caption = "Figure 11",
       x = "Observed Sale Price", y = "Predicted Sale Price") +
  theme_minimal()

Spatial error appears to cluster in parts of North and West Philadelphia.

Show the code
spatial_resids <- reg_data %>%
                    na.omit() %>%
                    st_as_sf()

spatial_resids$residuals <- model$finalModel$residuals
spatial_resids <- spatial_resids %>%
                        mutate(abs_residuals = abs(residuals),
                               pct_residuals = abs_residuals / sale_price * 100) %>%
                        st_as_sf() %>%
                        st_join(hoods)

tm_shape(hoods) +
  tm_borders() +
tm_shape(spatial_resids %>% mutate(pct_residuals = log(pct_residuals + 1))) +
  tm_dots(col = "pct_residuals", 
             title = "Pct. Residual",
             style = "jenks", 
             palette = "viridis", 
             size = 0.05, 
             alpha = 0.5
      ) +
tm_layout(frame = FALSE,
          main.title = "Pct. Residual per Observation",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 12", fontface = "bold", align = "right") +
  tm_credits("Pct. Residuals are log-transformed", fontface = "italic", align = "right")

Show the code
avg_pct_resid_x_hood <-  spatial_resids %>%
    group_by(mapname) %>%
    summarize(mean_resid = mean(pct_residuals)) %>%
  st_drop_geometry()

avg_pct_resid_x_hood <- left_join(hoods, avg_pct_resid_x_hood) %>%
                          st_as_sf()
  
tm_shape(hoods) +
  tm_borders() +
tm_shape(avg_pct_resid_x_hood) +
  tm_polygons(col = "mean_resid", 
             title = "Avg. Pct. Residual",
              style = "jenks", 
             palette = "viridis", 
             size = 0.2, 
             alpha = 1,
             textNA = "No Data",
             border.alpha = 0
      ) +
tm_layout(frame = FALSE,
          main.title = "Avg. Pct. Residual by Neighborhood",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 13", fontface = "bold", align = "right")

Running a global Moran’s I test, we find that sale price is spatially autocorrelated, confirming that this affects the quality of our model.

Show the code
# adapted from code chunk: autocorr
data_nb <- data %>% 
            mutate(nb = st_knn(geometry, k = 5),
                   wt = st_weights(nb),
                   .before = 1)


# calculate global moran's i for sale_price
glbl_moran <- global_moran(data_nb$sale_price, data_nb$nb, data_nb$wt)$`I` # 0.5687531

# moran monte carlo
moranMC = global_moran_perm(data_nb$sale_price, data_nb$nb, data_nb$wt, alternative = "two.sided", 999)
moranMC

    Monte-Carlo simulation of Moran I

data:  x 
weights: listw  
number of simulations + 1: 1000 

statistic = 0.61044, observed rank = 1000, p-value <
0.00000000000000022
alternative hypothesis: two.sided
Show the code
moranMCres = moranMC$res |>
  as.data.frame() %>% 
  rename(moranMCres = 1)

ggplot(moranMCres) +
  geom_histogram(aes(moranMCres), bins = 100) +
  geom_vline(xintercept = glbl_moran, color = pal_full[5]) +
  labs(title = "Observed and Permuted Moran's I",
       subtitle = "Red line indicates the observed Moran's I",
       x = "Moran's I",
       y = "Count",
       caption = "Figure 15")

Below, we plot the spatial lag of errors.

Show the code
# this can be made prettier
spdep::moran.plot(data_nb$sale_price, nb2listw(data_nb$nb),
           xlab = "Sale Price", 
           ylab = "Spatial Lag")

Show the code
# add caption that says Figure 16, any way to edit title?
Show the code
set.seed(40)

inTrain2 <- createDataPartition(
              y = paste(reg_data$fireplaces, reg_data$number_of_rooms,
                        reg_data$exterior_condition), 
              p = .60, list = FALSE) # p = .60 = split to 60/40 ratio for train/test
phl.training2 <- reg_data[inTrain2,] # defining training set
phl.test2 <- reg_data[-inTrain2,]    # defining testing set

reg.training2 <- 
  lm(sale_price ~ ., data = as.data.frame(phl.training2) %>% 
                             dplyr::select(-c(musa_id, geometry)))

# plot(reg.training2) # useful plots to look at model fit

phl.test2 <-
  phl.test2 %>%
  mutate(sale_price.Predict = predict(reg.training2, phl.test2), 
         # ^^ actually adding the predicted values from model defined in reg.training2
         sale_price.Error = sale_price.Predict - sale_price,
         sale_price.AbsError = abs(sale_price.Predict - sale_price),
         sale_price.APE = (abs(sale_price.Predict - sale_price)) / sale_price.Predict) %>% 
  left_join(dummied_data %>% dplyr::select(musa_id, geometry), by = "musa_id") %>% st_sf()

mape2 <- mean(phl.test2$sale_price.APE) # 0.3139695
mae2 <- mean(phl.test2$sale_price.AbsError) # 69252.47

summ_reg.training2 <- summary(reg.training2)
summ_coeff2 <- as.data.frame(summary(reg.training2)$coefficients)

ggplot() +
  geom_sf(data = hoods, color = "white", fill = "#f0cab9") + 
  geom_sf(data = phl.test2, aes(colour = q5(sale_price.Predict)), 
          show.legend = "point", size = .25) +
  scale_colour_manual(values = mypalette1,
                      labels=qBr(phl.test2,"sale_price.Predict"),
                      name="Predicted Sale Price") +
  labs(title = "Map of Predicted Sale Price for Test Set",
       subtitle="Test Set includes both modelling and challenge data; Philadelphia, PA", 
       caption = "Figure 17") +
  mapTheme()

In comparison to the map of observed sale prices, we can see in this map of the predicted sale price (including both modelling and challenge data) that our model generally predicts prices well, but over-predicts sale price in the northeast region as well as the northwestern tip of Philadelphia. Indeed, we note that mean absolute percent error appears to cluster in these neighborhoods, suggesting spatial autocorrelation in our residuals.

Show the code
# code adapted from resids code chunk + book
# by "test set" do they mean the new test set with both modelling and challenge data?
# proceeding with old test set (only modelling) for now

mape_data <- phl.test2 %>%
              dplyr::select(sale_price,sale_price.Predict,sale_price.APE,musa_id) %>%
              left_join(dummied_data %>% dplyr::select(musa_id, geometry), by = "musa_id") %>%
              st_sf()

mape_data <- st_join(mape_data, hoods)

mape_x_hood <- st_drop_geometry(mape_data) %>%
  group_by(mapname) %>%
  summarize(mean.MAPE = mean(sale_price.APE, na.rm = T),
            mean.price = mean(sale_price, na.rm = T),
            mean.pred.price = mean(sale_price.Predict, na.rm = T)) 

mapXHood <- mape_x_hood%>%
  ungroup() %>% 
  left_join(hoods) %>%
    st_sf()

tm_shape(hoods) +
  tm_borders() +
tm_shape(mapXHood) +
  tm_polygons(col = "mean.MAPE", 
             title = "MAPE",
             style = "jenks", 
             palette = "viridis", 
             size = 0.1, 
             alpha = 0.5
      ) +
tm_layout(frame = FALSE,
          main.title = "Mean test set MAPE by neighborhood",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 18 ", fontface = "bold", align = "right")

Discussion

Accuracy

As discussed above in the results section, the model is not particularly accurate. With a MAPE of around 36%, this would not be a credible tool for predicting home prices in Philadelphia. Much of this is likely due to the inadequacy of the non-spatial regression model, but also is greatly impacted by poor variable selection. Transforming variables to achieve normality and dropping variables with p-values below 5% would likely improve model performance.

Generalizability

In addition to its low accuracy, we find that our model generalizes poorly across diverse neighborhoods. It is evident, for example, that the model performs better in wealthier neighborhoods. Mean percent absolute error decreases as a function of price:

Show the code
st_drop_geometry(mape_x_hood) %>%
  ggplot(aes(mean.price, mean.MAPE)) +
  ggplot2::geom_point(alpha = 0.4) + 
  geom_smooth(method = "lm", formula = y ~ log(x+1), se=FALSE, colour = pal_full[5], lwd = 0.5) +
  labs(title = "Mean MAPE by neighborhood as a function of mean price by neighborhood",
       y = "Mean MAPE by neighborhood", 
       x = "Mean price by neighborhood",
       caption = "Figure 19")

Show the code
# possibly add neighborhood names of extreme outliers

Indeed, when mean percent error is considered in high income versus low income neighborhoods, we notice a difference of roughly 23%

Show the code
st_join(mape_data, split_data) %>% 
  filter(!is.na(incomeContext)) %>% 
  group_by(incomeContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  # spread(incomeContext, mean.MAPE) %>%
  kbl(caption = "Table 5: Test set MAPE by neighborhood income context", align = rep("c", 8),
      col.names = c("Income Context", "Average MAPE")) %>%
    # footnote("70 rows of data were missing median household income and are therefore excluded from this table.") %>% 
    kable_classic(full_width = F, html_font = "Cambria") %>% 
  column_spec(1, width = "12em") %>% column_spec(2, width = "12em")
Table 5: Test set MAPE by neighborhood income context
Income Context Average MAPE
High Income 22%
Low Income 45%

Given the strongly spatially autocorrelated nature of income in Philadelphia, this is unsurprising, but nevertheless necessary to account for.

Show the code
income <- tm_shape(na.omit(split_data)) +
  tm_polygons(col = "incomeContext", 
             title = "",
             style = "jenks", 
             palette = "-viridis", 
             size = 0.1, 
             alpha = 0.6,
             border.alpha = 0,
             border.col = NA
      ) + 
 tm_shape(hoods) +
 tm_borders(lwd = 1) +
tm_layout(frame = FALSE,
          main.title = "Income Across Philadelphia",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 20 ", fontface = "bold", align = "right")

income

Additionally, we find that the model generalizes poorly across racial lines.

Show the code
st_join(mape_data, split_data) %>% 
  group_by(raceContext) %>%
  summarize(mean.MAPE = scales::percent(mean(sale_price.APE, na.rm = T))) %>%
  st_drop_geometry() %>%
  kbl(caption = "Table 6: Test set MAPE by neighborhood race context", align = rep("c", 8),
      col.names = c("Race Context", "Average MAPE")) %>%
    # footnote("70 rows of data were missing median household income and are therefore excluded from this table.") %>% 
    kable_classic(full_width = F, html_font = "Cambria") %>% 
  column_spec(1, width = "12em") %>% column_spec(2, width = "12em")
Table 6: Test set MAPE by neighborhood race context
Race Context Average MAPE
Majority Non-White 46%
Majority White 22%

And once again, this is unsurprising giving the strong spatial autocorrelation of race in Philadelphia.

Show the code
race <- tm_shape(na.omit(split_data)) +
  tm_polygons(col = "raceContext", 
             title = "",
             style = "jenks", 
             palette = "viridis", 
             size = 0.1, 
             alpha = 0.6,
             border.alpha = 0
      ) + 
 tm_shape(hoods) +
 tm_borders(lwd = 1) +
tm_layout(frame = FALSE,
          main.title = "Race Across Philadelphia",
          legend.outside = TRUE
          ) +
  tm_credits("Philadelphia, PA", fontface = "italic", align = "right") +
  tm_credits("Figure 21", fontface = "bold", align = "right")

race

In fact, comparing race, income, and average sale price in Philadelphia, one finds that all three appear to be correlated.

Show the code
tmap_arrange(avg_price_hood_map, income, race, ncol = 2)

Possible Sources of Flaws

As mentioned above in the methods section, OLS regression entails several assumptions, including (but not only) 1) a linear relationship between the dependent variable and its predictors, 2) no multicollinearity, and 3) no spatial dependence in the data. Frankly, it is evident that our model violates several of these assumptions. Much of this can be attributed to two sources:

  1. First, our data wrangling did not sufficiently ensure a linear relationship between the dependent variable and its predictors. Clearly, some predictors, such as distance to commercial corridor, had a non-linear relationship to sale price. We also failed to adequately address issues around multicolinearity and select the most parsimonious set of predictors.

  2. Second, and more fundamentally, there are limits to the predictive power of OLS when dealing with spatial data. Given the evident spatial autocorrelation among all of our predictors, we believe that, without switching to a regression model that accounts for spatial processes, there is a limit on how much error can be eliminated from the model.

Thus, to improve the accuracy of predictions, we recommend 1) better selection and manipulation of input variables, and 2) the use of a predictive model that accounts for spatial dependencies in the data.

Conclusion

This model is neither accurate nor generalizable. Although there is not a specific threshold for accuracy, per se, the MAPE of more than 30% indicates that this model does not predict home prices in any particular neighborhood very well. Furthermore, given our analysis of the divergent accuracy across neighborhoods of different racial or economic characteristics, we find that this model does not generalize well, either.

In sum, we would not recommend this model to Zillow. Even discounting the issue of spatial dependence, it predicts poorly. Improving it would require first and foremost a more parsimonious choice of predictor variables. Additionally, it would be essential to incorporate the spatial dependence of variables through the use of something like spatial lag regression, spatial error regression, or geographically-weighted regression.

Footnotes

  1. Michael Neal, Sarah Strochak, Linna Zhu, and Caitlin Young, “How Automated valuation models can disproportionately affect majority-Black neighborhoods,” Urban Institute, December, 2020.↩︎

  2. Alan S. Dornfest, Chair Bill Marchand, Doug Warr, August Dettbarn, Wayne Forde, Joshua Myers, and Carol Neihardt, Standard on automated valuation models (AVMs), International Association of Assessing Officers, July, 2018.↩︎

  3. Azavea, geo-data, (2012), GitHub repository, https://raw.githubusercontent.com/azavea/geo-data/master/Neighborhoods_Philadelphia/↩︎