Introduction

For this real-estate price prediction project, our team was given a dataset scraped from zillow.com that included features of apartment/condo listings such as the number of bedrooms, the square footage, the address, and some less relevant information. We were tasked with cleaning this dataset, which had many missing values, and building the most accurate machine learning model possible from it.

pacman::p_load(dplyr, tidyr, ggplot2, magrittr, stringr, mlr)
housing_data = read.csv("housing_data_2016_2017.csv")

###Delete features that are irrelevant to sale price

From a quick look we identified many features that were going to be irrelevant to the model so we removed them from the start. Listing price was removed because we believed that would be an unfair feature if we’re looking to build a model to value properties. The model could be used to get an idea of a fair listing price for a property.

housing_data <- housing_data[29:57] %>%
  select(-c(date_of_sale, model_type, listing_price_to_nearest_1000, url))

Cleaning the Data

In this block of code we cleaned the addresses of the listings, converted some features to binary variables, and performed other general cleaning of the dataset.

housing_data %<>%
  mutate( zip_code = str_extract(full_address_or_zip_code, "[0-9]{5}")) %>%
  mutate(dogs_allowed = ifelse(substr(housing_data$dogs_allowed, 1, 3) == "yes", 1, 0)) %>%
  mutate(cats_allowed = ifelse(substr(housing_data$cats_allowed, 1, 3) == "yes", 1, 0)) %>%
  mutate(coop = as.integer(coop_condo == 'co-op')) %>%
  mutate(condo = as.integer(coop_condo == 'condo')) %>%
  select(-coop_condo)

d = housing_data
d %<>%
  mutate(maintenance_cost = sjmisc::rec(maintenance_cost, rec = "NA = 0 ; else = copy")) %<>%
  mutate(common_charges = sjmisc::rec(common_charges, rec = "NA = 0 ; else = copy"))##recode from NA to 0.
# combine maintaince cost and common charges
d %<>% 
  mutate( monthly_cost = common_charges + maintenance_cost)
d %<>%
  mutate(monthly_cost = sjmisc::rec(monthly_cost, rec = "0 = NA ; else = copy"))
## convert garage_exists feature to binary
d %<>%
  mutate(garage_exists = sjmisc::rec(garage_exists, rec = "NA = 0 ; else = copy")) ##recode from NA to 0. 
d %<>%
  mutate(garage_exists = sjmisc::rec(garage_exists, rec = " eys = 1; UG = 1 ; Underground = 1; yes = 1 ; Yes = 1 ; else = copy")) ##recode from NA to 0.
## One or more of the old values are recoded into identical new values.
## Please check if you correctly specified the recode-pattern,
## else separate multiple values with comma, e.g. rec="a,b,c=1; d,e,f=2".
d %<>%
  select(-c(maintenance_cost , common_charges))

Added latitude and longitude features using ggmap

We had the idea that latitudes and longitudes would be a better measure of location than zip codes or addresses, so we engineered those features using ggmap. We could then remove the original location related features. We also engineered a feature that calculates the distance from each coop/condo to the nearest LIRR station, as properties nearest to LIRR stations might be more desirable to those working in Manhattan.

#Already run and included in the data
#pacman::p_load(ggmap)
#d %<>%
#  mutate(lat = geocode(full_address_or_zip_code)$lat, lon = #geocode(full_address_or_zip_code)$lon )
#geocoordinates for relevant LIRR stations
lirr_coord = read.csv("coord.csv")
RAD_EARTH = 3958.8
degrees_to_radians = function(angle_degrees){
  for(i in 1:length(angle_degrees))
    angle_degrees[i] = angle_degrees[i]*pi/180
  return(angle_degrees)
}
compute_globe_distance = function(destination, origin){
  destination_rad = degrees_to_radians(destination)
  origin_rad = degrees_to_radians(origin)
  delta_lat = destination_rad[1] - origin_rad[1]
  delta_lon = destination_rad[2] - origin_rad[2]
  h = (sin(delta_lat/2))^2 + cos(origin_rad[1]) * cos(destination_rad[1]) * (sin(delta_lon/2))^2
  central_angle = 2 * asin(sqrt(h))
  return(RAD_EARTH * central_angle)
}
#find the closest LIRR station and compute distance
shortest_lirr_distance = function(all_lirr_coords, house_coords){
  shortest_dist = Inf
  for (i in 1: nrow(all_lirr_coords)){
    ith_lirr = c(all_lirr_coords$lat[i], all_lirr_coords$lon[i])
    new_dist = compute_globe_distance(ith_lirr, house_coords)
    if( new_dist < shortest_dist){
      shortest_dist = new_dist
    }
  }
  return(shortest_dist)
}
d %<>%
  rowwise() %>%
  mutate(shortest_dist = shortest_lirr_distance(lirr_coord, c(lat, lon)) ) %>%
  select(-c(zip_code, full_address_or_zip_code, community_district_num))

str(d)
## Classes 'rowwise_df', 'tbl_df', 'tbl' and 'data.frame':  2230 obs. of  24 variables:
##  $ approx_year_built     : int  1955 1955 2004 2002 1949 1938 1950 1960 1960 2005 ...
##  $ cats_allowed          : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ dining_room_type      : Factor w/ 5 levels "combo","dining area",..: 1 3 1 1 1 1 1 NA NA 5 ...
##  $ dogs_allowed          : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ fuel_type             : Factor w/ 6 levels "electric","gas",..: 2 4 NA 2 2 4 2 2 4 NA ...
##  $ garage_exists         : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ kitchen_type          : Factor w/ 4 levels "combo","eat in",..: 2 2 3 2 2 2 3 3 2 2 ...
##  $ num_bedrooms          : int  2 1 1 3 2 2 1 0 1 1 ...
##  $ num_floors_in_building: int  6 7 1 NA 2 6 NA 2 NA 4 ...
##  $ num_full_bathrooms    : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ num_half_bathrooms    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ num_total_rooms       : int  5 4 3 5 4 4 3 2 4 3 ...
##  $ parking_charges       : Factor w/ 90 levels " NA ","100","105",..: 1 1 1 1 1 1 1 1 41 1 ...
##  $ pct_tax_deductibl     : int  NA NA NA NA 39 NA NA NA NA NA ...
##  $ sale_price            : Factor w/ 316 levels " NA ","100000",..: 107 113 33 252 119 126 38 8 94 250 ...
##  $ sq_footage            : int  NA 890 550 NA 675 1000 NA 375 NA 681 ...
##  $ total_taxes           : Factor w/ 294 levels " NA ","100","1024",..: 1 1 255 68 1 1 1 1 1 19 ...
##  $ walk_score            : int  82 89 90 94 71 90 72 93 70 98 ...
##  $ lat                   : num  40.7 40.8 40.7 40.8 40.7 ...
##  $ lon                   : num  -73.8 -73.8 -73.9 -73.8 -73.7 ...
##  $ coop                  : int  1 1 0 0 1 1 1 1 1 0 ...
##  $ condo                 : int  0 0 1 1 0 0 0 0 0 1 ...
##  $ monthly_cost          : num  767 604 167 275 660 932 660 514 781 NA ...
##  $ shortest_dist         : num  0.796 0.694 1.096 0.303 2.778 ...

From an overview we identified features that needed the variable types to be changed

d$garage_exists = as.character(d$garage_exists)
d$garage_exists = as.integer(d$garage_exists)
d$parking_charges = as.character(d$parking_charges) 
d$parking_charges = as.numeric(d$parking_charges)
## Warning: NAs introduced by coercion
d$sale_price = as.character(d$sale_price)
d$sale_price = as.numeric(d$sale_price)
## Warning: NAs introduced by coercion
d$total_taxes = as.character(d$total_taxes) 
d$total_taxes = as.numeric(d$total_taxes)
## Warning: NAs introduced by coercion
str(d)
## Classes 'rowwise_df', 'tbl_df', 'tbl' and 'data.frame':  2230 obs. of  24 variables:
##  $ approx_year_built     : int  1955 1955 2004 2002 1949 1938 1950 1960 1960 2005 ...
##  $ cats_allowed          : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ dining_room_type      : Factor w/ 5 levels "combo","dining area",..: 1 3 1 1 1 1 1 NA NA 5 ...
##  $ dogs_allowed          : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ fuel_type             : Factor w/ 6 levels "electric","gas",..: 2 4 NA 2 2 4 2 2 4 NA ...
##  $ garage_exists         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ kitchen_type          : Factor w/ 4 levels "combo","eat in",..: 2 2 3 2 2 2 3 3 2 2 ...
##  $ num_bedrooms          : int  2 1 1 3 2 2 1 0 1 1 ...
##  $ num_floors_in_building: int  6 7 1 NA 2 6 NA 2 NA 4 ...
##  $ num_full_bathrooms    : int  1 1 1 2 1 1 1 1 1 1 ...
##  $ num_half_bathrooms    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ num_total_rooms       : int  5 4 3 5 4 4 3 2 4 3 ...
##  $ parking_charges       : num  NA NA NA NA NA NA NA NA 20 NA ...
##  $ pct_tax_deductibl     : int  NA NA NA NA 39 NA NA NA NA NA ...
##  $ sale_price            : num  228000 235500 137550 545000 241700 ...
##  $ sq_footage            : int  NA 890 550 NA 675 1000 NA 375 NA 681 ...
##  $ total_taxes           : num  NA NA 5500 2260 NA NA NA NA NA 1320 ...
##  $ walk_score            : int  82 89 90 94 71 90 72 93 70 98 ...
##  $ lat                   : num  40.7 40.8 40.7 40.8 40.7 ...
##  $ lon                   : num  -73.8 -73.8 -73.9 -73.8 -73.7 ...
##  $ coop                  : int  1 1 0 0 1 1 1 1 1 0 ...
##  $ condo                 : int  0 0 1 1 0 0 0 0 0 1 ...
##  $ monthly_cost          : num  767 604 167 275 660 932 660 514 781 NA ...
##  $ shortest_dist         : num  0.796 0.694 1.096 0.303 2.778 ...

Exploratory Data Analysis

We now may be able to convert the entire dataset to numeric, so that it will perform better with regression models. We will examine the remaining categorical variables as the first step in our exploratory data analysis.

table(d$dining_room_type, useNA = 'always')
## 
##       combo dining area      formal        none       other        <NA> 
##         957           2         620           2         201         448

We can combine dining area, none, and the NAs into the other category.

other <- d %>%
  filter(is.na(dining_room_type) | dining_room_type == 'none' | dining_room_type == 'dining area' | dining_room_type == 'other') %>%
  mutate(dining_room_type = "other")

non_na <- d %>%
  filter(dining_room_type == 'combo' | dining_room_type == 'formal')

d <- rbind(other, non_na)
d$dining_room_type <- as.factor(d$dining_room_type)
table(d$fuel_type, useNA = 'always')
## 
## electric      gas     none      oil    other    Other     <NA> 
##       62     1348        3      664       40        1      112

We can do a similar process for fuel type that we did for dining room type, compressing the number of categories.

other <- d %>%
  filter(is.na(fuel_type) | fuel_type == 'none' | fuel_type == 'other' | fuel_type == 'Other') %>%
  mutate(fuel_type = "other")

non_na <- d %>%
  filter(fuel_type == 'electric' | fuel_type == 'gas' | fuel_type == 'oil')

d <- rbind(other, non_na)
d$fuel_type <- as.factor(d$fuel_type)
table(d$kitchen_type, useNA = 'always')
## 
##      combo     eat in efficiency       none       <NA> 
##        399        942        849         23         17

We are going to assume a combo kitchen and an eat in kitchen are one in the same, and will go with the mode for missing values (combo)

combo <- d %>%
  filter(is.na(kitchen_type) | kitchen_type == 'none' | kitchen_type == 'combo' | kitchen_type == 'eat in') %>%
  mutate(kitchen_type = 'combo')

efficiency <- d %>%
  filter(kitchen_type == 'efficiency')

d <- rbind(combo, efficiency)
d$kitchen_type <- as.factor(d$kitchen_type)

Now to convert the categorical variables to dummy variables, and combine them to the dataset:

d <- cbind(model.matrix( ~ dining_room_type - 1, d), model.matrix( ~ fuel_type - 1, d), model.matrix( ~ kitchen_type - 1, d), d)

d %<>%
  select(-c(dining_room_type, fuel_type, kitchen_type))

Now we will take a look at the missingness of the remaining features

cols <- colnames(d)
for (i in 1:length(d)) {
  cat(cols[i])
  cat('\n')
  cat(sum(is.na(d[i])))
  cat('\n')
}
## dining_room_typecombo
## 0
## dining_room_typeformal
## 0
## dining_room_typeother
## 0
## fuel_typeelectric
## 0
## fuel_typegas
## 0
## fuel_typeoil
## 0
## fuel_typeother
## 0
## kitchen_typecombo
## 0
## kitchen_typeefficiency
## 0
## approx_year_built
## 39
## cats_allowed
## 0
## dogs_allowed
## 0
## garage_exists
## 0
## num_bedrooms
## 115
## num_floors_in_building
## 650
## num_full_bathrooms
## 0
## num_half_bathrooms
## 2058
## num_total_rooms
## 2
## parking_charges
## 1671
## pct_tax_deductibl
## 1754
## sale_price
## 1702
## sq_footage
## 1210
## total_taxes
## 1646
## walk_score
## 0
## lat
## 0
## lon
## 0
## coop
## 0
## condo
## 0
## monthly_cost
## 100
## shortest_dist
## 0

A quick overview of the missingness in the dataset showed us a few things. A majority of the sale_price feature, our target variable, was missing. These rows can still be useful though for imputation of the other missing values. There was a very high amount of missing values in the parking_charges, pct_tax_deductibl, and num_half_bathrooms, and we decided these features were not extremely important to the model, so we went ahead and removed those.

d %<>%
  select(-c(parking_charges, pct_tax_deductibl, num_half_bathrooms))

There is also a significant amount of missing values in the total taxes column, so we may be able to handle that manually as well.

table(d$total_taxes)
## 
##   11   13   14   16   17   19   20   22   25   28   29   30   31   36   39   45 
##    1   13    1    3    5    1    2    1    1    3    1    4    4    1    2    4 
##   51   54   65   66   70   79   89   91   93   99  100  104  106  111  120  142 
##    1    6    2    2    2    1    1    1    1    1    1    2    1    1    5    2 
##  150  153  159  163  184  189  196  200  210  216  219  220  225  227  231  232 
##    5    4    1    1    1    2    7    3    2    2    2    2    2    1    3    4 
##  234  240  241  245  248  250  254  258  260  269  285  288  289  291  300  307 
##    1    1    1    2    1   12    1    4    2    2    1    2    2    3    2    1 
##  350  357  360  372  373  394  395  396  449  450  480  492  500  537  550  554 
##    2    2    1    2    1    2    2    1    2    2    2    4    6    2    2    2 
##  580  616  625  631  632  641  663  678  686  699  744  750  829  850  892  911 
##    2    2    1    4    2    1    1    2    1    1    1    1    2    2    2    2 
##  945 1024 1042 1127 1149 1152 1200 1216 1283 1300 1311 1320 1350 1400 1421 1471 
##    1    1    1    2    2    1    2    1    1    1    2    1    2    1    1    1 
## 1500 1521 1582 1660 1748 1783 1800 1829 1830 1860 1900 1948 2013 2037 2092 2096 
##    3    1    2    1    1    1    4    2    2    2    2    1    2    1    2    2 
## 2100 2138 2148 2178 2220 2224 2255 2256 2260 2287 2312 2333 2382 2400 2408 2411 
##    2    1    1    1    1    1    2    2    1    1    2    1    3    2    1    2 
## 2413 2423 2424 2425 2433 2436 2450 2460 2475 2490 2499 2500 2556 2567 2580 2584 
##    2    1    2    2    1    1    2    2    1    2    2    2    2    2    1    2 
## 2595 2600 2617 2634 2663 2700 2726 2765 2766 2775 2776 2780 2800 2808 2833 2850 
##    2    2    1    2    2    3    1    2    1    2    1    3   10    2    1    4 
## 2884 2894 2922 2936 2950 2961 2982 3000 3047 3048 3084 3108 3127 3144 3158 3186 
##    1    2    2    2    1    1    1    3    2    1    1    2    2    1    2    1 
## 3200 3220 3230 3241 3242 3268 3280 3284 3305 3312 3316 3325 3329 3333 3359 3370 
##    8    2    1    2    2    2    2    2    2    2    1    2    1    1    1    1 
## 3386 3388 3400 3409 3470 3476 3477 3478 3490 3491 3500 3524 3550 3579 3588 3600 
##    2    2    6    1    2    1    1    2    1    3    8    1    1    2    2    2 
## 3606 3609 3630 3673 3682 3721 3737 3740 3785 3800 3802 3858 3866 3936 3940 3945 
##    2    2    1    2    2    2    2    2    2    2    2    2    1    1    1    2 
## 3972 3987 4000 4012 4057 4104 4200 4225 4227 4300 4317 4415 4471 4474 4500 4506 
##    1    2    5    1    2    2    2    2    2    2    1    2    2    1    3    2 
## 4524 4546 4549 4613 4638 4700 4738 4775 4800 4802 4838 4920 4958 4962 5100 5217 
##    2    3    1    1    2    2    2    1   11    2    2    1    1    2    2    2 
## 5241 5300 5323 5359 5363 5400 5415 5500 5700 5776 5797 5807 5920 6224 6486 6533 
##    2    2    2    1    2    1    1    3    1    2    2    1    1    1    2    2 
## 6700 6988 7774 8178 9300 
##    2    2    2    1    1

Aside from the missing values, there are many values that are impossible/incorrect as they are much too low to make sense. We decided to handle those manually by replacing them with the mean.

d %<>%
  mutate(total_taxes = ifelse(total_taxes < 1000, mean(d$total_taxes), total_taxes))
str(d)
## 'data.frame':    2230 obs. of  27 variables:
##  $ dining_room_typecombo : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dining_room_typeformal: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ dining_room_typeother : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ fuel_typeelectric     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fuel_typegas          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fuel_typeoil          : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fuel_typeother        : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ kitchen_typecombo     : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ kitchen_typeefficiency: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ approx_year_built     : int  2005 1950 NA 1951 1961 1955 1952 1928 1940 1940 ...
##  $ cats_allowed          : num  0 0 0 1 0 0 1 1 1 1 ...
##  $ dogs_allowed          : num  0 0 0 1 0 0 1 0 1 1 ...
##  $ garage_exists         : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ num_bedrooms          : int  1 1 2 1 1 2 2 2 1 2 ...
##  $ num_floors_in_building: int  4 NA NA 6 6 NA 2 6 6 6 ...
##  $ num_full_bathrooms    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ num_total_rooms       : int  3 3 4 3 3 6 5 4 3 4 ...
##  $ sale_price            : num  535000 185000 235000 280000 225000 ...
##  $ sq_footage            : int  681 NA NA NA NA NA NA 900 918 980 ...
##  $ total_taxes           : num  1320 NA NA NA NA ...
##  $ walk_score            : int  98 85 97 96 93 97 82 97 97 97 ...
##  $ lat                   : num  40.8 40.7 40.8 40.8 40.7 ...
##  $ lon                   : num  -73.9 -73.8 -73.8 -73.9 -73.9 ...
##  $ coop                  : int  0 1 1 1 1 1 0 1 1 1 ...
##  $ condo                 : int  1 0 0 0 0 0 1 0 0 0 ...
##  $ monthly_cost          : num  NA 499 NA 647 648 ...
##  $ shortest_dist         : num  1.047 0.662 0.548 0.864 0.793 ...

To handle the remaining missing values, we believe the best method is to use missForest to impute them, essentially predicting what the missing values probably would be or would be close to based on other features of the row. Before we do that though, we need to make sure we remember which rows did not have a value in sale price as those data points should not be used for the model.

df <- d %>%
  mutate(sale_miss = ifelse(is.na(sale_price), 1, 0))
pacman::p_load(missForest)

df_imp = missForest(df, ntree = 300,)$ximp
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!
##   missForest iteration 7 in progress...done!
##   missForest iteration 8 in progress...done!
##   missForest iteration 9 in progress...done!

All the missing values have now been imputed, so we can now remove the section of the dataset where sale_price was missing.

df <- df_imp %>%
  filter(sale_miss == 0) %>%
  select(-sale_miss)

Regression

summary(lm(sale_price ~ ., df))
## 
## Call:
## lm(formula = sale_price ~ ., data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -311557  -41301   -1315   44830  324624 
## 
## Coefficients: (4 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -5.995e+07  9.103e+06  -6.585 1.14e-10 ***
## dining_room_typecombo  -1.030e+04  8.048e+03  -1.280 0.201060    
## dining_room_typeformal  1.471e+04  9.886e+03   1.488 0.137328    
## dining_room_typeother          NA         NA      NA       NA    
## fuel_typeelectric       2.782e+03  2.763e+04   0.101 0.919832    
## fuel_typegas           -4.533e+03  1.408e+04  -0.322 0.747560    
## fuel_typeoil           -1.451e+04  1.459e+04  -0.994 0.320639    
## fuel_typeother                 NA         NA      NA       NA    
## kitchen_typecombo       1.363e+04  7.251e+03   1.880 0.060725 .  
## kitchen_typeefficiency         NA         NA      NA       NA    
## approx_year_built      -7.766e+01  2.666e+02  -0.291 0.770940    
## cats_allowed           -3.774e+03  9.727e+03  -0.388 0.698219    
## dogs_allowed            2.955e+04  1.077e+04   2.744 0.006288 ** 
## garage_exists           7.094e+03  9.554e+03   0.743 0.458102    
## num_bedrooms            4.637e+04  8.255e+03   5.617 3.21e-08 ***
## num_floors_in_building  3.999e+03  7.281e+02   5.492 6.29e-08 ***
## num_full_bathrooms     -4.248e+03  5.576e+04  -0.076 0.939306    
## num_total_rooms         4.871e+03  5.363e+03   0.908 0.364154    
## sq_footage              3.879e+01  1.291e+01   3.005 0.002790 ** 
## total_taxes             3.894e+01  5.236e+00   7.438 4.41e-13 ***
## walk_score              7.051e+02  3.902e+02   1.807 0.071321 .  
## lat                     9.062e+05  1.275e+05   7.106 4.10e-12 ***
## lon                    -3.149e+05  8.756e+04  -3.596 0.000354 ***
## coop                   -2.260e+05  1.433e+04 -15.770  < 2e-16 ***
## condo                          NA         NA      NA       NA    
## monthly_cost            1.194e+02  1.489e+01   8.020 7.44e-15 ***
## shortest_dist           7.458e+02  6.197e+03   0.120 0.904258    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 77090 on 505 degrees of freedom
## Multiple R-squared:  0.8233, Adjusted R-squared:  0.8156 
## F-statistic:   107 on 22 and 505 DF,  p-value: < 2.2e-16

A simple linear model is able to predict the sale_price with an R-squared of .824. Let’s see how this model performs out-of-sample.

##Train-Test Split
pacman::p_load(caTools)
set.seed(77)
spl = sample.split(df$sale_price, 0.75)
train = subset(df, spl == TRUE)
test = subset(df, spl == FALSE)
simple <- lm(sale_price ~ ., train)
yhat <- predict(simple, test)
## Warning in predict.lm(simple, test): prediction from a rank-deficient fit may be
## misleading
rsq <- function (preds, actual) {
  rss <- sum((preds - actual) ^ 2)  ## residual sum of squares
  tss <- sum((actual - mean(actual)) ^ 2)  ## total sum of squares
  r2 <- 1 - rss/tss
  r2
}

rsq(yhat, test$sale_price)
## [1] 0.6795935

The simple linear model has an out-of-sample R-squared of .69, which is to be expected from a simple model from such a small dataset. Let’s see what we can do with forward step-wise regression to select the best predictors based on p-value.

pacman::p_load(olsrr)
forward <- ols_step_forward_p(simple, penter = .05, details = FALSE, progress = TRUE)
## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. dining_room_typecombo 
## 2. dining_room_typeformal 
## 3. dining_room_typeother 
## 4. fuel_typeelectric 
## 5. fuel_typegas 
## 6. fuel_typeoil 
## 7. fuel_typeother 
## 8. kitchen_typecombo 
## 9. kitchen_typeefficiency 
## 10. approx_year_built 
## 11. cats_allowed 
## 12. dogs_allowed 
## 13. garage_exists 
## 14. num_bedrooms 
## 15. num_floors_in_building 
## 16. num_full_bathrooms 
## 17. num_total_rooms 
## 18. sq_footage 
## 19. total_taxes 
## 20. walk_score 
## 21. lat 
## 22. lon 
## 23. coop 
## 24. condo 
## 25. monthly_cost 
## 26. shortest_dist 
## 
## We are selecting variables based on p value...
## 
## Variables Entered: 
## 
## - sq_footage 
## - condo
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## - total_taxes
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## - num_floors_in_building
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## - lat
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## - num_bedrooms
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## - lon
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## - monthly_cost
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## - dogs_allowed
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## - walk_score
## Note: model has aliased coefficients
##       sums of squares computed by model comparison
## 
## No more variables to be added.
## 
## Final Model Output 
## ------------------
## 
##                             Model Summary                              
## ----------------------------------------------------------------------
## R                       0.910       RMSE                    76453.491 
## R-Squared               0.828       Coef. Var                  24.191 
## Adj. R-Squared          0.823       MSE                5845136242.437 
## Pred R-Squared          0.809       MAE                     55379.402 
## ----------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                     ANOVA                                     
## -----------------------------------------------------------------------------
##                     Sum of                                                   
##                    Squares         DF       Mean Square       F         Sig. 
## -----------------------------------------------------------------------------
## Regression    1.079998e+13         10      1.079998e+12    184.769    0.0000 
## Residual      2.250377e+12        385    5845136242.437                      
## Total         1.305036e+13        395                                        
## -----------------------------------------------------------------------------
## 
##                                                    Parameter Estimates                                                    
## -------------------------------------------------------------------------------------------------------------------------
##                  model             Beta     Std. Error    Std. Beta      t        Sig             lower            upper 
## -------------------------------------------------------------------------------------------------------------------------
##            (Intercept)    -60617604.488    9943201.750                 -6.096    0.000    -80167379.052    -41067829.923 
##             sq_footage          167.005         24.878        0.261     6.713    0.000          118.091          215.919 
##                  condo       207189.014      13113.421        0.494    15.800    0.000       181406.130       232971.898 
##            total_taxes           27.365          5.533        0.141     4.946    0.000           16.488           38.243 
## num_floors_in_building         4299.206        764.351        0.152     5.625    0.000         2796.382         5802.031 
##                    lat       991449.598     140942.019        0.163     7.034    0.000       714337.178      1268562.017 
##           num_bedrooms        38031.318       7694.065        0.157     4.943    0.000        22903.672        53158.963 
##                    lon      -271213.131      96638.547       -0.080    -2.806    0.005      -461218.509       -81207.753 
##           monthly_cost           88.103         18.188        0.185     4.844    0.000           52.343          123.862 
##           dogs_allowed        30190.746       8938.341        0.075     3.378    0.001        12616.674        47764.818 
##             walk_score          937.899        365.430        0.070     2.567    0.011          219.410         1656.388 
## -------------------------------------------------------------------------------------------------------------------------
yhat <- predict(forward$model, test)
rsq(yhat, test$sale_price)
## [1] 0.6399033

Using only the best 10 predictors, we had essentially the same R-squared from our model. Let’s see what we can do if we include these 10 predictors, as well as associations between them.

top10 <- forward$predictors
f <- df %>%
  select(top10, sale_price)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(top10)` instead of `top10` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
set.seed(777)
spl = sample.split(f$sale_price, 0.75)
ftrain = subset(f, spl == TRUE)
ftest = subset(f, spl == FALSE)

associations <- lm(sale_price ~ . * ., ftrain)
forward <- ols_step_forward_p(associations, penter = .025, details = FALSE, progress = TRUE)
## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. sq_footage 
## 2. condo 
## 3. total_taxes 
## 4. num_floors_in_building 
## 5. lat 
## 6. num_bedrooms 
## 7. lon 
## 8. monthly_cost 
## 9. dogs_allowed 
## 10. walk_score 
## 11. sq_footage:condo 
## 12. sq_footage:total_taxes 
## 13. sq_footage:num_floors_in_building 
## 14. sq_footage:lat 
## 15. sq_footage:num_bedrooms 
## 16. sq_footage:lon 
## 17. sq_footage:monthly_cost 
## 18. sq_footage:dogs_allowed 
## 19. sq_footage:walk_score 
## 20. condo:total_taxes 
## 21. condo:num_floors_in_building 
## 22. condo:lat 
## 23. condo:num_bedrooms 
## 24. condo:lon 
## 25. condo:monthly_cost 
## 26. condo:dogs_allowed 
## 27. condo:walk_score 
## 28. total_taxes:num_floors_in_building 
## 29. total_taxes:lat 
## 30. total_taxes:num_bedrooms 
## 31. total_taxes:lon 
## 32. total_taxes:monthly_cost 
## 33. total_taxes:dogs_allowed 
## 34. total_taxes:walk_score 
## 35. num_floors_in_building:lat 
## 36. num_floors_in_building:num_bedrooms 
## 37. num_floors_in_building:lon 
## 38. num_floors_in_building:monthly_cost 
## 39. num_floors_in_building:dogs_allowed 
## 40. num_floors_in_building:walk_score 
## 41. lat:num_bedrooms 
## 42. lat:lon 
## 43. lat:monthly_cost 
## 44. lat:dogs_allowed 
## 45. lat:walk_score 
## 46. num_bedrooms:lon 
## 47. num_bedrooms:monthly_cost 
## 48. num_bedrooms:dogs_allowed 
## 49. num_bedrooms:walk_score 
## 50. lon:monthly_cost 
## 51. lon:dogs_allowed 
## 52. lon:walk_score 
## 53. monthly_cost:dogs_allowed 
## 54. monthly_cost:walk_score 
## 55. dogs_allowed:walk_score 
## 
## We are selecting variables based on p value...
## 
## Variables Entered: 
## 
## - total_taxes:num_bedrooms 
## - condo 
## - num_floors_in_building 
## - lon 
## - dogs_allowed 
## - walk_score 
## - lat 
## - sq_footage 
## - monthly_cost 
## - total_taxes 
## - lon:walk_score 
## - num_bedrooms 
## - lat:monthly_cost 
## - total_taxes:dogs_allowed 
## - sq_footage:condo 
## - num_bedrooms:monthly_cost 
## - total_taxes:monthly_cost 
## - total_taxes:lon 
## - condo:lat 
## - lat:lon 
## - num_floors_in_building:lat 
## 
## No more variables to be added.
## 
## Final Model Output 
## ------------------
## 
##                             Model Summary                              
## ----------------------------------------------------------------------
## R                       0.948       RMSE                    60244.330 
## R-Squared               0.898       Coef. Var                  19.097 
## Adj. R-Squared          0.892       MSE                3629379268.003 
## Pred R-Squared          0.864       MAE                     44527.994 
## ----------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                      ANOVA                                      
## -------------------------------------------------------------------------------
##                     Sum of                                                     
##                    Squares         DF         Mean Square       F         Sig. 
## -------------------------------------------------------------------------------
## Regression    1.197833e+13         21    570396529546.032    157.161    0.0000 
## Residual      1.357388e+12        374      3629379268.003                      
## Total         1.333571e+13        395                                          
## -------------------------------------------------------------------------------
## 
##                                                      Parameter Estimates                                                       
## ------------------------------------------------------------------------------------------------------------------------------
##                      model             Beta      Std. Error    Std. Beta      t        Sig             lower            upper 
## ------------------------------------------------------------------------------------------------------------------------------
##                (Intercept)    -26217403996.743    9183511110.364                 -2.855    0.005    -4.427519e+10    -8.159616e+09 
##                      condo    -40766326.922    12513968.522      -95.867    -3.258    0.001    -6.537288e+07    -1.615977e+07 
##     num_floors_in_building     -2449958.827      866429.924      -90.786    -2.828    0.005    -4.153644e+06    -7.462741e+05 
##                        lon    -354065685.052    124413085.938     -102.808    -2.846    0.005    -5.987025e+08    -1.094289e+08 
##               dogs_allowed       -72558.034       28911.912       -0.175    -2.510    0.013    -1.294083e+05    -1.570776e+04 
##                 walk_score     -2629240.555      374394.412     -186.352    -7.023    0.000    -3.365422e+06    -1.893059e+06 
##                        lat    650461223.258    225719272.456      107.150     2.882    0.004     2.066233e+08     1.094299e+09 
##                 sq_footage          226.144          29.663        0.476     7.624    0.000     1.678170e+02     2.844710e+02 
##               monthly_cost        92973.512       17036.925      198.547     5.457    0.000     5.947334e+04     1.264737e+05 
##                total_taxes       -30004.616        6434.903     -144.894    -4.663    0.000    -4.265774e+04    -1.735149e+04 
##               num_bedrooms       -18853.585       19373.788       -0.077    -0.973    0.331    -5.694879e+04     1.924162e+04 
##   total_taxes:num_bedrooms           39.772           6.029        0.705     6.596    0.000     2.791600e+01     5.162700e+01 
##             lon:walk_score       -35638.296        5073.867     -186.917    -7.024    0.000    -4.561518e+04    -2.566141e+04 
##           lat:monthly_cost        -2281.080         418.308     -198.523    -5.453    0.000    -3.103610e+03    -1.458550e+03 
##   total_taxes:dogs_allowed           32.901           9.928        0.237     3.314    0.001     1.337900e+01     5.242400e+01 
##           condo:sq_footage         -215.665          30.261       -0.628    -7.127    0.000    -2.751680e+02    -1.561620e+02 
##  num_bedrooms:monthly_cost          -98.258          15.350       -0.658    -6.401    0.000    -1.284410e+02    -6.807600e+01 
##   total_taxes:monthly_cost            0.067           0.012        0.696     5.431    0.000     4.300000e-02     9.200000e-02 
##            total_taxes:lon         -405.200          87.129     -144.474    -4.651    0.000    -5.765250e+02    -2.338760e+02 
##                  condo:lat      1010825.928      307250.846       96.856     3.290    0.001     4.066702e+05     1.614982e+06 
##                    lon:lat      8784776.976     3057922.190      136.102     2.873    0.004     2.771901e+06     1.479765e+07 
## num_floors_in_building:lat        60201.655       21268.097       90.896     2.831    0.005     1.838162e+04     1.020217e+05 
## ------------------------------------------------------------------------------------------------------------------------------
yhat <- predict(forward$model, ftest)
rsq(yhat, ftest$sale_price)
## [1] 0.7937756

That’s a significant improvement, the out-of-sample R-squared jumped to .79 with this method. Just to check let’s see how this method would work if we trained on the whole dataset (including the imputed sale_price points).

f <- df_imp %>%
  select(top10, sale_price)

set.seed(77)
spl = sample.split(f$sale_price, 0.75)
ftrain = subset(f, spl == TRUE)
ftest = subset(f, spl == FALSE)

associations <- lm(sale_price ~ . * ., ftrain)
forward <- ols_step_forward_p(associations, penter = .005, details = FALSE, progress = TRUE)
## Forward Selection Method    
## ---------------------------
## 
## Candidate Terms: 
## 
## 1. sq_footage 
## 2. condo 
## 3. total_taxes 
## 4. num_floors_in_building 
## 5. lat 
## 6. num_bedrooms 
## 7. lon 
## 8. monthly_cost 
## 9. dogs_allowed 
## 10. walk_score 
## 11. sq_footage:condo 
## 12. sq_footage:total_taxes 
## 13. sq_footage:num_floors_in_building 
## 14. sq_footage:lat 
## 15. sq_footage:num_bedrooms 
## 16. sq_footage:lon 
## 17. sq_footage:monthly_cost 
## 18. sq_footage:dogs_allowed 
## 19. sq_footage:walk_score 
## 20. condo:total_taxes 
## 21. condo:num_floors_in_building 
## 22. condo:lat 
## 23. condo:num_bedrooms 
## 24. condo:lon 
## 25. condo:monthly_cost 
## 26. condo:dogs_allowed 
## 27. condo:walk_score 
## 28. total_taxes:num_floors_in_building 
## 29. total_taxes:lat 
## 30. total_taxes:num_bedrooms 
## 31. total_taxes:lon 
## 32. total_taxes:monthly_cost 
## 33. total_taxes:dogs_allowed 
## 34. total_taxes:walk_score 
## 35. num_floors_in_building:lat 
## 36. num_floors_in_building:num_bedrooms 
## 37. num_floors_in_building:lon 
## 38. num_floors_in_building:monthly_cost 
## 39. num_floors_in_building:dogs_allowed 
## 40. num_floors_in_building:walk_score 
## 41. lat:num_bedrooms 
## 42. lat:lon 
## 43. lat:monthly_cost 
## 44. lat:dogs_allowed 
## 45. lat:walk_score 
## 46. num_bedrooms:lon 
## 47. num_bedrooms:monthly_cost 
## 48. num_bedrooms:dogs_allowed 
## 49. num_bedrooms:walk_score 
## 50. lon:monthly_cost 
## 51. lon:dogs_allowed 
## 52. lon:walk_score 
## 53. monthly_cost:dogs_allowed 
## 54. monthly_cost:walk_score 
## 55. dogs_allowed:walk_score 
## 
## We are selecting variables based on p value...
## 
## Variables Entered: 
## 
## - total_taxes:lat 
## - dogs_allowed 
## - walk_score 
## - lon 
## - condo 
## - lat 
## - sq_footage:monthly_cost 
## - total_taxes 
## - num_floors_in_building 
## - num_bedrooms 
## - sq_footage 
## - sq_footage:condo 
## - monthly_cost 
## - lon:walk_score 
## - total_taxes:num_floors_in_building 
## - condo:lat 
## - condo:total_taxes 
## - total_taxes:lon 
## - condo:num_bedrooms 
## - sq_footage:total_taxes 
## - monthly_cost:walk_score 
## - num_floors_in_building:walk_score 
## - num_bedrooms:lon 
## - sq_footage:dogs_allowed 
## - num_bedrooms:walk_score 
## 
## No more variables to be added.
## 
## Final Model Output 
## ------------------
## 
##                             Model Summary                              
## ----------------------------------------------------------------------
## R                       0.963       RMSE                    43057.977 
## R-Squared               0.928       Coef. Var                  12.989 
## Adj. R-Squared          0.927       MSE                1853989393.030 
## Pred R-Squared          0.917       MAE                     30379.483 
## ----------------------------------------------------------------------
##  RMSE: Root Mean Square Error 
##  MSE: Mean Square Error 
##  MAE: Mean Absolute Error 
## 
##                                     ANOVA                                      
## ------------------------------------------------------------------------------
##                     Sum of                                                    
##                    Squares          DF       Mean Square       F         Sig. 
## ------------------------------------------------------------------------------
## Regression    3.937827e+13          25      1.575131e+12     849.59    0.0000 
## Residual      3.051667e+12        1646    1853989393.030                      
## Total         4.242993e+13        1671                                        
## ------------------------------------------------------------------------------
## 
##                                                           Parameter Estimates                                                           
## ---------------------------------------------------------------------------------------------------------------------------------------
##                              model             Beta      Std. Error    Std. Beta       t        Sig             lower            upper 
## ---------------------------------------------------------------------------------------------------------------------------------------
##                        (Intercept)    114601766.960    15874177.051                   7.219    0.000     83466056.706    145737477.214 
##                       dogs_allowed       -16244.660        7976.539       -0.044     -2.037    0.042       -31889.894         -599.427 
##                         walk_score     -1367222.664      127449.577     -126.901    -10.728    0.000     -1617203.062     -1117242.267 
##                                lon      1742446.457      178303.373        0.546      9.772    0.000      1392721.105      2092171.810 
##                              condo    -31805036.411     3755571.289      -87.853     -8.469    0.000    -39171237.450    -24438835.371 
##                                lat       340126.658      151005.726        0.061      2.252    0.024        43943.082       636310.234 
##                        total_taxes       -17025.713        3507.257      -91.171     -4.854    0.000       -23904.870       -10146.557 
##             num_floors_in_building         -577.604        1310.419       -0.024     -0.441    0.659        -3147.868         1992.659 
##                       num_bedrooms     14073100.741     2900092.398       66.263      4.853    0.000      8384841.355     19761360.128 
##                         sq_footage          232.807          16.246        0.479     14.330    0.000          200.943          264.672 
##                       monthly_cost          -11.809          25.304       -0.030     -0.467    0.641          -61.442           37.823 
##                    total_taxes:lat           12.737          50.168        2.781      0.254    0.800          -85.663          111.137 
##            sq_footage:monthly_cost            0.002           0.009        0.008      0.222    0.824           -0.015            0.019 
##                   condo:sq_footage         -129.019          14.413       -0.401     -8.951    0.000         -157.289         -100.748 
##                     walk_score:lon       -18519.177        1726.947     -127.134    -10.724    0.000       -21906.421       -15131.933 
## total_taxes:num_floors_in_building            2.218           0.217        0.369     10.212    0.000            1.792            2.644 
##                          lat:condo       789807.414       92223.395       88.889      8.564    0.000       608919.870       970694.958 
##                  total_taxes:condo          -35.450           4.318       -0.362     -8.210    0.000          -43.920          -26.981 
##                    total_taxes:lon         -224.464          33.412      -88.752     -6.718    0.000         -289.998         -158.931 
##                 condo:num_bedrooms        28312.242        4526.601        0.160      6.255    0.000        19433.739        37190.745 
##             total_taxes:sq_footage           -0.016           0.004       -0.192     -3.613    0.000           -0.024           -0.007 
##            walk_score:monthly_cost            0.919           0.227        0.180      4.054    0.000            0.474            1.364 
##  walk_score:num_floors_in_building          -45.833          12.352       -0.151     -3.710    0.000          -70.061          -21.605 
##                   lon:num_bedrooms       190837.663       39348.576       66.337      4.850    0.000       113659.120       268016.205 
##            dogs_allowed:sq_footage           33.056           8.242        0.091      4.010    0.000           16.889           49.222 
##            walk_score:num_bedrooms          387.834         114.503        0.155      3.387    0.001          163.247          612.420 
## ---------------------------------------------------------------------------------------------------------------------------------------
yhat <- predict(forward$model, test)
rsq(yhat, test$sale_price)
## [1] 0.8655408

Even though this model was trained on essentially fake data, it still performed very well when predicting on the real data points. This shows that this method at least has promise if we are able to scrape more data points that include the sale_price.

##RANDOM FOREST

Let’s see how a random forest regression model performs with our real dataset.

pacman::p_load(randomForest)
rf <- randomForest(sale_price ~ ., train, mtry=3, ntree = 500, importance=TRUE)


yhat <- predict(rf, test)
rsq(yhat, test$sale_price)
## [1] 0.8920029

The random forest regressor model had an R-squared of .89 out-of-sample, a huge improvement over the simple linear model. This is also a very acceptable R-squared in general, considering we only had 528 data points. Let’s take a look at how a sample tree functioned in our random forest:

pacman::p_load(rpart, rpart.plot)
tree <- rpart(
    formula = sale_price ~ .,
    data    = df,
    method  = "anova"
)

rpart.plot(tree)

What if we trained the random forest on the larger dataset:

rf <- randomForest(sale_price ~ ., ftrain, ntry = 3, ntree = 500, importance = TRUE)

yhat <- predict(rf, test)
rsq(yhat, test$sale_price)
## [1] 0.9599098

Conclusion

An amazing .96 R-squared using the larger dataset to train the model, then predicting on the real data points. There is definitely some over fitting going on when we train on the larger dataset though, because those missing sale prices were generated by the rest of the dataset’s information. It would definitely be interesting though to try these models on a larger, real dataset.