For this real-estate price prediction project, our team was given a dataset scraped from zillow.com that included features of apartment/condo listings such as the number of bedrooms, the square footage, the address, and some less relevant information. We were tasked with cleaning this dataset, which had many missing values, and building the most accurate machine learning model possible from it.
pacman::p_load(dplyr, tidyr, ggplot2, magrittr, stringr, mlr)
housing_data = read.csv("housing_data_2016_2017.csv")
###Delete features that are irrelevant to sale price
From a quick look we identified many features that were going to be irrelevant to the model so we removed them from the start. Listing price was removed because we believed that would be an unfair feature if we’re looking to build a model to value properties. The model could be used to get an idea of a fair listing price for a property.
housing_data <- housing_data[29:57] %>%
select(-c(date_of_sale, model_type, listing_price_to_nearest_1000, url))
In this block of code we cleaned the addresses of the listings, converted some features to binary variables, and performed other general cleaning of the dataset.
housing_data %<>%
mutate( zip_code = str_extract(full_address_or_zip_code, "[0-9]{5}")) %>%
mutate(dogs_allowed = ifelse(substr(housing_data$dogs_allowed, 1, 3) == "yes", 1, 0)) %>%
mutate(cats_allowed = ifelse(substr(housing_data$cats_allowed, 1, 3) == "yes", 1, 0)) %>%
mutate(coop = as.integer(coop_condo == 'co-op')) %>%
mutate(condo = as.integer(coop_condo == 'condo')) %>%
select(-coop_condo)
d = housing_data
d %<>%
mutate(maintenance_cost = sjmisc::rec(maintenance_cost, rec = "NA = 0 ; else = copy")) %<>%
mutate(common_charges = sjmisc::rec(common_charges, rec = "NA = 0 ; else = copy"))##recode from NA to 0.
# combine maintaince cost and common charges
d %<>%
mutate( monthly_cost = common_charges + maintenance_cost)
d %<>%
mutate(monthly_cost = sjmisc::rec(monthly_cost, rec = "0 = NA ; else = copy"))
## convert garage_exists feature to binary
d %<>%
mutate(garage_exists = sjmisc::rec(garage_exists, rec = "NA = 0 ; else = copy")) ##recode from NA to 0.
d %<>%
mutate(garage_exists = sjmisc::rec(garage_exists, rec = " eys = 1; UG = 1 ; Underground = 1; yes = 1 ; Yes = 1 ; else = copy")) ##recode from NA to 0.
## One or more of the old values are recoded into identical new values.
## Please check if you correctly specified the recode-pattern,
## else separate multiple values with comma, e.g. rec="a,b,c=1; d,e,f=2".
d %<>%
select(-c(maintenance_cost , common_charges))
We had the idea that latitudes and longitudes would be a better measure of location than zip codes or addresses, so we engineered those features using ggmap. We could then remove the original location related features. We also engineered a feature that calculates the distance from each coop/condo to the nearest LIRR station, as properties nearest to LIRR stations might be more desirable to those working in Manhattan.
#Already run and included in the data
#pacman::p_load(ggmap)
#d %<>%
# mutate(lat = geocode(full_address_or_zip_code)$lat, lon = #geocode(full_address_or_zip_code)$lon )
#geocoordinates for relevant LIRR stations
lirr_coord = read.csv("coord.csv")
RAD_EARTH = 3958.8
degrees_to_radians = function(angle_degrees){
for(i in 1:length(angle_degrees))
angle_degrees[i] = angle_degrees[i]*pi/180
return(angle_degrees)
}
compute_globe_distance = function(destination, origin){
destination_rad = degrees_to_radians(destination)
origin_rad = degrees_to_radians(origin)
delta_lat = destination_rad[1] - origin_rad[1]
delta_lon = destination_rad[2] - origin_rad[2]
h = (sin(delta_lat/2))^2 + cos(origin_rad[1]) * cos(destination_rad[1]) * (sin(delta_lon/2))^2
central_angle = 2 * asin(sqrt(h))
return(RAD_EARTH * central_angle)
}
#find the closest LIRR station and compute distance
shortest_lirr_distance = function(all_lirr_coords, house_coords){
shortest_dist = Inf
for (i in 1: nrow(all_lirr_coords)){
ith_lirr = c(all_lirr_coords$lat[i], all_lirr_coords$lon[i])
new_dist = compute_globe_distance(ith_lirr, house_coords)
if( new_dist < shortest_dist){
shortest_dist = new_dist
}
}
return(shortest_dist)
}
d %<>%
rowwise() %>%
mutate(shortest_dist = shortest_lirr_distance(lirr_coord, c(lat, lon)) ) %>%
select(-c(zip_code, full_address_or_zip_code, community_district_num))
str(d)
## Classes 'rowwise_df', 'tbl_df', 'tbl' and 'data.frame': 2230 obs. of 24 variables:
## $ approx_year_built : int 1955 1955 2004 2002 1949 1938 1950 1960 1960 2005 ...
## $ cats_allowed : num 0 0 0 0 1 1 0 0 0 0 ...
## $ dining_room_type : Factor w/ 5 levels "combo","dining area",..: 1 3 1 1 1 1 1 NA NA 5 ...
## $ dogs_allowed : num 0 0 0 0 1 1 0 0 0 0 ...
## $ fuel_type : Factor w/ 6 levels "electric","gas",..: 2 4 NA 2 2 4 2 2 4 NA ...
## $ garage_exists : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ kitchen_type : Factor w/ 4 levels "combo","eat in",..: 2 2 3 2 2 2 3 3 2 2 ...
## $ num_bedrooms : int 2 1 1 3 2 2 1 0 1 1 ...
## $ num_floors_in_building: int 6 7 1 NA 2 6 NA 2 NA 4 ...
## $ num_full_bathrooms : int 1 1 1 2 1 1 1 1 1 1 ...
## $ num_half_bathrooms : int NA NA NA NA NA NA NA NA NA NA ...
## $ num_total_rooms : int 5 4 3 5 4 4 3 2 4 3 ...
## $ parking_charges : Factor w/ 90 levels " NA ","100","105",..: 1 1 1 1 1 1 1 1 41 1 ...
## $ pct_tax_deductibl : int NA NA NA NA 39 NA NA NA NA NA ...
## $ sale_price : Factor w/ 316 levels " NA ","100000",..: 107 113 33 252 119 126 38 8 94 250 ...
## $ sq_footage : int NA 890 550 NA 675 1000 NA 375 NA 681 ...
## $ total_taxes : Factor w/ 294 levels " NA ","100","1024",..: 1 1 255 68 1 1 1 1 1 19 ...
## $ walk_score : int 82 89 90 94 71 90 72 93 70 98 ...
## $ lat : num 40.7 40.8 40.7 40.8 40.7 ...
## $ lon : num -73.8 -73.8 -73.9 -73.8 -73.7 ...
## $ coop : int 1 1 0 0 1 1 1 1 1 0 ...
## $ condo : int 0 0 1 1 0 0 0 0 0 1 ...
## $ monthly_cost : num 767 604 167 275 660 932 660 514 781 NA ...
## $ shortest_dist : num 0.796 0.694 1.096 0.303 2.778 ...
From an overview we identified features that needed the variable types to be changed
d$garage_exists = as.character(d$garage_exists)
d$garage_exists = as.integer(d$garage_exists)
d$parking_charges = as.character(d$parking_charges)
d$parking_charges = as.numeric(d$parking_charges)
## Warning: NAs introduced by coercion
d$sale_price = as.character(d$sale_price)
d$sale_price = as.numeric(d$sale_price)
## Warning: NAs introduced by coercion
d$total_taxes = as.character(d$total_taxes)
d$total_taxes = as.numeric(d$total_taxes)
## Warning: NAs introduced by coercion
str(d)
## Classes 'rowwise_df', 'tbl_df', 'tbl' and 'data.frame': 2230 obs. of 24 variables:
## $ approx_year_built : int 1955 1955 2004 2002 1949 1938 1950 1960 1960 2005 ...
## $ cats_allowed : num 0 0 0 0 1 1 0 0 0 0 ...
## $ dining_room_type : Factor w/ 5 levels "combo","dining area",..: 1 3 1 1 1 1 1 NA NA 5 ...
## $ dogs_allowed : num 0 0 0 0 1 1 0 0 0 0 ...
## $ fuel_type : Factor w/ 6 levels "electric","gas",..: 2 4 NA 2 2 4 2 2 4 NA ...
## $ garage_exists : int 0 0 0 0 0 0 0 0 0 0 ...
## $ kitchen_type : Factor w/ 4 levels "combo","eat in",..: 2 2 3 2 2 2 3 3 2 2 ...
## $ num_bedrooms : int 2 1 1 3 2 2 1 0 1 1 ...
## $ num_floors_in_building: int 6 7 1 NA 2 6 NA 2 NA 4 ...
## $ num_full_bathrooms : int 1 1 1 2 1 1 1 1 1 1 ...
## $ num_half_bathrooms : int NA NA NA NA NA NA NA NA NA NA ...
## $ num_total_rooms : int 5 4 3 5 4 4 3 2 4 3 ...
## $ parking_charges : num NA NA NA NA NA NA NA NA 20 NA ...
## $ pct_tax_deductibl : int NA NA NA NA 39 NA NA NA NA NA ...
## $ sale_price : num 228000 235500 137550 545000 241700 ...
## $ sq_footage : int NA 890 550 NA 675 1000 NA 375 NA 681 ...
## $ total_taxes : num NA NA 5500 2260 NA NA NA NA NA 1320 ...
## $ walk_score : int 82 89 90 94 71 90 72 93 70 98 ...
## $ lat : num 40.7 40.8 40.7 40.8 40.7 ...
## $ lon : num -73.8 -73.8 -73.9 -73.8 -73.7 ...
## $ coop : int 1 1 0 0 1 1 1 1 1 0 ...
## $ condo : int 0 0 1 1 0 0 0 0 0 1 ...
## $ monthly_cost : num 767 604 167 275 660 932 660 514 781 NA ...
## $ shortest_dist : num 0.796 0.694 1.096 0.303 2.778 ...
We now may be able to convert the entire dataset to numeric, so that it will perform better with regression models. We will examine the remaining categorical variables as the first step in our exploratory data analysis.
table(d$dining_room_type, useNA = 'always')
##
## combo dining area formal none other <NA>
## 957 2 620 2 201 448
We can combine dining area, none, and the NAs into the other category.
other <- d %>%
filter(is.na(dining_room_type) | dining_room_type == 'none' | dining_room_type == 'dining area' | dining_room_type == 'other') %>%
mutate(dining_room_type = "other")
non_na <- d %>%
filter(dining_room_type == 'combo' | dining_room_type == 'formal')
d <- rbind(other, non_na)
d$dining_room_type <- as.factor(d$dining_room_type)
table(d$fuel_type, useNA = 'always')
##
## electric gas none oil other Other <NA>
## 62 1348 3 664 40 1 112
We can do a similar process for fuel type that we did for dining room type, compressing the number of categories.
other <- d %>%
filter(is.na(fuel_type) | fuel_type == 'none' | fuel_type == 'other' | fuel_type == 'Other') %>%
mutate(fuel_type = "other")
non_na <- d %>%
filter(fuel_type == 'electric' | fuel_type == 'gas' | fuel_type == 'oil')
d <- rbind(other, non_na)
d$fuel_type <- as.factor(d$fuel_type)
table(d$kitchen_type, useNA = 'always')
##
## combo eat in efficiency none <NA>
## 399 942 849 23 17
We are going to assume a combo kitchen and an eat in kitchen are one in the same, and will go with the mode for missing values (combo)
combo <- d %>%
filter(is.na(kitchen_type) | kitchen_type == 'none' | kitchen_type == 'combo' | kitchen_type == 'eat in') %>%
mutate(kitchen_type = 'combo')
efficiency <- d %>%
filter(kitchen_type == 'efficiency')
d <- rbind(combo, efficiency)
d$kitchen_type <- as.factor(d$kitchen_type)
Now to convert the categorical variables to dummy variables, and combine them to the dataset:
d <- cbind(model.matrix( ~ dining_room_type - 1, d), model.matrix( ~ fuel_type - 1, d), model.matrix( ~ kitchen_type - 1, d), d)
d %<>%
select(-c(dining_room_type, fuel_type, kitchen_type))
Now we will take a look at the missingness of the remaining features
cols <- colnames(d)
for (i in 1:length(d)) {
cat(cols[i])
cat('\n')
cat(sum(is.na(d[i])))
cat('\n')
}
## dining_room_typecombo
## 0
## dining_room_typeformal
## 0
## dining_room_typeother
## 0
## fuel_typeelectric
## 0
## fuel_typegas
## 0
## fuel_typeoil
## 0
## fuel_typeother
## 0
## kitchen_typecombo
## 0
## kitchen_typeefficiency
## 0
## approx_year_built
## 39
## cats_allowed
## 0
## dogs_allowed
## 0
## garage_exists
## 0
## num_bedrooms
## 115
## num_floors_in_building
## 650
## num_full_bathrooms
## 0
## num_half_bathrooms
## 2058
## num_total_rooms
## 2
## parking_charges
## 1671
## pct_tax_deductibl
## 1754
## sale_price
## 1702
## sq_footage
## 1210
## total_taxes
## 1646
## walk_score
## 0
## lat
## 0
## lon
## 0
## coop
## 0
## condo
## 0
## monthly_cost
## 100
## shortest_dist
## 0
A quick overview of the missingness in the dataset showed us a few things. A majority of the sale_price feature, our target variable, was missing. These rows can still be useful though for imputation of the other missing values. There was a very high amount of missing values in the parking_charges, pct_tax_deductibl, and num_half_bathrooms, and we decided these features were not extremely important to the model, so we went ahead and removed those.
d %<>%
select(-c(parking_charges, pct_tax_deductibl, num_half_bathrooms))
There is also a significant amount of missing values in the total taxes column, so we may be able to handle that manually as well.
table(d$total_taxes)
##
## 11 13 14 16 17 19 20 22 25 28 29 30 31 36 39 45
## 1 13 1 3 5 1 2 1 1 3 1 4 4 1 2 4
## 51 54 65 66 70 79 89 91 93 99 100 104 106 111 120 142
## 1 6 2 2 2 1 1 1 1 1 1 2 1 1 5 2
## 150 153 159 163 184 189 196 200 210 216 219 220 225 227 231 232
## 5 4 1 1 1 2 7 3 2 2 2 2 2 1 3 4
## 234 240 241 245 248 250 254 258 260 269 285 288 289 291 300 307
## 1 1 1 2 1 12 1 4 2 2 1 2 2 3 2 1
## 350 357 360 372 373 394 395 396 449 450 480 492 500 537 550 554
## 2 2 1 2 1 2 2 1 2 2 2 4 6 2 2 2
## 580 616 625 631 632 641 663 678 686 699 744 750 829 850 892 911
## 2 2 1 4 2 1 1 2 1 1 1 1 2 2 2 2
## 945 1024 1042 1127 1149 1152 1200 1216 1283 1300 1311 1320 1350 1400 1421 1471
## 1 1 1 2 2 1 2 1 1 1 2 1 2 1 1 1
## 1500 1521 1582 1660 1748 1783 1800 1829 1830 1860 1900 1948 2013 2037 2092 2096
## 3 1 2 1 1 1 4 2 2 2 2 1 2 1 2 2
## 2100 2138 2148 2178 2220 2224 2255 2256 2260 2287 2312 2333 2382 2400 2408 2411
## 2 1 1 1 1 1 2 2 1 1 2 1 3 2 1 2
## 2413 2423 2424 2425 2433 2436 2450 2460 2475 2490 2499 2500 2556 2567 2580 2584
## 2 1 2 2 1 1 2 2 1 2 2 2 2 2 1 2
## 2595 2600 2617 2634 2663 2700 2726 2765 2766 2775 2776 2780 2800 2808 2833 2850
## 2 2 1 2 2 3 1 2 1 2 1 3 10 2 1 4
## 2884 2894 2922 2936 2950 2961 2982 3000 3047 3048 3084 3108 3127 3144 3158 3186
## 1 2 2 2 1 1 1 3 2 1 1 2 2 1 2 1
## 3200 3220 3230 3241 3242 3268 3280 3284 3305 3312 3316 3325 3329 3333 3359 3370
## 8 2 1 2 2 2 2 2 2 2 1 2 1 1 1 1
## 3386 3388 3400 3409 3470 3476 3477 3478 3490 3491 3500 3524 3550 3579 3588 3600
## 2 2 6 1 2 1 1 2 1 3 8 1 1 2 2 2
## 3606 3609 3630 3673 3682 3721 3737 3740 3785 3800 3802 3858 3866 3936 3940 3945
## 2 2 1 2 2 2 2 2 2 2 2 2 1 1 1 2
## 3972 3987 4000 4012 4057 4104 4200 4225 4227 4300 4317 4415 4471 4474 4500 4506
## 1 2 5 1 2 2 2 2 2 2 1 2 2 1 3 2
## 4524 4546 4549 4613 4638 4700 4738 4775 4800 4802 4838 4920 4958 4962 5100 5217
## 2 3 1 1 2 2 2 1 11 2 2 1 1 2 2 2
## 5241 5300 5323 5359 5363 5400 5415 5500 5700 5776 5797 5807 5920 6224 6486 6533
## 2 2 2 1 2 1 1 3 1 2 2 1 1 1 2 2
## 6700 6988 7774 8178 9300
## 2 2 2 1 1
Aside from the missing values, there are many values that are impossible/incorrect as they are much too low to make sense. We decided to handle those manually by replacing them with the mean.
d %<>%
mutate(total_taxes = ifelse(total_taxes < 1000, mean(d$total_taxes), total_taxes))
str(d)
## 'data.frame': 2230 obs. of 27 variables:
## $ dining_room_typecombo : num 0 0 0 0 0 0 0 0 0 0 ...
## $ dining_room_typeformal: num 0 0 0 0 0 0 0 0 0 0 ...
## $ dining_room_typeother : num 1 1 1 1 1 1 1 1 1 1 ...
## $ fuel_typeelectric : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fuel_typegas : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fuel_typeoil : num 0 0 0 0 0 0 0 0 0 0 ...
## $ fuel_typeother : num 1 1 1 1 1 1 1 1 1 1 ...
## $ kitchen_typecombo : num 1 1 1 1 1 1 1 1 1 1 ...
## $ kitchen_typeefficiency: num 0 0 0 0 0 0 0 0 0 0 ...
## $ approx_year_built : int 2005 1950 NA 1951 1961 1955 1952 1928 1940 1940 ...
## $ cats_allowed : num 0 0 0 1 0 0 1 1 1 1 ...
## $ dogs_allowed : num 0 0 0 1 0 0 1 0 1 1 ...
## $ garage_exists : int 0 0 0 1 0 0 1 0 0 0 ...
## $ num_bedrooms : int 1 1 2 1 1 2 2 2 1 2 ...
## $ num_floors_in_building: int 4 NA NA 6 6 NA 2 6 6 6 ...
## $ num_full_bathrooms : int 1 1 1 1 1 1 1 1 1 1 ...
## $ num_total_rooms : int 3 3 4 3 3 6 5 4 3 4 ...
## $ sale_price : num 535000 185000 235000 280000 225000 ...
## $ sq_footage : int 681 NA NA NA NA NA NA 900 918 980 ...
## $ total_taxes : num 1320 NA NA NA NA ...
## $ walk_score : int 98 85 97 96 93 97 82 97 97 97 ...
## $ lat : num 40.8 40.7 40.8 40.8 40.7 ...
## $ lon : num -73.9 -73.8 -73.8 -73.9 -73.9 ...
## $ coop : int 0 1 1 1 1 1 0 1 1 1 ...
## $ condo : int 1 0 0 0 0 0 1 0 0 0 ...
## $ monthly_cost : num NA 499 NA 647 648 ...
## $ shortest_dist : num 1.047 0.662 0.548 0.864 0.793 ...
To handle the remaining missing values, we believe the best method is to use missForest to impute them, essentially predicting what the missing values probably would be or would be close to based on other features of the row. Before we do that though, we need to make sure we remember which rows did not have a value in sale price as those data points should not be used for the model.
df <- d %>%
mutate(sale_miss = ifelse(is.na(sale_price), 1, 0))
pacman::p_load(missForest)
df_imp = missForest(df, ntree = 300,)$ximp
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
## missForest iteration 7 in progress...done!
## missForest iteration 8 in progress...done!
## missForest iteration 9 in progress...done!
All the missing values have now been imputed, so we can now remove the section of the dataset where sale_price was missing.
df <- df_imp %>%
filter(sale_miss == 0) %>%
select(-sale_miss)
summary(lm(sale_price ~ ., df))
##
## Call:
## lm(formula = sale_price ~ ., data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -311557 -41301 -1315 44830 324624
##
## Coefficients: (4 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.995e+07 9.103e+06 -6.585 1.14e-10 ***
## dining_room_typecombo -1.030e+04 8.048e+03 -1.280 0.201060
## dining_room_typeformal 1.471e+04 9.886e+03 1.488 0.137328
## dining_room_typeother NA NA NA NA
## fuel_typeelectric 2.782e+03 2.763e+04 0.101 0.919832
## fuel_typegas -4.533e+03 1.408e+04 -0.322 0.747560
## fuel_typeoil -1.451e+04 1.459e+04 -0.994 0.320639
## fuel_typeother NA NA NA NA
## kitchen_typecombo 1.363e+04 7.251e+03 1.880 0.060725 .
## kitchen_typeefficiency NA NA NA NA
## approx_year_built -7.766e+01 2.666e+02 -0.291 0.770940
## cats_allowed -3.774e+03 9.727e+03 -0.388 0.698219
## dogs_allowed 2.955e+04 1.077e+04 2.744 0.006288 **
## garage_exists 7.094e+03 9.554e+03 0.743 0.458102
## num_bedrooms 4.637e+04 8.255e+03 5.617 3.21e-08 ***
## num_floors_in_building 3.999e+03 7.281e+02 5.492 6.29e-08 ***
## num_full_bathrooms -4.248e+03 5.576e+04 -0.076 0.939306
## num_total_rooms 4.871e+03 5.363e+03 0.908 0.364154
## sq_footage 3.879e+01 1.291e+01 3.005 0.002790 **
## total_taxes 3.894e+01 5.236e+00 7.438 4.41e-13 ***
## walk_score 7.051e+02 3.902e+02 1.807 0.071321 .
## lat 9.062e+05 1.275e+05 7.106 4.10e-12 ***
## lon -3.149e+05 8.756e+04 -3.596 0.000354 ***
## coop -2.260e+05 1.433e+04 -15.770 < 2e-16 ***
## condo NA NA NA NA
## monthly_cost 1.194e+02 1.489e+01 8.020 7.44e-15 ***
## shortest_dist 7.458e+02 6.197e+03 0.120 0.904258
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 77090 on 505 degrees of freedom
## Multiple R-squared: 0.8233, Adjusted R-squared: 0.8156
## F-statistic: 107 on 22 and 505 DF, p-value: < 2.2e-16
A simple linear model is able to predict the sale_price with an R-squared of .824. Let’s see how this model performs out-of-sample.
##Train-Test Split
pacman::p_load(caTools)
set.seed(77)
spl = sample.split(df$sale_price, 0.75)
train = subset(df, spl == TRUE)
test = subset(df, spl == FALSE)
simple <- lm(sale_price ~ ., train)
yhat <- predict(simple, test)
## Warning in predict.lm(simple, test): prediction from a rank-deficient fit may be
## misleading
rsq <- function (preds, actual) {
rss <- sum((preds - actual) ^ 2) ## residual sum of squares
tss <- sum((actual - mean(actual)) ^ 2) ## total sum of squares
r2 <- 1 - rss/tss
r2
}
rsq(yhat, test$sale_price)
## [1] 0.6795935
The simple linear model has an out-of-sample R-squared of .69, which is to be expected from a simple model from such a small dataset. Let’s see what we can do with forward step-wise regression to select the best predictors based on p-value.
pacman::p_load(olsrr)
forward <- ols_step_forward_p(simple, penter = .05, details = FALSE, progress = TRUE)
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. dining_room_typecombo
## 2. dining_room_typeformal
## 3. dining_room_typeother
## 4. fuel_typeelectric
## 5. fuel_typegas
## 6. fuel_typeoil
## 7. fuel_typeother
## 8. kitchen_typecombo
## 9. kitchen_typeefficiency
## 10. approx_year_built
## 11. cats_allowed
## 12. dogs_allowed
## 13. garage_exists
## 14. num_bedrooms
## 15. num_floors_in_building
## 16. num_full_bathrooms
## 17. num_total_rooms
## 18. sq_footage
## 19. total_taxes
## 20. walk_score
## 21. lat
## 22. lon
## 23. coop
## 24. condo
## 25. monthly_cost
## 26. shortest_dist
##
## We are selecting variables based on p value...
##
## Variables Entered:
##
## - sq_footage
## - condo
## Note: model has aliased coefficients
## sums of squares computed by model comparison
## - total_taxes
## Note: model has aliased coefficients
## sums of squares computed by model comparison
## - num_floors_in_building
## Note: model has aliased coefficients
## sums of squares computed by model comparison
## - lat
## Note: model has aliased coefficients
## sums of squares computed by model comparison
## - num_bedrooms
## Note: model has aliased coefficients
## sums of squares computed by model comparison
## - lon
## Note: model has aliased coefficients
## sums of squares computed by model comparison
## - monthly_cost
## Note: model has aliased coefficients
## sums of squares computed by model comparison
## - dogs_allowed
## Note: model has aliased coefficients
## sums of squares computed by model comparison
## - walk_score
## Note: model has aliased coefficients
## sums of squares computed by model comparison
##
## No more variables to be added.
##
## Final Model Output
## ------------------
##
## Model Summary
## ----------------------------------------------------------------------
## R 0.910 RMSE 76453.491
## R-Squared 0.828 Coef. Var 24.191
## Adj. R-Squared 0.823 MSE 5845136242.437
## Pred R-Squared 0.809 MAE 55379.402
## ----------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -----------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -----------------------------------------------------------------------------
## Regression 1.079998e+13 10 1.079998e+12 184.769 0.0000
## Residual 2.250377e+12 385 5845136242.437
## Total 1.305036e+13 395
## -----------------------------------------------------------------------------
##
## Parameter Estimates
## -------------------------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## -------------------------------------------------------------------------------------------------------------------------
## (Intercept) -60617604.488 9943201.750 -6.096 0.000 -80167379.052 -41067829.923
## sq_footage 167.005 24.878 0.261 6.713 0.000 118.091 215.919
## condo 207189.014 13113.421 0.494 15.800 0.000 181406.130 232971.898
## total_taxes 27.365 5.533 0.141 4.946 0.000 16.488 38.243
## num_floors_in_building 4299.206 764.351 0.152 5.625 0.000 2796.382 5802.031
## lat 991449.598 140942.019 0.163 7.034 0.000 714337.178 1268562.017
## num_bedrooms 38031.318 7694.065 0.157 4.943 0.000 22903.672 53158.963
## lon -271213.131 96638.547 -0.080 -2.806 0.005 -461218.509 -81207.753
## monthly_cost 88.103 18.188 0.185 4.844 0.000 52.343 123.862
## dogs_allowed 30190.746 8938.341 0.075 3.378 0.001 12616.674 47764.818
## walk_score 937.899 365.430 0.070 2.567 0.011 219.410 1656.388
## -------------------------------------------------------------------------------------------------------------------------
yhat <- predict(forward$model, test)
rsq(yhat, test$sale_price)
## [1] 0.6399033
Using only the best 10 predictors, we had essentially the same R-squared from our model. Let’s see what we can do if we include these 10 predictors, as well as associations between them.
top10 <- forward$predictors
f <- df %>%
select(top10, sale_price)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(top10)` instead of `top10` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
set.seed(777)
spl = sample.split(f$sale_price, 0.75)
ftrain = subset(f, spl == TRUE)
ftest = subset(f, spl == FALSE)
associations <- lm(sale_price ~ . * ., ftrain)
forward <- ols_step_forward_p(associations, penter = .025, details = FALSE, progress = TRUE)
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. sq_footage
## 2. condo
## 3. total_taxes
## 4. num_floors_in_building
## 5. lat
## 6. num_bedrooms
## 7. lon
## 8. monthly_cost
## 9. dogs_allowed
## 10. walk_score
## 11. sq_footage:condo
## 12. sq_footage:total_taxes
## 13. sq_footage:num_floors_in_building
## 14. sq_footage:lat
## 15. sq_footage:num_bedrooms
## 16. sq_footage:lon
## 17. sq_footage:monthly_cost
## 18. sq_footage:dogs_allowed
## 19. sq_footage:walk_score
## 20. condo:total_taxes
## 21. condo:num_floors_in_building
## 22. condo:lat
## 23. condo:num_bedrooms
## 24. condo:lon
## 25. condo:monthly_cost
## 26. condo:dogs_allowed
## 27. condo:walk_score
## 28. total_taxes:num_floors_in_building
## 29. total_taxes:lat
## 30. total_taxes:num_bedrooms
## 31. total_taxes:lon
## 32. total_taxes:monthly_cost
## 33. total_taxes:dogs_allowed
## 34. total_taxes:walk_score
## 35. num_floors_in_building:lat
## 36. num_floors_in_building:num_bedrooms
## 37. num_floors_in_building:lon
## 38. num_floors_in_building:monthly_cost
## 39. num_floors_in_building:dogs_allowed
## 40. num_floors_in_building:walk_score
## 41. lat:num_bedrooms
## 42. lat:lon
## 43. lat:monthly_cost
## 44. lat:dogs_allowed
## 45. lat:walk_score
## 46. num_bedrooms:lon
## 47. num_bedrooms:monthly_cost
## 48. num_bedrooms:dogs_allowed
## 49. num_bedrooms:walk_score
## 50. lon:monthly_cost
## 51. lon:dogs_allowed
## 52. lon:walk_score
## 53. monthly_cost:dogs_allowed
## 54. monthly_cost:walk_score
## 55. dogs_allowed:walk_score
##
## We are selecting variables based on p value...
##
## Variables Entered:
##
## - total_taxes:num_bedrooms
## - condo
## - num_floors_in_building
## - lon
## - dogs_allowed
## - walk_score
## - lat
## - sq_footage
## - monthly_cost
## - total_taxes
## - lon:walk_score
## - num_bedrooms
## - lat:monthly_cost
## - total_taxes:dogs_allowed
## - sq_footage:condo
## - num_bedrooms:monthly_cost
## - total_taxes:monthly_cost
## - total_taxes:lon
## - condo:lat
## - lat:lon
## - num_floors_in_building:lat
##
## No more variables to be added.
##
## Final Model Output
## ------------------
##
## Model Summary
## ----------------------------------------------------------------------
## R 0.948 RMSE 60244.330
## R-Squared 0.898 Coef. Var 19.097
## Adj. R-Squared 0.892 MSE 3629379268.003
## Pred R-Squared 0.864 MAE 44527.994
## ----------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## -------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## -------------------------------------------------------------------------------
## Regression 1.197833e+13 21 570396529546.032 157.161 0.0000
## Residual 1.357388e+12 374 3629379268.003
## Total 1.333571e+13 395
## -------------------------------------------------------------------------------
##
## Parameter Estimates
## ------------------------------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ------------------------------------------------------------------------------------------------------------------------------
## (Intercept) -26217403996.743 9183511110.364 -2.855 0.005 -4.427519e+10 -8.159616e+09
## condo -40766326.922 12513968.522 -95.867 -3.258 0.001 -6.537288e+07 -1.615977e+07
## num_floors_in_building -2449958.827 866429.924 -90.786 -2.828 0.005 -4.153644e+06 -7.462741e+05
## lon -354065685.052 124413085.938 -102.808 -2.846 0.005 -5.987025e+08 -1.094289e+08
## dogs_allowed -72558.034 28911.912 -0.175 -2.510 0.013 -1.294083e+05 -1.570776e+04
## walk_score -2629240.555 374394.412 -186.352 -7.023 0.000 -3.365422e+06 -1.893059e+06
## lat 650461223.258 225719272.456 107.150 2.882 0.004 2.066233e+08 1.094299e+09
## sq_footage 226.144 29.663 0.476 7.624 0.000 1.678170e+02 2.844710e+02
## monthly_cost 92973.512 17036.925 198.547 5.457 0.000 5.947334e+04 1.264737e+05
## total_taxes -30004.616 6434.903 -144.894 -4.663 0.000 -4.265774e+04 -1.735149e+04
## num_bedrooms -18853.585 19373.788 -0.077 -0.973 0.331 -5.694879e+04 1.924162e+04
## total_taxes:num_bedrooms 39.772 6.029 0.705 6.596 0.000 2.791600e+01 5.162700e+01
## lon:walk_score -35638.296 5073.867 -186.917 -7.024 0.000 -4.561518e+04 -2.566141e+04
## lat:monthly_cost -2281.080 418.308 -198.523 -5.453 0.000 -3.103610e+03 -1.458550e+03
## total_taxes:dogs_allowed 32.901 9.928 0.237 3.314 0.001 1.337900e+01 5.242400e+01
## condo:sq_footage -215.665 30.261 -0.628 -7.127 0.000 -2.751680e+02 -1.561620e+02
## num_bedrooms:monthly_cost -98.258 15.350 -0.658 -6.401 0.000 -1.284410e+02 -6.807600e+01
## total_taxes:monthly_cost 0.067 0.012 0.696 5.431 0.000 4.300000e-02 9.200000e-02
## total_taxes:lon -405.200 87.129 -144.474 -4.651 0.000 -5.765250e+02 -2.338760e+02
## condo:lat 1010825.928 307250.846 96.856 3.290 0.001 4.066702e+05 1.614982e+06
## lon:lat 8784776.976 3057922.190 136.102 2.873 0.004 2.771901e+06 1.479765e+07
## num_floors_in_building:lat 60201.655 21268.097 90.896 2.831 0.005 1.838162e+04 1.020217e+05
## ------------------------------------------------------------------------------------------------------------------------------
yhat <- predict(forward$model, ftest)
rsq(yhat, ftest$sale_price)
## [1] 0.7937756
That’s a significant improvement, the out-of-sample R-squared jumped to .79 with this method. Just to check let’s see how this method would work if we trained on the whole dataset (including the imputed sale_price points).
f <- df_imp %>%
select(top10, sale_price)
set.seed(77)
spl = sample.split(f$sale_price, 0.75)
ftrain = subset(f, spl == TRUE)
ftest = subset(f, spl == FALSE)
associations <- lm(sale_price ~ . * ., ftrain)
forward <- ols_step_forward_p(associations, penter = .005, details = FALSE, progress = TRUE)
## Forward Selection Method
## ---------------------------
##
## Candidate Terms:
##
## 1. sq_footage
## 2. condo
## 3. total_taxes
## 4. num_floors_in_building
## 5. lat
## 6. num_bedrooms
## 7. lon
## 8. monthly_cost
## 9. dogs_allowed
## 10. walk_score
## 11. sq_footage:condo
## 12. sq_footage:total_taxes
## 13. sq_footage:num_floors_in_building
## 14. sq_footage:lat
## 15. sq_footage:num_bedrooms
## 16. sq_footage:lon
## 17. sq_footage:monthly_cost
## 18. sq_footage:dogs_allowed
## 19. sq_footage:walk_score
## 20. condo:total_taxes
## 21. condo:num_floors_in_building
## 22. condo:lat
## 23. condo:num_bedrooms
## 24. condo:lon
## 25. condo:monthly_cost
## 26. condo:dogs_allowed
## 27. condo:walk_score
## 28. total_taxes:num_floors_in_building
## 29. total_taxes:lat
## 30. total_taxes:num_bedrooms
## 31. total_taxes:lon
## 32. total_taxes:monthly_cost
## 33. total_taxes:dogs_allowed
## 34. total_taxes:walk_score
## 35. num_floors_in_building:lat
## 36. num_floors_in_building:num_bedrooms
## 37. num_floors_in_building:lon
## 38. num_floors_in_building:monthly_cost
## 39. num_floors_in_building:dogs_allowed
## 40. num_floors_in_building:walk_score
## 41. lat:num_bedrooms
## 42. lat:lon
## 43. lat:monthly_cost
## 44. lat:dogs_allowed
## 45. lat:walk_score
## 46. num_bedrooms:lon
## 47. num_bedrooms:monthly_cost
## 48. num_bedrooms:dogs_allowed
## 49. num_bedrooms:walk_score
## 50. lon:monthly_cost
## 51. lon:dogs_allowed
## 52. lon:walk_score
## 53. monthly_cost:dogs_allowed
## 54. monthly_cost:walk_score
## 55. dogs_allowed:walk_score
##
## We are selecting variables based on p value...
##
## Variables Entered:
##
## - total_taxes:lat
## - dogs_allowed
## - walk_score
## - lon
## - condo
## - lat
## - sq_footage:monthly_cost
## - total_taxes
## - num_floors_in_building
## - num_bedrooms
## - sq_footage
## - sq_footage:condo
## - monthly_cost
## - lon:walk_score
## - total_taxes:num_floors_in_building
## - condo:lat
## - condo:total_taxes
## - total_taxes:lon
## - condo:num_bedrooms
## - sq_footage:total_taxes
## - monthly_cost:walk_score
## - num_floors_in_building:walk_score
## - num_bedrooms:lon
## - sq_footage:dogs_allowed
## - num_bedrooms:walk_score
##
## No more variables to be added.
##
## Final Model Output
## ------------------
##
## Model Summary
## ----------------------------------------------------------------------
## R 0.963 RMSE 43057.977
## R-Squared 0.928 Coef. Var 12.989
## Adj. R-Squared 0.927 MSE 1853989393.030
## Pred R-Squared 0.917 MAE 30379.483
## ----------------------------------------------------------------------
## RMSE: Root Mean Square Error
## MSE: Mean Square Error
## MAE: Mean Absolute Error
##
## ANOVA
## ------------------------------------------------------------------------------
## Sum of
## Squares DF Mean Square F Sig.
## ------------------------------------------------------------------------------
## Regression 3.937827e+13 25 1.575131e+12 849.59 0.0000
## Residual 3.051667e+12 1646 1853989393.030
## Total 4.242993e+13 1671
## ------------------------------------------------------------------------------
##
## Parameter Estimates
## ---------------------------------------------------------------------------------------------------------------------------------------
## model Beta Std. Error Std. Beta t Sig lower upper
## ---------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 114601766.960 15874177.051 7.219 0.000 83466056.706 145737477.214
## dogs_allowed -16244.660 7976.539 -0.044 -2.037 0.042 -31889.894 -599.427
## walk_score -1367222.664 127449.577 -126.901 -10.728 0.000 -1617203.062 -1117242.267
## lon 1742446.457 178303.373 0.546 9.772 0.000 1392721.105 2092171.810
## condo -31805036.411 3755571.289 -87.853 -8.469 0.000 -39171237.450 -24438835.371
## lat 340126.658 151005.726 0.061 2.252 0.024 43943.082 636310.234
## total_taxes -17025.713 3507.257 -91.171 -4.854 0.000 -23904.870 -10146.557
## num_floors_in_building -577.604 1310.419 -0.024 -0.441 0.659 -3147.868 1992.659
## num_bedrooms 14073100.741 2900092.398 66.263 4.853 0.000 8384841.355 19761360.128
## sq_footage 232.807 16.246 0.479 14.330 0.000 200.943 264.672
## monthly_cost -11.809 25.304 -0.030 -0.467 0.641 -61.442 37.823
## total_taxes:lat 12.737 50.168 2.781 0.254 0.800 -85.663 111.137
## sq_footage:monthly_cost 0.002 0.009 0.008 0.222 0.824 -0.015 0.019
## condo:sq_footage -129.019 14.413 -0.401 -8.951 0.000 -157.289 -100.748
## walk_score:lon -18519.177 1726.947 -127.134 -10.724 0.000 -21906.421 -15131.933
## total_taxes:num_floors_in_building 2.218 0.217 0.369 10.212 0.000 1.792 2.644
## lat:condo 789807.414 92223.395 88.889 8.564 0.000 608919.870 970694.958
## total_taxes:condo -35.450 4.318 -0.362 -8.210 0.000 -43.920 -26.981
## total_taxes:lon -224.464 33.412 -88.752 -6.718 0.000 -289.998 -158.931
## condo:num_bedrooms 28312.242 4526.601 0.160 6.255 0.000 19433.739 37190.745
## total_taxes:sq_footage -0.016 0.004 -0.192 -3.613 0.000 -0.024 -0.007
## walk_score:monthly_cost 0.919 0.227 0.180 4.054 0.000 0.474 1.364
## walk_score:num_floors_in_building -45.833 12.352 -0.151 -3.710 0.000 -70.061 -21.605
## lon:num_bedrooms 190837.663 39348.576 66.337 4.850 0.000 113659.120 268016.205
## dogs_allowed:sq_footage 33.056 8.242 0.091 4.010 0.000 16.889 49.222
## walk_score:num_bedrooms 387.834 114.503 0.155 3.387 0.001 163.247 612.420
## ---------------------------------------------------------------------------------------------------------------------------------------
yhat <- predict(forward$model, test)
rsq(yhat, test$sale_price)
## [1] 0.8655408
Even though this model was trained on essentially fake data, it still performed very well when predicting on the real data points. This shows that this method at least has promise if we are able to scrape more data points that include the sale_price.
##RANDOM FOREST
Let’s see how a random forest regression model performs with our real dataset.
pacman::p_load(randomForest)
rf <- randomForest(sale_price ~ ., train, mtry=3, ntree = 500, importance=TRUE)
yhat <- predict(rf, test)
rsq(yhat, test$sale_price)
## [1] 0.8920029
The random forest regressor model had an R-squared of .89 out-of-sample, a huge improvement over the simple linear model. This is also a very acceptable R-squared in general, considering we only had 528 data points. Let’s take a look at how a sample tree functioned in our random forest:
pacman::p_load(rpart, rpart.plot)
tree <- rpart(
formula = sale_price ~ .,
data = df,
method = "anova"
)
rpart.plot(tree)
What if we trained the random forest on the larger dataset:
rf <- randomForest(sale_price ~ ., ftrain, ntry = 3, ntree = 500, importance = TRUE)
yhat <- predict(rf, test)
rsq(yhat, test$sale_price)
## [1] 0.9599098
An amazing .96 R-squared using the larger dataset to train the model, then predicting on the real data points. There is definitely some over fitting going on when we train on the larger dataset though, because those missing sale prices were generated by the rest of the dataset’s information. It would definitely be interesting though to try these models on a larger, real dataset.