Predicting home prices

It is widely known in real estate that the two most important factors determining the price of a house is its size (SqFt) and its location (Zip code, for example). So it is practically important that we can determine how much a house at a certain location would cost relative to comparable houses (“comps” as they call them) based on difference in squared footage and other information.

Given a dataset on the price and other information about homes, a quick regression can give us a ball-park estimate on how each factor, on average, contributes to home prices in the dataset:

# import dataset house
house <- read.csv("/cloud/project/house.csv")

Exercise: Use command summary(house) to explore variables that could potentially contributes to price of a home:

summary(house)
##        X              Obs           BedRooms         Baths      
##  Min.   :  2.0   Min.   :  2.0   Min.   :2.000   Min.   :1.000  
##  1st Qu.:128.2   1st Qu.:128.2   1st Qu.:3.000   1st Qu.:1.500  
##  Median :253.5   Median :253.5   Median :3.000   Median :2.000  
##  Mean   :253.4   Mean   :253.4   Mean   :3.313   Mean   :2.118  
##  3rd Qu.:378.8   3rd Qu.:378.8   3rd Qu.:4.000   3rd Qu.:2.500  
##  Max.   :504.0   Max.   :504.0   Max.   :5.000   Max.   :6.500  
##      Garage           Zip            Price             SqFt     
##  Min.   :0.000   Min.   :4.000   Min.   : 52000   Min.   : 672  
##  1st Qu.:2.000   1st Qu.:5.000   1st Qu.:104925   1st Qu.:1300  
##  Median :2.000   Median :6.000   Median :129900   Median :1680  
##  Mean   :1.817   Mean   :6.458   Mean   :157889   Mean   :1826  
##  3rd Qu.:2.000   3rd Qu.:9.000   3rd Qu.:184400   3rd Qu.:2208  
##  Max.   :3.000   Max.   :9.000   Max.   :830000   Max.   :8805  
##       pps        
##  Min.   : 34.28  
##  1st Qu.: 73.73  
##  Median : 84.39  
##  Mean   : 85.09  
##  3rd Qu.: 94.45  
##  Max.   :160.61

####list of variables that are potentially going to affect the price of houses: bedrooms, bathrooms, garage, zipcode, and square feet.

A simple regression to predict Price based on SqFt

summary(lm(Price~SqFt, data=house))
## 
## Call:
## lm(formula = Price ~ SqFt, data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -188158  -21004    -161   17388  226790 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -24431.933   4680.314   -5.22 2.63e-07 ***
## SqFt            99.863      2.361   42.29  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 40820 on 500 degrees of freedom
## Multiple R-squared:  0.7815, Adjusted R-squared:  0.7811 
## F-statistic:  1788 on 1 and 500 DF,  p-value: < 2.2e-16

Regression jargon: In this regression, Price is said to be the response variable. The variable on the right hand side of the regression formula, SqFt, is called predictor variable.

An alternative language, more popular but also misleading, call them dependent and independent, respectively.

Interpretation: On average, a house that is 1 sq ft larger is expected to cost $99.86 more.

Regressing Price on Zip codes: Regression could work on factor (categorical) variables too.

# first we use table() and aggregate() to describe the distribution of Zip and average price by Zip
table(house$Zip)
## 
##   4   5   6   9 
##  43 158 143 158
aggregate(Price~factor(Zip), FUN = "mean", data=house)
##   factor(Zip)     Price
## 1           4  95876.74
## 2           5 172757.28
## 3           6 194158.74
## 4           9 127071.52
#now run the regression
summary(lm(Price~factor(Zip), data=house)) 
## 
## Call:
## lm(formula = Price ~ factor(Zip), data = house)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -130259  -47172  -17172   25741  657243 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     95877      12365   7.754 5.06e-14 ***
## factor(Zip)5    76880      13946   5.513 5.68e-08 ***
## factor(Zip)6    98282      14102   6.970 1.01e-11 ***
## factor(Zip)9    31195      13946   2.237   0.0257 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 81080 on 498 degrees of freedom
## Multiple R-squared:  0.1412, Adjusted R-squared:  0.1361 
## F-statistic:  27.3 on 3 and 498 DF,  p-value: 2.319e-16
## Note: it's really important to factor() Zip
  #what if you just regress price on Zip?

Interpretation: The regression set the first Zip code (Zip==4) as the default and compare it with the average home prices. Thus, the average home price in Zip==4 is stored in the intercept estimate, which is 95,877. The regression results indicate that the average home price in Zip = 6 is 98,282 more than in Zip==4. Therefore, we expect that the average 95877 + 98282 = 194159

There are many issues when interpreting these ball-park estimates. One among the issues is the fact that we are not able to account for other factors that could potentially affect the prices of homes. A visualization of the data will illustrate some of these issue:

library(ggplot2)
ggplot(house, aes(x=SqFt, Price, color = factor(Zip))) +
  geom_point(alpha=1/3) +
  scale_color_brewer(palette = 'Set1') +
  theme_bw()

Based on our visual exploration of the price-per-sqft relationship, we note that:

  1. There is one outlier that may “bias” our estimate.
  2. Price-per-sqft could vary across zip codes.
  3. It is not clear how the 3-way relationship between Price and SqFt and Zip code may change when we also consider other factors that affect home price.

An optimal linear model for predictive purpose

It turns out that an optimal model to predict house price based on available information in the dataset is (I want to say “the optimal” model, but there is not one, unfortunately. My choice was partially subjective, as you will see):

summary(lm(Price~SqFt+Baths+SqFt*(Zip=="6"), 
           data = house[house$Price<6e+5&house$SqFt<5000,]))
## 
## Call:
## lm(formula = Price ~ SqFt + Baths + SqFt * (Zip == "6"), data = house[house$Price < 
##     6e+05 & house$SqFt < 5000, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -161221  -19156    -106   14596  215388 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -24375.754   5753.210  -4.237 2.70e-05 ***
## SqFt                    67.163      3.900  17.220  < 2e-16 ***
## Baths                25598.878   3461.690   7.395 6.11e-13 ***
## Zip == "6"TRUE      -63679.569  11577.065  -5.500 6.09e-08 ***
## SqFt:Zip == "6"TRUE     39.339      5.591   7.036 6.62e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36060 on 494 degrees of freedom
## Multiple R-squared:  0.7901, Adjusted R-squared:  0.7884 
## F-statistic: 464.7 on 4 and 494 DF,  p-value: < 2.2e-16

As we will see later this regression is “optimal” in a sense that it balances between a accurate estimate for the Price-per-SqFt relationship we cares about and simplicity.

A back-of-an-evelope interpretation:

We need a few more conceptual skills to fully interpret this regression result. For now I just want to highlight an important difference that it made from the simplistic model Price ~ SqFt or the model Price~factor(Zip).

In this model, the coefficient for SqFt is 67, and the coefficient for SqFt:Zip == "6"TRUE is 39. Combined, these coefficients indicated that:

  1. For houses in all zip codes other than SqFt:Zip == "6"TRUE, a one-squared-foot difference in size predicts $67 increase in price.

  2. For houses in zip code SqFt:Zip == "6"TRUE, a one-squared-foot difference in size predicts an additional $39 in price. Thus for SqFt:Zip == "6"TRUE, a one-squared-foot difference in size predicts a total of $39+$67=$106 increase in price.

The *** behind each of these estimates indicate that, if there were no relationships in the population, it is very unlikely the estimates we obtained a result of random sampling error. In effect, our estimates look practically reliable. The more stars we have for our estimates, means that the estimate is more statistically significant.

For practical concern, namely pricing homes by their squared footage, the “optimal” model yields a drastically different result from simple one-variable regressions. It tells us that we need to account for other amenities such as Baths and that we need to enable the price per squared foot to change between zip code 6 and other zip codes. Failing to do so could lead to severely mis-pricing the houses on the market.

We can incorporate into the model other factors and the basic estimates above would not be affected much:

Exercise: Rerun the regression above but add Baths and Garage as predictor variables. Use regression result to predict the average value of an additional bathroom and an additional garage, respectively.

summary(lm(Price~SqFt+Baths+Garage+BedRooms+SqFt*(Zip=="6"), 
           data = house[house$Price<600000&house$SqFt<5000,]))
## 
## Call:
## lm(formula = Price ~ SqFt + Baths + Garage + BedRooms + SqFt * 
##     (Zip == "6"), data = house[house$Price < 6e+05 & house$SqFt < 
##     5000, ])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -158998  -18297   -1398   15383  213432 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -11712.973   9256.573  -1.265 0.206338    
## SqFt                    69.709      4.311  16.171  < 2e-16 ***
## Baths                22482.502   3663.617   6.137 1.73e-09 ***
## Garage                9244.201   2749.936   3.362 0.000835 ***
## BedRooms             -8404.993   3448.068  -2.438 0.015138 *  
## Zip == "6"TRUE      -63935.156  11426.652  -5.595 3.66e-08 ***
## SqFt:Zip == "6"TRUE     40.047      5.508   7.271 1.41e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35450 on 492 degrees of freedom
## Multiple R-squared:  0.7979, Adjusted R-squared:  0.7954 
## F-statistic: 323.7 on 6 and 492 DF,  p-value: < 2.2e-16

Plan for the next steps:

Getting from the simplistic regression to determining an optimal model for our need is a multi-step journey. These steps are:

  1. Visually explore how house characteristics may affect their prices.
  2. Learn how to conduct multiple variable regression techniques and how to interpret results.
  3. Learn the basic criteria of model evaluation to guid our choice of models.
  4. Apply a popular procedure in data science to select an optimal regression model.