It is widely known in real estate that the two most important factors determining the price of a house is its size (SqFt) and its location (Zip code, for example). So it is practically important that we can determine how much a house at a certain location would cost relative to comparable houses (“comps” as they call them) based on difference in squared footage and other information.
Given a dataset on the price and other information about homes, a quick regression can give us a ball-park estimate on how each factor, on average, contributes to home prices in the dataset:
# import dataset house
house <- read.csv("/cloud/project/house.csv")
Exercise: Use command summary(house) to explore variables that could potentially contributes to price of a home:
summary(house)
## X Obs BedRooms Baths
## Min. : 2.0 Min. : 2.0 Min. :2.000 Min. :1.000
## 1st Qu.:128.2 1st Qu.:128.2 1st Qu.:3.000 1st Qu.:1.500
## Median :253.5 Median :253.5 Median :3.000 Median :2.000
## Mean :253.4 Mean :253.4 Mean :3.313 Mean :2.118
## 3rd Qu.:378.8 3rd Qu.:378.8 3rd Qu.:4.000 3rd Qu.:2.500
## Max. :504.0 Max. :504.0 Max. :5.000 Max. :6.500
## Garage Zip Price SqFt
## Min. :0.000 Min. :4.000 Min. : 52000 Min. : 672
## 1st Qu.:2.000 1st Qu.:5.000 1st Qu.:104925 1st Qu.:1300
## Median :2.000 Median :6.000 Median :129900 Median :1680
## Mean :1.817 Mean :6.458 Mean :157889 Mean :1826
## 3rd Qu.:2.000 3rd Qu.:9.000 3rd Qu.:184400 3rd Qu.:2208
## Max. :3.000 Max. :9.000 Max. :830000 Max. :8805
## pps
## Min. : 34.28
## 1st Qu.: 73.73
## Median : 84.39
## Mean : 85.09
## 3rd Qu.: 94.45
## Max. :160.61
####list of variables that are potentially going to affect the price of houses: bedrooms, bathrooms, garage, zipcode, and square feet.
A simple regression to predict Price based on SqFt
summary(lm(Price~SqFt, data=house))
##
## Call:
## lm(formula = Price ~ SqFt, data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -188158 -21004 -161 17388 226790
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24431.933 4680.314 -5.22 2.63e-07 ***
## SqFt 99.863 2.361 42.29 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 40820 on 500 degrees of freedom
## Multiple R-squared: 0.7815, Adjusted R-squared: 0.7811
## F-statistic: 1788 on 1 and 500 DF, p-value: < 2.2e-16
Regression jargon: In this regression, Price is said to be the response variable. The variable on the right hand side of the regression formula, SqFt, is called predictor variable.
An alternative language, more popular but also misleading, call them dependent and independent, respectively.
Interpretation: On average, a house that is 1 sq ft larger is expected to cost $99.86 more.
Regressing Price on Zip codes: Regression could work on factor (categorical) variables too.
# first we use table() and aggregate() to describe the distribution of Zip and average price by Zip
table(house$Zip)
##
## 4 5 6 9
## 43 158 143 158
aggregate(Price~factor(Zip), FUN = "mean", data=house)
## factor(Zip) Price
## 1 4 95876.74
## 2 5 172757.28
## 3 6 194158.74
## 4 9 127071.52
#now run the regression
summary(lm(Price~factor(Zip), data=house))
##
## Call:
## lm(formula = Price ~ factor(Zip), data = house)
##
## Residuals:
## Min 1Q Median 3Q Max
## -130259 -47172 -17172 25741 657243
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95877 12365 7.754 5.06e-14 ***
## factor(Zip)5 76880 13946 5.513 5.68e-08 ***
## factor(Zip)6 98282 14102 6.970 1.01e-11 ***
## factor(Zip)9 31195 13946 2.237 0.0257 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81080 on 498 degrees of freedom
## Multiple R-squared: 0.1412, Adjusted R-squared: 0.1361
## F-statistic: 27.3 on 3 and 498 DF, p-value: 2.319e-16
## Note: it's really important to factor() Zip
#what if you just regress price on Zip?
Interpretation: The regression set the first Zip code (Zip==4) as the default and compare it with the average home prices. Thus, the average home price in Zip==4 is stored in the intercept estimate, which is 95,877. The regression results indicate that the average home price in Zip = 6 is 98,282 more than in Zip==4. Therefore, we expect that the average 95877 + 98282 = 194159
There are many issues when interpreting these ball-park estimates. One among the issues is the fact that we are not able to account for other factors that could potentially affect the prices of homes. A visualization of the data will illustrate some of these issue:
library(ggplot2)
ggplot(house, aes(x=SqFt, Price, color = factor(Zip))) +
geom_point(alpha=1/3) +
scale_color_brewer(palette = 'Set1') +
theme_bw()
Based on our visual exploration of the price-per-sqft relationship, we note that:
It turns out that an optimal model to predict house price based on available information in the dataset is (I want to say “the optimal” model, but there is not one, unfortunately. My choice was partially subjective, as you will see):
summary(lm(Price~SqFt+Baths+SqFt*(Zip=="6"),
data = house[house$Price<6e+5&house$SqFt<5000,]))
##
## Call:
## lm(formula = Price ~ SqFt + Baths + SqFt * (Zip == "6"), data = house[house$Price <
## 6e+05 & house$SqFt < 5000, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -161221 -19156 -106 14596 215388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -24375.754 5753.210 -4.237 2.70e-05 ***
## SqFt 67.163 3.900 17.220 < 2e-16 ***
## Baths 25598.878 3461.690 7.395 6.11e-13 ***
## Zip == "6"TRUE -63679.569 11577.065 -5.500 6.09e-08 ***
## SqFt:Zip == "6"TRUE 39.339 5.591 7.036 6.62e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36060 on 494 degrees of freedom
## Multiple R-squared: 0.7901, Adjusted R-squared: 0.7884
## F-statistic: 464.7 on 4 and 494 DF, p-value: < 2.2e-16
As we will see later this regression is “optimal” in a sense that it balances between a accurate estimate for the Price-per-SqFt relationship we cares about and simplicity.
A back-of-an-evelope interpretation:
We need a few more conceptual skills to fully interpret this regression result. For now I just want to highlight an important difference that it made from the simplistic model Price ~ SqFt or the model Price~factor(Zip).
In this model, the coefficient for SqFt is 67, and the coefficient for SqFt:Zip == "6"TRUE is 39. Combined, these coefficients indicated that:
For houses in all zip codes other than SqFt:Zip == "6"TRUE, a one-squared-foot difference in size predicts $67 increase in price.
For houses in zip code SqFt:Zip == "6"TRUE, a one-squared-foot difference in size predicts an additional $39 in price. Thus for SqFt:Zip == "6"TRUE, a one-squared-foot difference in size predicts a total of $39+$67=$106 increase in price.
The *** behind each of these estimates indicate that, if there were no relationships in the population, it is very unlikely the estimates we obtained a result of random sampling error. In effect, our estimates look practically reliable. The more stars we have for our estimates, means that the estimate is more statistically significant.
For practical concern, namely pricing homes by their squared footage, the “optimal” model yields a drastically different result from simple one-variable regressions. It tells us that we need to account for other amenities such as Baths and that we need to enable the price per squared foot to change between zip code 6 and other zip codes. Failing to do so could lead to severely mis-pricing the houses on the market.
We can incorporate into the model other factors and the basic estimates above would not be affected much:
Exercise: Rerun the regression above but add Baths and Garage as predictor variables. Use regression result to predict the average value of an additional bathroom and an additional garage, respectively.
summary(lm(Price~SqFt+Baths+Garage+BedRooms+SqFt*(Zip=="6"),
data = house[house$Price<600000&house$SqFt<5000,]))
##
## Call:
## lm(formula = Price ~ SqFt + Baths + Garage + BedRooms + SqFt *
## (Zip == "6"), data = house[house$Price < 6e+05 & house$SqFt <
## 5000, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -158998 -18297 -1398 15383 213432
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11712.973 9256.573 -1.265 0.206338
## SqFt 69.709 4.311 16.171 < 2e-16 ***
## Baths 22482.502 3663.617 6.137 1.73e-09 ***
## Garage 9244.201 2749.936 3.362 0.000835 ***
## BedRooms -8404.993 3448.068 -2.438 0.015138 *
## Zip == "6"TRUE -63935.156 11426.652 -5.595 3.66e-08 ***
## SqFt:Zip == "6"TRUE 40.047 5.508 7.271 1.41e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 35450 on 492 degrees of freedom
## Multiple R-squared: 0.7979, Adjusted R-squared: 0.7954
## F-statistic: 323.7 on 6 and 492 DF, p-value: < 2.2e-16
Getting from the simplistic regression to determining an optimal model for our need is a multi-step journey. These steps are: