Housing Prices

Real estate agents know the three most important factors in determining the price of a house are location, location, and location. But, what other factors help determine the price at which a house should be listed? We’ve drawn a random sample of 1057 home sales from the public records of sales in upstate New York, in the region around the city of Saratoga Springs. The variables in the dataset are the price of the house as sold in 2002 (in dollars), the total living area (in square feet), the number of bathrooms, the number of bedrooms, the number of fireplaces, and the age of house (in years).

Getting to know my data

str(x)
## 'data.frame':    1057 obs. of  8 variables:
##  $ Price      : int  142212 134865 118007 138297 129470 206512 50709 108794 68353 123266 ...
##  $ Living.Area: int  1982 1676 1694 1800 2088 1456 960 1464 1216 1632 ...
##  $ Bathrooms  : num  1 1.5 2 1 1 2 1.5 1 1 1.5 ...
##  $ Bedrooms   : int  3 3 3 2 3 3 2 2 2 3 ...
##  $ Fireplaces : int  0 1 1 2 1 0 0 0 0 0 ...
##  $ Lot.Size   : num  2 0.38 0.96 0.48 1.84 0.98 NA 0.11 0.61 0.23 ...
##  $ Age        : int  133 14 15 49 29 10 12 87 101 14 ...
##  $ Fireplace  : logi  FALSE TRUE TRUE TRUE TRUE FALSE ...

We will check the linearity assumption by plotting Price against each predictor.

The number of bedrooms is a quantitative variable, but it holds only a few values (from 1-5). So a scatterplot may not be the best way to examine the relationship between Price and bedrooms. We will use a side-by-side boxplot. it shows a general increase in price with more bedrooms, and an approximately linear growth.

par(mfrow = c(1, 2))
plot(x$Living.Area, x$Price, main = 'Living Area vs. Price', xlab = 'Living Area', ylab = 'Price') #plotting price against the area of house 
plot(Price ~ factor(Bedrooms), data = x, main = 'Bedrooms vs. Price', xlab = 'Bedrooms', ylab = 'Price')

par(mfrow = c(1, 2))
plot(factor(x$Bathrooms), x$Price, main = 'Bathrooms vs. Price', xlab = 'Bathrooms', ylab = 'Price') #plotting price against the number of bathrooms
plot(Price ~ factor(Fireplaces), data = x, main = 'Fireplaces vs. Price', xlab = 'Fireplaces', ylab = 'Price')

The scatterplot of Price against Living.Area shows a strong positive correlation. We use the side-by-side boxplots of Price to check the linearity for Bedrooms, Bathrooms and Fireplaces, because each of those variables only has a few different values. The boxplots for Bedrooms show an approximate linear growth of price with the number of bedrooms. There seems to be two slopes between Price and Bathrooms, one from 1-2 bathrooms and then a steeper one from 2-4. For now, we’ll proceed cautiously, realizing that any slope we find will average these two. The plot for Fireplaces shows a positive correlation, but with an outlier–an expensive house with 4 fireplaces. Let’s keep this outlier in the dataset for now. It may disappear in the residuals of the multiple regression model.

The plot of Price against Age shows that there may be some curvilinear relationship. We therefore use log10 transformation to improve the linearity of the relationship. Also, note that there are a bunch of 0s in Age. We need to add a small positive number, e.g., 1, to every age to create a new variable Age.new and put it in the dataset

par(mfrow = c(1,2))
plot(x$Age, x$Price, xlab = 'Age', ylab = 'Price')

x$Age_new = x$Age + 1
plot(log10(x$Age_new), x$Price, xlab = 'Log10(Age)', ylab = 'Price')

Multiple Linear Regression Model

m <- lm(Price ~ Living.Area + Bedrooms + Bathrooms + Fireplaces + log10(Age_new), data = x)

The plot shows no bends or other nonlinearities, so the Linearity Assumption is met. The plot also shows the residuals spread equally across the fitted values, so the Equal Variance Assumption is satisfied. There is no outlier in the plot either.

We need to check the residual plot:

plot(m$fitted.values, m$residuals, xlab = 'Fitted', ylab = 'Residuals')
abline(0,0)

Finally we check the Normality Assumption. We will plot a histogram and a Q-Q plot of the residuals.

par(mfrow = c(1,2))
hist(m$residuals, main = 'Histogram of Residuals', xlab = 'Residuals')
qqnorm(m$residuals)
qqline(m$residuals)

The histogram looks unimodal and symmetric. The Q-Q plot has some bend on both ends, indicating that the residuals in the tails straggle away from the center more than Normally distributed data would. But, we here have no skewness and more than 1000 cases, so the Normality Assumption is not important. (The Central Limit Theorem applies here.)

Now let’s take a look at the summary of this model:

summary(m)
## 
## Call:
## lm(formula = Price ~ Living.Area + Bedrooms + Bathrooms + Fireplaces + 
##     log10(Age_new), data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -214161  -24688   -4590   16851  391928 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    22482.830   9149.597   2.457  0.01416 *  
## Living.Area       72.069      4.043  17.824  < 2e-16 ***
## Bedrooms       -6742.076   2746.543  -2.455  0.01426 *  
## Bathrooms      19901.227   3699.526   5.379 9.21e-08 ***
## Fireplaces      9772.241   3194.151   3.059  0.00227 ** 
## log10(Age_new) -7467.264   3181.206  -2.347  0.01909 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 48690 on 1051 degrees of freedom
## Multiple R-squared:  0.6037, Adjusted R-squared:  0.6018 
## F-statistic: 320.2 on 5 and 1051 DF,  p-value: < 2.2e-16

The R2 = 0.6037 and the adjusted R2adj = 0.6018. The two statistics are very close because there are many more observations than the number of predictors. There is 60.37% variation in Price that can be explained by the 5 predictors through this multiple regression model.

The F test statistic is 320.2 and it is quite significant indicated by its tiny p-value (< 2.2e-16). The model is therefore statistically useful: there is at least one predictor that is useful for predicting Price given other predictors already in the model.

To identify useful predictors, we need to look at the t test conducted on the coefficient of each predictor. The test results can be found in the summary output. As we can see, all the five predictors in the model are useful (i.e., reject \(H_{0}\) : \(\beta\)= 0) at 5% significance level. If 1% significance level is used the Bedrooms and the transformed Age will become not useful given other predictors in the model. But, it does not mean they are not useful predictors by themselves. Actually, both predictors have very significant t tests in their own simple regression models:

summary(lm(Price ~ Bedrooms, data = x))
## 
## Call:
## lm(formula = Price ~ Bedrooms, data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -199627  -42434   -9250   27927  419850 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    14350       9298   1.543    0.123    
## Bedrooms       48219       2844  16.955   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 68430 on 1055 degrees of freedom
## Multiple R-squared:  0.2141, Adjusted R-squared:  0.2134 
## F-statistic: 287.5 on 1 and 1055 DF,  p-value: < 2.2e-16
summary(lm(Price ~ log10(Age_new), data = x))
## 
## Call:
## lm(formula = Price ~ log10(Age_new), data = x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -162762  -46204   -9533   27711  486024 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      234409       4906   47.78   <2e-16 ***
## log10(Age_new)   -56933       3773  -15.09   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 70010 on 1055 degrees of freedom
## Multiple R-squared:  0.1775, Adjusted R-squared:  0.1767 
## F-statistic: 227.7 on 1 and 1055 DF,  p-value: < 2.2e-16

However, their contributions on predicting Price are somehow reduced by the other predictors in the model, due to the correlations among them. The coefficient of Bedrooms even flips the sign in the multiple regression model. It is because given certain Living Area the less bedrooms a house has, the more attractive (so more expensive) it tends to be.

The 95% and 99% slope confidence intervals are respectively given by

confint(m)
##                       2.5 %      97.5 %
## (Intercept)      4529.27393 40436.38677
## Living.Area        64.13504    80.00274
## Bedrooms       -12131.40691 -1352.74470
## Bathrooms       12641.93059 27160.52437
## Fireplaces       3504.60206 16039.88043
## log10(Age_new) -13709.50233 -1225.02571
confint(m, level = 0.99)
##                      0.5 %      99.5 %
## (Intercept)     -1127.8450 46093.50572
## Living.Area        61.6351    82.50267
## Bedrooms       -13829.5711   345.41952
## Bathrooms       10354.5450 29447.90994
## Fireplaces       1529.6852 18014.79724
## log10(Age_new) -15676.4154   741.88735

the 95% confidence interval for Fireplaces is (3504.60, 16039.88), which means that with 95% confidence the price will on average increase by between 3504.60 USD and 16039.88 USD, when there is an additional fireplace in the house, after accounting for other predictor variables in the model.

As we can see, at 95% level none of the confidence intervals cover 0, which means that all the coefficients are significantly different from 0. When the confidence level becomes 99%, however, the confidence intervals for Bedrooms and log10(Age.new) contain 0, which means that the coefficients of those two variables are not significantly different from 0 any more. The results coincide with those given by the t tests.

Finally, we would like to predict the price of a house, which has 3000 square-foot living area, 4 bedrooms, 2.5 bathrooms, 1 fireplace and was built in 1998. Note that the dataset was collected in 2002, so the age of the house is 4 years old. Let’s create a new dataset for predicting purpose:

data_pred <- data.frame(Living.Area = 3000, Bedrooms = 4, Bathrooms = 2.5, Fireplaces = 1, Age_new = 4+1)

We’ll use our model (m) to predict the price:

predict(m, newdata = data_pred, interval = 'prediction')
##        fit      lwr      upr
## 1 266027.1 170273.7 361780.5

The predicted price is 266,027 USD and its 95% prediction interval is (170,274 USD, 361,781 USD). Given a house with the same characteristics, there is 95% chance that its price will fall in the interval. Actually, it is not a good prediction because its margin of error is so big, almost 100k. It’s because the R2 of the model is only 60%, a little too low for good predictions.