Mysteries of Real Estate Appraisal

Intro

This is the first part of Capstone project for Coursera’s Statistics with R Specialization.

The Capstone Project aims to develop a model to predict the selling price of a given house in Ames, Iowa. This information helps assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment.

The data for the project comes from the Ames Assessor’s Office information used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. More information can be found in the Codebook. The data have been randomly divided into three separate data sets: a training data set, a test data set, and a validation data set.

The goal of this first part of Capstone is to explore the training data set to use it then in analysis of the housing market, viz:

create some exploration graphs
look at whether there are missing values in the variables
select a regression model

Code chunks can be displayed by clicking Code button

Load the data and necessary packages

load("./data/1220_SR-CS_RealEstate/ames_train.Rdata")
library(MASS); library(dplyr); library(ggplot2)
library(knitr); library(plotly); library(BAS)
library(GGally); library(gridExtra)

1. Distribution of the ages of the houses

ames_train <- ames_train %>% mutate(age = 2020-Year.Built)
summ<-summary(ames_train$age)
ggplot(ames_train, aes(x = age, y =..density..)) +
        geom_histogram(bins = 30, fill = "cornflowerblue", colour = "darkgrey")+
        geom_density(size = 1, colour = "brown") +
        geom_vline(aes(xintercept=mean(age,na.rm=TRUE), color="mean"), size=1)+
        geom_vline(aes(
                xintercept=median(age,na.rm=TRUE), color="median"), size=1) +
        scale_color_manual(name=NULL, values=c(mean="purple", median="violet"))+
        theme(legend.position=c(0.9, 0.9))+
        labs(title = "House Age Histogram")

distribution of age is right-skewed with mean = 47.8, median = 45
- showed multimodal behaviour indicating high counts for ages about 10 and 50-60
number of houses decreases as the their age increases

2. Relation between price and neighborhood

Summary statistics:

q2 <- ames_train %>% 
  group_by(Neighborhood) %>% 
  summarise(mean.price = mean(price),
            median.price = median(price),
            min.price = min(price),
            max.price = max(price),
            IQR = IQR(price),
            sd.price = sd(price),
            total = n()) 
most.expensive <- q2[which(q2$median.price == max(q2$median.price)),]
least.expensive <- q2[which(q2$median.price == min(q2$median.price)),]
most.hetero <- q2[which(q2$sd.price == max(q2$sd.price)),]

kable(most.expensive[1:8], caption = 'Most expensive neighborhood')

Most expensive neighborhood
Neighborhood	mean.price	median.price	min.price	max.price	IQR	sd.price	total
StoneBr	339316	340691.5	180000	591587	151358	123459.1	20

kable(least.expensive[1:8], caption = 'Least expensive neighborhood')

Least expensive neighborhood
Neighborhood	mean.price	median.price	min.price	max.price	IQR	sd.price	total
MeadowV	92946.88	85750	73000	129500	20150	18939.78	16

kable(most.hetero[1:8], caption = 'Most heterogeneous neighborhood')

Most heterogeneous neighborhood
Neighborhood	mean.price	median.price	min.price	max.price	IQR	sd.price	total
StoneBr	339316	340691.5	180000	591587	151358	123459.1	20

gg<- ggplot(ames_train, aes(x=reorder(Neighborhood, -price, FUN = median),
                            y = price/1000)) +
  geom_boxplot(colour = 'darkgrey', fill = 'purple') +
  labs(title = "House prices per Neighborhood (interactive)",
       caption = "Ordered by Median",
       x = 'Neighborhood', y = 'Price ($, thsnd)') +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(gg)

StoneBr is the most expensive and most heterogeneous neighborhood
- with median = 340691.5, variance = 123459.1
the least expensive neighborhood is MeadowV
- with median = 85750

3. Missing values

Variable with the largest number of missing values:

na <- colSums(is.na(ames_train))
head(sort(na, decreasing = TRUE), n = 1)

Pool.QC 
    997

the highest number of NA’s ($=997$) has Pool Quality (Pool.QC)
- apparently, because the absence of pool is coded as NA

4. Searching for the best model

Strategy:

start with a full model all.model predicting log(price) based on variables:
- lot size in square feet (Lot.Area),
- slope of property (Land.Slope),
- original construction date (Year.Built),
- remodel date (Year.Remod.Add),
- number of bedrooms above grade (Bedroom.AbvGr)
use backwards elimination to pick up significant predictors
- drop one predictor at a time until the parsimonious model is reached
choose the best model by comparison tools
- Adjusted R-squared (variance in log(price) around its mean explained by model)
- AIC (Akaike information criterion: both overfitting/ underfitting risk),
- BIC (Bayesian information criterion: similar to AIC, but with a different penalty for the number of parameters)
verify findings with Bayesian model averaging
model diagnostics: check linear regression conditions

Backwards elimination/ comparison

data <- ames_train %>% transmute(
  Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, log(price))
data <- data[complete.cases(data),]
all.model <- lm(`log(price)` ~ Lot.Area + Land.Slope + Year.Built +
                  Year.Remod.Add + Bedroom.AbvGr, data = data)
`-Lot.Area` <- lm(`log(price)` ~ (.-Lot.Area), data = data)
`-Land.Slope` <- lm(`log(price)` ~ (.-Land.Slope), data = data)
`-Year.Built` <- lm(`log(price)` ~ (.-Year.Built), data = data)
`-Year.Remod.Add` <- lm(`log(price)` ~ (.-Year.Remod.Add), data = data)
`-Bedroom.AbvGr` <- lm(`log(price)` ~ (.-Bedroom.AbvGr), data = data)
rsq.all <- round(summary(all.model)$adj.r.squared,4)
rsq1 <- round(summary(`-Lot.Area`)$adj.r.squared,4)
rsq2 <- round(summary(`-Land.Slope`)$adj.r.squared,4)
rsq3 <- round(summary(`-Year.Built`)$adj.r.squared,4)
rsq4 <- round(summary(`-Year.Remod.Add`)$adj.r.squared,4)
rsq5 <- round(summary(`-Bedroom.AbvGr`)$adj.r.squared,4)
aic<- round(c(AIC(all.model), AIC(`-Lot.Area`), AIC(`-Land.Slope`),
              AIC(`-Year.Built`), AIC(`-Year.Remod.Add`),
              AIC(`-Bedroom.AbvGr`)),4)
bic<- round(c(BIC(all.model), BIC(`-Lot.Area`), BIC(`-Land.Slope`),
              BIC(`-Year.Built`), BIC(`-Year.Remod.Add`),
              BIC(`-Bedroom.AbvGr`)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(
  rsq.all, rsq1, rsq2, rsq3, rsq4, rsq5), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('all.model','-Lot.Area', '-Land.Slope',
                           '-Year.Built', '-Year.Remod.Add', '-Bedroom.AbvGr')
kable(compar.mdls, caption = 'Models Comparison')

Models Comparison
	Adj.R-sq	AIC	BIC
all.model	0.5598	294.1022	333.3642
-Lot.Area	0.5220	375.5497	409.9040
-Land.Slope	0.5526	308.4026	337.8492
-Year.Built	0.4474	520.6532	555.0075
-Year.Remod.Add	0.4922	435.9700	470.3243
-Bedroom.AbvGr	0.5315	355.5240	389.8783

all.model has
- the best (highest) Adjusted R-squared = 0.5598
- the best (lowest) AIC = 294.1022
- the best (lowest) BIC = 333.3642
so, best.model1 is all.model:

best.model1 <- all.model

Model diagnostics

To best.model1 (= all.model) provide valid results, the following conditions should be met:

LINEARITY - linear relationships between explanatory and response variables
NORMALITY - nearly normal residuals with mean$=0$
CONSTANT VARIABILITY of residuals
INDEPENDENCY of residuals

LINEARITY: can be checked with correlogram

ggpairs(data,
        upper = list(continuous = wrap("cor", size = 8)),
        diag = list(continuous = wrap("barDiag", colour = "cornflowerblue")),
        lower = list(continuous = wrap("smooth", colour = 'cornflowerblue')))+
  ggplot2::theme(axis.text.x = element_text(angle = 45, vjust = 1))

there is no too strong relationship between any variables
relationships between price and other variables seems to be somewhat linear

NORMALITY: can be checked with histogram and/ or Q-Q plot of the residuals

hist<-ggplot(data = best.model1, aes(x = best.model1$residuals, y = ..density..))+
        geom_histogram(color= "darkgray",fill = "cornflowerblue")+
        geom_density(size = 1, colour = "brown")+
        xlab("residuals")+
        ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model1, aes(sample = .resid))+
        stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
        xlab("theoretical quantiles")+
        ylab("sample quantiles")+
        ggtitle("Normal Q-Q Plot")
grid.arrange(hist, qq, ncol=2)

histogram: shows a nearly normal distribution with slight left-skewness
- QQ-plot: no significant deviation from the mean, most of the points fit the qq-line except for tail areas

$=>$ residuals almost normally distributed

CONSTANT VARIABILITY: can be checked by plotting residuals against predicted values

INDEPENDENCY: residuals can be plotted as they appear in the data set, which would reveal any time series dependence

var<-ggplot(data = best.model1, aes(x = .fitted, y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("prediction") +ylab("residuals")+
        ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model1, aes(x = 1:nrow(data), y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("index") +ylab("residuals")+
        ggtitle("Model Residuals (independence)")
grid.arrange(var, indep, ncol=2)

variability: there is some level of homoscedasticity $=>$ constant variability
independence: residuals are distributed randomly $=>$ independent

the final best.model1 turned out to be the initial full all.model, which predicts prices (log(price)) based on
- lot size in square feet (Lot.Area),
- slope of property (Land.Slope),
- original construction date (Year.Built),
- remodel date (Year.Remod.Add),
- number of bedrooms above grade (Bedroom.AbvGr)

lm(formula = `log(price)` ~ Lot.Area + Land.Slope + Year.Built + 
    Year.Remod.Add + Bedroom.AbvGr, data = data)

all diagnostic plots show some of the same outlier

5. The farthest outlier

Find outlier index and look at the house characteristics:

index<- which.max(abs(resid(best.model1)))
out<-stack(ames_train[index,])
out

      values             ind
1  902207130             PID
2        832            area
3      12789           price
4         30     MS.SubClass
5         68    Lot.Frontage
6       9656        Lot.Area
7          2    Overall.Qual
8          2    Overall.Cond
9       1923      Year.Built
10      1970  Year.Remod.Add
11         0    Mas.Vnr.Area
12         0    BsmtFin.SF.1
13         0    BsmtFin.SF.2
14       678     Bsmt.Unf.SF
15       678   Total.Bsmt.SF
16       832     X1st.Flr.SF
17         0     X2nd.Flr.SF
18         0 Low.Qual.Fin.SF
19         0  Bsmt.Full.Bath
20         0  Bsmt.Half.Bath
21         1       Full.Bath
22         0       Half.Bath
23         2   Bedroom.AbvGr
24         1   Kitchen.AbvGr
25         5   TotRms.AbvGrd
26         1      Fireplaces
27      1928   Garage.Yr.Blt
28         2     Garage.Cars
29       780     Garage.Area
30         0    Wood.Deck.SF
31         0   Open.Porch.SF
32         0  Enclosed.Porch
33         0     X3Ssn.Porch
34         0    Screen.Porch
35         0       Pool.Area
36         0        Misc.Val
37         6         Mo.Sold
38      2010         Yr.Sold
39        97             age

the largest squared residual of -2.0879 has house #428 (PID = 902207130)
- this residual is over 3 times the standard deviation $=>$ outlier
actual price of this house $=\$$ 12789 is the lowest in the entire ames_train
- predicted price $=\$$ 103176.21, i.e. almost 10 times larger
factors contributing to high squared residual of the house could be
- age of the house (Year built: 1923),
- year when it was remodelled (Remodel date: 1970),
- overall quality (Quality: 2)
- overall condition (Condition: 2)
the last two factors mentioned (Overall.Qual and Overall.Cond) are not present in the final best.model1
- these factors could contribute to the price of the house is very far from the mean and median, what leads to the largest squared residual

6. Replacing `Lot.Area` with `log(Lot.Area)`

Take the same steps as in Question 4

data1 <- data %>% transmute(log(Lot.Area), Land.Slope,
                           Year.Built, Year.Remod.Add, Bedroom.AbvGr, `log(price)`)
all.model <- lm(`log(price)` ~ ., data = data1)
`-log(Lot.Area)` <- lm(`log(price)` ~ (.-`log(Lot.Area)`), data = data1)
`-Land.Slope` <- lm(`log(price)` ~ `log(Lot.Area)` + Year.Built + Year.Remod.Add +
           Bedroom.AbvGr, data = data1)
`-Year.Built` <- lm(`log(price)` ~ (.-Year.Built), data = data1)
`-Year.Remod.Add` <- lm(`log(price)` ~ (.-Year.Remod.Add), data = data1)
`-Bedroom.AbvGr` <- lm(`log(price)` ~ (.-Bedroom.AbvGr), data = data1)
rsq.all <- round(summary(all.model)$adj.r.squared,4)
rsq1 <- round(summary(`-log(Lot.Area)`)$adj.r.squared,4)
rsq2 <- round(summary(`-Land.Slope`)$adj.r.squared,4)
rsq3 <- round(summary(`-Year.Built`)$adj.r.squared,4)
rsq4 <- round(summary(`-Year.Remod.Add`)$adj.r.squared,4)
rsq5 <- round(summary(`-Bedroom.AbvGr`)$adj.r.squared,4)
aic<- round(c(AIC(all.model), AIC(`-log(Lot.Area)`), AIC(`-Land.Slope`),
              AIC(`-Year.Built`), AIC(`-Year.Remod.Add`),
              AIC(`-Bedroom.AbvGr`)),4)
bic<- round(c(BIC(all.model), BIC(`-log(Lot.Area)`), BIC(`-Land.Slope`),
              BIC(`-Year.Built`), BIC(`-Year.Remod.Add`),
              BIC(`-Bedroom.AbvGr`)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(
  rsq.all, rsq1, rsq2, rsq3, rsq4, rsq5), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('all.model','-Lot.Area', '-Land.Slope',
                           '-Year.Built', '-Year.Remod.Add', '-Bedroom.AbvGr')
kable(compar.mdls, caption = 'Models Comparison')

Models Comparison
	Adj.R-sq	AIC	BIC
all.model	0.6032	190.3791	229.6411
-Lot.Area	0.5220	375.5497	409.9040
-Land.Slope	0.6015	192.7134	222.1599
-Year.Built	0.4932	434.0186	468.3729
-Year.Remod.Add	0.5364	344.9971	379.3513
-Bedroom.AbvGr	0.5911	219.4560	253.8102

all.model has
- the best (highest) Adjusted R-squared = 0.6032
- the best (lowest) AIC = 190.3791
-Land.Slope model is
- slightly inferior to all.model in Adjusted R-squared (-0.0017)
- slightly inferior to all.model in AIC (2.3343)
- noticeably superior to all.model in BIC (-7.4812)
so, choosing Bayesian approach, best.model2 is -Land.Slope model:

best.model2 <- `-Land.Slope`

the final best.model2 turned out to be -Land.Slope model, which predicts prices (log(price)) based on
- lot size in square feet (log(Lot.Area)),
- original construction date (Year.Built),
- remodel date (Year.Remod.Add),
- number of bedrooms above grade (Bedroom.AbvGr)

lm(formula = `log(price)` ~ `log(Lot.Area)` + Year.Built + Year.Remod.Add + 
    Bedroom.AbvGr, data = data1)

7. Log-transformed model diagnostics

LINEARITY

best1 <- tibble(Observed = data$`log(price)`, Predicted = fitted(best.model1))
best2 <- tibble(Observed = data1$`log(price)`, Predicted = fitted(best.model2))
gg1 <- ggplot(best1, aes(x=Observed, y=Predicted)) +
  geom_point(col="cornflowerblue") +
  stat_smooth(method = "lm", col = "violet") +
  ggtitle("Lot.Area")
gg2 <- ggplot(best2, aes(x=Observed, y=Predicted)) +
  geom_point(col="cornflowerblue") +
  stat_smooth(method = "lm", col = "purple") +
  ggtitle("log(Lot.Area)")
grid.arrange(gg1, gg2, nrow = 1,
             top = "Observed log(price) vs. Predicted log(price)")

Comparing to plot for Lot.Area, plot for log-transformed log(Lot.Area) gives a better model fit as it:

looks more linear
- dots are closer to the fitted line, and better distributed along it
has less outliers $=>$ lower prediction error

Model diagnostics, other conditions:

NORMALITY/ CONSTANT VARIABILITY/ INDEPENDENCY (the same procedure as in Question 4)

hist<-ggplot(data = best.model2, aes(x = best.model2$residuals, y = ..density..))+
        geom_histogram(color= "darkgray",fill = "cornflowerblue")+
        geom_density(size = 1, colour = "brown")+
        xlab("residuals")+
        ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model2, aes(sample = .resid))+
        stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
        xlab("theoretical quantiles")+
        ylab("sample quantiles")+
        ggtitle("Normal Q-Q Plot")
grid.arrange(hist, qq, ncol=2)

var<-ggplot(data = best.model2, aes(x = .fitted, y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("prediction") +ylab("residuals")+
        ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model2, aes(x = 1:nrow(data), y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("index") +ylab("residuals")+
        ggtitle("Model Residuals (independence)")
grid.arrange(var, indep, ncol=2)

Overall, other log-transformed log(Lot.Area) diagnostic plots are almost similar to Lot.Area ones and show the same outlier (# $428$); meanwhile:

histograms - distribution for log(Lot.Area) looks slightly less skewed and more centered
variability: residuals for log(Lot.Area) have less variance

rsq.best1 <- round(summary(best.model1)$adj.r.squared,4)
rsq.best2 <- round(summary(best.model2)$adj.r.squared,4)
aic<- round(c(AIC(best.model1), AIC(best.model2)),4)
bic<- round(c(BIC(best.model1), BIC(best.model2)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(rsq.best1, rsq.best2), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('best.model1','best.model2')
kable(compar.mdls, caption = 'Best Models Comparison')

Best Models Comparison
	Adj.R-sq	AIC	BIC
best.model1	0.5598	294.1022	333.3642
best.model2	0.6015	192.7134	222.1599

log-transformation to log(Lot.Area) has led to advances in model accuracy:
- improvement in Adjusted R-squared by increasing it from 0.56 to 0.6,
- improvement in AIC by decreasing it from 294.1 to 192.71,
- improvement in BIC by decreasing it from 333.36 to 222.16

Mysteries of Real Estate Appraisal

Statistics with R, Milestone: Basic Model Selection

Kuba

December 13, 2020

Intro

1. Distribution of the ages of the houses

2. Relation between price and neighborhood

3. Missing values

4. Searching for the best model

5. The farthest outlier

6. Replacing `Lot.Area` with `log(Lot.Area)`

7. Log-transformed model diagnostics

Mysteries of Real Estate Appraisal

Statistics with R, Milestone: Basic Model Selection

Kuba

December 13, 2020

Intro

1. Distribution of the ages of the houses

2. Relation between price and neighborhood

3. Missing values

4. Searching for the best model

5. The farthest outlier

6. Replacing Lot.Area with log(Lot.Area)

7. Log-transformed model diagnostics

6. Replacing `Lot.Area` with `log(Lot.Area)`