Intro

This is the first part of Capstone project for Coursera’s Statistics with R Specialization.

The Capstone Project aims to develop a model to predict the selling price of a given house in Ames, Iowa. This information helps assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment.

The data for the project comes from the Ames Assessor’s Office information used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. More information can be found in the Codebook. The data have been randomly divided into three separate data sets: a training data set, a test data set, and a validation data set.

The goal of this first part of Capstone is to explore the training data set to use it then in analysis of the housing market, viz:

  • create some exploration graphs
  • look at whether there are missing values in the variables
  • select a regression model

  • Code chunks can be displayed by clicking Code button

Load the data and necessary packages

load("./data/1220_SR-CS_RealEstate/ames_train.Rdata")
library(MASS); library(dplyr); library(ggplot2)
library(knitr); library(plotly); library(BAS)
library(GGally); library(gridExtra)

1. Distribution of the ages of the houses

ames_train <- ames_train %>% mutate(age = 2020-Year.Built)
summ<-summary(ames_train$age)
ggplot(ames_train, aes(x = age, y =..density..)) +
        geom_histogram(bins = 30, fill = "cornflowerblue", colour = "darkgrey")+
        geom_density(size = 1, colour = "brown") +
        geom_vline(aes(xintercept=mean(age,na.rm=TRUE), color="mean"), size=1)+
        geom_vline(aes(
                xintercept=median(age,na.rm=TRUE), color="median"), size=1) +
        scale_color_manual(name=NULL, values=c(mean="purple", median="violet"))+
        theme(legend.position=c(0.9, 0.9))+
        labs(title = "House Age Histogram")


  • distribution of age is right-skewed with mean = 47.8, median = 45
    • showed multimodal behaviour indicating high counts for ages about 10 and 50-60
  • number of houses decreases as the their age increases

2. Relation between price and neighborhood

Summary statistics:

q2 <- ames_train %>% 
  group_by(Neighborhood) %>% 
  summarise(mean.price = mean(price),
            median.price = median(price),
            min.price = min(price),
            max.price = max(price),
            IQR = IQR(price),
            sd.price = sd(price),
            total = n()) 
most.expensive <- q2[which(q2$median.price == max(q2$median.price)),]
least.expensive <- q2[which(q2$median.price == min(q2$median.price)),]
most.hetero <- q2[which(q2$sd.price == max(q2$sd.price)),]

kable(most.expensive[1:8], caption = 'Most expensive neighborhood')
Most expensive neighborhood
Neighborhood mean.price median.price min.price max.price IQR sd.price total
StoneBr 339316 340691.5 180000 591587 151358 123459.1 20
kable(least.expensive[1:8], caption = 'Least expensive neighborhood')
Least expensive neighborhood
Neighborhood mean.price median.price min.price max.price IQR sd.price total
MeadowV 92946.88 85750 73000 129500 20150 18939.78 16
kable(most.hetero[1:8], caption = 'Most heterogeneous neighborhood')
Most heterogeneous neighborhood
Neighborhood mean.price median.price min.price max.price IQR sd.price total
StoneBr 339316 340691.5 180000 591587 151358 123459.1 20
gg<- ggplot(ames_train, aes(x=reorder(Neighborhood, -price, FUN = median),
                            y = price/1000)) +
  geom_boxplot(colour = 'darkgrey', fill = 'purple') +
  labs(title = "House prices per Neighborhood (interactive)",
       caption = "Ordered by Median",
       x = 'Neighborhood', y = 'Price ($, thsnd)') +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(gg)

  • StoneBr is the most expensive and most heterogeneous neighborhood
    • with median = 340691.5, variance = 123459.1
  • the least expensive neighborhood is MeadowV
    • with median = 85750

3. Missing values

Variable with the largest number of missing values:

na <- colSums(is.na(ames_train))
head(sort(na, decreasing = TRUE), n = 1)
Pool.QC 
    997 

  • the highest number of NA’s (\(=997\)) has Pool Quality (Pool.QC)
    • apparently, because the absence of pool is coded as NA

4. Searching for the best model

Strategy:

  • start with a full model all.model predicting log(price) based on variables:
    • lot size in square feet (Lot.Area),
    • slope of property (Land.Slope),
    • original construction date (Year.Built),
    • remodel date (Year.Remod.Add),
    • number of bedrooms above grade (Bedroom.AbvGr)
  • use backwards elimination to pick up significant predictors
    • drop one predictor at a time until the parsimonious model is reached
  • choose the best model by comparison tools
    • Adjusted R-squared (variance in log(price) around its mean explained by model)
    • AIC (Akaike information criterion: both overfitting/ underfitting risk),
    • BIC (Bayesian information criterion: similar to AIC, but with a different penalty for the number of parameters)
  • verify findings with Bayesian model averaging
  • model diagnostics: check linear regression conditions

Backwards elimination/ comparison

data <- ames_train %>% transmute(
  Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, log(price))
data <- data[complete.cases(data),]
all.model <- lm(`log(price)` ~ Lot.Area + Land.Slope + Year.Built +
                  Year.Remod.Add + Bedroom.AbvGr, data = data)
`-Lot.Area` <- lm(`log(price)` ~ (.-Lot.Area), data = data)
`-Land.Slope` <- lm(`log(price)` ~ (.-Land.Slope), data = data)
`-Year.Built` <- lm(`log(price)` ~ (.-Year.Built), data = data)
`-Year.Remod.Add` <- lm(`log(price)` ~ (.-Year.Remod.Add), data = data)
`-Bedroom.AbvGr` <- lm(`log(price)` ~ (.-Bedroom.AbvGr), data = data)
rsq.all <- round(summary(all.model)$adj.r.squared,4)
rsq1 <- round(summary(`-Lot.Area`)$adj.r.squared,4)
rsq2 <- round(summary(`-Land.Slope`)$adj.r.squared,4)
rsq3 <- round(summary(`-Year.Built`)$adj.r.squared,4)
rsq4 <- round(summary(`-Year.Remod.Add`)$adj.r.squared,4)
rsq5 <- round(summary(`-Bedroom.AbvGr`)$adj.r.squared,4)
aic<- round(c(AIC(all.model), AIC(`-Lot.Area`), AIC(`-Land.Slope`),
              AIC(`-Year.Built`), AIC(`-Year.Remod.Add`),
              AIC(`-Bedroom.AbvGr`)),4)
bic<- round(c(BIC(all.model), BIC(`-Lot.Area`), BIC(`-Land.Slope`),
              BIC(`-Year.Built`), BIC(`-Year.Remod.Add`),
              BIC(`-Bedroom.AbvGr`)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(
  rsq.all, rsq1, rsq2, rsq3, rsq4, rsq5), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('all.model','-Lot.Area', '-Land.Slope',
                           '-Year.Built', '-Year.Remod.Add', '-Bedroom.AbvGr')
kable(compar.mdls, caption = 'Models Comparison')
Models Comparison
Adj.R-sq AIC BIC
all.model 0.5598 294.1022 333.3642
-Lot.Area 0.5220 375.5497 409.9040
-Land.Slope 0.5526 308.4026 337.8492
-Year.Built 0.4474 520.6532 555.0075
-Year.Remod.Add 0.4922 435.9700 470.3243
-Bedroom.AbvGr 0.5315 355.5240 389.8783
  • all.model has
    • the best (highest) Adjusted R-squared = 0.5598
    • the best (lowest) AIC = 294.1022
    • the best (lowest) BIC = 333.3642
  • so, best.model1 is all.model:
best.model1 <- all.model

Model diagnostics

To best.model1 (= all.model) provide valid results, the following conditions should be met:

  • LINEARITY - linear relationships between explanatory and response variables
  • NORMALITY - nearly normal residuals with mean\(=0\)
  • CONSTANT VARIABILITY of residuals
  • INDEPENDENCY of residuals

LINEARITY: can be checked with correlogram

ggpairs(data,
        upper = list(continuous = wrap("cor", size = 8)),
        diag = list(continuous = wrap("barDiag", colour = "cornflowerblue")),
        lower = list(continuous = wrap("smooth", colour = 'cornflowerblue')))+
  ggplot2::theme(axis.text.x = element_text(angle = 45, vjust = 1))

  • there is no too strong relationship between any variables
  • relationships between price and other variables seems to be somewhat linear

NORMALITY: can be checked with histogram and/ or Q-Q plot of the residuals

hist<-ggplot(data = best.model1, aes(x = best.model1$residuals, y = ..density..))+
        geom_histogram(color= "darkgray",fill = "cornflowerblue")+
        geom_density(size = 1, colour = "brown")+
        xlab("residuals")+
        ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model1, aes(sample = .resid))+
        stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
        xlab("theoretical quantiles")+
        ylab("sample quantiles")+
        ggtitle("Normal Q-Q Plot")
grid.arrange(hist, qq, ncol=2)

  • histogram: shows a nearly normal distribution with slight left-skewness
    • QQ-plot: no significant deviation from the mean, most of the points fit the qq-line except for tail areas

\(=>\) residuals almost normally distributed

CONSTANT VARIABILITY: can be checked by plotting residuals against predicted values

INDEPENDENCY: residuals can be plotted as they appear in the data set, which would reveal any time series dependence

var<-ggplot(data = best.model1, aes(x = .fitted, y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("prediction") +ylab("residuals")+
        ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model1, aes(x = 1:nrow(data), y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("index") +ylab("residuals")+
        ggtitle("Model Residuals (independence)")
grid.arrange(var, indep, ncol=2)

  • variability: there is some level of homoscedasticity \(=>\) constant variability
  • independence: residuals are distributed randomly \(=>\) independent

  • the final best.model1 turned out to be the initial full all.model, which predicts prices (log(price)) based on
    • lot size in square feet (Lot.Area),
    • slope of property (Land.Slope),
    • original construction date (Year.Built),
    • remodel date (Year.Remod.Add),
    • number of bedrooms above grade (Bedroom.AbvGr)
lm(formula = `log(price)` ~ Lot.Area + Land.Slope + Year.Built + 
    Year.Remod.Add + Bedroom.AbvGr, data = data)
  • all diagnostic plots show some of the same outlier

5. The farthest outlier

Find outlier index and look at the house characteristics:

index<- which.max(abs(resid(best.model1)))
out<-stack(ames_train[index,])
out
      values             ind
1  902207130             PID
2        832            area
3      12789           price
4         30     MS.SubClass
5         68    Lot.Frontage
6       9656        Lot.Area
7          2    Overall.Qual
8          2    Overall.Cond
9       1923      Year.Built
10      1970  Year.Remod.Add
11         0    Mas.Vnr.Area
12         0    BsmtFin.SF.1
13         0    BsmtFin.SF.2
14       678     Bsmt.Unf.SF
15       678   Total.Bsmt.SF
16       832     X1st.Flr.SF
17         0     X2nd.Flr.SF
18         0 Low.Qual.Fin.SF
19         0  Bsmt.Full.Bath
20         0  Bsmt.Half.Bath
21         1       Full.Bath
22         0       Half.Bath
23         2   Bedroom.AbvGr
24         1   Kitchen.AbvGr
25         5   TotRms.AbvGrd
26         1      Fireplaces
27      1928   Garage.Yr.Blt
28         2     Garage.Cars
29       780     Garage.Area
30         0    Wood.Deck.SF
31         0   Open.Porch.SF
32         0  Enclosed.Porch
33         0     X3Ssn.Porch
34         0    Screen.Porch
35         0       Pool.Area
36         0        Misc.Val
37         6         Mo.Sold
38      2010         Yr.Sold
39        97             age

  • the largest squared residual of -2.0879 has house #428 (PID = 902207130)
    • this residual is over 3 times the standard deviation \(=>\) outlier
  • actual price of this house \(=\$\) 12789 is the lowest in the entire ames_train
    • predicted price \(=\$\) 103176.21, i.e. almost 10 times larger
  • factors contributing to high squared residual of the house could be
    • age of the house (Year built: 1923),
    • year when it was remodelled (Remodel date: 1970),
    • overall quality (Quality: 2)
    • overall condition (Condition: 2)
  • the last two factors mentioned (Overall.Qual and Overall.Cond) are not present in the final best.model1
    • these factors could contribute to the price of the house is very far from the mean and median, what leads to the largest squared residual

6. Replacing Lot.Area with log(Lot.Area)

Take the same steps as in Question 4

data1 <- data %>% transmute(log(Lot.Area), Land.Slope,
                           Year.Built, Year.Remod.Add, Bedroom.AbvGr, `log(price)`)
all.model <- lm(`log(price)` ~ ., data = data1)
`-log(Lot.Area)` <- lm(`log(price)` ~ (.-`log(Lot.Area)`), data = data1)
`-Land.Slope` <- lm(`log(price)` ~ `log(Lot.Area)` + Year.Built + Year.Remod.Add +
           Bedroom.AbvGr, data = data1)
`-Year.Built` <- lm(`log(price)` ~ (.-Year.Built), data = data1)
`-Year.Remod.Add` <- lm(`log(price)` ~ (.-Year.Remod.Add), data = data1)
`-Bedroom.AbvGr` <- lm(`log(price)` ~ (.-Bedroom.AbvGr), data = data1)
rsq.all <- round(summary(all.model)$adj.r.squared,4)
rsq1 <- round(summary(`-log(Lot.Area)`)$adj.r.squared,4)
rsq2 <- round(summary(`-Land.Slope`)$adj.r.squared,4)
rsq3 <- round(summary(`-Year.Built`)$adj.r.squared,4)
rsq4 <- round(summary(`-Year.Remod.Add`)$adj.r.squared,4)
rsq5 <- round(summary(`-Bedroom.AbvGr`)$adj.r.squared,4)
aic<- round(c(AIC(all.model), AIC(`-log(Lot.Area)`), AIC(`-Land.Slope`),
              AIC(`-Year.Built`), AIC(`-Year.Remod.Add`),
              AIC(`-Bedroom.AbvGr`)),4)
bic<- round(c(BIC(all.model), BIC(`-log(Lot.Area)`), BIC(`-Land.Slope`),
              BIC(`-Year.Built`), BIC(`-Year.Remod.Add`),
              BIC(`-Bedroom.AbvGr`)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(
  rsq.all, rsq1, rsq2, rsq3, rsq4, rsq5), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('all.model','-Lot.Area', '-Land.Slope',
                           '-Year.Built', '-Year.Remod.Add', '-Bedroom.AbvGr')
kable(compar.mdls, caption = 'Models Comparison')
Models Comparison
Adj.R-sq AIC BIC
all.model 0.6032 190.3791 229.6411
-Lot.Area 0.5220 375.5497 409.9040
-Land.Slope 0.6015 192.7134 222.1599
-Year.Built 0.4932 434.0186 468.3729
-Year.Remod.Add 0.5364 344.9971 379.3513
-Bedroom.AbvGr 0.5911 219.4560 253.8102
  • all.model has
    • the best (highest) Adjusted R-squared = 0.6032
    • the best (lowest) AIC = 190.3791
  • -Land.Slope model is
    • slightly inferior to all.model in Adjusted R-squared (-0.0017)
    • slightly inferior to all.model in AIC (2.3343)
    • noticeably superior to all.model in BIC (-7.4812)
  • so, choosing Bayesian approach, best.model2 is -Land.Slope model:
best.model2 <- `-Land.Slope`

  • the final best.model2 turned out to be -Land.Slope model, which predicts prices (log(price)) based on
    • lot size in square feet (log(Lot.Area)),
    • original construction date (Year.Built),
    • remodel date (Year.Remod.Add),
    • number of bedrooms above grade (Bedroom.AbvGr)
lm(formula = `log(price)` ~ `log(Lot.Area)` + Year.Built + Year.Remod.Add + 
    Bedroom.AbvGr, data = data1)

7. Log-transformed model diagnostics

LINEARITY

best1 <- tibble(Observed = data$`log(price)`, Predicted = fitted(best.model1))
best2 <- tibble(Observed = data1$`log(price)`, Predicted = fitted(best.model2))
gg1 <- ggplot(best1, aes(x=Observed, y=Predicted)) +
  geom_point(col="cornflowerblue") +
  stat_smooth(method = "lm", col = "violet") +
  ggtitle("Lot.Area")
gg2 <- ggplot(best2, aes(x=Observed, y=Predicted)) +
  geom_point(col="cornflowerblue") +
  stat_smooth(method = "lm", col = "purple") +
  ggtitle("log(Lot.Area)")
grid.arrange(gg1, gg2, nrow = 1,
             top = "Observed log(price) vs. Predicted log(price)")

Comparing to plot for Lot.Area, plot for log-transformed log(Lot.Area) gives a better model fit as it:

  • looks more linear
    • dots are closer to the fitted line, and better distributed along it
  • has less outliers \(=>\) lower prediction error

Model diagnostics, other conditions:

NORMALITY/ CONSTANT VARIABILITY/ INDEPENDENCY (the same procedure as in Question 4)

hist<-ggplot(data = best.model2, aes(x = best.model2$residuals, y = ..density..))+
        geom_histogram(color= "darkgray",fill = "cornflowerblue")+
        geom_density(size = 1, colour = "brown")+
        xlab("residuals")+
        ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model2, aes(sample = .resid))+
        stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
        xlab("theoretical quantiles")+
        ylab("sample quantiles")+
        ggtitle("Normal Q-Q Plot")
grid.arrange(hist, qq, ncol=2)

var<-ggplot(data = best.model2, aes(x = .fitted, y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("prediction") +ylab("residuals")+
        ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model2, aes(x = 1:nrow(data), y = .resid))+
        geom_point(colour= "cornflowerblue") +
        geom_smooth(method=lm, colour="brown") +
        xlab("index") +ylab("residuals")+
        ggtitle("Model Residuals (independence)")
grid.arrange(var, indep, ncol=2)

Overall, other log-transformed log(Lot.Area) diagnostic plots are almost similar to Lot.Area ones and show the same outlier (# \(428\)); meanwhile:

  • histograms - distribution for log(Lot.Area) looks slightly less skewed and more centered
  • variability: residuals for log(Lot.Area) have less variance
rsq.best1 <- round(summary(best.model1)$adj.r.squared,4)
rsq.best2 <- round(summary(best.model2)$adj.r.squared,4)
aic<- round(c(AIC(best.model1), AIC(best.model2)),4)
bic<- round(c(BIC(best.model1), BIC(best.model2)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(rsq.best1, rsq.best2), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('best.model1','best.model2')
kable(compar.mdls, caption = 'Best Models Comparison')
Best Models Comparison
Adj.R-sq AIC BIC
best.model1 0.5598 294.1022 333.3642
best.model2 0.6015 192.7134 222.1599

  • log-transformation to log(Lot.Area) has led to advances in model accuracy:
    • improvement in Adjusted R-squared by increasing it from 0.56 to 0.6,
    • improvement in AIC by decreasing it from 294.1 to 192.71,
    • improvement in BIC by decreasing it from 333.36 to 222.16