This is the first part of Capstone project for Coursera’s Statistics with R Specialization.
The Capstone Project aims to develop a model to predict the selling price of a given house in Ames, Iowa. This information helps assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment.
The data for the project comes from the Ames Assessor’s Office information used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. More information can be found in the Codebook. The data have been randomly divided into three separate data sets: a training data set, a test data set, and a validation data set.
The goal of this first part of Capstone is to explore the training data set to use it then in analysis of the housing market, viz:
Code buttonLoad the data and necessary packages
load("./data/1220_SR-CS_RealEstate/ames_train.Rdata")
library(MASS); library(dplyr); library(ggplot2)
library(knitr); library(plotly); library(BAS)
library(GGally); library(gridExtra)ames_train <- ames_train %>% mutate(age = 2020-Year.Built)
summ<-summary(ames_train$age)
ggplot(ames_train, aes(x = age, y =..density..)) +
geom_histogram(bins = 30, fill = "cornflowerblue", colour = "darkgrey")+
geom_density(size = 1, colour = "brown") +
geom_vline(aes(xintercept=mean(age,na.rm=TRUE), color="mean"), size=1)+
geom_vline(aes(
xintercept=median(age,na.rm=TRUE), color="median"), size=1) +
scale_color_manual(name=NULL, values=c(mean="purple", median="violet"))+
theme(legend.position=c(0.9, 0.9))+
labs(title = "House Age Histogram")age is right-skewed with mean = 47.8, median = 45
Summary statistics:
q2 <- ames_train %>%
group_by(Neighborhood) %>%
summarise(mean.price = mean(price),
median.price = median(price),
min.price = min(price),
max.price = max(price),
IQR = IQR(price),
sd.price = sd(price),
total = n())
most.expensive <- q2[which(q2$median.price == max(q2$median.price)),]
least.expensive <- q2[which(q2$median.price == min(q2$median.price)),]
most.hetero <- q2[which(q2$sd.price == max(q2$sd.price)),]
kable(most.expensive[1:8], caption = 'Most expensive neighborhood')| Neighborhood | mean.price | median.price | min.price | max.price | IQR | sd.price | total |
|---|---|---|---|---|---|---|---|
| StoneBr | 339316 | 340691.5 | 180000 | 591587 | 151358 | 123459.1 | 20 |
kable(least.expensive[1:8], caption = 'Least expensive neighborhood')| Neighborhood | mean.price | median.price | min.price | max.price | IQR | sd.price | total |
|---|---|---|---|---|---|---|---|
| MeadowV | 92946.88 | 85750 | 73000 | 129500 | 20150 | 18939.78 | 16 |
kable(most.hetero[1:8], caption = 'Most heterogeneous neighborhood')| Neighborhood | mean.price | median.price | min.price | max.price | IQR | sd.price | total |
|---|---|---|---|---|---|---|---|
| StoneBr | 339316 | 340691.5 | 180000 | 591587 | 151358 | 123459.1 | 20 |
gg<- ggplot(ames_train, aes(x=reorder(Neighborhood, -price, FUN = median),
y = price/1000)) +
geom_boxplot(colour = 'darkgrey', fill = 'purple') +
labs(title = "House prices per Neighborhood (interactive)",
caption = "Ordered by Median",
x = 'Neighborhood', y = 'Price ($, thsnd)') +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
ggplotly(gg)StoneBr is the most expensive and most heterogeneous neighborhood
MeadowV
Variable with the largest number of missing values:
na <- colSums(is.na(ames_train))
head(sort(na, decreasing = TRUE), n = 1)Pool.QC
997
Pool.QC)
NAStrategy:
all.model predicting log(price) based on variables:
Lot.Area),Land.Slope),Year.Built),Year.Remod.Add),Bedroom.AbvGr)Backwards elimination/ comparison
data <- ames_train %>% transmute(
Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, log(price))
data <- data[complete.cases(data),]
all.model <- lm(`log(price)` ~ Lot.Area + Land.Slope + Year.Built +
Year.Remod.Add + Bedroom.AbvGr, data = data)
`-Lot.Area` <- lm(`log(price)` ~ (.-Lot.Area), data = data)
`-Land.Slope` <- lm(`log(price)` ~ (.-Land.Slope), data = data)
`-Year.Built` <- lm(`log(price)` ~ (.-Year.Built), data = data)
`-Year.Remod.Add` <- lm(`log(price)` ~ (.-Year.Remod.Add), data = data)
`-Bedroom.AbvGr` <- lm(`log(price)` ~ (.-Bedroom.AbvGr), data = data)
rsq.all <- round(summary(all.model)$adj.r.squared,4)
rsq1 <- round(summary(`-Lot.Area`)$adj.r.squared,4)
rsq2 <- round(summary(`-Land.Slope`)$adj.r.squared,4)
rsq3 <- round(summary(`-Year.Built`)$adj.r.squared,4)
rsq4 <- round(summary(`-Year.Remod.Add`)$adj.r.squared,4)
rsq5 <- round(summary(`-Bedroom.AbvGr`)$adj.r.squared,4)
aic<- round(c(AIC(all.model), AIC(`-Lot.Area`), AIC(`-Land.Slope`),
AIC(`-Year.Built`), AIC(`-Year.Remod.Add`),
AIC(`-Bedroom.AbvGr`)),4)
bic<- round(c(BIC(all.model), BIC(`-Lot.Area`), BIC(`-Land.Slope`),
BIC(`-Year.Built`), BIC(`-Year.Remod.Add`),
BIC(`-Bedroom.AbvGr`)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(
rsq.all, rsq1, rsq2, rsq3, rsq4, rsq5), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('all.model','-Lot.Area', '-Land.Slope',
'-Year.Built', '-Year.Remod.Add', '-Bedroom.AbvGr')
kable(compar.mdls, caption = 'Models Comparison')| Adj.R-sq | AIC | BIC | |
|---|---|---|---|
| all.model | 0.5598 | 294.1022 | 333.3642 |
| -Lot.Area | 0.5220 | 375.5497 | 409.9040 |
| -Land.Slope | 0.5526 | 308.4026 | 337.8492 |
| -Year.Built | 0.4474 | 520.6532 | 555.0075 |
| -Year.Remod.Add | 0.4922 | 435.9700 | 470.3243 |
| -Bedroom.AbvGr | 0.5315 | 355.5240 | 389.8783 |
all.model has
best.model1 is all.model:best.model1 <- all.modelModel diagnostics
To best.model1 (= all.model) provide valid results, the following conditions should be met:
LINEARITY: can be checked with correlogram
ggpairs(data,
upper = list(continuous = wrap("cor", size = 8)),
diag = list(continuous = wrap("barDiag", colour = "cornflowerblue")),
lower = list(continuous = wrap("smooth", colour = 'cornflowerblue')))+
ggplot2::theme(axis.text.x = element_text(angle = 45, vjust = 1))price and other variables seems to be somewhat linearNORMALITY: can be checked with histogram and/ or Q-Q plot of the residuals
hist<-ggplot(data = best.model1, aes(x = best.model1$residuals, y = ..density..))+
geom_histogram(color= "darkgray",fill = "cornflowerblue")+
geom_density(size = 1, colour = "brown")+
xlab("residuals")+
ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model1, aes(sample = .resid))+
stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
xlab("theoretical quantiles")+
ylab("sample quantiles")+
ggtitle("Normal Q-Q Plot")
grid.arrange(hist, qq, ncol=2)\(=>\) residuals almost normally distributed
CONSTANT VARIABILITY: can be checked by plotting residuals against predicted values
INDEPENDENCY: residuals can be plotted as they appear in the data set, which would reveal any time series dependence
var<-ggplot(data = best.model1, aes(x = .fitted, y = .resid))+
geom_point(colour= "cornflowerblue") +
geom_smooth(method=lm, colour="brown") +
xlab("prediction") +ylab("residuals")+
ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model1, aes(x = 1:nrow(data), y = .resid))+
geom_point(colour= "cornflowerblue") +
geom_smooth(method=lm, colour="brown") +
xlab("index") +ylab("residuals")+
ggtitle("Model Residuals (independence)")
grid.arrange(var, indep, ncol=2)best.model1 turned out to be the initial full all.model, which predicts prices (log(price)) based on
Lot.Area),Land.Slope),Year.Built),Year.Remod.Add),Bedroom.AbvGr)lm(formula = `log(price)` ~ Lot.Area + Land.Slope + Year.Built +
Year.Remod.Add + Bedroom.AbvGr, data = data)
Find outlier index and look at the house characteristics:
index<- which.max(abs(resid(best.model1)))
out<-stack(ames_train[index,])
out values ind
1 902207130 PID
2 832 area
3 12789 price
4 30 MS.SubClass
5 68 Lot.Frontage
6 9656 Lot.Area
7 2 Overall.Qual
8 2 Overall.Cond
9 1923 Year.Built
10 1970 Year.Remod.Add
11 0 Mas.Vnr.Area
12 0 BsmtFin.SF.1
13 0 BsmtFin.SF.2
14 678 Bsmt.Unf.SF
15 678 Total.Bsmt.SF
16 832 X1st.Flr.SF
17 0 X2nd.Flr.SF
18 0 Low.Qual.Fin.SF
19 0 Bsmt.Full.Bath
20 0 Bsmt.Half.Bath
21 1 Full.Bath
22 0 Half.Bath
23 2 Bedroom.AbvGr
24 1 Kitchen.AbvGr
25 5 TotRms.AbvGrd
26 1 Fireplaces
27 1928 Garage.Yr.Blt
28 2 Garage.Cars
29 780 Garage.Area
30 0 Wood.Deck.SF
31 0 Open.Porch.SF
32 0 Enclosed.Porch
33 0 X3Ssn.Porch
34 0 Screen.Porch
35 0 Pool.Area
36 0 Misc.Val
37 6 Mo.Sold
38 2010 Yr.Sold
39 97 age
ames_train
Overall.Qual and Overall.Cond) are not present in the final best.model1
Lot.Area with log(Lot.Area)Take the same steps as in Question 4
data1 <- data %>% transmute(log(Lot.Area), Land.Slope,
Year.Built, Year.Remod.Add, Bedroom.AbvGr, `log(price)`)
all.model <- lm(`log(price)` ~ ., data = data1)
`-log(Lot.Area)` <- lm(`log(price)` ~ (.-`log(Lot.Area)`), data = data1)
`-Land.Slope` <- lm(`log(price)` ~ `log(Lot.Area)` + Year.Built + Year.Remod.Add +
Bedroom.AbvGr, data = data1)
`-Year.Built` <- lm(`log(price)` ~ (.-Year.Built), data = data1)
`-Year.Remod.Add` <- lm(`log(price)` ~ (.-Year.Remod.Add), data = data1)
`-Bedroom.AbvGr` <- lm(`log(price)` ~ (.-Bedroom.AbvGr), data = data1)
rsq.all <- round(summary(all.model)$adj.r.squared,4)
rsq1 <- round(summary(`-log(Lot.Area)`)$adj.r.squared,4)
rsq2 <- round(summary(`-Land.Slope`)$adj.r.squared,4)
rsq3 <- round(summary(`-Year.Built`)$adj.r.squared,4)
rsq4 <- round(summary(`-Year.Remod.Add`)$adj.r.squared,4)
rsq5 <- round(summary(`-Bedroom.AbvGr`)$adj.r.squared,4)
aic<- round(c(AIC(all.model), AIC(`-log(Lot.Area)`), AIC(`-Land.Slope`),
AIC(`-Year.Built`), AIC(`-Year.Remod.Add`),
AIC(`-Bedroom.AbvGr`)),4)
bic<- round(c(BIC(all.model), BIC(`-log(Lot.Area)`), BIC(`-Land.Slope`),
BIC(`-Year.Built`), BIC(`-Year.Remod.Add`),
BIC(`-Bedroom.AbvGr`)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(
rsq.all, rsq1, rsq2, rsq3, rsq4, rsq5), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('all.model','-Lot.Area', '-Land.Slope',
'-Year.Built', '-Year.Remod.Add', '-Bedroom.AbvGr')
kable(compar.mdls, caption = 'Models Comparison')| Adj.R-sq | AIC | BIC | |
|---|---|---|---|
| all.model | 0.6032 | 190.3791 | 229.6411 |
| -Lot.Area | 0.5220 | 375.5497 | 409.9040 |
| -Land.Slope | 0.6015 | 192.7134 | 222.1599 |
| -Year.Built | 0.4932 | 434.0186 | 468.3729 |
| -Year.Remod.Add | 0.5364 | 344.9971 | 379.3513 |
| -Bedroom.AbvGr | 0.5911 | 219.4560 | 253.8102 |
all.model has
-Land.Slope model is
all.model in Adjusted R-squared (-0.0017)all.model in AIC (2.3343)all.model in BIC (-7.4812)best.model2 is -Land.Slope model:best.model2 <- `-Land.Slope`best.model2 turned out to be -Land.Slope model, which predicts prices (log(price)) based on
log(Lot.Area)),Year.Built),Year.Remod.Add),Bedroom.AbvGr)lm(formula = `log(price)` ~ `log(Lot.Area)` + Year.Built + Year.Remod.Add +
Bedroom.AbvGr, data = data1)
LINEARITY
best1 <- tibble(Observed = data$`log(price)`, Predicted = fitted(best.model1))
best2 <- tibble(Observed = data1$`log(price)`, Predicted = fitted(best.model2))
gg1 <- ggplot(best1, aes(x=Observed, y=Predicted)) +
geom_point(col="cornflowerblue") +
stat_smooth(method = "lm", col = "violet") +
ggtitle("Lot.Area")
gg2 <- ggplot(best2, aes(x=Observed, y=Predicted)) +
geom_point(col="cornflowerblue") +
stat_smooth(method = "lm", col = "purple") +
ggtitle("log(Lot.Area)")
grid.arrange(gg1, gg2, nrow = 1,
top = "Observed log(price) vs. Predicted log(price)")Comparing to plot for Lot.Area, plot for log-transformed log(Lot.Area) gives a better model fit as it:
Model diagnostics, other conditions:
NORMALITY/ CONSTANT VARIABILITY/ INDEPENDENCY (the same procedure as in Question 4)
hist<-ggplot(data = best.model2, aes(x = best.model2$residuals, y = ..density..))+
geom_histogram(color= "darkgray",fill = "cornflowerblue")+
geom_density(size = 1, colour = "brown")+
xlab("residuals")+
ggtitle("Distribution of Model Residuals")
qq<- ggplot(data = best.model2, aes(sample = .resid))+
stat_qq(colour="cornflowerblue") + stat_qq_line(colour="brown", size=1)+
xlab("theoretical quantiles")+
ylab("sample quantiles")+
ggtitle("Normal Q-Q Plot")
grid.arrange(hist, qq, ncol=2)var<-ggplot(data = best.model2, aes(x = .fitted, y = .resid))+
geom_point(colour= "cornflowerblue") +
geom_smooth(method=lm, colour="brown") +
xlab("prediction") +ylab("residuals")+
ggtitle("Residuals vs Prediction (variability)")
indep<-ggplot(data = best.model2, aes(x = 1:nrow(data), y = .resid))+
geom_point(colour= "cornflowerblue") +
geom_smooth(method=lm, colour="brown") +
xlab("index") +ylab("residuals")+
ggtitle("Model Residuals (independence)")
grid.arrange(var, indep, ncol=2)Overall, other log-transformed log(Lot.Area) diagnostic plots are almost similar to Lot.Area ones and show the same outlier (# \(428\)); meanwhile:
log(Lot.Area) looks slightly less skewed and more centeredlog(Lot.Area) have less variancersq.best1 <- round(summary(best.model1)$adj.r.squared,4)
rsq.best2 <- round(summary(best.model2)$adj.r.squared,4)
aic<- round(c(AIC(best.model1), AIC(best.model2)),4)
bic<- round(c(BIC(best.model1), BIC(best.model2)),4)
compar.mdls<- cbind(`Adj.R-sq`=c(rsq.best1, rsq.best2), AIC=aic, BIC=bic)
row.names(compar.mdls)<- c('best.model1','best.model2')
kable(compar.mdls, caption = 'Best Models Comparison')| Adj.R-sq | AIC | BIC | |
|---|---|---|---|
| best.model1 | 0.5598 | 294.1022 | 333.3642 |
| best.model2 | 0.6015 | 192.7134 | 222.1599 |
log(Lot.Area) has led to advances in model accuracy: