Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set.

load("ames_train.Rdata")
load("ames_test.Rdata")
load("ames_validation.Rdata")

library(statsr)
library(dplyr)
library(BAS)
library(MASS)
library(ggplot2)
library(gridExtra)
library(dplyr)
library(data.table)
library(pander)

Part 1 - Exploratory Data Analysis (EDA)

Data

The data set contains information from the Ames Assessor’s Office used in computing assessed values for individual residential properties sold in Ames, IA from 2006 to 2010. It contains 1000 observations of the sale ID (PID) with 80 variables of property sales in Ames, Iowa. The study is observational, causality cannot be claimed.

The parameters include house characteristics of the following types, numbers:

20 continuous, like house area or lot area
14 discrete, like number of rooms or fire places in the house
23 nominal (categorical.), like neighborhood or type of dwelling (one story, two story, …)
23 ordinal (categorical.), like quality of the house

More details available through this link: https://ww2.amstat.org/publications/jse/v19n3/decock/datadocumentation.txt

Firstly the data is checked for missing values, there are 4711 NA’s.

panderOptions("table.split.table",120)
options(width=120)

# count the NA per column
ames.na <- as.data.frame(colSums(is.na(ames_train)))

# use data.table library to convert row names into column
setDT(ames.na, keep.rownames = TRUE)

#rename columns
colnames(ames.na)[1]="Parameter"
colnames(ames.na)[2]="na.n"

pander(ames.na %>% arrange(desc(na.n)) %>% filter(na.n>0))

Parameter	na.n
Pool.QC	997
Misc.Feature	971
Alley	933
Fence	798
Fireplace.Qu	491
Lot.Frontage	167
Garage.Yr.Blt	48
Garage.Qual	47
Garage.Cond	47
Garage.Type	46
Garage.Finish	46
Bsmt.Qual	21
Bsmt.Cond	21
Bsmt.Exposure	21
BsmtFin.Type.1	21
BsmtFin.Type.2	21
Mas.Vnr.Area	7
BsmtFin.SF.1	1
BsmtFin.SF.2	1
Bsmt.Unf.SF	1
Total.Bsmt.SF	1
Bsmt.Full.Bath	1
Bsmt.Half.Bath	1
Garage.Cars	1
Garage.Area	1

Closer look shows that most of the NA’s in the data are actually not missing, but are just indicating that the property just does not have the feature. These are particularly true for the categorical parameters, others need further investigations. This “missing” data are corrected by converting NA to “NotAvail” category for all categorical parameters. Lot.frontage NA is converted to 0 square feet and so is Mas.Vnr.Area. Garage.Yr.Blt is converted to same year as house is built even though there is no garage. This corrects vast majority of NA’s in the data and only handful of true NA’s are left.

Also cases where abnormal sale condition is indicated possibly don’t present the true value of the property, those are removed and new parameter indicating the age of the house is added.

# add NotFeat as a factor level and remove NA's
ames_train[] <- lapply(ames_train, function(x){
  if(!is.factor(x)) return(x)
  x <- factor(x, exclude=NULL)
  levels(x)[is.na(levels(x))] <- "NotFeat"
  return(x)
})

# Replace NA with 0 in selected columns
ames_train$Lot.Frontage[is.na(ames_train$Lot.Frontage)] <- 0
ames_train$Mas.Vnr.Area[is.na(ames_train$Mas.Vnr.Area)] <- 0

# add new parameter is.garage
ames_train$is.garage <- ifelse(is.na(ames_train$Garage.Yr.Blt)==T, F,T)

# change the Garage.Yr.Blt to Year.Built where NA
ames_train$Garage.Yr.Blt[is.na(ames_train$Garage.Yr.Blt)] <- ames_train$Year.Built[is.na(ames_train$Garage.Yr.Blt)]

##MS.SubClass Identifies the type of dwelling involved in the sale and should be categorical.
ames_train$MS.SubClass <- as.factor(ames_train$MS.SubClass)

## Utilities parameter is removed as it contains only one factor level. PID is removed as it is ID of sale, and should not provide any additional information from statistical point of view
ames_train$Utilities <- NULL
ames_train$PID <- NULL

# only include cases where sale condition is normal
ames_train = ames_train %>% filter(Sale.Condition == "Normal")

# add house age parameter
ames_train$house.age <- 2018-ames_train$Year.Built

Analysing the response variable, price

Median price per square foot in Ames is 155000 and IQR is 76000, below summary table provides further information:

pander(ames_train %>% 
         summarize(price.mean = mean(price), 
                   price.median = median(price), 
                   price.IQR = IQR(price), 
                   price.sd = sd(price)))

price.mean	price.median	price.IQR	price.sd
174622	155500	76000	72270

Looking at the histogram of prices, it seems somewhat skewed to the right. Transforming price to logarithmic scale seems to improve the distribution and the impact will be verified later in the document. Below graphs show the dollar value distribution and log transformed distributions. Orange veritcal line indicates median and red the mean price.

g1 <- ggplot(ames_train)+
  geom_histogram(aes(price),bins=30, color="darkblue",fill="blue")+
  geom_vline(aes(xintercept=median(price)),size=1,color="orange")+
    geom_vline(aes(xintercept=mean(price)),size=1,color="red")+
  labs(x="price",y="Count of properties sold")

g2 <- ggplot(ames_train)+
  geom_histogram(aes(log(price)),bins=30, color="darkblue",fill="blue")+
    geom_vline(aes(xintercept=median(log(price))),size=1,color="orange")+
      geom_vline(aes(xintercept=mean(log(price))),size=1,color="red")+
  labs(x="log(price)",y="Count of properties sold")

grid.arrange(g1,g2,ncol=2)

Explanatory variables

There are 79 explanatory parameters. The most interesting are possibly the ones real estate companies use in their advertising material. Those parameters are used for attracting potential buyers and are also easily available for predicting values of newly listed houses. These parameters consist of items like living area size, neighborhood, number of bedrooms, garage, indication on the age of the house etc.

Location

Location is expected to be one of the most important factors with respect to price. This is also true as can be seen in the below box plot graph which shows the price medians and distributions in each neighborhood sorted by median price. Small dots indicate the individual sales providing indication on number of sales in the neighborhood. Most records are from NAmes, most expensive neighborhood is NridgHt and cheapest is MeadowV. At this stage it is unknown though whether the neighborhood is more expensive because of bigger houses in some are as or because people are happier paying more in certain areas.

ggplot(ames_train)+
  geom_boxplot(aes(reorder(Neighborhood,price,median),price),color="darkblue", fill="lightblue")+
  geom_jitter(aes(x=Neighborhood, y = price),alpha= 0.32, height = 0, width = 0.29,size=0.5) +
  coord_flip() +
  labs(x="Neighborhood", y="Price")

Main statistics about the price in each neighborhood are median price, IQR and number of sales in the neighborhood. Those are provided in the below summary table.

pander(ames_train %>% group_by(Neighborhood) %>% summarise(price.median=median(price),
                                                           price.IQR=IQR(price),
                                                           n=n()) %>%
         arrange(desc(price.median)))

## `summarise()` ungrouping output (override with `.groups` argument)

Neighborhood	price.median	price.IQR	n
NridgHt	305500	177813	36
NoRidge	290000	50313	28
GrnHill	280000	50000	2
StoneBr	255500	116875	12
Timber	232500	120000	17
Somerst	220000	77500	45
Veenker	217500	68000	9
Greens	212625	16438	4
Crawfor	198000	76100	25
CollgCr	195000	58500	75
Blmngtn	192000	20130	7
NWAmes	190000	59000	35
ClearCr	187500	68000	11
Gilbert	184000	22900	36
SawyerW	181000	52250	39
Mitchel	160000	33000	41
NPkVill	142100	13025	4
NAmes	140000	25250	140
Sawyer	135000	19000	57
SWISU	129500	35000	10
Edwards	125400	40125	50
BrkSide	125250	41394	36
Blueste	123900	10250	3
OldTown	120000	33900	62
IDOTRR	103000	44250	27
BrDale	100500	11450	7
MeadowV	85750	20150	16

Other explanatory variables of interest

Other factors like the overall quality of finishing materials and overall condition of the house are also expected to impact the price. Interestingly the overall quality of finishing materials seem to have much clearer impact to the price than what the overall condition does have - as long as the overall condition is reasonable. The box plots below show both overall quality and overall condition distributions per quality and condition class (1-10). Black dots are individual deals per class indicating the number of sales in each class.

g1 <- ggplot(ames_train)+
  geom_boxplot(aes(reorder(Overall.Qual,price), price),color="darkblue", fill="lightblue")+
  geom_jitter(aes(x=Overall.Qual, y = price),alpha= 0.25, height = 0, width = 0.29,size=0.45) +
  guides(fill=FALSE, color=FALSE) +
  labs(x="Overall Quality", y="Price")

g2 <- ggplot(ames_train)+
  geom_boxplot(aes(reorder(Overall.Cond,Overall.Cond), price),color="darkblue", fill="lightblue")+
  geom_jitter(aes(x=Overall.Cond, y = price),alpha= 0.25, height = 0, width = 0.29,size=0.5) +
  guides(fill=FALSE, color=FALSE) +
  labs(x="Overall Condition", y="Price")

grid.arrange(g1,g2, ncol=2)

Most houses seem to be scored between 4 and 8 both in quality and condition resulting in higher accuracy for predictions in the middle of the range compared to the extremes.

There are number of other interesting explanatory parameters, however the above perhaps are the most interesting findings from the data showing the importance of location and quality of finishing materials while overall condition surprisingly had much smaller impact on price.

Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).

The selected parameters are such that they are often found on the real estate marketing materials or can be guessed based on picture. Those are also expected to be the most important factors people consider when buying a house and hence expected to be important parameters for defining the price. Such items are neighborhood, living area, lot area, number of bedrooms and bath rooms, quality and condition, position of the lot and if there is a garage on the property. Below a simple linear model is created using these variables. At this stage neighborhood is excluded but the final model will contain it.

Here the above mentioned parameters are compared against each other by fitting a simple linear model using lm function. Based on the p.value most of the variables seem significant. Other lot configurations than cul de sac seem to provide little input to the model.

# new data frame for further analysis 
ames_train.selected.for.model.1 <- ames_train[,c("price", "area", "Lot.Area", "Bedroom.AbvGr", "Overall.Qual", "Overall.Cond","is.garage","Full.Bath","house.age" )]

# first model
model.init.lm <- lm( price ~ . , data=ames_train.selected.for.model.1)
summary(model.init.lm)

## 
## Call:
## lm(formula = price ~ ., data = ames_train.selected.for.model.1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -94009 -18897   -735  15789 217224 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.191e+04  1.020e+04  -1.168    0.243    
## area           9.058e+01  3.798e+00  23.851  < 2e-16 ***
## Lot.Area       1.032e+00  1.034e-01   9.974  < 2e-16 ***
## Bedroom.AbvGr -1.532e+04  1.682e+03  -9.108  < 2e-16 ***
## Overall.Qual   1.853e+04  1.214e+03  15.257  < 2e-16 ***
## Overall.Cond   5.666e+03  1.039e+03   5.455 6.49e-08 ***
## is.garageTRUE -1.971e+03  5.308e+03  -0.371    0.711    
## Full.Bath     -1.276e+04  2.883e+03  -4.427 1.08e-05 ***
## house.age     -7.140e+02  5.053e+01 -14.130  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29860 on 825 degrees of freedom
## Multiple R-squared:  0.8309, Adjusted R-squared:  0.8293 
## F-statistic: 506.7 on 8 and 825 DF,  p-value: < 2.2e-16

The Q-Q and residual plots for the initial model are shown below and indicate that the model is not very linear. Price was later transformed to log scale which resulted in better linearity. Details of the impact are shown later in the document.

par(mfrow=c(1,2))
plot(model.init.lm, which=1)
plot(model.init.lm, which=2)

pred.temp <- predict(model.init.lm,ames_train)
model.init.RMSE <- sqrt(mean((ames_train$price-(pred.temp))^2))

paste("The initial RMSE is  $", round(round(model.init.RMSE)))

## [1] "The initial RMSE is  $ 29700"

The RMSE for the initial model indicates the prediction error margin exactly one standard deviation away from the actually observed price. About 68% of predictions are within +- RMSE from actual and 95% are within 2 RMSE’s.

Section 2.2 Model Selection

Now either using BAS another step wise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?

Here BAS library is used for calculating the posterior probabilities for each parameter and BMA, HPM, MPM and BPM methods for selecting the explanatory variables are compared. Price is predicted using each of the models and resulting scatter graphs are shown below.

model.bas.train.df = as.data.frame(matrix(NA, ncol=5, nrow(ames_train)))
colnames(model.bas.train.df) = c("price", "BMA", "BPM", "HPM","MPM")

model.bas <- bas.lm(log(price) ~ . , 
                              data=ames_train.selected.for.model.1, 
                              prior="BIC")

model.bas.train.df$price = ames_train$price
model.bas.train.df$BMA = exp(predict(model.bas, ames_train, estimator="BMA")$fit)
model.bas.train.df$HPM = exp(predict(model.bas, ames_train, estimator="HPM")$fit)
model.bas.train.df$BPM = exp(predict(model.bas, ames_train, estimator="BPM")$fit)
model.bas.train.df$MPM = exp(predict(model.bas, ames_train, estimator="MPM")$fit)

g1 <- ggplot(model.bas.train.df)+
  geom_point(aes(price,BMA),color="darkblue")
g2 <- ggplot(model.bas.train.df)+
  geom_point(aes(price,HPM),color="darkblue")
g3 <- ggplot(model.bas.train.df)+
  geom_point(aes(price,BPM),color="darkblue")
g4 <- ggplot(model.bas.train.df)+
  geom_point(aes(price,MPM),color="darkblue")

grid.arrange(g1,g2, g3,g4,ncol=4)

The scatter graphs seem very similar and the difference is difficult to pick, it is possible that HPM, BPM or MPM methods arrive at same model. At this stage we assume BMA provides the best RMSE and select that as preferred model for further analysis.

pred.temp <- predict(model.bas,ames_train,estimator = "BMA")$fit
model.bas.train.df.RMSE.BMA <- sqrt(mean((ames_train$price-exp(pred.temp))^2))

paste("The RMSE for BMA is $", round(round(model.bas.train.df.RMSE.BMA)),"while the initial model RMSE was $", round(round(model.init.RMSE)))

## [1] "The RMSE for BMA is $ 26339 while the initial model RMSE was $ 29700"

The RMSE for BMA improved from the initial RMSE. The summary table shows the inclusion probabilities for each parameter, where low probabilities seem to correlated quite well with the p.values of the first model.

summary(model.bas)

##               P(B != 0 | Y)    model 1       model 2       model 3       model 4       model 5
## Intercept         1.0000000     1.0000     1.0000000  1.000000e+00  1.000000e+00  1.000000e+00
## area              1.0000000     1.0000     1.0000000  1.000000e+00  1.000000e+00  1.000000e+00
## Lot.Area          1.0000000     1.0000     1.0000000  1.000000e+00  1.000000e+00  1.000000e+00
## Bedroom.AbvGr     0.9998797     1.0000     1.0000000  1.000000e+00  0.000000e+00  1.000000e+00
## Overall.Qual      1.0000000     1.0000     1.0000000  1.000000e+00  1.000000e+00  1.000000e+00
## Overall.Cond      1.0000000     1.0000     1.0000000  1.000000e+00  1.000000e+00  1.000000e+00
## is.garageTRUE     0.9998754     1.0000     1.0000000  0.000000e+00  1.000000e+00  0.000000e+00
## Full.Bath         0.9586165     1.0000     0.0000000  1.000000e+00  1.000000e+00  0.000000e+00
## house.age         1.0000000     1.0000     1.0000000  1.000000e+00  1.000000e+00  1.000000e+00
## BF                       NA     1.0000     0.3454117  1.018603e-03  9.894107e-04  7.470850e-05
## PostProbs                NA     0.9584     0.0414000  1.000000e-04  1.000000e-04  0.000000e+00
## R2                       NA     0.8697     0.8683000  8.664000e-01  8.664000e-01  8.645000e-01
## dim                      NA     9.0000     8.0000000  8.000000e+00  8.000000e+00  7.000000e+00
## logmarg                  NA -1182.0417 -1183.1046949 -1.188931e+03 -1.188960e+03 -1.191544e+03

Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.

BMA was selected as the preferred model above and residual plot is created. Points seem randomly dispersed around the horizontal axis and the distribution looks normal. There is one potential outlier (611), which is nearly 100 years old house with unfinished basement and garage indicating that it may well be quite unusual property and fits poorly in the model.

plot(model.bas, which=1)

Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).

model.bas.train.df["BMA.d"] = (model.bas.train.df$BMA - model.bas.train.df$price) 
model.bas.train.df["BPM.d"] = (model.bas.train.df$BPM - model.bas.train.df$price)
model.bas.train.df["HPM.d"] = (model.bas.train.df$HPM - model.bas.train.df$price)
model.bas.train.df["MPM.d"] = (model.bas.train.df$MPM - model.bas.train.df$price)

model.bas.train.df.RMSE.BMA = sqrt(mean(model.bas.train.df$BMA.d^2))
model.bas.train.df.RMSE.HPM = sqrt(mean(model.bas.train.df$HPM.d^2))
model.bas.train.df.RMSE.BPM = sqrt(mean(model.bas.train.df$BPM.d^2))
model.bas.train.df.RMSE.MPM = sqrt(mean(model.bas.train.df$MPM.d^2))

paste("RMSE's for BMA: $",round(model.bas.train.df.RMSE.BMA),", HPM: $",round(model.bas.train.df.RMSE.HPM),", MPM: $",round(model.bas.train.df.RMSE.MPM)," and BPM: $", round(model.bas.train.df.RMSE.BPM))

## [1] "RMSE's for BMA: $ 26339 , HPM: $ 26330 , MPM: $ 26330  and BPM: $ 26330"

The RMSE for each selection method are shown above. Based on exactly the ame RMSE for BPM, MPM and HPM suggests that those three methods arrived with same model. The difference between all 3 the models is very small. HPM is selected. All RMSE’s here are reasonable indicating that the model provides some accuracy. Interestingly however, contrary to previous assumption BMA did not provide the highest accuracy when measured as RMSE.

Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a data set other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?

Comparing the actual prices to predicted prices using the test data shows very similar picture to the one created using the training data. The RMSE is in both cases very similar and mean prediction value is close to the actual mean. Based on this the potential of overfitting occurring with this model can be considered low.

# Clean the test data as was done before with training data

ames_test[] <- lapply(ames_test, function(x){
  if(!is.factor(x)) return(x)
  x <- factor(x, exclude=NULL)
  levels(x)[is.na(levels(x))] <- "NotFeat"
  return(x)
})

ames_test$Lot.Frontage[is.na(ames_test$Lot.Frontage)] <- 0
ames_test$Mas.Vnr.Area[is.na(ames_test$Mas.Vnr.Area)] <- 0
ames_test$is.garage <- ifelse(is.na(ames_test$Garage.Yr.Blt)==T, F,T)
ames_test$Garage.Yr.Blt[is.na(ames_test$Garage.Yr.Blt)] <- ames_test$Year.Built[is.na(ames_test$Garage.Yr.Blt)]
ames_test$MS.SubClass <- as.factor(ames_test$MS.SubClass)
ames_test$Utilities <- NULL
ames_test$PID <- NULL
ames_test = ames_test %>% filter(Sale.Condition == "Normal")
ames_test$house.age <- 2018-ames_test$Year.Built

# create df for results
model.bas.test.df = as.data.frame(matrix(NA, ncol=3, nrow(ames_test)))
colnames(model.bas.test.df) = c("price", "BMA", "delta")

model.bas.test.df$price <-ames_test$price
model.bas.test.df$BMA = exp(predict(model.bas, ames_test, estimator="BMA")$fit)
model.bas.test.df$delta=model.bas.test.df$price-model.bas.test.df$BMA
model.bas.test.df.RMSE.BMA = sqrt(mean(model.bas.test.df$delta^2))
model.bas.test.df.mean.BMA = mean(model.bas.test.df$BMA)

ggplot(model.bas.test.df)+
  geom_point(aes(price,BMA),color="darkblue")+
  geom_smooth(aes(price,BMA),color="red", method="lm", se=F)+
  labs(x="price", y="BMA predicted price")

## `geom_smooth()` using formula 'y ~ x'

Above graph shows the observed price on x-axis and predicted price on the y-axis. Correlation looks good also with the test data. Below the key numbers from test data are shown, RMSE is almost equal between training and testing data.

paste("Mean for actual out-of-sample price is $", round(mean(model.bas.test.df$price)), " while mean for predicted out-of-sample price is $", round(mean(model.bas.test.df$BMA)),". RMSE for training sample data is $", round(model.bas.train.df.RMSE.BMA,2), "and RMSE for out-of-sample data is $", round(model.bas.test.df.RMSE.BMA,2))

## [1] "Mean for actual out-of-sample price is $ 179640  while mean for predicted out-of-sample price is $ 177737 . RMSE for training sample data is $ 26338.77 and RMSE for out-of-sample data is $ 26356.32"

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the data set and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

Section 3.1 Final Model

Provide the summary table for your model.

For the final model also the Neighborhood and number of cars spaces in garage was added as an explanatory variable. It comes with 27 categorical levels, but otherwise the parameters are as in the initial model.

HPM provided the best results on the initial model, the HPM was also chosen as the final model. The summary table of the BAS models is shown below. Selected model is HPM, and Bayes Factor of 1 indicates the highest probability model. Parameters indicated by column model 1 are included in the final model.

Note that price, area and lot.area are transformed to logaritmic scale.

# create final model
ames_train.selected.for.model.final <- ames_train[,c("price", "area", "Lot.Area", "Bedroom.AbvGr", "Overall.Qual", "Overall.Cond","is.garage","Full.Bath","house.age","Lot.Config","Garage.Cars","Neighborhood")]

model.final.bas <- bas.lm(log(price) ~  log(area) + log(Lot.Area) + Bedroom.AbvGr+ Overall.Qual+ Overall.Cond+is.garage+Full.Bath+house.age+Lot.Config+Garage.Cars+Neighborhood,
                          data=ames_train.selected.for.model.final,
                          prior="BIC")
summary(model.final.bas)

##                     P(B != 0 | Y)    model 1       model 2       model 3       model 4       model 5
## Intercept              1.00000000     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## log(area)              1.00000000     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## log(Lot.Area)          1.00000000     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## Bedroom.AbvGr          0.99991121     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## Overall.Qual           1.00000000     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## Overall.Cond           1.00000000     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## is.garageTRUE          0.12309830     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## Full.Bath              0.02829711     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## house.age              1.00000000     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## Lot.ConfigCulDSac      0.03277592     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## Lot.ConfigFR2          0.46673057     1.0000     0.0000000     0.0000000     1.0000000     1.0000000
## Lot.ConfigFR3          0.08584472     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## Lot.ConfigInside       0.05579068     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## Garage.Cars            0.99999986     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodBlueste    0.02785843     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodBrDale     0.26669872     0.0000     0.0000000     0.0000000     0.0000000     1.0000000
## NeighborhoodBrkSide    0.06643146     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodClearCr    0.02557859     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodCollgCr    0.95399037     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodCrawfor    0.99527934     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodEdwards    0.84288255     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodGilbert    0.99794065     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodGreens     0.25741804     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodGrnHill    0.99245749     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodIDOTRR     0.74991748     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodMeadowV    0.12235862     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodMitchel    0.05274199     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodNAmes      0.10855391     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodNoRidge    0.09450191     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodNPkVill    0.02782395     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodNridgHt    0.93096668     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodNWAmes     0.82352455     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodOldTown    0.47853590     1.0000     1.0000000     0.0000000     0.0000000     1.0000000
## NeighborhoodSawyer     0.05045458     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodSawyerW    0.99995689     1.0000     1.0000000     1.0000000     1.0000000     1.0000000
## NeighborhoodSomerst    0.05089309     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodStoneBr    0.13982498     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodSWISU      0.02800621     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodTimber     0.02682322     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## NeighborhoodVeenker    0.02763737     0.0000     0.0000000     0.0000000     0.0000000     0.0000000
## BF                             NA     1.0000     0.7788507     0.4851377     0.4548721     0.6038529
## PostProbs                      NA     0.0255     0.0242000     0.0204000     0.0142000     0.0139000
## R2                             NA     0.9089     0.9081000     0.9073000     0.9080000     0.9096000
## dim                            NA    19.0000    18.0000000    17.0000000    18.0000000    20.0000000
## logmarg                        NA -1066.2373 -1066.4872660 -1066.9606526 -1067.0250692 -1066.7417549

# calculate means and RMSE for final model using the training data
model.final.train.df = as.data.frame(matrix(NA, ncol=3, nrow(ames_train)))
colnames(model.final.train.df) = c("price", "HPM", "delta")
model.final.train.df$price <-ames_train$price
model.final.train.df$HPM = as.numeric(exp(predict(model.final.bas, ames_train, estimator="HPM")$fit))
model.final.train.df$delta=model.final.train.df$price-model.final.train.df$HPM
model.final.train.df.RMSE = sqrt(mean(model.final.train.df$delta^2))
model.final.train.df.mean = mean(model.final.train.df$HPM)

paste("Mean price of the actual training data is", round(mean(model.final.train.df$price))," while mean price of the predicted price using the final model is", round(mean(model.final.train.df$HPM)),". RMSE of the final model with training data is", round(model.final.train.df.RMSE))

## [1] "Mean price of the actual training data is 174622  while mean price of the predicted price using the final model is 173258 . RMSE of the final model with training data is 22071"

Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.

As was shown in the EDA, transforming price into logarithmic scale reduced RMSE. Other parameters were also verified and converting area into logaritmic scale also helped correcting skewness, these paramters were area and lot.area. Below example graphs show the impact of transformation with the initial model, firstly price on linear scale and then price on logarithmic scale with very positive results.

r1 <- model.init.lm <- lm( price ~ . , data=ames_train.selected.for.model.1)
r2 <- model.init.lm <- lm( log(price) ~ . , data=ames_train.selected.for.model.1)

par(mfrow=c(2,2))
plot(r1, which=2)
plot(r2, which=2)
plot(r1, which=1)
plot(r2, which=1)

Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.

Non categorical variables in the final model were checked for colinearity against other parameters in the model. No obvious colinearity was found apart from some between number of rooms and area. However even there seems to be quite a bit of variance and both were left in the model.

ames_train.selected.for.model.final <- ames_train[,c("price", "area", "Lot.Area", "Bedroom.AbvGr", "Overall.Qual", "Overall.Cond","is.garage","Full.Bath","house.age","Neighborhood")]

pairs(~area+Lot.Area+Bedroom.AbvGr+Overall.Qual+Overall.Cond+is.garage+Full.Bath+house.age,
      data=ames_train.selected.for.model.final)

Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.

As was mentioned before, the final model contains all the same parameters as in the initial model but also neighborhood is added as an explanatory parameter. All these parameters are though to have significant impact on price and are also easily found from real estate marketing materials for future evaluation of new houses on the market. Rest of the parameters are selected by using BAS. It calculates posterior probabilities for including parameters in the model and highest probability model (HPM) was used for the selection of parameters for the final model.

BAS was selected due to personal preference for Bayesian methods due to the additional information it provides in comparison to frequentist approach.

The parameters in the highest probability model (for which the Bayes Factor = 1) for the final model are shown in below summary:

a <- coef(model.final.bas)
coefs_final <- data.frame(parameter = a$namesx, pos.mean = a$postmean, pos.sd = a$postsd, pos.p = a$probne0) %>% filter(pos.mean>0)
pander( coefs_final)

parameter	pos.mean	pos.sd	pos.p
Intercept	12	0.004058	1
log(area)	0.4741	0.0233	1
log(Lot.Area)	0.1537	0.009931	1
Overall.Qual	0.08509	0.005351	1
Overall.Cond	0.06019	0.004186	1
Lot.ConfigCulDSac	0.0003807	0.003713	0.03278
Lot.ConfigInside	0.0006639	0.003624	0.05579
Garage.Cars	0.05013	0.008377	1
NeighborhoodBrkSide	0.002073	0.01037	0.06643
NeighborhoodCrawfor	0.1118	0.02879	0.9953
NeighborhoodGreens	0.03453	0.06637	0.2574
NeighborhoodGrnHill	0.3398	0.08927	0.9925
NeighborhoodNAmes	0.002695	0.00947	0.1086
NeighborhoodNoRidge	0.004058	0.01533	0.0945
NeighborhoodNPkVill	0.0007003	0.01115	0.02782
NeighborhoodNridgHt	0.07416	0.03053	0.931
NeighborhoodSomerst	0.001351	0.008352	0.05089
NeighborhoodStoneBr	0.009385	0.02716	0.1398
NeighborhoodSWISU	0.0003361	0.007265	0.02801
NeighborhoodTimber	0.0001994	0.005409	0.02682
NeighborhoodVeenker	0.0004411	0.007542	0.02764

Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.

Out of sample data did not impact the model selection here as the RMSE was just slightly higher for out of sample data compared to when using the training data. Also the price predictions falling into prediction intervals was very similar and were both near 95%.

RMSE

RMSE of the final model for training data and for test data are calculate below.

ames_test = ames_test %>% filter(Neighborhood != "Landmrk") # remove new Neighborhood

model.final.test.df = as.data.frame(matrix(NA, ncol=3, nrow(ames_test)))
colnames(model.final.test.df) = c("price", "HPM", "delta")

model.final.test.df$price <-ames_test$price
model.final.test.df$HPM = exp(predict(model.final.bas, ames_test, estimator="HPM")$fit)
model.final.test.df$delta=model.final.test.df$price-model.final.test.df$HPM
model.final.test.df.RMSE = sqrt(mean(model.final.test.df$delta^2))
model.final.test.df.mean = mean(model.final.test.df$HPM)

paste("RMSE for training sample data is", round(model.final.train.df.RMSE)," and RMSE for out-of-sample data is", round(model.final.test.df.RMSE))

## [1] "RMSE for training sample data is 22071  and RMSE for out-of-sample data is 24334"

Prediction Interval

The percentage of sales in training data falling into the prediction interval is 95.3%

model.final.bas.pred <- predict(model.final.bas, ames_train, 
                                estimator="HPM", 
                                prediction=TRUE, se.fit=TRUE)

# Get dataset of predictions and confidence intervals
out = as.data.frame(cbind(exp(confint(model.final.bas.pred)),
                          price = ames_train$price))

# Fix names in dataset
colnames(out)[1:2] <- c("lwr", "upr")  #fix names

# Get Coverage
pred.test.HPM.coverage <- out %>% summarize(cover = sum(price >= lwr & price <= upr)/n())
pred.test.HPM.coverage

##       cover
## 1 0.9484412

The percentage of sales in testing data falling into the prediction interval is 94.9

model.final.bas.pred <- predict(model.final.bas, ames_test, 
                                estimator="HPM", 
                                prediction=TRUE, se.fit=TRUE)

# Get dataset of predictions and confidence intervals
out = as.data.frame(cbind(exp(confint(model.final.bas.pred)),
                          price = ames_test$price))

# Fix names in dataset
colnames(out)[1:2] <- c("lwr", "upr")  #fix names

# Get Coverage
pred.test.HPM.coverage <- out %>% summarize(cover = sum(price >= lwr & price <= upr)/n())
pred.test.HPM.coverage

##       cover
## 1 0.9448529

Part 4 Final Model Assessment

Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

The residuals of the final model seem to be close to normally distributed based on the residuals and Q-Q plots. Looking at the Q-Q plot, the residuals are rather normally distributed to about 2 standard deviations away.

Orange lines in the residuals plot indicate potential outlier boundaries at 3 x sd. however the probability of there being at least one sample further than 3 sd away in a sample size of 834 is rather high, 0.895, hence the possibility of them being actual outliers cannot be inferred.

The distribution of (predicted - observed) graph shows that the predictions are centered around 0. Orange verical lines show the 2.5% and 97.5% quantiles of the difference.

#calculate sd for residuals
r.sd <- sd(model.final.train.df$HPM-model.final.train.df$price)
k=3

# Residuals plot
g1 <- ggplot(model.final.train.df)+
  geom_hline(yintercept = r.sd*k, size=1,color="orange")+
  geom_hline(yintercept = -r.sd*k, size=1,color="orange")+
  geom_point(aes(HPM,HPM-price))+
  stat_smooth(aes(HPM,HPM-price),se=F)+
  geom_hline(yintercept = 0, size=1,color="red")+
  
  labs(x="Fitted Values",y="Residuals", title="Final model residuals")

p.at.least.one.outlier <- 1 - (1 - (2*pnorm(-3)))^nrow(model.final.train.df)

# Quantile-Quantile Plot 
resid.m <- mean(model.final.train.df$HPM-model.final.train.df$price, na.rm=TRUE)
resid.sd <- sd(model.final.train.df$HPM-model.final.train.df$price, na.rm=TRUE)
std_resid <- as.data.frame(((model.final.train.df$HPM-model.final.train.df$price)-resid.m)/resid.sd)

g2 <- ggplot(std_resid)+
  geom_abline(aes(intercept = 0,slope=1), size=1,color="orange")+
  stat_qq(aes(sample=std_resid$`((model.final.train.df$HPM - model.final.train.df$price) - resid.m)/resid.sd`))+
  labs(title="Final model Q-Q plot", y="Standardized Residuals", x="Standardized Quantiles")+
  xlim(-4,4)+
  ylim(-4,4)

g3 <- ggplot(model.final.train.df)+
  geom_density(aes(delta), color="blue", size=1)+
  geom_vline(xintercept = quantile(model.final.train.df$delta, probs = 0.025),color="orange",size=1)+
    geom_vline(xintercept = quantile(model.final.train.df$delta, probs = 0.975),color="orange",size=1)+
  labs(x="(price - prediction)")

grid.arrange(g1,g2,g3, ncol=2)

Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.

The RMSE (=standard deviation of the difference between predicted values and actually observed values) of the final model is much better than was for the initial models. This indicates that the additional parameters and log scale transformation price did improve the accuracy of the model.

RMSE can be translated as follows: 68% of the observed house prices are within RMSE from predicted price and 95% of the prices within 2 x RMSE away from predicted price.

# These were already calculated earlier in the model testing chapter
paste("RMSE of the final model with training data is ", round(model.final.train.df.RMSE), " and RMSE of the final model with test data is ",round(model.final.test.df.RMSE))

## [1] "RMSE of the final model with training data is  22071  and RMSE of the final model with test data is  24334"

RSME here seems reasonable, compared to the median house price it is about 15% of it.

Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

Model works well with the training, testing and validation data, but does not take into account the fluctuations in value after the data collection period. If house building year was included as explanatory parameter as well, steady increase in price could possibly be predicted.

Also it is limited to the area, however similar method can be used for other areas by training the model with new data and locality.

Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” data set to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

Below the predicted and actually observed values of the final model are compared.

# Clean the validation data as was done before with training data

ames_validation[] <- lapply(ames_validation, function(x){
  if(!is.factor(x)) return(x)
  x <- factor(x, exclude=NULL)
  levels(x)[is.na(levels(x))] <- "NotFeat"
  return(x)
})

ames_validation$Lot.Frontage[is.na(ames_validation$Lot.Frontage)] <- 0
ames_validation$Mas.Vnr.Area[is.na(ames_validation$Mas.Vnr.Area)] <- 0
ames_validation$is.garage <- ifelse(is.na(ames_validation$Garage.Yr.Blt)==T, F,T)
ames_validation$Garage.Yr.Blt[is.na(ames_validation$Garage.Yr.Blt)] <- ames_validation$Year.Built[is.na(ames_validation$Garage.Yr.Blt)]
ames_validation$MS.SubClass <- as.factor(ames_validation$MS.SubClass)
ames_validation$Utilities <- NULL
ames_validation$PID <- NULL
ames_validation = ames_validation %>% filter(Sale.Condition == "Normal")
ames_validation$house.age <- 2018-ames_validation$Year.Built

# create df for results
model.final.validation.df = as.data.frame(matrix(NA, ncol=3, nrow(ames_validation)))
colnames(model.final.validation.df) = c("price", "HPM", "delta")

model.final.validation.df$price <-ames_validation$price
model.final.validation.df$HPM = exp(predict(model.final.bas, ames_validation, estimator="HPM")$fit)
model.final.validation.df$delta=model.final.validation.df$price-model.final.validation.df$HPM
model.final.validation.df.RMSE = sqrt(mean(model.final.validation.df$delta^2))
model.final.validation.df.mean = mean(model.final.validation.df$HPM)

model.final.bas.pred <- predict(model.final.bas, ames_test, 
                                estimator="HPM", 
                                prediction=TRUE, se.fit=TRUE)

# Get dataset of predictions and credibility intervals
out = as.data.frame(cbind(exp(confint(model.final.bas.pred)),
                          price = ames_test$price))

# Fix names in dataset
colnames(out)[1:2] <- c("lwr", "upr")  #fix names

ggplot(out)+
  geom_point(aes(price,pred),color="darkblue")+
  geom_smooth(aes(price,pred),color="red", method="lm", se=F)+
  labs(x="price", y="HPM predicted price")

paste("Mean for actual validation price is", round(mean(model.final.validation.df$price)),". Mean for predicted validation price is", round(mean(model.final.validation.df$HPM))," and RMSE for validation data is", round(model.final.validation.df.RMSE))

## [1] "Mean for actual validation price is 172264 . Mean for predicted validation price is 171199  and RMSE for validation data is 21917"

When model was applied to validation data, the RMSE is even lower than it was for training or testing data.

model.final.bas.pred <- predict(model.final.bas, ames_test, 
                                estimator="HPM", 
                                prediction=TRUE, se.fit=TRUE)

# Get dataset of predictions and credibility intervals
out = as.data.frame(cbind(exp(confint(model.final.bas.pred)),
                          price = ames_test$price))

# Fix names in dataset
colnames(out)[1:2] <- c("lwr", "upr")  #fix names

# Get Coverage
pred.test.HPM.coverage <- out %>% summarize(cover = sum(price >= lwr & price <= upr)/n())
pred.test.HPM.coverage

##       cover
## 1 0.9448529

The percentage of sales in validation data falling into the prediction interval is 94.9%. This indicates that the model correctly reflects the uncertainty in the predictions.

Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.

The attempt was to create a model to predict house prices based on Bayesian methods. BAS library was used to calculate the probabilities for each parameter and the highest probability model (HPM) was selected. Checking the model with the validation data it was able to provide accuracy of about ?23,000 for 68% of the houses or about ?45,000 for 95% of houses. The median price in the area is about 155000, so this error of ?23000 corresponds to about 14% of the mean house price. Model seems to work well for normal sale cases, however case classified as abnormal sale can be anything and potentially does not reflect the true value of the house, hence they were removed from the model training data.

The learning from the data was that most important inputs were the quality of the house, age, size and location. Location however was proven to be important only in some of the areas while in others it plays not much role and low price may just mean that the other qualities dominate, like for example average living area in some neighborhoods maybe smaller than in some more expensive areas but the price per square foot could be the same. In future it could be interesting to do similar analysis from price per square foot the point of view. Learning from the process was that it is truly worth playing with he data and models as there is many ways of improving the model from what initially was thought was good.

Predicting house prices in Ames, Iowa

Capstone project, Duke University, Statistics with R

Riku Luostari

11 Jan 2018

Background for the task