2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(dplyr)
library(BAS)
library(ggplot2)
library(lubridate)
library(MASS)
library(gridExtra)
library(plotly)
library(cowplot)

## Warning: package 'cowplot' was built under R version 3.4.3

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).

Before even starting, we have to re-estate the questions we would like to answer with this data set. The question we would like to answer here is “what variables of this dataset as useful to predict a house price in IOWA and what is the contribution of each of the predictors to the overall fitted model if any?”

Based on the provided information and the request analysis from the firm, we want first to explore the data and few potential predictors of house pricing that comes to mind are the house age, the Neighborhood, the quality of the house, the size of the house and more. With that in mind, we will begin by plotting the house count by Year built and see if there is any trend observed.

## Frequency plot of house count by Year.Built

ggplot(data = ames_train, aes(x=year(today())-Year.Built))+
  geom_histogram(bins = 30, fill="blue", colour="white")+labs(title="House counts by age", x="House age", y="House count")+
  
  geom_vline(xintercept = median(year(today())-ames_train$Year.Built), colour="red")+
  
  annotate("text", x= median(year(today())-ames_train$Year.Built)-4, y=c(100,-10)
           , label=c("Median", median(year(today())-ames_train$Year.Built))
           , colour="red", angle=c(90,0))+
  
  geom_vline(xintercept = mean(year(today())-ames_train$Year.Built), colour="#41c42f")+
  
  annotate("text", x= mean(year(today())-ames_train$Year.Built)+4, y=c(100,-10)
           , label=c("Mean", round(mean(year(today())-ames_train$Year.Built), 1))
           , colour="#41c42f", angle=c(-90,0))+
  
  theme(plot.title = element_text(hjust=0.5)  
        ,title = element_text(hjust = 0.5, colour = "red", size = rel(2))
        , axis.title =element_text(size = rel(1), colour = "blue", hjust = .5)
        , legend.position = "none"
        ,panel.background = element_rect(fill = "grey", colour = "black")
        , panel.grid = element_line(color = "blue"))

Plotting the above house count by Year.Built shows a multimodel distribution of houses with at least 3 modes. The plot shows also the distribution is extremely right skewed. This tells us that the usage of robust statistics would be more appropriate to discribe this distribution. Let us use below appropriate summary statistics to describe the distribution.

ames_train %>% summarise(Min=min(price), Mean=mean(price), Median=median(price), IQR=IQR(price), Std.dev=sd(price), Max=max(price))

## # A tibble: 1 x 6
##     Min     Mean Median     IQR  Std.dev    Max
##   <dbl>    <dbl>  <dbl>   <dbl>    <dbl>  <dbl>
## 1 12789 181190.1 159467 83237.5 81909.79 615000

ames_train %>% summarise(Min=min(year(today())-Year.Built), Mean=mean(year(today())-Year.Built), Median=median(year(today())-Year.Built), IQR=IQR(year(today())-Year.Built), Std.dev=sd(year(today())-Year.Built), Max=max(year(today())-Year.Built))

## # A tibble: 1 x 6
##     Min   Mean Median   IQR  Std.dev   Max
##   <dbl>  <dbl>  <dbl> <dbl>    <dbl> <dbl>
## 1     7 44.797     42    46 29.63741   145

Above summary statistics shows that median age of this house distribution is 42 Years which means 50% of houses were built less than 42 years ago. The IQR reveals more details by showing that the middle 50% of houses have an age difference of 46 Years.

We can also see on the other summary statistics that the minimum house price is 12789 which is too low fora house built and sold in normal condition. The max price of 615000 may need to be investigated as well.

a1 <- ames_train %>% group_by(Neighborhood) %>% ggplot(aes(x=Neighborhood, y=log(price), fill=factor(Neighborhood))) + geom_boxplot()+labs(title="Log(price) by Neighborhood", xlab="Neighborhood", ylab="Log(Price)", fill="By Neighborhood")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5))

a1

a2 <- ames_train %>% ggplot(aes(x=price/10^5))+geom_histogram(aes(fill=factor(ames_train$Pool.QC)))+labs(title="Price distribution by Pool Quality", xlab="Price in 100K", ylab="House Count", fill="Pool Quality")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5))

a2

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

a3 <- ames_train %>% ggplot(aes(x=price/10^5))+geom_histogram(aes(fill=factor(ames_train$Sale.Condition)))+labs(title="Price distribution by Sale.Condition", xlab="Price in 100K", ylab="House Count", fill="Sale.Condition")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5))

a3

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

a4 <- ames_train %>% filter(Sale.Condition=="Normal")  %>% ggplot(aes(x=price/10^5, fill=Sale.Condition))+geom_histogram()+labs(title="Price distribution by Normal Sale.Condition", xlab="Price in 100K", ylab="House Count", fill="Normal Sale.Condition")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5))

a4

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The boxplot above shows the pricing variation per neighborhood and appears from it that StoneBr is the Neighborhood in IOWA with the highest Inter-Quartile Range (IQR). The plot also shows that the cheapest house in the dataset is in OldTown and the related house price is quite dispersed as compared to other houses in the same neighborhood. This may require further investigations.

This plot also shows that the house price skewness is influence partially by Sales not done in Normal condition.

The most expensive house in the data set is one that ha a swimming pool in good quality.

With this in mind, we will see how to subset the dataset to normallize the house distribution to improve modelling.

## counting all missing values

missing_val <- ames_train %>% summarise_all(funs(sum(is.na(.))))

## sorting the top 3 missing values. Unlist helps converting to vector with variable names as names
head(sort(unlist(missing_val[1,]), decreasing = TRUE), 20)

##        Pool.QC   Misc.Feature          Alley          Fence   Fireplace.Qu 
##            997            971            933            798            491 
##   Lot.Frontage  Garage.Yr.Blt    Garage.Qual    Garage.Cond    Garage.Type 
##            167             48             47             47             46 
##  Garage.Finish      Bsmt.Qual      Bsmt.Cond  Bsmt.Exposure BsmtFin.Type.1 
##             46             21             21             21             21 
## BsmtFin.Type.2   Mas.Vnr.Area   BsmtFin.SF.1   BsmtFin.SF.2    Bsmt.Unf.SF 
##             21              7              1              1              1

ames_train %>% filter(is.na(Garage.Qual) | is.na(Bsmt.Qual)) %>% summarise(Count=n())

## # A tibble: 1 x 1
##   Count
##   <int>
## 1    64

##

Investigating missing values count and their significance is a critical step in understanding this dataset. The above dataset summary shows that in this train data set of 1000 houses, Only 3 houses are equipped with a swimming pool. This can be justified by the cost of having a swimming pool an operating it. This tells us that the presence of a swimmingpool is likely not a relevant indicator of the house price and might actually not be a relevant predictor of house price. The same analysis applies to Miscellaneous features variable Misc.Feature, Alley, Fence and Fireplace.Qu.

Above summary also shows that there are 47 houses without a garage of with an unfinished garage and 21 houses without basement, filtering both criteria returns 64 houses with NA values related to those variables to be dealt with.

In order to avoid discarding all NAs while fitting the model, all NAs of categorical variables of this dataset will be investigated individually and based on their meaning will be set to another categorical variable that we will call “None”.

P1 <- ames_train %>% ggplot(aes(x=Lot.Area ,y=price))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of house  Price by Lot.Area Log Transformed")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))

P1

P2 <- ames_train %>% ggplot(aes(x=log(Lot.Area) ,y=log(price)))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of house  Price by Lot.Area Log Transformed")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))

P2

P3 <- ames_train %>% ggplot(aes(x=Overall.Qual ,y=price))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of house Price by Overall quality")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))

P3

P4 <- ames_train %>% ggplot(aes(x=Overall.Qual ,y=log(price)))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of Log of house Price by Overall quality")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))

P4

P5 <- ames_train %>% ggplot(aes(x=Bedroom.AbvGr ,y=log(price)))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of Log of house Price by Bedroom.AvgGr")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))

P5

P6 <- ames_train %>% ggplot(aes(x=log(area) ,y=log(price)))+geom_point(color="red")+geom_smooth(method = "lm", fill="green")+ labs(title="Scatter plot of Log of house Price by Log of area")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))

P6

P7 <- ames_train %>% ggplot(aes(x=Sale.Condition, y=log(price), fill=Sale.Condition))+geom_boxplot()+labs(title="Scatter plot of Log of house Price by Sale Condition")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))

P7

P8 <- ames_train %>% ggplot(aes(x=Bldg.Type, y=log(price), fill=Bldg.Type))+geom_boxplot()+labs(title="Scatter plot of Log of house Price by Building Type")+theme(axis.text.x = element_text(angle = 90, hjust = 1), title = element_text(colour = "blue") ,plot.title = element_text(hjust = 0.5, size = 8))

P8

Below two (02) plots show as well analysis of the relationship of house price and both Lot.Area and Oevrall.Qual. on the left scatter plots are done without log transformation of the price and it shows a high concentration of points around zero (0) making it difficult to really perceive the relationship of the Lot.Area to the price. on the right price and Lot.Area are log transformed, while the overal quality remains the same. we immediately see better the linear relationship between the price and Lot.Area and the line fitted for the Overall.Qual seems to be better as it crosses right in the middle of the dots. These two (02) variables seems to describe some linearity in their relationship to price.

2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.2.1 Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from âames_trainâ and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).

Based on the exploratory data analysis performed, we believe the following 10 variables are the most significant predictor to the house price in this dataset:

Lot.Area, Neighborhood, Area, overall.Qual, Bedroom.AbvGr, Year.Built (that may be transformed into age), Presence of Basement or not (Bsmt.Qual), Garage.Cond, Sale.Condition, Bldg.Type

Some variables selected as part of the following model are based on result of data exploration performed above. For instance, from plotting performed above, it appears that Lot.Area, Neighborhood, Area, Overall.Qual, Year.Built, Sale.Condition seem to form a linear relationship with house price. The remaining variables were added as best guess to this model based on assumptions.

Given also the NA values analysis perfomed above, to avoid a silent discard of the NA values and based on the data dictionary definition of those NA, we decide to convert all NAs in categorical variables to a new factor level called “None” which will be considered while executing the model selection. This is only applicable to the Bsmt.Qual and the Garage.Cond variables which are the only categorical variables with NAs in this model.

NA values on Garage.Cond represents the fact that there is no Garage in the house

NA values on Bsmt.Qual represents the fact that there is no Basement in the house

prepare_data <- function(data=ames_train){
 
  
  # identifying variables with same values not to be considered in modeling since not significant to the study
  var_to_use <- data %>% summarise_all(funs(length(unique(.))))
  var_to_use <- names(data.frame(var_to_use)[,var_to_use>1])
  
  # select userful variables only for the model
  data <- data %>% select(var_to_use)
  
  # Deal with NA values 
  
  ## Seperate Numerical from Categorical variables
  ames_var_types <- split(names(data), sapply(data, function(x) paste(class(x), collapse = " ")))
  
  ## Deal with numerical variables NA values

  ames_train_int <- data %>% select(ames_var_types$integer)
  #ames_train_int[is.na(ames_train_int)] <- 0
  
  ## Dealing with categorical variables NA values
  ames_train_fac <- data %>% select(ames_var_types$factor)
  ames_train_fac <- sapply(ames_train_fac, as.character)
  ames_train_fac[is.na(ames_train_fac)] <- c("None")
  ames_train_fac[ames_train_fac==""] <- c("None")
  ames_train_fac <- data.frame(ames_train_fac)
  
  ## Merging both numerical and categoricals
  data <- cbind(ames_train_int, ames_train_fac)
  
  data <- data %>% mutate(House.Age= year(today()) - Year.Built)
  
  data$Year.Built <- NULL
  
  return(data)
}

ames_train <- prepare_data(data=ames_train)

The above script:

creates a new factor level “none” by setting all NA and empty values of all categorical variables in the dataset ames_train to the value “none”.

It also removes all categorical variables with only a single level.

Finally creates a new variable called House.Age which is numerical and removes the variable Year.Built variable

intuitive_model <- lm(formula = log(price) ~ log(Lot.Area) + log(area) +Neighborhood + Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Garage.Cond  + Sale.Condition + Bldg.Type, data = ames_train)

summary(intuitive_model)

## 
## Call:
## lm(formula = log(price) ~ log(Lot.Area) + log(area) + Neighborhood + 
##     Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Garage.Cond + 
##     Sale.Condition + Bldg.Type, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37652 -0.07388  0.00082  0.07745  0.59956 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.5863353  0.2612093  29.043  < 2e-16 ***
## log(Lot.Area)          0.0881312  0.0159873   5.513 4.56e-08 ***
## log(area)              0.5151260  0.0272478  18.905  < 2e-16 ***
## NeighborhoodBlueste   -0.0171610  0.1014559  -0.169 0.865717    
## NeighborhoodBrDale    -0.1169465  0.0741136  -1.578 0.114913    
## NeighborhoodBrkSide   -0.0546212  0.0609345  -0.896 0.370271    
## NeighborhoodClearCr    0.0307487  0.0690191   0.446 0.656053    
## NeighborhoodCollgCr   -0.0696761  0.0528234  -1.319 0.187475    
## NeighborhoodCrawfor    0.0753758  0.0599847   1.257 0.209213    
## NeighborhoodEdwards   -0.1093875  0.0550902  -1.986 0.047365 *  
## NeighborhoodGilbert   -0.1342566  0.0548214  -2.449 0.014506 *  
## NeighborhoodGreens     0.1115216  0.0902853   1.235 0.217057    
## NeighborhoodGrnHill    0.4105934  0.1189072   3.453 0.000579 ***
## NeighborhoodIDOTRR    -0.2110818  0.0628662  -3.358 0.000817 ***
## NeighborhoodMeadowV   -0.0986541  0.0639489  -1.543 0.123237    
## NeighborhoodMitchel   -0.0339515  0.0559659  -0.607 0.544231    
## NeighborhoodNAmes     -0.0391561  0.0541415  -0.723 0.469725    
## NeighborhoodNoRidge    0.0300313  0.0582792   0.515 0.606463    
## NeighborhoodNPkVill    0.0017566  0.0924858   0.019 0.984851    
## NeighborhoodNridgHt    0.1128681  0.0528356   2.136 0.032917 *  
## NeighborhoodNWAmes    -0.0831152  0.0564491  -1.472 0.141246    
## NeighborhoodOldTown   -0.1047175  0.0604668  -1.732 0.083631 .  
## NeighborhoodSawyer    -0.0433053  0.0561918  -0.771 0.441094    
## NeighborhoodSawyerW   -0.1169313  0.0546608  -2.139 0.032673 *  
## NeighborhoodSomerst   -0.0006721  0.0505021  -0.013 0.989384    
## NeighborhoodStoneBr    0.1347873  0.0588866   2.289 0.022302 *  
## NeighborhoodSWISU     -0.1101904  0.0714777  -1.542 0.123503    
## NeighborhoodTimber     0.0185630  0.0616569   0.301 0.763427    
## NeighborhoodVeenker   -0.0428562  0.0701149  -0.611 0.541195    
## Overall.Qual           0.0845770  0.0065190  12.974  < 2e-16 ***
## Bedroom.AbvGr         -0.0441271  0.0086307  -5.113 3.84e-07 ***
## House.Age             -0.0031231  0.0003989  -7.830 1.30e-14 ***
## Bsmt.CondFa           -0.2072388  0.1130394  -1.833 0.067066 .  
## Bsmt.CondGd           -0.0698351  0.1101118  -0.634 0.526090    
## Bsmt.CondNone         -0.2799906  0.1130547  -2.477 0.013438 *  
## Bsmt.CondPo            0.0563993  0.1892885   0.298 0.765803    
## Bsmt.CondTA           -0.0834522  0.1077999  -0.774 0.439041    
## Garage.CondFa         -0.5011045  0.1561708  -3.209 0.001378 ** 
## Garage.CondGd         -0.2283402  0.1661071  -1.375 0.169562    
## Garage.CondNone       -0.3507137  0.1550106  -2.263 0.023891 *  
## Garage.CondPo         -0.3831565  0.1657232  -2.312 0.020989 *  
## Garage.CondTA         -0.2934632  0.1539917  -1.906 0.056990 .  
## Sale.ConditionAdjLand  0.2268656  0.1160550   1.955 0.050899 .  
## Sale.ConditionAlloca   0.2985431  0.0801409   3.725 0.000207 ***
## Sale.ConditionFamily  -0.0596883  0.0423659  -1.409 0.159199    
## Sale.ConditionNormal   0.1053826  0.0205939   5.117 3.75e-07 ***
## Sale.ConditionPartial  0.1331717  0.0286052   4.656 3.69e-06 ***
## Bldg.Type2fmCon        0.0038722  0.0362826   0.107 0.915030    
## Bldg.TypeDuplex       -0.0757817  0.0298963  -2.535 0.011410 *  
## Bldg.TypeTwnhs        -0.1454875  0.0398270  -3.653 0.000273 ***
## Bldg.TypeTwnhsE       -0.0804046  0.0255017  -3.153 0.001667 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.151 on 949 degrees of freedom
## Multiple R-squared:  0.8776, Adjusted R-squared:  0.8712 
## F-statistic: 136.1 on 50 and 949 DF,  p-value: < 2.2e-16

The model summary shows that all variables of the intuitive model are to the model based on the P-Value criteria. This is backed up by the R-squared value which is quite high 0.8723.

2.2.2 Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?

We start on the basis of equi-probability of belonging to the final model for each variable of the initial model.

2.2.2.1 1. First model selection method: BIC with a uniform prior.

# Using the BIC

bic_model <- bas.lm(log(price) ~ log(Lot.Area) + log(area) +Neighborhood + Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Garage.Cond  + Sale.Condition + Bldg.Type, data = ames_train, prior = "BIC", modelprior = uniform())

summary(bic_model)

##                       P(B != 0 | Y)    model 1       model 2       model 3
## Intercept                1.00000000     1.0000     1.0000000     1.0000000
## log(Lot.Area)            1.00000000     1.0000     1.0000000     1.0000000
## log(area)                1.00000000     1.0000     1.0000000     1.0000000
## NeighborhoodBlueste      0.02039545     0.0000     0.0000000     0.0000000
## NeighborhoodBrDale       0.06094873     0.0000     0.0000000     0.0000000
## NeighborhoodBrkSide      0.03184298     0.0000     0.0000000     0.0000000
## NeighborhoodClearCr      0.10514497     0.0000     0.0000000     0.0000000
## NeighborhoodCollgCr      0.42304918     1.0000     1.0000000     0.0000000
## NeighborhoodCrawfor      0.99176547     1.0000     1.0000000     1.0000000
## NeighborhoodEdwards      0.72424873     1.0000     1.0000000     1.0000000
## NeighborhoodGilbert      0.99821304     1.0000     1.0000000     1.0000000
## NeighborhoodGreens       0.09011289     0.0000     0.0000000     0.0000000
## NeighborhoodGrnHill      0.97251381     1.0000     1.0000000     1.0000000
## NeighborhoodIDOTRR       0.99793365     1.0000     1.0000000     1.0000000
## NeighborhoodMeadowV      0.07608799     0.0000     0.0000000     0.0000000
## NeighborhoodMitchel      0.02036033     0.0000     0.0000000     0.0000000
## NeighborhoodNAmes        0.05492988     0.0000     0.0000000     0.0000000
## NeighborhoodNoRidge      0.20083919     0.0000     0.0000000     0.0000000
## NeighborhoodNPkVill      0.02339286     0.0000     0.0000000     0.0000000
## NeighborhoodNridgHt      0.99961186     1.0000     1.0000000     1.0000000
## NeighborhoodNWAmes       0.20339322     0.0000     0.0000000     0.0000000
## NeighborhoodOldTown      0.20675032     0.0000     0.0000000     0.0000000
## NeighborhoodSawyer       0.02100306     0.0000     0.0000000     0.0000000
## NeighborhoodSawyerW      0.95281915     1.0000     1.0000000     1.0000000
## NeighborhoodSomerst      0.05676248     0.0000     0.0000000     0.0000000
## NeighborhoodStoneBr      0.97453751     1.0000     1.0000000     1.0000000
## NeighborhoodSWISU        0.02282877     0.0000     0.0000000     0.0000000
## NeighborhoodTimber       0.04166532     0.0000     0.0000000     0.0000000
## NeighborhoodVeenker      0.02700343     0.0000     0.0000000     0.0000000
## Overall.Qual             1.00000000     1.0000     1.0000000     1.0000000
## Bedroom.AbvGr            0.99999175     1.0000     1.0000000     1.0000000
## House.Age                1.00000000     1.0000     1.0000000     1.0000000
## Bsmt.CondFa              0.95993576     1.0000     1.0000000     1.0000000
## Bsmt.CondGd              0.02180072     0.0000     0.0000000     0.0000000
## Bsmt.CondNone            0.99997311     1.0000     1.0000000     1.0000000
## Bsmt.CondPo              0.02386114     0.0000     0.0000000     0.0000000
## Bsmt.CondTA              0.02667441     0.0000     0.0000000     0.0000000
## Garage.CondFa            0.99981400     1.0000     1.0000000     1.0000000
## Garage.CondGd            0.06148967     0.0000     0.0000000     0.0000000
## Garage.CondNone          0.27741195     0.0000     0.0000000     0.0000000
## Garage.CondPo            0.05564299     0.0000     0.0000000     0.0000000
## Garage.CondTA            0.09129333     0.0000     0.0000000     0.0000000
## Sale.ConditionAdjLand    0.05288864     0.0000     0.0000000     0.0000000
## Sale.ConditionAlloca     0.95959351     1.0000     1.0000000     1.0000000
## Sale.ConditionFamily     0.08199673     0.0000     0.0000000     0.0000000
## Sale.ConditionNormal     0.99999846     1.0000     1.0000000     1.0000000
## Sale.ConditionPartial    0.99998607     1.0000     1.0000000     1.0000000
## Bldg.Type2fmCon          0.02044057     0.0000     0.0000000     0.0000000
## Bldg.TypeDuplex          0.34352837     0.0000     1.0000000     0.0000000
## Bldg.TypeTwnhs           0.94048871     1.0000     1.0000000     1.0000000
## Bldg.TypeTwnhsE          0.68216647     1.0000     1.0000000     1.0000000
## BF                               NA     1.0000     0.7988378     0.7216875
## PostProbs                        NA     0.0115     0.0092000     0.0083000
## R2                               NA     0.8709     0.8717000     0.8699000
## dim                              NA    23.0000    24.0000000    22.0000000
## logmarg                          NA -1643.3158 -1643.5404087 -1643.6419743
##                             model 4       model 5
## Intercept                 1.0000000     1.0000000
## log(Lot.Area)             1.0000000     1.0000000
## log(area)                 1.0000000     1.0000000
## NeighborhoodBlueste       0.0000000     0.0000000
## NeighborhoodBrDale        0.0000000     0.0000000
## NeighborhoodBrkSide       0.0000000     0.0000000
## NeighborhoodClearCr       0.0000000     0.0000000
## NeighborhoodCollgCr       1.0000000     1.0000000
## NeighborhoodCrawfor       1.0000000     1.0000000
## NeighborhoodEdwards       1.0000000     1.0000000
## NeighborhoodGilbert       1.0000000     1.0000000
## NeighborhoodGreens        0.0000000     0.0000000
## NeighborhoodGrnHill       1.0000000     1.0000000
## NeighborhoodIDOTRR        1.0000000     1.0000000
## NeighborhoodMeadowV       0.0000000     0.0000000
## NeighborhoodMitchel       0.0000000     0.0000000
## NeighborhoodNAmes         0.0000000     0.0000000
## NeighborhoodNoRidge       0.0000000     0.0000000
## NeighborhoodNPkVill       0.0000000     0.0000000
## NeighborhoodNridgHt       1.0000000     1.0000000
## NeighborhoodNWAmes        1.0000000     1.0000000
## NeighborhoodOldTown       0.0000000     0.0000000
## NeighborhoodSawyer        0.0000000     0.0000000
## NeighborhoodSawyerW       1.0000000     1.0000000
## NeighborhoodSomerst       0.0000000     0.0000000
## NeighborhoodStoneBr       1.0000000     1.0000000
## NeighborhoodSWISU         0.0000000     0.0000000
## NeighborhoodTimber        0.0000000     0.0000000
## NeighborhoodVeenker       0.0000000     0.0000000
## Overall.Qual              1.0000000     1.0000000
## Bedroom.AbvGr             1.0000000     1.0000000
## House.Age                 1.0000000     1.0000000
## Bsmt.CondFa               1.0000000     1.0000000
## Bsmt.CondGd               0.0000000     0.0000000
## Bsmt.CondNone             1.0000000     1.0000000
## Bsmt.CondPo               0.0000000     0.0000000
## Bsmt.CondTA               0.0000000     0.0000000
## Garage.CondFa             1.0000000     1.0000000
## Garage.CondGd             0.0000000     0.0000000
## Garage.CondNone           0.0000000     0.0000000
## Garage.CondPo             0.0000000     0.0000000
## Garage.CondTA             0.0000000     0.0000000
## Sale.ConditionAdjLand     0.0000000     0.0000000
## Sale.ConditionAlloca      1.0000000     1.0000000
## Sale.ConditionFamily      0.0000000     0.0000000
## Sale.ConditionNormal      1.0000000     1.0000000
## Sale.ConditionPartial     1.0000000     1.0000000
## Bldg.Type2fmCon           0.0000000     0.0000000
## Bldg.TypeDuplex           1.0000000     0.0000000
## Bldg.TypeTwnhs            1.0000000     1.0000000
## Bldg.TypeTwnhsE           1.0000000     1.0000000
## BF                        0.5883681     0.5055826
## PostProbs                 0.0068000     0.0058000
## R2                        0.8725000     0.8716000
## dim                      25.0000000    24.0000000
## logmarg               -1643.8462137 -1643.9978551

image(bic_model, rotate = FALSE)

Using the BIC model selection, all variables from the initial model are included in the best model selected except the Garage.Cond variable. Selected variables are indicated by the value 1 of the Bayesian Factor and the posterior probability of belonging to the model is as well displayed. on average, the variable Bldg.Type appears to have the lowest posterior probability in the best model selected.

2.2.2.2 2. Second model selection method: AIC with a uniform prior

bas_model_AIC <- bas.lm(log(price) ~ log(Lot.Area) + log(area) +Neighborhood + Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond + Garage.Cond  + Sale.Condition + Bldg.Type, data = ames_train, prior = "AIC", modelprior = uniform())

summary(bas_model_AIC)

##                       P(B != 0 | Y)    model 1       model 2       model 3
## Intercept                1.00000000     1.0000     1.0000000     1.0000000
## log(Lot.Area)            1.00000000     1.0000     1.0000000     1.0000000
## log(area)                1.00000000     1.0000     1.0000000     1.0000000
## NeighborhoodBlueste      0.12846933     0.0000     0.0000000     0.0000000
## NeighborhoodBrDale       0.67297799     1.0000     1.0000000     1.0000000
## NeighborhoodBrkSide      0.14410069     0.0000     0.0000000     0.0000000
## NeighborhoodClearCr      0.47423069     1.0000     1.0000000     1.0000000
## NeighborhoodCollgCr      0.99010283     1.0000     1.0000000     1.0000000
## NeighborhoodCrawfor      0.99959374     1.0000     1.0000000     1.0000000
## NeighborhoodEdwards      0.99895678     1.0000     1.0000000     1.0000000
## NeighborhoodGilbert      0.99999529     1.0000     1.0000000     1.0000000
## NeighborhoodGreens       0.70807406     1.0000     1.0000000     1.0000000
## NeighborhoodGrnHill      0.99981682     1.0000     1.0000000     1.0000000
## NeighborhoodIDOTRR       0.99999765     1.0000     1.0000000     1.0000000
## NeighborhoodMeadowV      0.69025188     1.0000     1.0000000     1.0000000
## NeighborhoodMitchel      0.14986329     0.0000     0.0000000     0.0000000
## NeighborhoodNAmes        0.19839750     0.0000     0.0000000     0.0000000
## NeighborhoodNoRidge      0.46099683     1.0000     0.0000000     1.0000000
## NeighborhoodNPkVill      0.07347804     0.0000     0.0000000     0.0000000
## NeighborhoodNridgHt      0.99999582     1.0000     1.0000000     1.0000000
## NeighborhoodNWAmes       0.95471751     1.0000     1.0000000     1.0000000
## NeighborhoodOldTown      0.92324730     1.0000     1.0000000     1.0000000
## NeighborhoodSawyer       0.12904437     0.0000     0.0000000     0.0000000
## NeighborhoodSawyerW      0.99986213     1.0000     1.0000000     1.0000000
## NeighborhoodSomerst      0.20027242     0.0000     0.0000000     0.0000000
## NeighborhoodStoneBr      0.99980097     1.0000     1.0000000     1.0000000
## NeighborhoodSWISU        0.36458743     1.0000     1.0000000     0.0000000
## NeighborhoodTimber       0.15043017     0.0000     0.0000000     0.0000000
## NeighborhoodVeenker      0.15642752     0.0000     0.0000000     0.0000000
## Overall.Qual             1.00000000     1.0000     1.0000000     1.0000000
## Bedroom.AbvGr            0.99999982     1.0000     1.0000000     1.0000000
## House.Age                1.00000000     1.0000     1.0000000     1.0000000
## Bsmt.CondFa              0.99885843     1.0000     1.0000000     1.0000000
## Bsmt.CondGd              0.15158859     0.0000     0.0000000     0.0000000
## Bsmt.CondNone            0.99995206     1.0000     1.0000000     1.0000000
## Bsmt.CondPo              0.19458851     0.0000     0.0000000     0.0000000
## Bsmt.CondTA              0.21737386     0.0000     0.0000000     0.0000000
## Garage.CondFa            0.99961403     1.0000     1.0000000     1.0000000
## Garage.CondGd            0.31311583     0.0000     0.0000000     0.0000000
## Garage.CondNone          0.98189573     1.0000     1.0000000     1.0000000
## Garage.CondPo            0.95173539     1.0000     1.0000000     1.0000000
## Garage.CondTA            0.80772557     1.0000     1.0000000     1.0000000
## Sale.ConditionAdjLand    0.79540444     1.0000     1.0000000     1.0000000
## Sale.ConditionAlloca     0.99982165     1.0000     1.0000000     1.0000000
## Sale.ConditionFamily     0.68989393     1.0000     1.0000000     1.0000000
## Sale.ConditionNormal     0.99999978     1.0000     1.0000000     1.0000000
## Sale.ConditionPartial    0.99999804     1.0000     1.0000000     1.0000000
## Bldg.Type2fmCon          0.05890929     0.0000     0.0000000     0.0000000
## Bldg.TypeDuplex          0.99082333     1.0000     1.0000000     1.0000000
## Bldg.TypeTwnhs           0.99964138     1.0000     1.0000000     1.0000000
## Bldg.TypeTwnhsE          0.99764979     1.0000     1.0000000     1.0000000
## BF                               NA     1.0000     0.9891761     0.9888365
## PostProbs                        NA     0.0002     0.0002000     0.0002000
## R2                               NA     0.8767     0.8764000     0.8764000
## dim                              NA    37.0000    36.0000000    36.0000000
## logmarg                          NA -1577.7603 -1577.7711587 -1577.7715020
##                             model 4       model 5
## Intercept                 1.0000000     1.0000000
## log(Lot.Area)             1.0000000     1.0000000
## log(area)                 1.0000000     1.0000000
## NeighborhoodBlueste       0.0000000     0.0000000
## NeighborhoodBrDale        1.0000000     1.0000000
## NeighborhoodBrkSide       0.0000000     0.0000000
## NeighborhoodClearCr       1.0000000     1.0000000
## NeighborhoodCollgCr       1.0000000     1.0000000
## NeighborhoodCrawfor       1.0000000     1.0000000
## NeighborhoodEdwards       1.0000000     1.0000000
## NeighborhoodGilbert       1.0000000     1.0000000
## NeighborhoodGreens        1.0000000     1.0000000
## NeighborhoodGrnHill       1.0000000     1.0000000
## NeighborhoodIDOTRR        1.0000000     1.0000000
## NeighborhoodMeadowV       1.0000000     1.0000000
## NeighborhoodMitchel       0.0000000     0.0000000
## NeighborhoodNAmes         0.0000000     0.0000000
## NeighborhoodNoRidge       0.0000000     1.0000000
## NeighborhoodNPkVill       0.0000000     0.0000000
## NeighborhoodNridgHt       1.0000000     1.0000000
## NeighborhoodNWAmes        1.0000000     1.0000000
## NeighborhoodOldTown       1.0000000     1.0000000
## NeighborhoodSawyer        0.0000000     0.0000000
## NeighborhoodSawyerW       1.0000000     1.0000000
## NeighborhoodSomerst       0.0000000     0.0000000
## NeighborhoodStoneBr       1.0000000     1.0000000
## NeighborhoodSWISU         0.0000000     0.0000000
## NeighborhoodTimber        0.0000000     0.0000000
## NeighborhoodVeenker       0.0000000     0.0000000
## Overall.Qual              1.0000000     1.0000000
## Bedroom.AbvGr             1.0000000     1.0000000
## House.Age                 1.0000000     1.0000000
## Bsmt.CondFa               1.0000000     1.0000000
## Bsmt.CondGd               0.0000000     0.0000000
## Bsmt.CondNone             1.0000000     1.0000000
## Bsmt.CondPo               0.0000000     0.0000000
## Bsmt.CondTA               0.0000000     0.0000000
## Garage.CondFa             1.0000000     1.0000000
## Garage.CondGd             0.0000000     0.0000000
## Garage.CondNone           1.0000000     1.0000000
## Garage.CondPo             1.0000000     1.0000000
## Garage.CondTA             1.0000000     1.0000000
## Sale.ConditionAdjLand     1.0000000     1.0000000
## Sale.ConditionAlloca      1.0000000     1.0000000
## Sale.ConditionFamily      1.0000000     0.0000000
## Sale.ConditionNormal      1.0000000     1.0000000
## Sale.ConditionPartial     1.0000000     1.0000000
## Bldg.Type2fmCon           0.0000000     0.0000000
## Bldg.TypeDuplex           1.0000000     1.0000000
## Bldg.TypeTwnhs            1.0000000     1.0000000
## Bldg.TypeTwnhsE           1.0000000     1.0000000
## BF                        0.9562441     0.9562177
## PostProbs                 0.0002000     0.0002000
## R2                        0.8762000     0.8762000
## dim                      35.0000000    35.0000000
## logmarg               -1577.8050178 -1577.8050454

image(bas_model_AIC, rotate = FALSE)

Using the AIC model selection, the model 1 selected contains all the variables from the initial model and looking at the image plot, we can see that the step AIC is more inclusive of the new variables created as a result of the categorical variables and their levels combinations. This is justified by the fact that the AIC model penalty for additional \(n\) parameters is set to \(k=2\) which in most cases is less than the one of the BIC for which \(k=log(n)\) with \(n\) being the number of observations. BIC is more parsimonious. We will stick to the BIC method during this analysis.

2.2.3 Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.

Let us first refit the model selected by the BIC model and proceed with residuals analysis.

Plotting residuals helps verifying conditions for MLR. We will go throught the conditions for Multilinear Regressions.

Linearity condition

Nearly normality condition

selected_model_bic <- lm(formula = log(price) ~ log(Lot.Area) + log(area) +Neighborhood + Overall.Qual + Bedroom.AbvGr + House.Age + Bsmt.Cond  + Sale.Condition + Bldg.Type, data = ames_train)

hist(selected_model_bic$residuals, breaks = 100)

plot(selected_model_bic$residuals, type="p")

The scatter plot of residuals shows that observations are scattered around 0. We can also see that the histogram of residuals is centered around zero (0) which shows nearly normality. We can however see a left skewness of the plot showing presence of potential outliers. The theoretical quantiles plot shows some irregular linearlity showing as well the presence of outliers.

Constant variability condition

Independence condition

dev.off()

## null device 
##           1

plot(selected_model_bic)

## Warning: not plotting observations with leverage one:
##   516

## Warning: not plotting observations with leverage one:
##   516

plot(abs(selected_model_bic$residuals) ~ selected_model_bic$fitted.values)

q1 <- ggplot(data = ames_train, aes(x=selected_model_bic$fitted.values, y=log(ames_train$price)))+geom_point(aes(color=ames_train$Sale.Condition)) + geom_smooth(method = "lm", aes(color="red"))

q1

q2 <- ggplot(data = ames_train, aes(x=selected_model_bic$fitted.values, y=log(ames_train$price)))+geom_point(aes(color=ames_train$Pool.QC)) + geom_smooth(method = "lm", aes(color="red"))

q2

q3 <- ggplot(data = ames_train, aes(x=selected_model_bic$fitted.values, y=log(ames_train$price)))+geom_point(aes(color=ames_train$Bsmt.Cond)) + geom_smooth(method = "lm", aes(color="red"))

q3

q4 <- ggplot(data = ames_train, aes(x=selected_model_bic$fitted.values, y=log(ames_train$price)))+geom_point(aes(color=ames_train$Garage.Cond)) + geom_smooth(method = "lm", aes(color="red"))

q4

ames_train[names(head(sort(selected_model_bic$residuals), 10)),c("price","House.Age", "Sale.Condition", "Neighborhood", "Bsmt.Cond", "Garage.Cond", "Overall.Qual", "Overall.Cond", "Lot.Area", "Lot.Frontage")]

##      price House.Age Sale.Condition Neighborhood Bsmt.Cond Garage.Cond
## 428  12789        94        Abnorml      OldTown        Fa          Fa
## 310 184750        10        Partial      Edwards        TA          TA
## 741  40000        97         Normal       IDOTRR        TA          Fa
## 276 150000        40         Family      Veenker        TA          TA
## 181  82500        40         Family       NWAmes        TA          TA
## 559  34900        97        Abnorml       IDOTRR        TA        None
## 511  63000        83         Normal      Edwards        TA          Po
## 80   67500        97         Normal      SawyerW        TA          TA
## 233  97500        69        Abnorml        NAmes        TA          TA
## 820 104000        66         Normal      Crawfor        TA          TA
##     Overall.Qual Overall.Cond Lot.Area Lot.Frontage
## 428            2            2     9656           68
## 310           10            5    40094          130
## 741            4            4     8500           50
## 276            9            3    24572           NA
## 181            7            5    11900           85
## 559            4            5     7879           60
## 511            5            3     9780           60
## 80             5            5     9800           70
## 233            6            5    17503           78
## 820            6            5     7801           79

From the above plot, we can see that several observations are appearing to be outliers to the rest of observation. Investigating them deeper taking into account their relationship to variables within the current model and we observed that among the top 10 observations with the highest residuals:

6/10 of them were not sold in normal Sale Condition including the cheapest house of the dataset

2/10 have an inexistant or a poor Garage condition

5/10 of them have an Overall Quality rating of Average or lower

All of them have an overal condition of Average or lower

7/10 are more than 65 years old and the 5 cheapest are more than 80 years old.

2.2.4 Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).

The RMSE is calculated below in dollar for our model. We use the predict function for in-sample prediction based on our initial inituive model. Since the price was log transformed, we use exponential function to return the prediction to dollar values before calculating the residuals and then calculate the RMSE of the model.

predict_intuitive <- exp(predict(selected_model_bic))
resid_intuitive <- ames_train$price - predict_intuitive

rmse_intuitive <- sqrt(mean(resid_intuitive^2))
rmse_intuitive

## [1] 29984.45

2.2.5 Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called âoverfitting.â To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?

While attempting to calculate the RMSE, we noticed that the test dataset provided has some differences which prevented the calculation to be straight forward. For instance, all observations in the ames_test dataset have:

All sale Condition set to normal which simply causes the variable Sale.Condition to become an insignificant predictor

Categorical variables of observations selected do not contain all levels as in the train dataset.

The new variable House.Age is not present in the test dataset and the Year.Built Variable was still present

To address all the above we:

We reconciled the levels of all categorical variables by using a union statement on the levels of the model and the one of the ames_test dataset.

We treated the variables with the function prepare_data above which removes NA and replaced with th new level none, replaced the variable Year.Built by the variable House.Age as in the trian dataset

The out-sample RMSE was then calculated for this linear regression and the value \(191810.5\) was found for it. Comparing it to the in-sample RMSE, we can clearly see that the out-sample prediction RMSE & the in-sample prediction RMSE \(29984.45\) difference is too large for this model.

This may be due to possible outliers in the dataset. Several potential outliying observations were taken note of in the previous model analysis.

Additionally, the model generally fits better within the sample it was built on (in-sample) and poorly with a different sample (out-sample). Further investigations and finetuning are required to improve this model.

# Calculating the RMSE value of the intuitive model based on the ames_test dataset
ames_test_new <- prepare_data(data=ames_test)

# restoring the Sale.Condition removed in the variable treatment
ames_test_new$Sale.Condition <- ames_test$Sale.Condition

# Reconcilling the levels of the model with those of the train dataset to avoid errors while running the predict function. 
var_list_bic <- names(selected_model_bic$xlevels)
selected_model_bic$xlevels <- sapply(var_list_bic, function(x) union(selected_model_bic$xlevels[[x]], levels(ames_test_new[[x]])))

predict_outsample <- exp(predict(selected_model_bic, newdata=ames_test_new))

## Warning in predict.lm(selected_model_bic, newdata = ames_test_new):
## prediction from a rank-deficient fit may be misleading

resid_outsample <- ames_test$price - predict_outsample

rmse_outsample <- sqrt(mean(resid_outsample^2, na.rm = TRUE))
rmse_outsample

## [1] 191810.5

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

2.3 Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

2.3.1 Section 3.1 Final Model

Provide the summary table for your model.

The final model extracted from the full model using the BIC method returns a single variable Overall.Qual with an R-square of roughtly 0.70 which would mean the Overall.Qual on its own explains 70% of the house price in this dataset. However, based on our EDA, we know that other variables such as the Neiborhood, the Lot.Area and others are statistically significant predictors of the house price. Hence we will use the AIC method in this case which is more inclusive to selecte a model with a little bit more variables than the currently returned one.

full_model_lm <- lm(formula = log(price)~., data = na.omit(ames_train))

n <- nrow(na.omit(ames_train))

final_model_BIC <- stepAIC(full_model_lm, direction = "backward", k=n, trace = 0, steps = 1000)

print("Final model based on BIC")

## [1] "Final model based on BIC"

summary(final_model_BIC)

## 
## Call:
## lm(formula = log(price) ~ Overall.Qual, data = na.omit(ames_train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.53757 -0.11989  0.01186  0.13492  0.94285 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.494269   0.037090  282.94   <2e-16 ***
## Overall.Qual  0.249820   0.005859   42.64   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2338 on 779 degrees of freedom
## Multiple R-squared:  0.7001, Adjusted R-squared:  0.6997 
## F-statistic:  1818 on 1 and 779 DF,  p-value: < 2.2e-16

# This function will do the automatic model selection and yield the best model based on selection criteria

final_model_AIC <- stepAIC(full_model_lm, direction = "backward", k=2, trace = 0, steps = 1000)

print("Final model based on AIC")

## [1] "Final model based on AIC"

summary(final_model_AIC)

## 
## Call:
## lm(formula = log(price) ~ area + Lot.Frontage + Overall.Qual + 
##     Overall.Cond + Year.Remod.Add + BsmtFin.SF.1 + BsmtFin.SF.2 + 
##     X1st.Flr.SF + Bsmt.Full.Bath + Kitchen.AbvGr + Fireplaces + 
##     Garage.Cars + Enclosed.Porch + X3Ssn.Porch + Screen.Porch + 
##     MS.Zoning + Lot.Shape + Lot.Config + Land.Slope + Neighborhood + 
##     Condition.1 + Condition.2 + Bldg.Type + Exterior.2nd + Exter.Qual + 
##     Exter.Cond + Foundation + Bsmt.Qual + Bsmt.Exposure + Heating + 
##     Central.Air + Kitchen.Qual + Functional + Garage.Type + Garage.Qual + 
##     Garage.Cond + Sale.Type + Sale.Condition + House.Age, data = na.omit(ames_train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.81141 -0.05332  0.00198  0.04965  0.66353 
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.930e+00  7.274e-01  10.901  < 2e-16 ***
## area                   2.903e-04  1.553e-05  18.688  < 2e-16 ***
## Lot.Frontage          -5.980e-04  2.732e-04  -2.189 0.028980 *  
## Overall.Qual           4.060e-02  6.312e-03   6.431 2.47e-10 ***
## Overall.Cond           4.451e-02  5.639e-03   7.892 1.28e-14 ***
## Year.Remod.Add         8.696e-04  3.512e-04   2.476 0.013529 *  
## BsmtFin.SF.1           8.113e-05  1.497e-05   5.420 8.45e-08 ***
## BsmtFin.SF.2           4.458e-05  2.926e-05   1.523 0.128129    
## X1st.Flr.SF            7.318e-05  1.895e-05   3.863 0.000124 ***
## Bsmt.Full.Bath         2.586e-02  1.166e-02   2.217 0.026979 *  
## Kitchen.AbvGr          1.304e-01  4.717e-02   2.765 0.005860 ** 
## Fireplaces             1.656e-02  8.909e-03   1.859 0.063511 .  
## Garage.Cars            3.261e-02  1.036e-02   3.147 0.001728 ** 
## Enclosed.Porch         1.414e-04  8.342e-05   1.694 0.090656 .  
## X3Ssn.Porch            2.846e-04  1.285e-04   2.214 0.027152 *  
## Screen.Porch           1.704e-04  7.717e-05   2.209 0.027553 *  
## MS.ZoningFV            3.616e-01  6.987e-02   5.175 3.05e-07 ***
## MS.ZoningRH            4.993e-01  7.965e-02   6.269 6.68e-10 ***
## MS.ZoningRL            4.566e-01  6.044e-02   7.553 1.46e-13 ***
## MS.ZoningRM            3.617e-01  5.548e-02   6.520 1.42e-10 ***
## Lot.ShapeIR2           2.876e-02  2.793e-02   1.030 0.303529    
## Lot.ShapeIR3           3.369e-01  7.783e-02   4.329 1.73e-05 ***
## Lot.ShapeReg          -3.381e-03  1.068e-02  -0.317 0.751656    
## Lot.ConfigCulDSac     -3.945e-02  2.542e-02  -1.552 0.121121    
## Lot.ConfigFR2         -5.461e-02  2.646e-02  -2.064 0.039412 *  
## Lot.ConfigFR3         -1.396e-01  6.661e-02  -2.095 0.036536 *  
## Lot.ConfigInside      -3.436e-02  1.183e-02  -2.904 0.003815 ** 
## Land.SlopeMod         -1.417e-02  2.464e-02  -0.575 0.565489    
## Land.SlopeSev         -2.065e-01  9.578e-02  -2.156 0.031482 *  
## NeighborhoodBlueste    6.961e-02  8.468e-02   0.822 0.411377    
## NeighborhoodBrDale     7.097e-02  7.033e-02   1.009 0.313254    
## NeighborhoodBrkSide    1.326e-01  6.125e-02   2.165 0.030721 *  
## NeighborhoodClearCr    1.007e-01  7.317e-02   1.377 0.169015    
## NeighborhoodCollgCr   -5.692e-03  4.752e-02  -0.120 0.904687    
## NeighborhoodCrawfor    1.815e-01  5.721e-02   3.172 0.001583 ** 
## NeighborhoodEdwards   -4.248e-03  5.157e-02  -0.082 0.934381    
## NeighborhoodGilbert    5.458e-03  4.960e-02   0.110 0.912414    
## NeighborhoodGreens     9.118e-02  8.161e-02   1.117 0.264313    
## NeighborhoodIDOTRR     7.521e-02  6.823e-02   1.102 0.270766    
## NeighborhoodMeadowV   -4.861e-02  7.184e-02  -0.677 0.498854    
## NeighborhoodMitchel    3.263e-02  5.206e-02   0.627 0.530971    
## NeighborhoodNAmes      2.700e-03  5.119e-02   0.053 0.957945    
## NeighborhoodNoRidge    6.638e-02  5.465e-02   1.215 0.224984    
## NeighborhoodNPkVill    1.007e-01  1.185e-01   0.849 0.395919    
## NeighborhoodNridgHt    1.170e-01  4.685e-02   2.496 0.012795 *  
## NeighborhoodNWAmes    -2.596e-02  5.336e-02  -0.487 0.626707    
## NeighborhoodOldTown    1.619e-02  6.243e-02   0.259 0.795415    
## NeighborhoodSawyer     7.070e-02  5.316e-02   1.330 0.183993    
## NeighborhoodSawyerW   -1.085e-02  4.948e-02  -0.219 0.826487    
## NeighborhoodSomerst    1.796e-01  5.343e-02   3.361 0.000823 ***
## NeighborhoodStoneBr    1.730e-01  5.168e-02   3.348 0.000862 ***
## NeighborhoodSWISU      3.746e-02  6.515e-02   0.575 0.565494    
## NeighborhoodTimber     5.207e-02  5.404e-02   0.964 0.335627    
## NeighborhoodVeenker    1.007e-01  6.162e-02   1.634 0.102739    
## Condition.1Feedr      -3.234e-02  3.992e-02  -0.810 0.418148    
## Condition.1Norm        5.170e-02  3.352e-02   1.543 0.123437    
## Condition.1PosA        9.648e-02  5.832e-02   1.654 0.098573 .  
## Condition.1PosN        1.059e-01  5.822e-02   1.819 0.069392 .  
## Condition.1RRAe        1.244e-02  5.371e-02   0.232 0.816940    
## Condition.1RRAn       -2.675e-02  5.184e-02  -0.516 0.606011    
## Condition.2Feedr       2.724e-01  1.478e-01   1.843 0.065856 .  
## Condition.2Norm        3.870e-01  1.395e-01   2.774 0.005703 ** 
## Condition.2PosA        3.196e-01  1.868e-01   1.711 0.087596 .  
## Condition.2PosN       -5.597e-01  1.745e-01  -3.207 0.001407 ** 
## Condition.2RRNn        3.974e-01  1.830e-01   2.172 0.030241 *  
## Bldg.Type2fmCon       -1.569e-02  4.244e-02  -0.370 0.711766    
## Bldg.TypeDuplex       -1.703e-01  4.793e-02  -3.553 0.000409 ***
## Bldg.TypeTwnhs        -1.589e-01  3.275e-02  -4.852 1.54e-06 ***
## Bldg.TypeTwnhsE       -9.328e-02  2.261e-02  -4.126 4.18e-05 ***
## Exterior.2ndBrk Cmn    2.037e-01  1.382e-01   1.474 0.141029    
## Exterior.2ndBrkFace    3.468e-01  5.114e-02   6.781 2.71e-11 ***
## Exterior.2ndCmentBd    2.493e-01  5.390e-02   4.625 4.53e-06 ***
## Exterior.2ndHdBoard    2.538e-01  4.651e-02   5.458 6.87e-08 ***
## Exterior.2ndImStucc    2.195e-01  6.187e-02   3.548 0.000416 ***
## Exterior.2ndMetalSd    3.001e-01  4.487e-02   6.687 4.95e-11 ***
## Exterior.2ndPlywood    2.429e-01  4.804e-02   5.058 5.55e-07 ***
## Exterior.2ndStucco     2.968e-01  5.731e-02   5.178 3.00e-07 ***
## Exterior.2ndVinylSd    2.798e-01  4.586e-02   6.102 1.81e-09 ***
## Exterior.2ndWd Sdng    3.021e-01  4.597e-02   6.572 1.03e-10 ***
## Exterior.2ndWd Shng    2.706e-01  4.989e-02   5.424 8.26e-08 ***
## Exter.QualFa           1.002e-01  5.848e-02   1.713 0.087264 .  
## Exter.QualGd          -3.819e-02  2.893e-02  -1.320 0.187315    
## Exter.QualTA          -6.860e-02  3.345e-02  -2.051 0.040715 *  
## Exter.CondFa          -1.436e-01  7.869e-02  -1.825 0.068393 .  
## Exter.CondGd           3.875e-04  6.688e-02   0.006 0.995379    
## Exter.CondTA           2.412e-02  6.584e-02   0.366 0.714229    
## FoundationCBlock       1.469e-02  2.012e-02   0.730 0.465493    
## FoundationPConc        5.204e-02  2.168e-02   2.400 0.016679 *  
## FoundationSlab         4.752e-02  7.795e-02   0.610 0.542357    
## FoundationStone        2.274e-02  6.921e-02   0.328 0.742646    
## Bsmt.QualFa           -1.756e-01  3.913e-02  -4.487 8.54e-06 ***
## Bsmt.QualGd           -5.573e-02  2.061e-02  -2.704 0.007041 ** 
## Bsmt.QualNone         -1.492e-01  1.261e-01  -1.184 0.237041    
## Bsmt.QualPo            9.760e-01  1.844e-01   5.293 1.65e-07 ***
## Bsmt.QualTA           -8.146e-02  2.665e-02  -3.057 0.002330 ** 
## Bsmt.ExposureGd        5.316e-02  1.922e-02   2.766 0.005831 ** 
## Bsmt.ExposureMn       -3.019e-02  1.842e-02  -1.639 0.101605    
## Bsmt.ExposureNo       -2.574e-02  1.284e-02  -2.004 0.045448 *  
## Bsmt.ExposureNone     -7.072e-02  1.083e-01  -0.653 0.513844    
## HeatingGasW            1.701e-01  5.564e-02   3.056 0.002334 ** 
## HeatingGrav            3.124e-01  1.576e-01   1.983 0.047818 *  
## HeatingWall            1.269e-01  1.252e-01   1.014 0.311036    
## Central.AirY           1.692e-01  2.992e-02   5.656 2.34e-08 ***
## Kitchen.QualFa        -9.575e-02  4.358e-02  -2.197 0.028382 *  
## Kitchen.QualGd        -3.295e-02  2.350e-02  -1.402 0.161397    
## Kitchen.QualPo         2.561e-01  1.451e-01   1.764 0.078136 .  
## Kitchen.QualTA        -5.048e-02  2.663e-02  -1.896 0.058441 .  
## FunctionalMaj2         2.617e-02  1.247e-01   0.210 0.833866    
## FunctionalMin1         4.461e-02  8.103e-02   0.550 0.582197    
## FunctionalMin2         8.289e-02  7.833e-02   1.058 0.290368    
## FunctionalMod          5.267e-02  8.112e-02   0.649 0.516385    
## FunctionalSal         -2.650e-01  1.558e-01  -1.701 0.089496 .  
## FunctionalTyp          1.222e-01  7.334e-02   1.666 0.096249 .  
## Garage.TypeAttchd      3.516e-02  4.443e-02   0.791 0.429055    
## Garage.TypeBasment    -2.603e-02  6.703e-02  -0.388 0.697870    
## Garage.TypeBuiltIn    -3.429e-02  4.833e-02  -0.710 0.478217    
## Garage.TypeDetchd      1.337e-02  4.555e-02   0.293 0.769302    
## Garage.QualFa         -1.400e-01  1.149e-01  -1.219 0.223333    
## Garage.QualGd         -9.882e-02  1.274e-01  -0.776 0.438054    
## Garage.QualPo         -7.601e-01  1.756e-01  -4.328 1.75e-05 ***
## Garage.QualTA         -1.485e-01  1.126e-01  -1.318 0.187845    
## Garage.CondFa         -1.207e-01  3.542e-02  -3.407 0.000698 ***
## Garage.CondGd          8.017e-02  7.719e-02   1.039 0.299404    
## Garage.CondPo          2.909e-01  6.710e-02   4.336 1.69e-05 ***
## Garage.CondTA                 NA         NA      NA       NA    
## Sale.TypeCon           3.059e-02  5.939e-02   0.515 0.606642    
## Sale.TypeConLD         1.235e-01  6.102e-02   2.024 0.043427 *  
## Sale.TypeConLI        -7.701e-02  7.441e-02  -1.035 0.301046    
## Sale.TypeConLw         1.622e-02  5.447e-02   0.298 0.766008    
## Sale.TypeCWD           6.867e-02  7.013e-02   0.979 0.327828    
## Sale.TypeNew          -6.809e-02  7.743e-02  -0.879 0.379531    
## Sale.TypeOth           1.952e-01  8.477e-02   2.303 0.021583 *  
## Sale.TypeVWD          -2.004e-02  1.181e-01  -0.170 0.865330    
## Sale.TypeWD           -4.421e-03  2.731e-02  -0.162 0.871457    
## Sale.ConditionAlloca   1.446e-01  8.545e-02   1.692 0.091059 .  
## Sale.ConditionFamily   5.949e-03  3.708e-02   0.160 0.872575    
## Sale.ConditionNormal   7.821e-02  1.881e-02   4.157 3.66e-05 ***
## Sale.ConditionPartial  1.608e-01  7.317e-02   2.198 0.028320 *  
## House.Age             -2.179e-03  4.728e-04  -4.608 4.90e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1064 on 643 degrees of freedom
## Multiple R-squared:  0.9487, Adjusted R-squared:  0.9378 
## F-statistic: 86.81 on 137 and 643 DF,  p-value: < 2.2e-16

From the StepAIC model we can see that the following variables Condition.1, Exter.Qual, Kitchen.Qual, Functional, Garage.Type, Lot.Frontage are statiscally insignificant based on their P-value and the default used significant level of 5%.

By removing the insignificant variables and refitting the model several times we get to the following model we end up with the below model that has an R-squared of 0.8947 which tells us all variables included explains 89.47% of the house price in this datase.

full_model_lm2 <- lm(formula = log(price) ~ log(area) + Overall.Qual + 
    X1st.Flr.SF + Kitchen.AbvGr +  
    Garage.Cars + Enclosed.Porch +  Screen.Porch + 
    MS.Zoning + Land.Slope + Neighborhood + 
    Bldg.Type + Exterior.2nd +  
    Bsmt.Qual + Central.Air + Sale.Condition + House.Age, data = na.omit(ames_train))

final_model_AIC2 <- stepAIC(full_model_lm2, direction = "backward", k=2, trace = 0, steps = 1000)

summary(final_model_AIC2)

## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + X1st.Flr.SF + 
##     Kitchen.AbvGr + Garage.Cars + Enclosed.Porch + Screen.Porch + 
##     MS.Zoning + Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual + 
##     Central.Air + Sale.Condition + House.Age, data = na.omit(ames_train))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.19931 -0.06467 -0.00153  0.07681  0.60390 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            8.015e+00  1.954e-01  41.019  < 2e-16 ***
## log(area)              3.581e-01  2.581e-02  13.877  < 2e-16 ***
## Overall.Qual           6.522e-02  7.468e-03   8.733  < 2e-16 ***
## X1st.Flr.SF            1.350e-04  2.058e-05   6.559 1.04e-10 ***
## Kitchen.AbvGr          8.835e-02  5.786e-02   1.527 0.127204    
## Garage.Cars            4.299e-02  1.234e-02   3.484 0.000524 ***
## Enclosed.Porch         2.745e-04  1.034e-04   2.654 0.008141 ** 
## Screen.Porch           3.121e-04  9.708e-05   3.214 0.001366 ** 
## MS.ZoningFV            2.511e-01  8.501e-02   2.954 0.003238 ** 
## MS.ZoningRH            2.246e-01  9.905e-02   2.268 0.023631 *  
## MS.ZoningRL            2.660e-01  7.325e-02   3.632 0.000302 ***
## MS.ZoningRM            1.947e-01  6.743e-02   2.887 0.004001 ** 
## NeighborhoodBlueste    8.588e-02  1.086e-01   0.791 0.429272    
## NeighborhoodBrDale     7.253e-02  8.923e-02   0.813 0.416611    
## NeighborhoodBrkSide    1.105e-01  7.559e-02   1.462 0.144207    
## NeighborhoodClearCr    1.523e-01  8.993e-02   1.693 0.090863 .  
## NeighborhoodCollgCr    4.454e-02  6.156e-02   0.724 0.469558    
## NeighborhoodCrawfor    2.423e-01  7.139e-02   3.395 0.000725 ***
## NeighborhoodEdwards    2.050e-02  6.480e-02   0.316 0.751883    
## NeighborhoodGilbert    2.608e-02  6.356e-02   0.410 0.681688    
## NeighborhoodGreens     2.453e-01  1.026e-01   2.392 0.016995 *  
## NeighborhoodIDOTRR     1.559e-02  8.320e-02   0.187 0.851395    
## NeighborhoodMeadowV   -3.530e-02  9.255e-02  -0.381 0.703044    
## NeighborhoodMitchel    8.910e-02  6.597e-02   1.351 0.177276    
## NeighborhoodNAmes      5.776e-02  6.432e-02   0.898 0.369470    
## NeighborhoodNoRidge    1.672e-01  6.951e-02   2.405 0.016404 *  
## NeighborhoodNPkVill    7.785e-02  1.570e-01   0.496 0.620156    
## NeighborhoodNridgHt    1.884e-01  6.027e-02   3.126 0.001844 ** 
## NeighborhoodNWAmes     1.324e-02  6.748e-02   0.196 0.844460    
## NeighborhoodOldTown    7.102e-02  7.753e-02   0.916 0.360008    
## NeighborhoodSawyer     8.171e-02  6.727e-02   1.215 0.224849    
## NeighborhoodSawyerW    2.703e-02  6.366e-02   0.425 0.671259    
## NeighborhoodSomerst    1.218e-01  6.812e-02   1.788 0.074248 .  
## NeighborhoodStoneBr    2.382e-01  6.626e-02   3.595 0.000347 ***
## NeighborhoodSWISU      3.565e-02  8.204e-02   0.435 0.663973    
## NeighborhoodTimber     1.280e-01  6.966e-02   1.837 0.066595 .  
## NeighborhoodVeenker    1.860e-01  7.797e-02   2.385 0.017333 *  
## Bldg.Type2fmCon       -1.973e-03  5.174e-02  -0.038 0.969597    
## Bldg.TypeDuplex       -1.791e-01  6.008e-02  -2.981 0.002968 ** 
## Bldg.TypeTwnhs        -1.732e-01  3.854e-02  -4.494 8.14e-06 ***
## Bldg.TypeTwnhsE       -1.019e-01  2.550e-02  -3.997 7.09e-05 ***
## Exterior.2ndBrk Cmn    2.602e-01  1.812e-01   1.436 0.151531    
## Exterior.2ndBrkFace    3.553e-01  6.463e-02   5.497 5.37e-08 ***
## Exterior.2ndCmentBd    3.020e-01  6.674e-02   4.526 7.05e-06 ***
## Exterior.2ndHdBoard    2.788e-01  5.794e-02   4.811 1.83e-06 ***
## Exterior.2ndImStucc    2.708e-01  7.892e-02   3.432 0.000634 ***
## Exterior.2ndMetalSd    3.243e-01  5.583e-02   5.809 9.46e-09 ***
## Exterior.2ndPlywood    2.471e-01  5.962e-02   4.144 3.82e-05 ***
## Exterior.2ndStucco     4.174e-01  7.059e-02   5.913 5.19e-09 ***
## Exterior.2ndVinylSd    2.996e-01  5.688e-02   5.267 1.84e-07 ***
## Exterior.2ndWd Sdng    3.110e-01  5.752e-02   5.407 8.75e-08 ***
## Exterior.2ndWd Shng    3.173e-01  6.212e-02   5.108 4.18e-07 ***
## Bsmt.QualFa           -2.077e-01  4.643e-02  -4.473 8.96e-06 ***
## Bsmt.QualGd           -9.612e-02  2.366e-02  -4.062 5.41e-05 ***
## Bsmt.QualNone         -2.993e-01  5.225e-02  -5.729 1.49e-08 ***
## Bsmt.QualPo            2.508e-01  1.618e-01   1.550 0.121482    
## Bsmt.QualTA           -1.466e-01  3.158e-02  -4.644 4.07e-06 ***
## Central.AirY           2.411e-01  2.998e-02   8.041 3.66e-15 ***
## Sale.ConditionAlloca   1.238e-01  1.096e-01   1.130 0.258966    
## Sale.ConditionFamily  -6.395e-02  4.560e-02  -1.403 0.161192    
## Sale.ConditionNormal   7.469e-02  2.294e-02   3.255 0.001187 ** 
## Sale.ConditionPartial  9.074e-02  3.026e-02   2.998 0.002807 ** 
## House.Age             -2.582e-03  4.994e-04  -5.170 3.04e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1431 on 718 degrees of freedom
## Multiple R-squared:  0.8965, Adjusted R-squared:  0.8875 
## F-statistic: 100.3 on 62 and 718 DF,  p-value: < 2.2e-16

The above is the summary table of the final model selected

2.3.2 Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.

Variables price, area and Lot.Area were log transformed to allow a better fit of the model.

This is well illustrated below using simple linear regression model between the response variable price and the transformed variable and see the goodness of fit of the models (Adjusted R-squared) compaired to the model without transformation.

Looking at the plots below, we clearly see that for the relationship between price & area, the linearity of the relationship is more visible when calculating the log of both the price and the area. The same applies to the linearity between price and Lot.Area.

Plot variation from left to right shows linearity improvement. This means the Lot.Area that was removed from the automaticmodel slection will be reinserted during the next step of this analysis.

# This is done by testing the Adjusted R-squared value of linear models with and without the log transformation of area
# Justifying plotting of price, are and Lot.Area

dev.off()

## null device 
##           1

plot(price ~area, data = ames_train, col="blue")
plot(log(price) ~area, data = ames_train, col="red")
plot(log(price) ~log(area), data = ames_train, col="green")
plot(price ~Lot.Area, data = ames_train, col="blue")
plot(log(price) ~Lot.Area, data = ames_train, col="red")
plot(log(price) ~log(Lot.Area), data = ames_train, col="green")


summary(lm(formula= log(price)~area, data = ames_train))$adj.r.squared < summary(lm(formula= log(price)~log(area), data = ames_train))$adj.r.squared

## [1] TRUE

print("TRUE. It shows that log transformation of area is necessary for both price and area.")

## [1] "TRUE. It shows that log transformation of area is necessary for both price and area."

summary(lm(formula= log(price)~Lot.Area, data = ames_train))$adj.r.squared < summary(lm(formula= log(price)~log(Lot.Area), data = ames_train))$adj.r.squared

## [1] TRUE

print("TRUE. It shows that log transformation of area is necessary for both price and Lot.Area.")

## [1] "TRUE. It shows that log transformation of area is necessary for both price and Lot.Area."

ames_train %>% filter(price>500000)

##         PID area  price MS.SubClass Lot.Frontage Lot.Area Overall.Qual
## 1 528164060 2470 615000          20          106    12720           10
## 2 528118050 2290 500067          20           59    17169           10
## 3 527214060 2698 535000          60           82    16052           10
## 4 528150070 2364 611657          20          100    12919            9
## 5 527216080 2338 591587          20           52    51974            9
## 6 527216070 3279 538000          60           47    53504            8
##   Overall.Cond Year.Remod.Add Mas.Vnr.Area BsmtFin.SF.1 BsmtFin.SF.2
## 1            5           2003          680         2257            0
## 2            5           2007          970         1684            0
## 3            5           2006          734         1206            0
## 4            5           2010          760         2188            0
## 5            5           2007          710         1101            0
## 6            5           2003          603         1416            0
##   Bsmt.Unf.SF Total.Bsmt.SF X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF
## 1         278          2535        2470           0               0
## 2         636          2320        2290           0               0
## 3         644          1850        1850         848               0
## 4         142          2330        2364           0               0
## 5        1559          2660        2338           0               0
## 6         234          1650        1690        1589               0
##   Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath Bedroom.AbvGr
## 1              2              0         1         1             1
## 2              2              0         2         1             2
## 3              1              0         2         1             4
## 4              1              0         2         1             2
## 5              1              0         2         1             4
## 6              1              0         3         1             4
##   Kitchen.AbvGr TotRms.AbvGrd Fireplaces Garage.Yr.Blt Garage.Cars
## 1             1             7          2          2003           3
## 2             1             7          1          2007           3
## 3             1            11          1          2006           3
## 4             1            11          2          2009           3
## 5             1             8          2          2005           3
## 6             1            12          1          2003           3
##   Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
## 1         789          154            65              0           0
## 2        1174          192            30              0           0
## 3         736          250             0              0           0
## 4         820            0            67              0           0
## 5        1110            0           135              0           0
## 6         841          503            36              0           0
##   Screen.Porch Pool.Area Misc.Val Mo.Sold Yr.Sold MS.Zoning Street Alley
## 1          216       144        0       2    2008        RL   Pave  None
## 2            0         0        0       8    2007        RL   Pave  None
## 3            0         0        0       7    2006        RL   Pave  None
## 4            0         0        0       3    2010        RL   Pave  None
## 5          322         0        0       6    2007        RL   Pave  None
## 6          210         0        0       6    2010        RL   Pave  None
##   Lot.Shape Land.Contour Lot.Config Land.Slope Neighborhood Condition.1
## 1       Reg          HLS     Inside        Mod      NridgHt        Norm
## 2       IR2          Lvl    CulDSac        Gtl      NridgHt        Norm
## 3       IR1          Lvl    CulDSac        Gtl      StoneBr        Norm
## 4       IR1          Lvl     Inside        Gtl      NridgHt        Norm
## 5       IR1          Lvl    CulDSac        Gtl      StoneBr        PosN
## 6       IR2          HLS    CulDSac        Mod      StoneBr        Norm
##   Condition.2 Bldg.Type House.Style Roof.Style Roof.Matl Exterior.1st
## 1        Norm      1Fam      1Story        Hip   CompShg      MetalSd
## 2        Norm      1Fam      1Story        Hip   CompShg      CemntBd
## 3        Norm      1Fam      2Story        Hip   CompShg      VinylSd
## 4        Norm      1Fam      1Story        Hip   CompShg      VinylSd
## 5        Norm      1Fam      1Story        Hip   CompShg      VinylSd
## 6        Norm      1Fam      2Story        Hip   CompShg      CemntBd
##   Exterior.2nd Mas.Vnr.Type Exter.Qual Exter.Cond Foundation Bsmt.Qual
## 1      MetalSd        Stone         Ex         TA      PConc        Ex
## 2      CmentBd      BrkFace         Ex         TA      PConc        Ex
## 3      VinylSd        Stone         Ex         TA      PConc        Ex
## 4      VinylSd        Stone         Ex         TA      PConc        Ex
## 5      VinylSd      BrkFace         Ex         TA      PConc        Ex
## 6      Wd Shng      BrkFace         Ex         TA      PConc        Gd
##   Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.Type.2 Heating Heating.QC
## 1        TA            Gd            GLQ            Unf    GasA         Ex
## 2        TA            Av            GLQ            Unf    GasA         Ex
## 3        TA            No            GLQ            Unf    GasA         Ex
## 4        TA            Gd            GLQ            Unf    GasA         Ex
## 5        TA            Av            GLQ            Unf    GasA         Ex
## 6        TA            Gd            ALQ            Unf    GasA         Ex
##   Central.Air Electrical Kitchen.Qual Functional Fireplace.Qu Garage.Type
## 1           Y      SBrkr           Ex        Typ           Gd      Attchd
## 2           Y      SBrkr           Ex        Typ           Gd      Attchd
## 3           Y      SBrkr           Ex        Typ           Gd      Attchd
## 4           Y      SBrkr           Ex        Typ           Gd      Attchd
## 5           Y      SBrkr           Gd        Typ           Gd      Attchd
## 6           Y      SBrkr           Ex        Mod           Gd     BuiltIn
##   Garage.Finish Garage.Qual Garage.Cond Paved.Drive Pool.QC Fence
## 1           Fin          TA          TA           Y      Ex  None
## 2           Fin          TA          TA           Y    None  None
## 3           RFn          TA          TA           Y    None  None
## 4           Fin          TA          TA           Y    None  None
## 5           Fin          Gd          TA           Y    None  None
## 6           Fin          TA          TA           Y    None  None
##   Misc.Feature Sale.Type Sale.Condition House.Age
## 1         None       WD          Normal        14
## 2         None       New        Partial        10
## 3         None       New        Partial        11
## 4         None       New        Partial         8
## 5         None       New        Partial        11
## 6         None       WD          Normal        14

2.3.3 Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.

Instead of dealing with the Year.Built as a categorical variable, it was found easier to correlate between the price and another numerical variable called House.Age which would compute the age of the house today base on the Year.Built. Since both variable would be representing the same phenomenon, we decided to discard the variable Year.Built from the dataset. This interaction is constructed within the function prepare_data().

We recognize limitation of the variable House.Age as it was supposed to be calculated with the Year.Sold instead of using Today date. However, the year sold varies between 2006-2010 and a difference of 4 years would not significantly change the current House-Age variation.

# Below is the formula that was used to perform this interaction as part of the prepare_data formula.
# ames_train <- ames_train %>% mutate(House.Age = year(today())-Year.Built)

summary(lm(formula=log(price)~House.Age, data = ames_train))

## 
## Call:
## lm(formula = log(price) ~ House.Age, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.12310 -0.18669 -0.03868  0.16837  1.29541 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.4181869  0.0187605  661.93   <2e-16 ***
## House.Age   -0.0089228  0.0003493  -25.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3272 on 998 degrees of freedom
## Multiple R-squared:  0.3953, Adjusted R-squared:  0.3947 
## F-statistic: 652.5 on 1 and 998 DF,  p-value: < 2.2e-16

ggplot(data=ames_train, aes(x=price, y=House.Age))+geom_point(aes(colour="red"))+geom_smooth(method = "lm")

The negative relationship is expressed year by the descending slope from left to right which means that the older the house, the lower the price as per this relationship.

2.3.4 Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.

The method choosen to select variables is AIC to allow inclusion of more variables in the initial model before proceeding to finetuning. As demonstrate above, the BIC model would only return the Overall.Qual as significant predictor, completely ignoring all other variables. Hence we decided to use the AIC and proceed by elimination based on the return P-values until we reach this final model.

This is due to the fact that AIC calculates the likelihood of the model and penalize it by a fixed values of K=2 for penalty calculation for each an every new parameter added.

We will also take into account the fact that several observations distorting house price should be removed from the dataet in order to improve model prediction. In this case we will filter in only observation for which the Sale.Condition was Normal and without a Swimming pool.

# Model selection was performed above. But for the purposed of responding to this question, we are refitting the model in the same variable.

ames_train_noout <- ames_train %>% filter(is.na(Pool.QC) | tolower(Sale.Condition) =="normal")

full_model_lm3 <- lm(formula = log(price) ~ log(area) + log(Lot.Area) + Overall.Qual + 
    X1st.Flr.SF + Kitchen.AbvGr +  
    Garage.Cars + Enclosed.Porch +  Screen.Porch + 
    MS.Zoning + Land.Slope + Neighborhood + 
    Bldg.Type + Exterior.2nd +  
    Bsmt.Qual + Central.Air + House.Age, data = na.omit(ames_train_noout))

final_model_AIC3 <- stepAIC(full_model_lm3, direction = "backward", k=2, trace = 0, steps = 1000)

summary(final_model_AIC3)

## 
## Call:
## lm(formula = log(price) ~ log(area) + log(Lot.Area) + Overall.Qual + 
##     X1st.Flr.SF + Garage.Cars + Screen.Porch + MS.Zoning + Land.Slope + 
##     Neighborhood + Bldg.Type + Exterior.2nd + Bsmt.Qual + Central.Air + 
##     House.Age, data = na.omit(ames_train_noout))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56808 -0.05691  0.00061  0.06547  0.54444 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.620e+00  2.266e-01  33.632  < 2e-16 ***
## log(area)            3.814e-01  2.265e-02  16.840  < 2e-16 ***
## log(Lot.Area)        5.953e-02  1.886e-02   3.156 0.001681 ** 
## Overall.Qual         6.455e-02  6.694e-03   9.643  < 2e-16 ***
## X1st.Flr.SF          1.248e-04  1.957e-05   6.379 3.63e-10 ***
## Garage.Cars          5.582e-02  1.063e-02   5.253 2.11e-07 ***
## Screen.Porch         2.123e-04  8.564e-05   2.478 0.013476 *  
## MS.ZoningFV          3.753e-01  8.151e-02   4.604 5.10e-06 ***
## MS.ZoningRH          2.390e-01  8.974e-02   2.663 0.007948 ** 
## MS.ZoningRL          3.480e-01  6.745e-02   5.160 3.40e-07 ***
## MS.ZoningRM          2.549e-01  6.334e-02   4.025 6.44e-05 ***
## Land.SlopeMod        2.581e-02  2.550e-02   1.013 0.311719    
## Land.SlopeSev        2.643e-01  1.336e-01   1.979 0.048308 *  
## NeighborhoodBlueste  1.659e-01  1.021e-01   1.625 0.104607    
## NeighborhoodBrDale   1.070e-01  9.202e-02   1.163 0.245400    
## NeighborhoodBrkSide  1.568e-01  8.243e-02   1.902 0.057659 .  
## NeighborhoodClearCr  1.491e-01  9.148e-02   1.630 0.103689    
## NeighborhoodCollgCr  4.402e-02  7.324e-02   0.601 0.548088    
## NeighborhoodCrawfor  2.485e-01  8.042e-02   3.090 0.002099 ** 
## NeighborhoodEdwards  3.850e-02  7.560e-02   0.509 0.610784    
## NeighborhoodGilbert  2.393e-02  7.592e-02   0.315 0.752707    
## NeighborhoodGreens   2.647e-01  9.625e-02   2.750 0.006140 ** 
## NeighborhoodIDOTRR   1.316e-01  8.797e-02   1.496 0.135291    
## NeighborhoodMeadowV -3.925e-02  9.459e-02  -0.415 0.678363    
## NeighborhoodMitchel  9.939e-02  7.602e-02   1.308 0.191543    
## NeighborhoodNAmes    9.382e-02  7.509e-02   1.249 0.211992    
## NeighborhoodNoRidge  1.665e-01  7.696e-02   2.164 0.030904 *  
## NeighborhoodNPkVill  1.199e-01  1.349e-01   0.889 0.374596    
## NeighborhoodNridgHt  1.609e-01  7.260e-02   2.216 0.027082 *  
## NeighborhoodNWAmes   5.394e-02  7.714e-02   0.699 0.484680    
## NeighborhoodOldTown  1.271e-01  8.392e-02   1.515 0.130330    
## NeighborhoodSawyer   9.464e-02  7.699e-02   1.229 0.219463    
## NeighborhoodSawyerW  3.620e-02  7.466e-02   0.485 0.627986    
## NeighborhoodSomerst  9.884e-02  8.060e-02   1.226 0.220565    
## NeighborhoodStoneBr  1.475e-01  7.913e-02   1.864 0.062866 .  
## NeighborhoodSWISU    4.366e-02  8.661e-02   0.504 0.614395    
## NeighborhoodTimber   8.091e-02  7.869e-02   1.028 0.304223    
## NeighborhoodVeenker  1.700e-01  8.392e-02   2.025 0.043312 *  
## Bldg.Type2fmCon     -2.499e-03  3.547e-02  -0.070 0.943868    
## Bldg.TypeDuplex     -1.214e-01  3.195e-02  -3.799 0.000161 ***
## Bldg.TypeTwnhs      -9.335e-02  3.823e-02  -2.441 0.014924 *  
## Bldg.TypeTwnhsE     -4.119e-02  2.666e-02  -1.545 0.122914    
## Exterior.2ndBrk Cmn  4.782e-02  1.474e-01   0.324 0.745723    
## Exterior.2ndBrkFace  1.364e-01  5.885e-02   2.318 0.020816 *  
## Exterior.2ndCmentBd  1.912e-01  6.437e-02   2.970 0.003105 ** 
## Exterior.2ndHdBoard  1.028e-01  5.250e-02   1.959 0.050567 .  
## Exterior.2ndImStucc  9.243e-02  6.896e-02   1.340 0.180660    
## Exterior.2ndMetalSd  1.245e-01  5.096e-02   2.443 0.014853 *  
## Exterior.2ndPlywood  5.421e-02  5.377e-02   1.008 0.313820    
## Exterior.2ndStucco   2.122e-01  6.194e-02   3.426 0.000655 ***
## Exterior.2ndVinylSd  1.134e-01  5.204e-02   2.179 0.029744 *  
## Exterior.2ndWd Sdng  1.229e-01  5.174e-02   2.375 0.017854 *  
## Exterior.2ndWd Shng  1.374e-01  5.618e-02   2.447 0.014717 *  
## Bsmt.QualFa         -1.577e-01  4.305e-02  -3.664 0.000271 ***
## Bsmt.QualGd         -9.764e-02  2.377e-02  -4.107 4.58e-05 ***
## Bsmt.QualNone       -2.849e-01  4.493e-02  -6.342 4.55e-10 ***
## Bsmt.QualPo          2.276e-01  1.349e-01   1.687 0.092124 .  
## Bsmt.QualTA         -1.320e-01  2.943e-02  -4.485 8.79e-06 ***
## Central.AirY         1.700e-01  2.514e-02   6.763 3.29e-11 ***
## House.Age           -2.728e-03  4.306e-04  -6.336 4.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1142 on 582 degrees of freedom
## Multiple R-squared:  0.9184, Adjusted R-squared:  0.9101 
## F-statistic:   111 on 59 and 582 DF,  p-value: < 2.2e-16

By doing all the changes described above and by adding back the log transformed log(Lot.Area) in the model, we see a jump of Adjusted R-squared from \(0.8856\) to \(0.9101\) which is a great improvement of the goodness of fit of our model. Also we see the R-square value goes from 0.8947 to 0.9184 which means out model explanatory variables are able to explain nearly 92% of the house price.

2.3.5 Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.

While testing the model on the out-of-sample data, the RMSE value returned was \(191810.5\) which shows that the model predicts very poorly the house price. Issues noticed while calculatingRMSE value were the following:

All sale Condition set to normal which simply causes the variable Sale.Condition to become an insignificant predictor

Categorical variables of observations selected do not contain all levels as in the train dataset.

The new variable House.Age is not present in the test dataset and the Year.Built Variable was still present

To address all the above we:

We reconciled the levels of all categorical variables by using a union statement on the levels of the model and the one of the ames_test dataset.

We treated the variables with the function prepare_data above which removes NA and replaced with th new level none, replaced the variable Year.Built by the variable House.Age as in the trian dataset

# Reconcilling the levels of the model with those of the train dataset to avoid errors while running the predict function. 
cat_var_list <- names(final_model_AIC3$xlevels)
final_model_AIC3$xlevels <- sapply(cat_var_list, function(x) union(final_model_AIC3$xlevels[[x]], levels(ames_test_new[[x]])))

# Performing out-sample prediction
predict_out_final <- suppressWarnings(exp(predict(final_model_AIC3, newdata=ames_test_new)))
resid_out_final <- ames_test_new$price - predict_out_final
rmse_out_final <- sqrt(mean(resid_out_final^2, na.rm = TRUE))
rmse_out_final

## [1] 38401.83

We can also clearly see an improvement of the RMSE value after subsetting the dataset to Sale.Condition=“Normal” only and reintroduing the log(Lot.Area). RMSE goes from \(191810.5\) in our initial model out-of-sample test to \(38401.83\) for this final model. The final model predicts way better the house price based on this output.

2.4 Part 4 Final Model Assessment

2.4.1 Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

Below plots show the model plot including the Theoretical quantile plot that shows that:

The nearly normality condition is not fully respected as we can see the right skewness of the residual histogram. This can be explained by the

The residual vs fitted plot shows as well that the constant variability of residuals around zero is nearly respected.

The dot plot of actual vs fitted price values shows quite reliable fit which means that the model is quite good at predicting the out-of-sample data.

suppressWarnings(plot(final_model_AIC3))

residual_data <- data.frame(price=ames_test_new$price, predicted=predict_out_final, residuals=resid_out_final)

ggplot()+geom_point(data = residual_data, aes(x=price, y=predicted, color="blue")) + geom_smooth(data = residual_data, aes(x=price, y=predicted), method = "lm")

ggplot(data = residual_data, aes(x=residuals))+geom_histogram(fill="blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

2.4.2 Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.

# Performing out-sample prediction
predict_out_final2 <- suppressWarnings(exp(predict(final_model_AIC3, interval = "prediction"  ,newdata=ames_test_new)))
resid_out_final2 <- ames_test_new$price - predict_out_final2[,"fit"]
rmse_out_final2 <- sqrt(mean(resid_out_final2^2, na.rm = TRUE))

paste("Final model RMSE is :", round(rmse_out_final2,2))

## [1] "Final model RMSE is : 38401.83"

# Comparing both out-sample rmse for both intuitive model and refined final model
paste("Final model RMSE is lower than intuitive model:", rmse_out_final2 < rmse_outsample)

## [1] "Final model RMSE is lower than intuitive model: TRUE"

coverage.prob.final <- mean(ames_test_new$price > predict_out_final2[,"lwr"] &
                            ames_test_new$price < predict_out_final2[,"upr"], na.rm = TRUE)

paste("This model covers up to:", round(coverage.prob.final*100, 2), ("% of the data in the modified test dataset"))

## [1] "This model covers up to: 82.5 % of the data in the modified test dataset"

RMSE for the intuitive model compared to the final model indicates that the final model predicts better the price than the initial model. However, the high value of the RMSE shows that some high leverage points are influencing the entire calculation.

2.4.3 Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?

2.4.3.1 Strenghts

The model provides accurate prediction accurately for 82.5% of the test dataset which is out-of-sample.

The model has a strong R-squared value 0.9184 which represents 91.84% of explanation provided by the predictors to the house price.

The model deals with NA values better than the preivous model.

The model was built by using the combination of automated model selection and manual.

2.4.3.2 Weakness

The model needs to better handle leverage points and possible outliers

It still has a high RMSE value which indicates inaccuracy in out-of-sample prediction

Coverage of the modelon test data is equal to 82.5% which is still to improve

2.4.4 Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the âames_validationâ dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")

The RMSE of the final model ran on the validation data is \(46344.48\) Dollars

The final model RMSE is higher on the validation data than on training and on test data as demonstrated below

The coverage of the final model 75.88% which is lower than in the test dataset

From this result we can conclude that even though the final model accuracy is not as high as expected, it does an acceptable approximation of the house price based on the provided dataset.

## Ensuring that the NA are dealt with and that the new variable Houe.Age is created
ames_validation_new <- prepare_data(ames_validation)

# Reconcilling the levels of the model with those of the train dataset to avoid errors while running the predict function. 
cat_var_list3 <- names(final_model_AIC3$xlevels)
final_model_AIC3$xlevels <- sapply(cat_var_list3, function(x) union(final_model_AIC3$xlevels[[x]], levels(ames_validation_new[[x]])))

# Performing out-sample prediction
predict_out_final3 <- suppressWarnings(exp(predict(final_model_AIC3 ,newdata=ames_validation_new)))
resid_out_final3 <- ames_validation_new$price - predict_out_final3["fit"]
rmse_out_final3 <- sqrt(mean(resid_out_final3^2, na.rm = TRUE))

paste("Final model RMSE on validation data is:", round(rmse_out_final3,2))

## [1] "Final model RMSE on validation data is: NaN"

paste("Compared to the initial model RMSE ", round(rmse_intuitive,2) ,"is", ifelse(rmse_out_final3 < rmse_intuitive, "lower", "higher"))

## [1] "Compared to the initial model RMSE  29984.45 is NA"

paste("Compared to the test model RMSE ", round(rmse_out_final,2), "is" ,ifelse(rmse_out_final3 < rmse_out_final, "lower", " higher"))

## [1] "Compared to the test model RMSE  38401.83 is NA"

## Calculating the coverage
predict_out_final4 <- suppressWarnings(exp(predict(final_model_AIC3, interval = "prediction" ,newdata=ames_validation_new)))
resid_out_final4 <- ames_validation_new$price - predict_out_final4["fit"]
rmse_out_final4 <- sqrt(mean(resid_out_final4^2, na.rm = TRUE))

coverage.prob.validation <- mean(ames_validation_new$price > predict_out_final4[,"lwr"] &
                            ames_validation_new$price < predict_out_final4[,"upr"], na.rm = TRUE)

paste("This model covers up to:", round(coverage.prob.validation*100, 2), ("% of the observations in the validation data"))

## [1] "This model covers up to: 75.88 % of the observations in the validation data"

2.5 Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.

This analysis has been full of lessons and through it I learn several concepts and methods including the concept of model testing and validation both in-smple and out-sample. I also learned that the that house price of analysing and validating my analysis using seperate portions of the dataset.

House price main 3 predictors per this model are area, Lot.Area * Overall quality.

If not transformed, some variables such as “area” and “Lot.Area” linear relationship to the house “price” is simply discarded during the model selection and marked as statistically insignificant. However, when log transformed they start presenting linearity.

Conditions for inference were unfortunately not fully respected as the nearly normallity and the linearity conditions were not completely verified due to the skeness of the residual distribution. Hence, performance of the predictions yield may vary greatly from one dataset to another.

Peer Assessment II

1 Background