1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(dplyr)
library(BAS)
library(MASS)
library(corrplot)
library(ggplot2)
library(knitr)
library(GGally)

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).


dim(ames_train)
## [1] 1000   81

The dataset has 1000 observations and 81 variables. We will look at the first 6 rows of data to view the data before doing analysis

head(ames_train)
## # A tibble: 6 x 81
##      PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
##    <int> <int>  <int>       <int> <fct>            <int>    <int> <fct> 
## 1 9.09e8   856 126000          30 RL                  NA     7890 Pave  
## 2 9.05e8  1049 139500         120 RL                  42     4235 Pave  
## 3 9.11e8  1001 124900          30 C (all)             60     6060 Pave  
## 4 5.35e8  1039 114000          70 RL                  80     8146 Pave  
## 5 5.34e8  1665 227000          60 RL                  70     8400 Pave  
## 6 9.08e8  1922 198500          85 RL                  64     7301 Pave  
## # ... with 73 more variables: Alley <fct>, Lot.Shape <fct>,
## #   Land.Contour <fct>, Utilities <fct>, Lot.Config <fct>,
## #   Land.Slope <fct>, Neighborhood <fct>, Condition.1 <fct>,
## #   Condition.2 <fct>, Bldg.Type <fct>, House.Style <fct>,
## #   Overall.Qual <int>, Overall.Cond <int>, Year.Built <int>,
## #   Year.Remod.Add <int>, Roof.Style <fct>, Roof.Matl <fct>,
## #   Exterior.1st <fct>, Exterior.2nd <fct>, Mas.Vnr.Type <fct>,
## #   Mas.Vnr.Area <int>, Exter.Qual <fct>, Exter.Cond <fct>,
## #   Foundation <fct>, Bsmt.Qual <fct>, Bsmt.Cond <fct>,
## #   Bsmt.Exposure <fct>, BsmtFin.Type.1 <fct>, BsmtFin.SF.1 <int>,
## #   BsmtFin.Type.2 <fct>, BsmtFin.SF.2 <int>, Bsmt.Unf.SF <int>,
## #   Total.Bsmt.SF <int>, Heating <fct>, Heating.QC <fct>,
## #   Central.Air <fct>, Electrical <fct>, X1st.Flr.SF <int>,
## #   X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>, Bsmt.Full.Bath <int>,
## #   Bsmt.Half.Bath <int>, Full.Bath <int>, Half.Bath <int>,
## #   Bedroom.AbvGr <int>, Kitchen.AbvGr <int>, Kitchen.Qual <fct>,
## #   TotRms.AbvGrd <int>, Functional <fct>, Fireplaces <int>,
## #   Fireplace.Qu <fct>, Garage.Type <fct>, Garage.Yr.Blt <int>,
## #   Garage.Finish <fct>, Garage.Cars <int>, Garage.Area <int>,
## #   Garage.Qual <fct>, Garage.Cond <fct>, Paved.Drive <fct>,
## #   Wood.Deck.SF <int>, Open.Porch.SF <int>, Enclosed.Porch <int>,
## #   X3Ssn.Porch <int>, Screen.Porch <int>, Pool.Area <int>, Pool.QC <fct>,
## #   Fence <fct>, Misc.Feature <fct>, Misc.Val <int>, Mo.Sold <int>,
## #   Yr.Sold <int>, Sale.Type <fct>, Sale.Condition <fct>

Next, we will look at the structure of the data

str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  81 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...

There are two categorical variables are coded in R as having type int. We need to change them to factors when conducting analysis.

ames_train$Overall.Qual <- factor(ames_train$Overall.Qual,ordered = TRUE)
ames_train$Overall.Cond <- factor(ames_train$Overall.Cond,ordered = TRUE)

From str of the data, we saw that there are some variables having missing data and three variables with the highest number of missing observations are Misc.Feature, Alley and Pool.QC

na_count <- colSums(is.na(ames_train))
head(sort(na_count, decreasing = TRUE), 3)
##      Pool.QC Misc.Feature        Alley 
##          997          971          933

In our case, removing all NA from the dataset will affect our analysis. For example, Pool.QC has the higest number of NA’s. This number is likely high as “NA” is coded as “No Pool” in the data and not all of houses must have a pool. Another example is that NA values for Basement.Qual and Garage.Qual correspond to houses that do not have a basement or a garage respectively. Thus, the best way to deal with these NA values when fitting the linear model with these variables is recoding all NA values as a separate category.

2.1.1 Distribution of price

The purpose of this project is building a model to predict the price for all houses in Ames. So we will perform summary statistics and look at the distribution of the price first

ames_train %>% 
    summarize(Q1 = quantile(price, 0.25), MEAN = mean(price), 
              MEDIAN = median(price),Q3 = quantile(price, 0.75),               IQR = IQR(price), STDEV = sd(price)) %>%
              mutate(SKEW = ifelse(MEAN > MEDIAN, 
                                   "RIGHT", "LEFT")) 
## # A tibble: 1 x 7
##        Q1    MEAN MEDIAN     Q3    IQR  STDEV SKEW 
##     <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>
## 1 129762. 181190. 159467 213000 83238. 81910. RIGHT

Average price for all houses in Ames is $181,190 which is higher than the median of the house

ggplot(ames_train, aes(x= price)) +
  geom_histogram(bins = 30, fill = 'steelblue') 

The distribution of price is right-skewed so median will be recommended to be used over the mean for analysis. Also, as price is right-skewed and it will be used as a dependent variable in a linear regression later, so we should log-transform price

2.1.1.1 The ages of the houses

ames_train$age <- sapply(ames_train$Year.Built, function(x) 2019 - x)

ggplot(ames_train, aes(x = age, y =..density..)) +
  geom_histogram(bins = 30, fill = 'blue', colour = 'black') +
  geom_density(size = 1, colour = 'red') +
  labs(title = 'Distribution of House Age', x = 'Age', y = 'Number of Houses') +
  geom_vline(xintercept = mean(ames_train$age), colour = 'green', size = 1) +
  geom_vline(xintercept = median(ames_train$age), colour = 'brown', size = 1) 

summary_age <- ames_train %>%
  summarize(Mean_age = mean(age),
            Median_age = median(age),
            Sd_age = sd(age),
            Q1 = quantile(age, 0.25),
            Q3 = quantile(age, 0.75),
            IQR = IQR(age),
            Total = n())
summary_age
## # A tibble: 1 x 7
##   Mean_age Median_age Sd_age    Q1    Q3   IQR Total
##      <dbl>      <dbl>  <dbl> <dbl> <dbl> <dbl> <int>
## 1     46.8         44   29.6    18    64    46  1000

The distribution of `age’ is righ-skewed (the mean is 46.797, median is 44) and showed multimodal behaviour indicating that there are some high counts of houses for certain ages (such as 50 and 90). There is an decrease in number of houses as the age of houses increase.

2.1.2 Home price to its neighborhood

The mantra in real estate is “Location, Location, Location!” We will do summary statistics about home price to its neighborhood for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods

# Perform summary statistics

q2 <- ames_train %>% 
  group_by(Neighborhood) %>% 
  summarise(mean_price = mean(price),
            median_price = median(price),
            min_price = min(price),
            max_price = max(price),
            IQR = IQR(price),
            sd_price = sd(price),
            var_price = var(price),
            total = n()) 

# Determine most expensive, lease expensive and most heterogeneous

most_expensive <- q2[which(q2$median_price == max(q2$median_price)),]
least_expensive <- q2[which(q2$median_price == min(q2$median_price)),]
most_heter <- q2[which(q2$sd_price == max(q2$sd_price)),]

kable(most_expensive[1:9], caption = 'Most expensive houses')
Most expensive houses
Neighborhood mean_price median_price min_price max_price IQR sd_price var_price total
StoneBr 339316 340691.5 180000 591587 151358 123459.1 15242150036 20
kable(least_expensive[1:9], caption = 'Least expensive houses')
Least expensive houses
Neighborhood mean_price median_price min_price max_price IQR sd_price var_price total
MeadowV 92946.88 85750 73000 129500 20150 18939.78 358715156 16
kable(most_heter[1:9], caption = 'Most heterogeneous houses')
Most heterogeneous houses
Neighborhood mean_price median_price min_price max_price IQR sd_price var_price total
StoneBr 339316 340691.5 180000 591587 151358 123459.1 15242150036 20
# Plot
ggplot(ames_train, aes(x = Neighborhood, y = price/1000)) +
  geom_boxplot(colour = 'black', fill = 'orange') +
  labs(title = "Housing prices per Neighborhood", x = 'Neighborhood', y = "Price in [$ 1000]") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

According to the chart, StoneBr is the most expensive and most heterogeneous neighborhood, follow by NridgHt while the least expensive Neighborhood is MeadowV. We can create different plots for StoneBr and NridgHt

stone <- ames_train %>%
  filter(Neighborhood == 'StoneBr')

p1 <- ggplot(stone, aes(x = price/1000, y = ..density..)) +
  geom_histogram(bins = 30, colour = 'black', fill = 'brown') +
  geom_density(size = 1, colour = 'blue') +
  labs(title = 'Neigborhood - StoneBr: Price distribution', x = 'Price(in $1000)', y = 'Density')

nrid <- ames_train %>%
  filter(Neighborhood == 'NridgHt') 
 
p2 <- ggplot(nrid, aes(x = price/1000, y = ..density..)) +
  geom_histogram(bins = 30, colour = 'black', fill = 'brown') +
  geom_density(size = 1, colour = 'blue') +
  labs(title = 'Neigborhood - NridgHt: Price distribution', x = 'Price(in $1000)', y = 'Density')
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
grid.arrange(p1, p2, ncol = 2)

2.1.3 Overal.Qual in association with price

To know if Lot.Area, Bedroom.AbvGr, Overall.Qual and Year.Built affect the price of houses in Ames or not, we will look at the following charts:

# Lot.Area
p1 <- ggplot(ames_train, aes(x = Lot.Area, y = price)) +
  geom_point() +
  stat_smooth(method = 'lm')

# Bedroom.AbvGr
p2 <- ggplot(ames_train, aes(x = Bedroom.AbvGr, y = price)) +
  geom_jitter() +
  stat_smooth(method = 'lm')

# Overall.Qual
p3 <- ggplot(ames_train, aes(x = Overall.Qual, y = price)) +
  geom_jitter()+
  stat_smooth(method = 'lm')

# Year.Built
p4 <- ggplot(ames_train, aes(x = Year.Built, y = price)) +
  geom_point()+
  stat_smooth(method = 'lm')

grid.arrange(p1, p2, p3, p4, ncol = 2)

From the chart, Overall.Qual appears to be the best single predictor of price.

2.1.4 Variable correlation matrix

We will perform correlation matrix to identify top of the most infuencial variables to the price. The correlation coefficient measures the linearly between 2 variables.

# Convert entire dataframe to numeric
data.corr <- as.data.frame(sapply(ames_train, as.numeric))
correlations = cor(data.corr, method = "s")
# Show variables that have strong correlations with price, focus on coefficient > 0.5 or < -0.5
corr.price = as.matrix(sort(correlations[,'price'], decreasing = TRUE))
corr.id = names(which(apply(corr.price, 1, function(x) (x > 0.5 | x < -0.5))))
corrplot(as.matrix(correlations[corr.id,corr.id]), type = 'upper', method='color', addCoef.col = 'black', tl.cex = 1,cl.cex = 1, number.cex=1)

The plot shows 14 variables having strong relationship with price.


2.2 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.2.1 Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


There are a lot of factors that affect the price of a house. In this section, We will pick 10 variables to be predictors for price and log-transform price and area (mentioned in EAD)

  • Overall.Qual: people will look at this information first when choosing a house or deciding to buy a house
  • Bedroom.AbvGr: number of bedroom may be one of concern when people go shopping for a house
  • Area: most often, a house with large area will cost more
  • Lot.Area: the position of a house will affect the price
  • Year.Built: in general, new house will be more expensive than old house
  • Garage.Area, Total.Bsmt.SF, Garage.Cars, Full.Bath and X1st.Flr.SF will affct the price too
# Full Model 
model.full <- lm(log(price) ~ Overall.Qual + Garage.Area +
                   Total.Bsmt.SF + Garage.Cars + log(area) +
                   Full.Bath + Bedroom.AbvGr + Year.Built +
                   X1st.Flr.SF + Lot.Area, data = ames_train)
summary(model.full)
## 
## Call:
## lm(formula = log(price) ~ Overall.Qual + Garage.Area + Total.Bsmt.SF + 
##     Garage.Cars + log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + 
##     X1st.Flr.SF + Lot.Area, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.42804 -0.07357  0.00684  0.08424  0.65793 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     2.447e+00  5.828e-01   4.198 2.93e-05 ***
## Overall.Qual.L  9.959e-01  1.036e-01   9.610  < 2e-16 ***
## Overall.Qual.Q -9.926e-02  9.280e-02  -1.070 0.285061    
## Overall.Qual.C -8.705e-02  8.333e-02  -1.045 0.296462    
## Overall.Qual^4  8.612e-02  6.709e-02   1.284 0.199557    
## Overall.Qual^5 -2.295e-01  5.032e-02  -4.562 5.72e-06 ***
## Overall.Qual^6  1.149e-01  4.150e-02   2.768 0.005739 ** 
## Overall.Qual^7 -1.218e-01  3.575e-02  -3.407 0.000683 ***
## Overall.Qual^8  9.394e-02  2.658e-02   3.534 0.000429 ***
## Overall.Qual^9 -5.486e-02  1.555e-02  -3.528 0.000437 ***
## Garage.Area     6.486e-05  5.725e-05   1.133 0.257545    
## Total.Bsmt.SF   1.241e-04  2.408e-05   5.154 3.08e-07 ***
## Garage.Cars     4.055e-02  1.722e-02   2.355 0.018727 *  
## log(area)       4.573e-01  3.203e-02  14.276  < 2e-16 ***
## Full.Bath      -3.163e-02  1.482e-02  -2.134 0.033091 *  
## Bedroom.AbvGr  -3.160e-02  8.587e-03  -3.680 0.000246 ***
## Year.Built      3.041e-03  2.582e-04  11.779  < 2e-16 ***
## X1st.Flr.SF     3.098e-05  2.768e-05   1.119 0.263455    
## Lot.Area        3.138e-06  5.681e-07   5.524 4.24e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1647 on 979 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8492, Adjusted R-squared:  0.8465 
## F-statistic: 306.4 on 18 and 979 DF,  p-value: < 2.2e-16

Adjusted \(R^2\) is pretty high which is 0.8465 showing strong relationship between those predictors to the price


2.2.2 Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


NOTE: Write your written response to section 2.2 here. Delete this note before you submit your work.

# Model selection using AIC
model.AIC <- stepAIC(model.full, k = 2)
## Start:  AIC=-3581.17
## log(price) ~ Overall.Qual + Garage.Area + Total.Bsmt.SF + Garage.Cars + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + X1st.Flr.SF + 
##     Lot.Area
## 
##                 Df Sum of Sq    RSS     AIC
## - X1st.Flr.SF    1    0.0340 26.592 -3581.9
## - Garage.Area    1    0.0348 26.593 -3581.9
## <none>                       26.558 -3581.2
## - Full.Bath      1    0.1235 26.682 -3578.5
## - Garage.Cars    1    0.1504 26.708 -3577.5
## - Bedroom.AbvGr  1    0.3674 26.925 -3569.5
## - Total.Bsmt.SF  1    0.7207 27.279 -3556.4
## - Lot.Area       1    0.8278 27.386 -3552.5
## - Year.Built     1    3.7638 30.322 -3450.9
## - log(area)      1    5.5284 32.086 -3394.4
## - Overall.Qual   9    9.1660 35.724 -3303.3
## 
## Step:  AIC=-3581.89
## log(price) ~ Overall.Qual + Garage.Area + Total.Bsmt.SF + Garage.Cars + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + Lot.Area
## 
##                 Df Sum of Sq    RSS     AIC
## - Garage.Area    1    0.0422 26.634 -3582.3
## <none>                       26.592 -3581.9
## - Full.Bath      1    0.1193 26.711 -3579.4
## - Garage.Cars    1    0.1456 26.738 -3578.4
## - Bedroom.AbvGr  1    0.4035 26.996 -3568.9
## - Lot.Area       1    0.8608 27.453 -3552.1
## - Total.Bsmt.SF  1    1.9763 28.568 -3512.3
## - Year.Built     1    3.7310 30.323 -3452.9
## - log(area)      1    6.3710 32.963 -3369.5
## - Overall.Qual   9    9.1334 35.725 -3305.2
## 
## Step:  AIC=-3582.31
## log(price) ~ Overall.Qual + Total.Bsmt.SF + Garage.Cars + log(area) + 
##     Full.Bath + Bedroom.AbvGr + Year.Built + Lot.Area
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       26.634 -3582.3
## - Full.Bath      1    0.1405 26.775 -3579.1
## - Bedroom.AbvGr  1    0.4083 27.042 -3569.1
## - Lot.Area       1    0.8756 27.510 -3552.0
## - Garage.Cars    1    0.9445 27.579 -3549.5
## - Total.Bsmt.SF  1    2.0983 28.733 -3508.6
## - Year.Built     1    3.7708 30.405 -3452.2
## - log(area)      1    6.4726 33.107 -3367.2
## - Overall.Qual   9    9.1192 35.753 -3306.4
model.AIC$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(price) ~ Overall.Qual + Garage.Area + Total.Bsmt.SF + Garage.Cars + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + X1st.Flr.SF + 
##     Lot.Area
## 
## Final Model:
## log(price) ~ Overall.Qual + Total.Bsmt.SF + Garage.Cars + log(area) + 
##     Full.Bath + Bedroom.AbvGr + Year.Built + Lot.Area
## 
## 
##            Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1                                   979   26.55801 -3581.169
## 2 - X1st.Flr.SF  1 0.03396243       980   26.59197 -3581.894
## 3 - Garage.Area  1 0.04219861       981   26.63417 -3582.311
# Model selection using BIC
model.BIC <- stepAIC(model.full, k = log(nrow(ames_train)))
## Start:  AIC=-3487.92
## log(price) ~ Overall.Qual + Garage.Area + Total.Bsmt.SF + Garage.Cars + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + X1st.Flr.SF + 
##     Lot.Area
## 
##                 Df Sum of Sq    RSS     AIC
## - X1st.Flr.SF    1    0.0340 26.592 -3493.6
## - Garage.Area    1    0.0348 26.593 -3493.5
## - Full.Bath      1    0.1235 26.682 -3490.2
## - Garage.Cars    1    0.1504 26.708 -3489.2
## <none>                       26.558 -3487.9
## - Bedroom.AbvGr  1    0.3674 26.925 -3481.1
## - Total.Bsmt.SF  1    0.7207 27.279 -3468.1
## - Lot.Area       1    0.8278 27.386 -3464.2
## - Year.Built     1    3.7638 30.322 -3362.6
## - log(area)      1    5.5284 32.086 -3306.1
## - Overall.Qual   9    9.1660 35.724 -3254.2
## 
## Step:  AIC=-3493.55
## log(price) ~ Overall.Qual + Garage.Area + Total.Bsmt.SF + Garage.Cars + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + Lot.Area
## 
##                 Df Sum of Sq    RSS     AIC
## - Garage.Area    1    0.0422 26.634 -3498.9
## - Full.Bath      1    0.1193 26.711 -3496.0
## - Garage.Cars    1    0.1456 26.738 -3495.0
## <none>                       26.592 -3493.6
## - Bedroom.AbvGr  1    0.4035 26.996 -3485.4
## - Lot.Area       1    0.8608 27.453 -3468.7
## - Total.Bsmt.SF  1    1.9763 28.568 -3428.9
## - Year.Built     1    3.7310 30.323 -3369.4
## - log(area)      1    6.3710 32.963 -3286.1
## - Overall.Qual   9    9.1334 35.725 -3261.1
## 
## Step:  AIC=-3498.88
## log(price) ~ Overall.Qual + Total.Bsmt.SF + Garage.Cars + log(area) + 
##     Full.Bath + Bedroom.AbvGr + Year.Built + Lot.Area
## 
##                 Df Sum of Sq    RSS     AIC
## - Full.Bath      1    0.1405 26.775 -3500.5
## <none>                       26.634 -3498.9
## - Bedroom.AbvGr  1    0.4083 27.042 -3490.6
## - Lot.Area       1    0.8756 27.510 -3473.5
## - Garage.Cars    1    0.9445 27.579 -3471.0
## - Total.Bsmt.SF  1    2.0983 28.733 -3430.1
## - Year.Built     1    3.7708 30.405 -3373.6
## - log(area)      1    6.4726 33.107 -3288.7
## - Overall.Qual   9    9.1192 35.753 -3267.2
## 
## Step:  AIC=-3500.54
## log(price) ~ Overall.Qual + Total.Bsmt.SF + Garage.Cars + log(area) + 
##     Bedroom.AbvGr + Year.Built + Lot.Area
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       26.775 -3500.5
## - Bedroom.AbvGr  1    0.4925 27.267 -3489.3
## - Lot.Area       1    0.8827 27.657 -3475.1
## - Garage.Cars    1    0.9325 27.707 -3473.3
## - Total.Bsmt.SF  1    2.1059 28.881 -3431.9
## - Year.Built     1    3.6744 30.449 -3379.1
## - log(area)      1    6.6827 33.457 -3285.1
## - Overall.Qual   9    9.2468 36.021 -3266.6
model.BIC$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## log(price) ~ Overall.Qual + Garage.Area + Total.Bsmt.SF + Garage.Cars + 
##     log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + X1st.Flr.SF + 
##     Lot.Area
## 
## Final Model:
## log(price) ~ Overall.Qual + Total.Bsmt.SF + Garage.Cars + log(area) + 
##     Bedroom.AbvGr + Year.Built + Lot.Area
## 
## 
##            Step Df   Deviance Resid. Df Resid. Dev       AIC
## 1                                   979   26.55801 -3487.922
## 2 - X1st.Flr.SF  1 0.03396243       980   26.59197 -3493.554
## 3 - Garage.Area  1 0.04219861       981   26.63417 -3498.879
## 4   - Full.Bath  1 0.14053582       982   26.77470 -3500.535
# Model selection using BAM
model.bas <- bas.lm(log(price) ~ Overall.Qual + Garage.Area +
                   Total.Bsmt.SF + Garage.Cars + log(area) +
                   Full.Bath + Bedroom.AbvGr + Year.Built +
                   X1st.Flr.SF + Lot.Area, data = ames_train,
                   prior = "AIC", modelprior=uniform())
image(model.bas, rotate = FALSE)

# Initial Model: log(price) ~ Overall.Qual + Garage.Area + Total.Bsmt.SF + Garage.Cars + log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + X1st.Flr.SF + Lot.Area

# Model AIC: log(price) ~ Overall.Qual + Total.Bsmt.SF + Garage.Cars + log(area) + Full.Bath + Bedroom.AbvGr + Year.Built + Lot.Area
    
# Model BIC : log(price) ~ Overall.Qual + Total.Bsmt.SF + Garage.Cars + log(area) + Bedroom.AbvGr + Year.Built + Lot.Area

2.2.3 Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


2.2.3.1 Residuals plot

par(mfrow = c(2,2))
plot(model.bas)

The Residuals vs Fitted plot is used to check the linear relationship assumptions. A horizontal line, without distinct patterns is an indication for a linear relationship, what is good. In our example, there is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables. However, the model overpredicted certain houses such as house #428,181 and 310. We can find this information as follows

pred_train <- predict(model.bas,ames_train,estimator = "BMA")
resid_train <- na.omit(ames_train$price - exp(pred_train$fit))
data.fit.resid <- data.frame(fitted = na.omit(exp(pred_train$fit)), resid = resid_train)
overprice <- ames_train %>% dplyr::select(Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr, price)
overprice$predicted <- exp(pred_train$fit)
overprice[c(428,181,310),]
## # A tibble: 3 x 7
##   Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr  price
##      <int> <fct>           <int>          <int>         <int>  <int>
## 1     9656 Gtl              1923           1970             2  12789
## 2    11900 Gtl              1977           1977             3  82500
## 3    40094 Gtl              2007           2008             3 184750
## # ... with 1 more variable: predicted <dbl>

2.2.3.2 Q-Q plot

The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.

# Quantile-Quantile Plot of Residuals

mu_resid <- mean(resid_train, na.rm=TRUE)
sd_resid <- sd(resid_train, na.rm=TRUE)
std_resid <- (resid_train-mu_resid)/sd_resid
par(mfrow=c(1,2))
qqnorm(std_resid, lty = 2)
qqline(std_resid)
plot(density(std_resid), main="Probability Density of Std. Residuals", xlab="Std. Residuals", ylab="P(Std. Residuals)")

We see that the distribution is right-skewed and fairly normal indicating the model overestimating some houses.

2.2.3.3 Scale plot

sqrt_std_resid <- sqrt(abs(std_resid))
plot_dat <- data.frame(fitted = na.omit(exp(pred_train$fit)), resid = resid_train, sqrt_std_resid = sqrt_std_resid)
ggplot(plot_dat, aes(x = fitted, y = sqrt_std_resid)) +
  geom_point(colour = 'darkgreen') + 
  geom_smooth(method = 'loess', color= "blue", lwd = 0.5) + 
  labs(title = "Scale-Location Plot", y = "Sqrt(Std. Residuals)", x = "Fitted values") 

The scale plot is used to check the homogeneity of variance of the residuals (homoscedasticity). Horizontal line with equally spread points is a good indication of homoscedasticity. At higher fitted values, the variance and bias in model predictions diverges as the predicted price increases. To overcome this issue, we can either remove all houses that the model overestimated or add correction to the model.


2.2.4 Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


model.bas <- bas.lm(log(price) ~ Overall.Qual + Garage.Area +
                   Total.Bsmt.SF + Garage.Cars + log(area) +
                   Full.Bath + Bedroom.AbvGr + Year.Built +
                   X1st.Flr.SF + Lot.Area, data = ames_train,
                   prior = "AIC", modelprior=uniform())

pred_train <- predict(model.bas,ames_train,estimator = "BMA")
resid_train <- na.omit(ames_train$price - exp(pred_train$fit))
rmse_train <- sqrt(mean(resid_train^2))
paste('RMSE for BMA model under ames_train = ', format(rmse_train, digit = 7))
## [1] "RMSE for BMA model under ames_train =  32422.83"

2.2.5 Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


pred_test <- predict(model.bas,ames_test,estimator = "BMA")
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev
## = object$xlevels): variable 'Overall.Qual' is not a factor
## Error in predict.bas(model.bas, ames_test, estimator = "BMA"): Dimension of newdata does not match orginal model
resid_test <- na.omit(ames_test$price - exp(pred_test$fit))
## Error in na.omit(ames_test$price - exp(pred_test$fit)): object 'pred_test' not found
rmse_test <- sqrt(mean(resid_test^2))
## Error in mean(resid_test^2): object 'resid_test' not found
paste('RMSE for BMA model under ames_test = ', format(rmse_test, digit = 7))
## Error in format(rmse_test, digit = 7): object 'rmse_test' not found

In general, the RMSE for predictions on a training data set will be lower than that for predictions on a testing data set. Because the model is built on the training data so it will fit to the training data more than to the testing data. Yet, in this case, it’s the opposit showing no overfitting


Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

2.3 Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

2.3.1 Section 3.1 Final Model

Provide the summary table for your model.


ames_train = ames_train[-310,]
model.final <- bas.lm(log(price) ~ Overall.Qual + Garage.Area +
                   Total.Bsmt.SF + Garage.Cars + log(area) +
                   Full.Bath + Bedroom.AbvGr + Year.Built +
                   X1st.Flr.SF + Lot.Area + Kitchen.Qual+ Neighborhood, data = ames_train,
                   prior = "AIC", modelprior=uniform())
image(model.final, rotate = FALSE)


2.3.2 Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


In this model, we transform price and area because log-tranform both price and area provide the most linear relationship between price and area.

# No log transform
p51 <- ggplot(ames_train, aes(x = area, y = price)) +
  geom_point() +
  stat_smooth(method = 'lm')

#Log area transform
p52 <- ggplot(ames_train, aes(x = log(area), y = price)) +
  geom_point() +
  stat_smooth(method = 'lm')

#Log price transform
p53 <- ggplot(ames_train, aes(x = area, y = log(price))) +
  geom_point() +
  stat_smooth(method = 'lm')

# Log transform both
p54 <- ggplot(ames_train, aes(x = log(area), y = log(price))) +
  geom_point() +
  stat_smooth(method = 'lm')

grid.arrange(p51, p52, p53, p54, ncol = 2)

We see that log-transform both price and area makes the relationship appear to be the most linear


2.3.3 Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.


Use the vif function in the R package car to test for multicollinearity. - VIF = 1 : not correlated - 1< VIF < 5: moderatedly correlated - 5< VIF < 10: highly correlated

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
Final.Model <- as.formula(price ~ Overall.Qual + Garage.Area +
                   Total.Bsmt.SF + Garage.Cars + area +
                   Full.Bath + Bedroom.AbvGr + Year.Built +
                   X1st.Flr.SF + Lot.Area + Kitchen.Qual)
vif(lm(Final.Model, ames_train))
##                   GVIF Df GVIF^(1/(2*Df))
## Overall.Qual  7.366945  9        1.117333
## Garage.Area   5.765524  1        2.401151
## Total.Bsmt.SF 3.804495  1        1.950511
## Garage.Cars   6.353370  1        2.520589
## area          4.080910  1        2.020126
## Full.Bath     2.386011  1        1.544672
## Bedroom.AbvGr 1.949106  1        1.396104
## Year.Built    2.320623  1        1.523359
## X1st.Flr.SF   4.021882  1        2.005463
## Lot.Area      1.144870  1        1.069986
## Kitchen.Qual  4.041765  4        1.190752

We see that all features in our model are moderatedly or highly correlated so we will not include variable interation


2.3.4 Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


We peformed corrplot to check the correlation between predictors and price. Then we have top 14 variables that have strong relationship with price. Also, we will put some assumption about reality factor that may affect the price of a house. When we have all variables we want to analyze, we will perform BIC, AIC and BMA model to see the differences. Then we pick BMA model to perform analysis


2.3.5 Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


In general, RMSE from training data will be lower than the testing data. Yet, in this analysis, when seeing RMSE from testing data lower than training data, we pick different variables to build another model and test and in all cases the testing RMSE was lower than the training RMSE and we still see that RMSE under testing data is still lower than under training data. Thus, we can conclude that there is no overfitting


2.4 Part 4 Final Model Assessment

2.4.1 Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.


par(mfrow = c(2,2))
plot(model.final)

For Residuals vs Fitted plot, we do not see house #310 and there is no pattern in the residual plot. This suggests that we can assume linear relationship between the predictors and the outcome variables. There are houses #424, 736, 181 that the model overpredicted but they are less than 3sd so it’s acceptable.

pred_train2 <- predict(model.final,ames_train,estimator = "BMA")
resid_train2 <- na.omit(ames_train$price - exp(pred_train2$fit))
mu_resid2 <- mean(resid_train2, na.rm=TRUE)
sd_resid2 <- sd(resid_train2, na.rm=TRUE)
std_resid2 <- (resid_train2-mu_resid2)/sd_resid2
# Quantile-Quantile Plot of Residuals
par(mfrow=c(1,2))
qqnorm(std_resid2, lty = 2)
qqline(std_resid2)
plot(density(std_resid2), main="Probability Density of Std. Residuals", 

    xlab="Std. Residuals", ylab="P(Std. Residuals)")

The residuals are normally distributed out to at least two standard deviations.

#Scale plot
sqrt_std_resid2 <- sqrt(abs(std_resid2))
plot_dat2 <- data.frame(fitted2 = na.omit(exp(pred_train2$fit)), resid2 = resid_train2, sqrt_std_resid2 = sqrt_std_resid2)
ggplot(plot_dat2, aes(x = fitted2, y = sqrt_std_resid2)) +
  geom_point(color = 'darkgreen') + 
  geom_smooth(method = 'loess', color= "blue", lwd = 0.5) + 
  labs(title = "Scale-Location Plot for Adj. Model", y = "Sqrt(Std. Residuals)", x = "Fitted values") 

The variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors (or heteroscedasticity).


2.4.2 Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.


sqrt(mean(resid_train2^2))
## [1] 25431.84
ames_test = ames_test %>% filter(Neighborhood != "Landmrk")
pred_test2 <- predict(model.final, newdata = ames_test, estimator="HPM")
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev
## = object$xlevels): variable 'Overall.Qual' is not a factor
## Error in predict.bas(model.final, newdata = ames_test, estimator = "HPM"): Dimension of newdata does not match orginal model
test2_rmse <- sqrt(mean((exp(pred_test2$fit) - ames_test$price)^2))
## Error in mean((exp(pred_test2$fit) - ames_test$price)^2): object 'pred_test2' not found
test2_rmse
## Error in eval(expr, envir, enclos): object 'test2_rmse' not found

RMSE from testing data is still lower than that of training data but the difference is not as much as the old model is.


2.4.3 Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?


The strength of the model is that it can predict most of the houses’ price in Ames. But there are some certain houses, it overpredicted the price such as house #428,310,181. After removing the house #310, the accuracy of the model increases but when we look at the scale plot, tt can be seen that the variability (variances) of the residual points increases with the value of the fitted outcome variable, suggesting non-constant variances in the residuals errors. Thus, we should consider to use a log or square root transformation of the outcome variable (y) and some predictors besides area


2.4.4 Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")

We will use the ames_validation dataset to do some additional assessment of your final model especially to find the RMSE and compare to that of the training data and/or testing data and the coverage

# RMSE
pred.v.HPM <- predict(model.final, ames_validation, 
                    estimator="HPM", 
                    prediction=TRUE, se.fit=TRUE)
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev
## = object$xlevels): variable 'Overall.Qual' is not a factor
## Error in predict.bas(model.final, ames_validation, estimator = "HPM", : Dimension of newdata does not match orginal model
v_rmse <- sqrt(mean((exp(pred.v.HPM$fit) - ames_validation$price)^2))
## Error in mean((exp(pred.v.HPM$fit) - ames_validation$price)^2): object 'pred.v.HPM' not found
v_rmse
## Error in eval(expr, envir, enclos): object 'v_rmse' not found
# Get dataset of predictions and confidence intervals
out = as.data.frame(cbind(exp(confint(pred.v.HPM)),
                          price = ames_validation$price))
## Error in confint(pred.v.HPM): object 'pred.v.HPM' not found
# Fix names in dataset
colnames(out)[1:2] <- c("lwr", "upr")  #fix names
## Error in colnames(out)[1:2] <- c("lwr", "upr"): object 'out' not found
# Get Coverage
pred.v.HPM.coverage <- out %>% summarize(cover = sum(price >= lwr & price <= upr)/n())
## Error in eval(lhs, parent, parent): object 'out' not found
pred.v.HPM.coverage
## Error in eval(expr, envir, enclos): object 'pred.v.HPM.coverage' not found

Using the credible intervals from the validation predictions, 97.38% of all the credible intervals contain the true price of the house in the validation set. Using the median probability model to generate out-of-sample predictions and a 97.38% prediction interval, the proportion of observations (rows) in ames_validation have sales prices that fall outside the prediction intervals is about 2% which is lower than 5% so the model handles well uncertainty.


2.5 Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


We can predict the house price from using the BMA model under ames_train data. Next, using teting data and validation data to check the accuracy of the model is important when performing diagnostic test of a model. In general, the model is built on the training data so overfitting often occurs. However, there is some exceptions. The testing data using for our project tends to perform better in predicting the price.

In case that we have a lot of variables to choose as predictors, we should not only base on variables that show strong quantative information. Like in this model, corrplot showing more than 10 variables having strong relationship with price but there are some variables which actually affected the price of a house are excluded.

Also, the model overpredicted certain houses such as house #310,428 and 181 so we need to do deep analysis about these houses. After looking all information about house #310, we see that excluding this house from our analysis will not affect our model because the actual price reflects exactly the status of the house. Moreover, when building a model, we should always diagnotic the model about residuals vs fitted, scales plot, the Q-Q plot to see how our model behaves and adjust the model to have a better result.

In short, to have a good model we must obtain a good data, explore the data to understand well about data, build model, diagnotic the model, test the model and validate the model * * *