Background

Cory Torrella is hired as a statistical consultant working for a real estate investment firm.

Your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(dplyr)
library(BAS)
library(tidyverse)
library(gridExtra)
library(MASS)
library(knitr)

Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set. Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).


We begin by exploring a summary of the ames_train data:

summary(ames_train)
##       PID                 area          price         MS.SubClass    
##  Min.   :5.263e+08   Min.   : 334   Min.   : 12789   Min.   : 20.00  
##  1st Qu.:5.285e+08   1st Qu.:1092   1st Qu.:129763   1st Qu.: 20.00  
##  Median :5.354e+08   Median :1411   Median :159467   Median : 50.00  
##  Mean   :7.059e+08   Mean   :1477   Mean   :181190   Mean   : 57.15  
##  3rd Qu.:9.071e+08   3rd Qu.:1743   3rd Qu.:213000   3rd Qu.: 70.00  
##  Max.   :1.007e+09   Max.   :4676   Max.   :615000   Max.   :190.00  
##                                                                      
##    MS.Zoning    Lot.Frontage       Lot.Area       Street     Alley    
##  A (agr):  0   Min.   : 21.00   Min.   :  1470   Grvl:  3   Grvl: 33  
##  C (all):  9   1st Qu.: 57.00   1st Qu.:  7314   Pave:997   Pave: 34  
##  FV     : 56   Median : 69.00   Median :  9317              NA's:933  
##  I (all):  1   Mean   : 69.21   Mean   : 10352                        
##  RH     :  7   3rd Qu.: 80.00   3rd Qu.: 11650                        
##  RL     :772   Max.   :313.00   Max.   :215245                        
##  RM     :155   NA's   :167                                            
##  Lot.Shape Land.Contour  Utilities      Lot.Config  Land.Slope  Neighborhood
##  IR1:338   Bnk: 33      AllPub:1000   Corner :173   Gtl:962    NAmes  :155  
##  IR2: 30   HLS: 38      NoSeWa:   0   CulDSac: 76   Mod: 33    CollgCr: 85  
##  IR3:  3   Low: 20      NoSewr:   0   FR2    : 36   Sev:  5    Somerst: 74  
##  Reg:629   Lvl:909                    FR3    :  5              OldTown: 71  
##                                       Inside :710              Sawyer : 61  
##                                                                Edwards: 60  
##                                                                (Other):494  
##   Condition.1   Condition.2   Bldg.Type    House.Style   Overall.Qual   
##  Norm   :875   Norm   :988   1Fam  :823   1Story :521   Min.   : 1.000  
##  Feedr  : 53   Feedr  :  6   2fmCon: 20   2Story :286   1st Qu.: 5.000  
##  Artery : 23   Artery :  2   Duplex: 35   1.5Fin : 98   Median : 6.000  
##  RRAn   : 14   PosN   :  2   Twnhs : 38   SLvl   : 41   Mean   : 6.095  
##  PosN   : 11   PosA   :  1   TwnhsE: 84   SFoyer : 36   3rd Qu.: 7.000  
##  RRAe   : 11   RRNn   :  1                2.5Unf : 10   Max.   :10.000  
##  (Other): 13   (Other):  0                (Other):  8                   
##   Overall.Cond     Year.Built   Year.Remod.Add   Roof.Style    Roof.Matl  
##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  9   CompShg:984  
##  1st Qu.:5.000   1st Qu.:1955   1st Qu.:1966   Gable  :775   Tar&Grv: 11  
##  Median :5.000   Median :1975   Median :1992   Gambrel:  8   WdShake:  2  
##  Mean   :5.559   Mean   :1972   Mean   :1984   Hip    :204   WdShngl:  2  
##  3rd Qu.:6.000   3rd Qu.:2001   3rd Qu.:2004   Mansard:  4   Metal  :  1  
##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :  0   ClyTile:  0  
##                                                              (Other):  0  
##   Exterior.1st  Exterior.2nd  Mas.Vnr.Type  Mas.Vnr.Area    Exter.Qual
##  VinylSd:349   VinylSd:345          :  7   Min.   :   0.0   Ex: 39    
##  HdBoard:164   HdBoard:150   BrkCmn :  8   1st Qu.:   0.0   Fa: 11    
##  MetalSd:147   MetalSd:148   BrkFace:317   Median :   0.0   Gd:337    
##  Wd Sdng:138   Wd Sdng:130   CBlock :  0   Mean   : 104.1   TA:613    
##  Plywood: 74   Plywood: 96   None   :593   3rd Qu.: 160.0             
##  CemntBd: 40   CmentBd: 40   Stone  : 75   Max.   :1290.0             
##  (Other): 88   (Other): 91                 NA's   :7                  
##  Exter.Cond  Foundation  Bsmt.Qual  Bsmt.Cond  Bsmt.Exposure BsmtFin.Type.1
##  Ex:  4     BrkTil:102       :  1       :  1       :  2      GLQ    :294   
##  Fa: 19     CBlock:430   Ex  : 87   Ex  :  2   Av  :157      Unf    :279   
##  Gd:116     PConc :453   Fa  : 28   Fa  : 23   Gd  : 98      ALQ    :163   
##  Po:  0     Slab  : 12   Gd  :424   Gd  : 44   Mn  : 87      Rec    :107   
##  TA:861     Stone :  3   Po  :  1   Po  :  1   No  :635      BLQ    : 87   
##             Wood  :  0   TA  :438   TA  :908   NA's: 21      (Other): 49   
##                          NA's: 21   NA's: 21                 NA's   : 21   
##   BsmtFin.SF.1    BsmtFin.Type.2  BsmtFin.SF.2      Bsmt.Unf.SF    
##  Min.   :   0.0   Unf    :863    Min.   :   0.00   Min.   :   0.0  
##  1st Qu.:   0.0   LwQ    : 31    1st Qu.:   0.00   1st Qu.: 223.5  
##  Median : 400.0   Rec    : 29    Median :   0.00   Median : 461.0  
##  Mean   : 464.1   BLQ    : 24    Mean   :  48.07   Mean   : 547.0  
##  3rd Qu.: 773.0   ALQ    : 20    3rd Qu.:   0.00   3rd Qu.: 783.0  
##  Max.   :2260.0   (Other): 12    Max.   :1526.00   Max.   :2336.0  
##  NA's   :1        NA's   : 21    NA's   :1         NA's   :1       
##  Total.Bsmt.SF     Heating    Heating.QC Central.Air Electrical 
##  Min.   :   0.0   Floor:  0   Ex:516     N: 55            :  0  
##  1st Qu.: 797.5   GasA :988   Fa: 22     Y:945       FuseA: 54  
##  Median : 998.0   GasW :  8   Gd:157                 FuseF: 12  
##  Mean   :1059.2   Grav :  2   Po:  1                 FuseP:  2  
##  3rd Qu.:1301.0   OthW :  1   TA:304                 Mix  :  0  
##  Max.   :3138.0   Wall :  1                          SBrkr:932  
##  NA's   :1                                                      
##   X1st.Flr.SF      X2nd.Flr.SF     Low.Qual.Fin.SF   Bsmt.Full.Bath  
##  Min.   : 334.0   Min.   :   0.0   Min.   :   0.00   Min.   :0.0000  
##  1st Qu.: 876.2   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.:0.0000  
##  Median :1080.5   Median :   0.0   Median :   0.00   Median :0.0000  
##  Mean   :1157.1   Mean   : 315.2   Mean   :   4.32   Mean   :0.4474  
##  3rd Qu.:1376.2   3rd Qu.: 688.2   3rd Qu.:   0.00   3rd Qu.:1.0000  
##  Max.   :3138.0   Max.   :1836.0   Max.   :1064.00   Max.   :3.0000  
##                                                      NA's   :1       
##  Bsmt.Half.Bath      Full.Bath       Half.Bath     Bedroom.AbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.000   Median :3.000  
##  Mean   :0.06106   Mean   :1.541   Mean   :0.378   Mean   :2.806  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :4.000   Max.   :2.000   Max.   :6.000  
##  NA's   :1                                                        
##  Kitchen.AbvGr   Kitchen.Qual TotRms.AbvGrd     Functional    Fireplaces   
##  Min.   :0.000   Ex: 67       Min.   : 2.00   Typ    :935   Min.   :0.000  
##  1st Qu.:1.000   Fa: 20       1st Qu.: 5.00   Min2   : 24   1st Qu.:0.000  
##  Median :1.000   Gd:403       Median : 6.00   Min1   : 18   Median :1.000  
##  Mean   :1.039   Po:  1       Mean   : 6.34   Mod    : 16   Mean   :0.597  
##  3rd Qu.:1.000   TA:509       3rd Qu.: 7.00   Maj1   :  4   3rd Qu.:1.000  
##  Max.   :2.000                Max.   :13.00   Maj2   :  2   Max.   :4.000  
##                                               (Other):  1                  
##  Fireplace.Qu  Garage.Type  Garage.Yr.Blt  Garage.Finish  Garage.Cars   
##  Ex  : 16     2Types : 10   Min.   :1900       :  2      Min.   :0.000  
##  Fa  : 24     Attchd :610   1st Qu.:1961   Fin :247      1st Qu.:1.000  
##  Gd  :232     Basment: 11   Median :1979   RFn :278      Median :2.000  
##  Po  : 18     BuiltIn: 56   Mean   :1978   Unf :427      Mean   :1.767  
##  TA  :219     CarPort:  1   3rd Qu.:2002   NA's: 46      3rd Qu.:2.000  
##  NA's:491     Detchd :266   Max.   :2010                 Max.   :5.000  
##               NA's   : 46   NA's   :48                   NA's   :1      
##   Garage.Area     Garage.Qual Garage.Cond Paved.Drive  Wood.Deck.SF   
##  Min.   :   0.0       :  1        :  1    N: 67       Min.   :  0.00  
##  1st Qu.: 312.0   Ex  :  1    Ex  :  1    P: 29       1st Qu.:  0.00  
##  Median : 480.0   Fa  : 37    Fa  : 21    Y:904       Median :  0.00  
##  Mean   : 475.4   Gd  :  7    Gd  :  6                Mean   : 93.84  
##  3rd Qu.: 576.0   Po  :  3    Po  :  6                3rd Qu.:168.00  
##  Max.   :1390.0   TA  :904    TA  :918                Max.   :857.00  
##  NA's   :1        NA's: 47    NA's: 47                                
##  Open.Porch.SF    Enclosed.Porch    X3Ssn.Porch       Screen.Porch   
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.00  
##  Median : 28.00   Median :  0.00   Median :  0.000   Median :  0.00  
##  Mean   : 48.93   Mean   : 23.48   Mean   :  3.118   Mean   : 14.77  
##  3rd Qu.: 74.00   3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:  0.00  
##  Max.   :742.00   Max.   :432.00   Max.   :508.000   Max.   :440.00  
##                                                                      
##    Pool.Area       Pool.QC      Fence     Misc.Feature    Misc.Val       
##  Min.   :  0.000   Ex  :  1   GdPrv: 43   Elev:  0     Min.   :    0.00  
##  1st Qu.:  0.000   Fa  :  1   GdWo : 37   Gar2:  2     1st Qu.:    0.00  
##  Median :  0.000   Gd  :  1   MnPrv:120   Othr:  1     Median :    0.00  
##  Mean   :  1.463   TA  :  0   MnWw :  2   Shed: 25     Mean   :   45.81  
##  3rd Qu.:  0.000   NA's:997   NA's :798   TenC:  1     3rd Qu.:    0.00  
##  Max.   :800.000                          NA's:971     Max.   :15500.00  
##                                                                          
##     Mo.Sold          Yr.Sold       Sale.Type   Sale.Condition
##  Min.   : 1.000   Min.   :2006   WD     :863   Abnorml: 61   
##  1st Qu.: 4.000   1st Qu.:2007   New    : 79   AdjLand:  2   
##  Median : 6.000   Median :2008   COD    : 27   Alloca :  4   
##  Mean   : 6.243   Mean   :2008   ConLD  :  7   Family : 17   
##  3rd Qu.: 8.000   3rd Qu.:2009   ConLw  :  6   Normal :834   
##  Max.   :12.000   Max.   :2010   Con    :  5   Partial: 82   
##                                  (Other): 13

Initial Data Response

It is important that we take note of the values that may need to be cleaned in order to prove useful. Here are some of our first actions:

  • Filter the data set to contain only the sale conditions that are normal, as the houses with abnormal selling conditions exhibit atypical behavior and can disproportionately influence the model.

  • Avoiding risk of incurring bias in the the modelling by transforming NA’s to their perspective category and acknowledging some continuous variables (such as Lot.Frontage) have a great number of NA’s which means we are missing potential valuable data.

  • Categorical variables such as Fence, Garage.Qual, and Garage.Cond have NA’s that are not missing data, but actually another category such as “Not having a fence”/“Not having a garage”.

With this information, we explore relationships and considerations of different factors:

Quantify Neighborhood Value

n.value <- ames_train %>% 
                        group_by(Neighborhood) %>% 
                        summarize(homes.sold = n(),
                                avg.price = mean(price, na.rm = T),
                                med.price = median(price, na.rm = T),
                                iqr.price = IQR(price, na.rm = T),
                                avg.area = mean(area, na.rm = T),
                                med.area = median(area, na.rm = T),
                                iqr.area = IQR(area, na.rm = T)) %>% 
                        arrange(desc(med.price))
n.value
## # A tibble: 27 x 8
##    Neighborhood homes.sold avg.price med.price iqr.price avg.area med.area
##    <fct>             <int>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
##  1 StoneBr              20   339316.   340692.   151358     1950.    1870 
##  2 NridgHt              57   333647.   336860    148800     2017.    1980 
##  3 NoRidge              28   295845.   290000     50312.    2290     2282.
##  4 GrnHill               2   280000    280000     50000     1398.    1398.
##  5 Timber               19   265192.   232500    151200     1686.    1717 
##  6 Somerst              74   234596.   221650     72684.    1603.    1569 
##  7 Greens                4   198562.   212625     16438.    1136     1230.
##  8 Veenker              10   233650    205750     68125     1766.    1576.
##  9 Crawfor              29   204197.   205000     80100     1720.    1646 
## 10 CollgCr              85   196951.   195800     58836     1479.    1541 
## # ... with 17 more rows, and 1 more variable: iqr.area <dbl>

The variance in the average home price and the range of home prices as a result of the neighborhood they were built in is useful in creating a model for predicting home prices.

Now, we’ll explore the distribution of age of the houses in each neighborhood:

data.value <- ames_train
data.value$age <- sapply(data.value$Year.Built, function(x) 2019-x)
p1 <- ggplot(data.value, aes(x = age, y = ..density..)) + 
    geom_histogram(bins = 30, fill = 'dark gray', color = 'white') +
    geom_density(size = 0.8,color = 'dark green')+
    labs(title = "Distribution of ages of the houses", x = "Age", y = 'Count') +
    geom_vline(xintercept = mean(data.value$age), color = "dark red", size = 1.2) +
    geom_vline(xintercept = median(data.value$age), color = "dark blue", size = 1.2) +
    theme_bw() 

p1

Visualize Neighborhood Value

df_price_neighbor <- data.value[,c("price","Neighborhood")]
p2 <- ggplot(df_price_neighbor, aes(y = price, x = reorder(Neighborhood, price, median), fill = Neighborhood), main = "Price among Neighborhoods", xlim= "Neighborhood", ylim="price" )+geom_boxplot() + coord_flip()
p2

Examine Overall Quality vs Age

p3 <- ggplot(data.value, aes(x = Year.Built, y = log(price), col=Overall.Qual)) + geom_point() + theme_bw() + labs(title='Year.Built Analysis', x = 'Year house was built', y = 'Logarithm of Price')
p3

As expected, more modern houses are sold for higher prices than the older ones.

However, if a given neighborhood consists mostly of new homes and higher-than-average prices, we observe that a home’s relative age is a moderating factor for prediction modeling.

Analyze Lot Size

lot.data <- ames_train[which(ames_train$Lot.Area < 50000),]
p4 <- ggplot(lot.data, aes(x = Lot.Area, y = price)) +
  geom_point() +
  geom_smooth(se = TRUE, method="lm", formula = y ~ splines::bs(x, 3)) +
  geom_jitter(height = 0.05) +
  ggtitle("Lot Size Analysis")

p4

Homes built on large lots typically sell for higher prices, all else being equal.

However, there are diminishing returns to additional acreage over 25000 square feet. As Lot.Area shows to be a useful variable, we need to use its logarithm as well as the logarithm of price to get a proper relation.

It’s essential to explore the structure of the data and the relationships between the variables in the data set. While we only have 3-4 plots above, it’s only a small fraction of the variables and values we explored.


Part 2 - Development and Assessment of an Initial Model

Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from “ames_train” and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


We will pick 10 variables to be predictors for price and log-transform price and area (discussed previously in EAD) to build our multi-linear model. In the exploratory summary table above, we could see which variables have the lowest p-values. We’ll use them for our initial model.

We may remove variables with a high p-value, because they reduce significant values to the model, or worse, increase standard error.

i.model <- lm(log(price) ~ log(area) + Neighborhood + Lot.Config + Overall.Qual + Overall.Cond + Year.Built + TotRms.AbvGrd + Full.Bath + Half.Bath + log(Lot.Area), 
                  data = ames_train)

summary(i.model)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Neighborhood + Lot.Config + 
##     Overall.Qual + Overall.Cond + Year.Built + TotRms.AbvGrd + 
##     Full.Bath + Half.Bath + log(Lot.Area), data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44760 -0.06961  0.00200  0.08105  0.50723 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.2290996  0.8553047  -4.945    9e-07 ***
## log(area)            0.5660212  0.0340768  16.610  < 2e-16 ***
## NeighborhoodBlueste -0.1177475  0.0980573  -1.201 0.230124    
## NeighborhoodBrDale  -0.1890690  0.0679204  -2.784 0.005480 ** 
## NeighborhoodBrkSide -0.0714155  0.0577511  -1.237 0.216535    
## NeighborhoodClearCr -0.0609548  0.0667644  -0.913 0.361480    
## NeighborhoodCollgCr -0.1127382  0.0497513  -2.266 0.023671 *  
## NeighborhoodCrawfor  0.0503747  0.0578107   0.871 0.383768    
## NeighborhoodEdwards -0.1577755  0.0526820  -2.995 0.002816 ** 
## NeighborhoodGilbert -0.1424677  0.0526795  -2.704 0.006963 ** 
## NeighborhoodGreens   0.0531447  0.0891940   0.596 0.551427    
## NeighborhoodGrnHill  0.2656380  0.1147967   2.314 0.020879 *  
## NeighborhoodIDOTRR  -0.2127837  0.0591085  -3.600 0.000335 ***
## NeighborhoodMeadowV -0.1798387  0.0614107  -2.928 0.003487 ** 
## NeighborhoodMitchel -0.0916415  0.0534282  -1.715 0.086626 .  
## NeighborhoodNAmes   -0.1013528  0.0515447  -1.966 0.049550 *  
## NeighborhoodNoRidge  0.0018169  0.0554183   0.033 0.973853    
## NeighborhoodNPkVill -0.0066369  0.0884910  -0.075 0.940229    
## NeighborhoodNridgHt  0.0894957  0.0509240   1.757 0.079162 .  
## NeighborhoodNWAmes  -0.1542766  0.0538771  -2.863 0.004281 ** 
## NeighborhoodOldTown -0.1306623  0.0576586  -2.266 0.023665 *  
## NeighborhoodSawyer  -0.1127886  0.0534113  -2.112 0.034970 *  
## NeighborhoodSawyerW -0.1648820  0.0522396  -3.156 0.001648 ** 
## NeighborhoodSomerst -0.0158120  0.0489374  -0.323 0.746685    
## NeighborhoodStoneBr  0.1156330  0.0575870   2.008 0.044926 *  
## NeighborhoodSWISU   -0.0709978  0.0677891  -1.047 0.295209    
## NeighborhoodTimber  -0.0146329  0.0583969  -0.251 0.802195    
## NeighborhoodVeenker -0.0868019  0.0691527  -1.255 0.209704    
## Lot.ConfigCulDSac    0.0108483  0.0215361   0.504 0.614570    
## Lot.ConfigFR2       -0.0540617  0.0280801  -1.925 0.054491 .  
## Lot.ConfigFR3       -0.1371237  0.0680649  -2.015 0.044224 *  
## Lot.ConfigInside    -0.0092697  0.0128464  -0.722 0.470728    
## Overall.Qual         0.0834528  0.0062670  13.316  < 2e-16 ***
## Overall.Cond         0.0710217  0.0048984  14.499  < 2e-16 ***
## Year.Built           0.0052037  0.0003946  13.187  < 2e-16 ***
## TotRms.AbvGrd       -0.0155060  0.0054003  -2.871 0.004178 ** 
## Full.Bath           -0.0496426  0.0145387  -3.415 0.000666 ***
## Half.Bath           -0.0411220  0.0121895  -3.374 0.000772 ***
## log(Lot.Area)        0.1393604  0.0136129  10.237  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1477 on 961 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.8766 
## F-statistic: 187.8 on 38 and 961 DF,  p-value: < 2.2e-16

Adjusted R-squared is 0.8766, showing a valuable relationship between those predictors to the price.


Section 2.2 Model Selection

Using either BAS or another stepwise selection procedure, choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


Now we will run 2 different model selection methods:

1. BIC with Backward Selection

to find the TRUE model among the set of candidates, we start with the full array of regressors and then remove them at each step until the BIC for the model is at its highest.

2. AIC with Stepwise Regression

to consider adding or subtracting regressors at each step. When we consider all variables at each step, it gives us confidence that the strongest model will be selected.

k <- log(nrow(ames_train))

model.bic <- stepAIC(object = i.model, 
                     k = k, 
                     direction = "backward",
                     trace = F)
summary(model.bic)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Qual + Overall.Cond + 
##     Year.Built + TotRms.AbvGrd + Full.Bath + Half.Bath + log(Lot.Area), 
##     data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.37898 -0.08691  0.00075  0.09235  0.59197 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.9807949  0.5854912  -8.507  < 2e-16 ***
## log(area)      0.5906024  0.0358864  16.458  < 2e-16 ***
## Overall.Qual   0.1145357  0.0057951  19.764  < 2e-16 ***
## Overall.Cond   0.0665850  0.0050939  13.072  < 2e-16 ***
## Year.Built     0.0053850  0.0002558  21.050  < 2e-16 ***
## TotRms.AbvGrd -0.0151798  0.0056370  -2.693   0.0072 ** 
## Full.Bath     -0.0669000  0.0142865  -4.683 3.22e-06 ***
## Half.Bath     -0.0670495  0.0122563  -5.471 5.68e-08 ***
## log(Lot.Area)  0.1384414  0.0106491  13.000  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1596 on 991 degrees of freedom
## Multiple R-squared:  0.8571, Adjusted R-squared:  0.8559 
## F-statistic: 742.9 on 8 and 991 DF,  p-value: < 2.2e-16
model.aic <- stepAIC(i.model, 
                     k = 2,
                     direction = "both",
                     trace = F)
summary(model.aic)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Neighborhood + Lot.Config + 
##     Overall.Qual + Overall.Cond + Year.Built + TotRms.AbvGrd + 
##     Full.Bath + Half.Bath + log(Lot.Area), data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.44760 -0.06961  0.00200  0.08105  0.50723 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -4.2290996  0.8553047  -4.945    9e-07 ***
## log(area)            0.5660212  0.0340768  16.610  < 2e-16 ***
## NeighborhoodBlueste -0.1177475  0.0980573  -1.201 0.230124    
## NeighborhoodBrDale  -0.1890690  0.0679204  -2.784 0.005480 ** 
## NeighborhoodBrkSide -0.0714155  0.0577511  -1.237 0.216535    
## NeighborhoodClearCr -0.0609548  0.0667644  -0.913 0.361480    
## NeighborhoodCollgCr -0.1127382  0.0497513  -2.266 0.023671 *  
## NeighborhoodCrawfor  0.0503747  0.0578107   0.871 0.383768    
## NeighborhoodEdwards -0.1577755  0.0526820  -2.995 0.002816 ** 
## NeighborhoodGilbert -0.1424677  0.0526795  -2.704 0.006963 ** 
## NeighborhoodGreens   0.0531447  0.0891940   0.596 0.551427    
## NeighborhoodGrnHill  0.2656380  0.1147967   2.314 0.020879 *  
## NeighborhoodIDOTRR  -0.2127837  0.0591085  -3.600 0.000335 ***
## NeighborhoodMeadowV -0.1798387  0.0614107  -2.928 0.003487 ** 
## NeighborhoodMitchel -0.0916415  0.0534282  -1.715 0.086626 .  
## NeighborhoodNAmes   -0.1013528  0.0515447  -1.966 0.049550 *  
## NeighborhoodNoRidge  0.0018169  0.0554183   0.033 0.973853    
## NeighborhoodNPkVill -0.0066369  0.0884910  -0.075 0.940229    
## NeighborhoodNridgHt  0.0894957  0.0509240   1.757 0.079162 .  
## NeighborhoodNWAmes  -0.1542766  0.0538771  -2.863 0.004281 ** 
## NeighborhoodOldTown -0.1306623  0.0576586  -2.266 0.023665 *  
## NeighborhoodSawyer  -0.1127886  0.0534113  -2.112 0.034970 *  
## NeighborhoodSawyerW -0.1648820  0.0522396  -3.156 0.001648 ** 
## NeighborhoodSomerst -0.0158120  0.0489374  -0.323 0.746685    
## NeighborhoodStoneBr  0.1156330  0.0575870   2.008 0.044926 *  
## NeighborhoodSWISU   -0.0709978  0.0677891  -1.047 0.295209    
## NeighborhoodTimber  -0.0146329  0.0583969  -0.251 0.802195    
## NeighborhoodVeenker -0.0868019  0.0691527  -1.255 0.209704    
## Lot.ConfigCulDSac    0.0108483  0.0215361   0.504 0.614570    
## Lot.ConfigFR2       -0.0540617  0.0280801  -1.925 0.054491 .  
## Lot.ConfigFR3       -0.1371237  0.0680649  -2.015 0.044224 *  
## Lot.ConfigInside    -0.0092697  0.0128464  -0.722 0.470728    
## Overall.Qual         0.0834528  0.0062670  13.316  < 2e-16 ***
## Overall.Cond         0.0710217  0.0048984  14.499  < 2e-16 ***
## Year.Built           0.0052037  0.0003946  13.187  < 2e-16 ***
## TotRms.AbvGrd       -0.0155060  0.0054003  -2.871 0.004178 ** 
## Full.Bath           -0.0496426  0.0145387  -3.415 0.000666 ***
## Half.Bath           -0.0411220  0.0121895  -3.374 0.000772 ***
## log(Lot.Area)        0.1393604  0.0136129  10.237  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1477 on 961 degrees of freedom
## Multiple R-squared:  0.8813, Adjusted R-squared:  0.8766 
## F-statistic: 187.8 on 38 and 961 DF,  p-value: < 2.2e-16

We will use the AIC Stepwise Regression Model, because it has a higher adjusted R-squared (0.8766) and better residual standard error (0.1477); which means it would perform more robust prediction values.


Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


plot(model.aic)

Residual impressions:

  • The model’s residuals are normally distributed with a center around 0.

  • The shape of distribution is normal and lays close to the qq line.

  • We will track outliers for potential implications.


Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


predict <- exp(predict(model.aic, ames_train))

model.resids <- ames_train$price - predict

model.rmse <- sqrt(mean(model.resids^2))

model.rmse
## [1] 29362.77

The model RMSE is $29362.77.

This value may be interpreted as the standard deviation of prediction errors in the prediction model.


Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test.Rdata")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


ames_test <- ames_test %>% 
        filter(Neighborhood != "Landmrk")

initmodel_test <- exp(predict(model.aic, ames_test))

resid.test <- ames_test$price - initmodel_test

rmse.test <- sqrt(mean(resid.test^2, na.rm = TRUE))

rmse.test
## [1] 25278.85
model.rmse > rmse.test
## [1] TRUE

The lower the RMSE, the better the fit.

The test RMSE is $25278.85.

This means that the prediction for the train model proved to be even more accurate for the test data.


Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.

Section 3.1 Final Model

Provide the summary table for your model.


full.model1 = lm(price ~ area + Overall.Cond + Year.Built + Year.Remod.Add + Heating + Central.Air +  Bedroom.AbvGr+ Kitchen.AbvGr +  Kitchen.Qual  + TotRms.AbvGrd+ Fireplaces + Garage.Yr.Blt + Garage.Area + Pool.Area , data = ames_test)

summary(full.model1)
## 
## Call:
## lm(formula = price ~ area + Overall.Cond + Year.Built + Year.Remod.Add + 
##     Heating + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + 
##     TotRms.AbvGrd + Fireplaces + Garage.Yr.Blt + Garage.Area + 
##     Pool.Area, data = ames_test)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -98742 -17287     86  14816 221661 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -9.623e+05  1.599e+05  -6.019 2.73e-09 ***
## area            9.122e+01  4.550e+00  20.048  < 2e-16 ***
## Overall.Cond    7.087e+03  1.291e+03   5.490 5.47e-08 ***
## Year.Built      8.151e+02  7.558e+01  10.785  < 2e-16 ***
## Year.Remod.Add -2.787e+01  8.150e+01  -0.342   0.7324    
## HeatingGasW    -4.941e+03  1.125e+04  -0.439   0.6608    
## HeatingWall     6.226e+03  1.899e+04   0.328   0.7431    
## Central.AirY   -8.820e+03  6.536e+03  -1.349   0.1776    
## Bedroom.AbvGr  -1.291e+04  2.001e+03  -6.453 1.95e-10 ***
## Kitchen.AbvGr  -4.625e+04  6.971e+03  -6.635 6.16e-11 ***
## Kitchen.QualFa -7.942e+04  1.081e+04  -7.345 5.30e-13 ***
## Kitchen.QualGd -7.146e+04  5.602e+03 -12.757  < 2e-16 ***
## Kitchen.QualTA -7.875e+04  6.074e+03 -12.966  < 2e-16 ***
## TotRms.AbvGrd   1.342e+02  1.476e+03   0.091   0.9276    
## Fireplaces      1.113e+04  1.988e+03   5.600 3.00e-08 ***
## Garage.Yr.Blt  -2.383e+02  8.902e+01  -2.676   0.0076 ** 
## Garage.Area     8.670e+01  8.673e+00   9.997  < 2e-16 ***
## Pool.Area       1.834e+01  3.743e+01   0.490   0.6243    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30350 on 761 degrees of freedom
##   (37 observations deleted due to missingness)
## Multiple R-squared:  0.8266, Adjusted R-squared:  0.8227 
## F-statistic: 213.3 on 17 and 761 DF,  p-value: < 2.2e-16

Having residual standard error: 30350 on 761 degrees of freedom is high, which means the chosen model is not a good fit for our prediction needs. (On our first model, we had residual standard error of 0.1477).


Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


After inspecting residuals in the previous model, we found that 3 houses were sold under abnormal conditions:

  • 610 (PID = 531375070)
  • 428 (PID = 902207130)
  • 310 (PID = 908154205)

We remove these to clean our data, leaving the rest of the data with more normal prediction conditions.

ames_train_clean <- ames_train%>%
        filter(Sale.Condition == "Normal" & !(PID %in% c("531375070", "902207130", "908154205")))%>%
        dplyr::select(price, area, Overall.Cond, Year.Built, Year.Remod.Add, Heating, Central.Air, Bedroom.AbvGr, Kitchen.AbvGr, Kitchen.Qual, TotRms.AbvGrd,Fireplaces, Garage.Yr.Blt,Garage.Area, Pool.Area) 
        
full.model2 = lm(log(price) ~ log(area)  + Overall.Cond + Year.Built + Year.Remod.Add + Heating + Central.Air +  Bedroom.AbvGr+ Kitchen.AbvGr +  Kitchen.Qual  + TotRms.AbvGrd+ Fireplaces + Garage.Yr.Blt + log(Garage.Area) + Pool.Area , data = ames_train_clean)


summary(full.model2)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Cond + Year.Built + 
##     Year.Remod.Add + Heating + Central.Air + Bedroom.AbvGr + 
##     Kitchen.AbvGr + Kitchen.Qual + TotRms.AbvGrd + Fireplaces + 
##     Garage.Yr.Blt + log(Garage.Area) + Pool.Area, data = ames_train_clean)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74726 -0.08643  0.00512  0.08990  0.50311 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.5675243  0.7640547  -2.052 0.040544 *  
## log(area)         0.5488495  0.0330454  16.609  < 2e-16 ***
## Overall.Cond      0.0524961  0.0059707   8.792  < 2e-16 ***
## Year.Built        0.0042442  0.0003880  10.938  < 2e-16 ***
## Year.Remod.Add    0.0006107  0.0003910   1.562 0.118773    
## HeatingGasW       0.2119427  0.0608877   3.481 0.000528 ***
## HeatingGrav       0.3390236  0.1446595   2.344 0.019350 *  
## HeatingWall       0.1171465  0.1471950   0.796 0.426357    
## Central.AirY      0.1703173  0.0312530   5.450 6.78e-08 ***
## Bedroom.AbvGr    -0.0439392  0.0093640  -4.692 3.19e-06 ***
## Kitchen.AbvGr    -0.0988982  0.0324339  -3.049 0.002372 ** 
## Kitchen.QualFa   -0.2413396  0.0512914  -4.705 3.00e-06 ***
## Kitchen.QualGd   -0.2314970  0.0247071  -9.370  < 2e-16 ***
## Kitchen.QualPo    0.0923343  0.1480338   0.624 0.532982    
## Kitchen.QualTA   -0.2999732  0.0279207 -10.744  < 2e-16 ***
## TotRms.AbvGrd     0.0115864  0.0065618   1.766 0.077833 .  
## Fireplaces        0.0774477  0.0094210   8.221 8.44e-16 ***
## Garage.Yr.Blt    -0.0004732  0.0004587  -1.032 0.302506    
## log(Garage.Area)  0.1429367  0.0185184   7.719 3.61e-14 ***
## Pool.Area        -0.0001527  0.0001517  -1.006 0.314718    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1417 on 778 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.8558, Adjusted R-squared:  0.8523 
## F-statistic:   243 on 19 and 778 DF,  p-value: < 2.2e-16
  • The Residual standard error: 0.1417 on 778 degrees of freedom

  • Adjusted R-sqaured: 0.8523


Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.


The present model does not include any variable interactions. During the process of exploring variable interaction data, we could not achieve as strong of an R-squared without creating unnecessary complexity for such a minor fraction of model improvement. As a result, we continue with your present model.


Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


We arrived at the final.model using AIC Stepwise Regression, in which each regressor is considered at every step. This method does not permanently remove a regressor from consideration, and uses the model.aic model developed in section two as a foundation. We added and removed variables stepwise until only the strongest model remains.

full.model3.bas = bas.lm(log(price) ~ log(area)+ Overall.Cond+ Year.Built+Central.Air+ Kitchen.Qual+ Fireplaces, prior = "BIC",
                  method = "MCMC",
                  modelprior = uniform(), data = ames_train_clean)
                  
summary(full.model3.bas)
##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.00000000     1.0000  1.000000e+00  1.000000e+00
## log(area)         0.99960938     1.0000  1.000000e+00  1.000000e+00
## Overall.Cond      0.99775391     1.0000  1.000000e+00  1.000000e+00
## Year.Built        0.99902344     1.0000  1.000000e+00  1.000000e+00
## Central.AirY      0.99687500     1.0000  1.000000e+00  1.000000e+00
## Kitchen.QualFa    0.99482422     1.0000  1.000000e+00  0.000000e+00
## Kitchen.QualGd    0.99775391     1.0000  1.000000e+00  1.000000e+00
## Kitchen.QualPo    0.05195313     0.0000  1.000000e+00  0.000000e+00
## Kitchen.QualTA    0.99804688     1.0000  1.000000e+00  1.000000e+00
## Fireplaces        1.00000000     1.0000  1.000000e+00  1.000000e+00
## BF                        NA     1.0000  3.781668e-02  4.298596e-12
## PostProbs                 NA     0.9402  5.190000e-02  3.400000e-03
## R2                        NA     0.8368  8.368000e-01  8.248000e-01
## dim                       NA     9.0000  1.000000e+01  8.000000e+00
## logmarg                   NA -1275.9448 -1.279220e+03 -1.302118e+03
##                      model 4       model 5
## Intercept       1.000000e+00  1.000000e+00
## log(area)       1.000000e+00  1.000000e+00
## Overall.Cond    1.000000e+00  0.000000e+00
## Year.Built      1.000000e+00  1.000000e+00
## Central.AirY    0.000000e+00  0.000000e+00
## Kitchen.QualFa  1.000000e+00  1.000000e+00
## Kitchen.QualGd  1.000000e+00  0.000000e+00
## Kitchen.QualPo  0.000000e+00  0.000000e+00
## Kitchen.QualTA  1.000000e+00  0.000000e+00
## Fireplaces      1.000000e+00  1.000000e+00
## BF              3.992883e-06  1.387492e-61
## PostProbs       1.300000e-03  9.000000e-04
## R2              8.305000e-01  7.641000e-01
## dim             8.000000e+00  5.000000e+00
## logmarg        -1.288376e+03 -1.416075e+03
image(full.model3.bas, top.models = 7)

final.model = lm(log(price) ~ log(area) + Overall.Cond + Year.Built + Central.Air + Kitchen.Qual + Fireplaces, data = ames_train_clean)

Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


Here, we calculate the proportion of observations that fall within prediction intervals to see the coverage difference between train and test environments.

model.train<-mean(exp(predict(final.model, ames_train)))
model.test<-mean(exp(predict(final.model, ames_test)))
model.train>model.test
## [1] TRUE
# Predict prices train
predict.train <- exp(predict(final.model, ames_train, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage.train <- mean(ames_train$price > predict.train[,"lwr"] &
                            ames_train$price < predict.train[,"upr"])
coverage.train
## [1] 0.921
predict.full <- exp(predict(final.model, ames_test, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage.test <- mean(ames_test$price > predict.full[,"lwr"] &
                            ames_test$price < predict.full[,"upr"])
coverage.test
## [1] 0.9497549

The test model falls within the prediction interval

with results of 94.9%, while the original has results of 92.1%.


Part 4 Final Model Assessment

Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.


plot(final.model)

hist(final.model$residuals)

qqnorm(final.model$residuals)
qqline(final.model$residuals)

The residual plots allow us to see that there is no obvious or major assumption violation outside of the normal distribution tails. A few points show high leverage in the Cook’s plot, but this is to be expected as the linear model uses so many categorical variables.


Section 4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.


Here we are going to extract the prediction values and residuals to calculate the RMSE.

predict.model.train <- exp(predict(final.model, ames_train))
resid.train.final <- ames_train$price - predict.model.train
rmse.train.final <- sqrt(mean(resid.train.final^2))

predict.model.test <- exp(predict(final.model, ames_test))
resid.test.final <- ames_test$price - predict.model.test
rmse.test.final <- sqrt(mean(resid.test.final^2))

rmse.train.final
## [1] 37910.31
rmse.test.final
## [1] 32361.88
model.rmse
## [1] 29362.77
rmse.test
## [1] 25278.85

Initial train vs test: - Initial train model’s RMSE is $29362.77 - Initial test model’s RMSE is $25278.85

Final train vs test: - Final train model’s RMSE is $37910.31 - Final test model’s RMSE is $32361.88

We can see that the model fits well, because both test models resulted in lower RMSE values.


Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?


Our strength is that the final model performs well with a high posterior probability. Our weakness is that we cannot 100% guarantee our delivery of the absolute best predicted house values (even with high confidence and normalized distribution).


Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the “ames_validation” dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention:

  • What is the RMSE of your final model when applied to the validation data?
  • How does this value compare to that of the training data and/or testing data?
  • What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
  • From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")

paste(nrow(ames_validation), "observations")
## [1] "763 observations"
paste(nrow(ames_train_clean), "observations")
## [1] "834 observations"
predict.model.validation <- exp(predict(final.model, ames_validation))

resid.validation <- ames_validation$price - predict.model.validation 

rmse.validation <- sqrt(mean(resid.validation^2))

rmse.validation
## [1] 29329.01
set.seed(555664)

confint_ames = exp(confint(predict(full.model3.bas, ames_validation, se.fit = TRUE), parm="pred"))

length(which(ifelse(ames_validation$price > confint_ames[,"2.5%"] & ames_validation$price < confint_ames[,"97.5%"], "yes", "no") == "yes"))/nrow(confint_ames)*100
## [1] 95.80603

We observe that there is less than a 4.2% chance that predicted values derived from the final.model fall outside of a 95% confidence interval.


Part 5 Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


Our final.model provided 95.8% accuracy in predicting the selling price of homes in Ames, Iowa.

The model can be applied to out-of-case data as depicted by the results of our model. We also learned that Ames, Iowa is a diverse market for real estate investing, with homes varying greatly in price, age, square footage, and much more. Given this wide variability, it makes the efficacy of our prediction model that much more useful.


Capstone Project: Peer Assessment II by Cory Torrella Statistics with R Specialization, Duke University