getwd()
## [1] "/Users/marinazub/Desktop/Study/statistics and R/capstone project/capstone"

1 Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

2 Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

load("ames_train.Rdata")

Use the code block below to load any necessary packages

library(statsr)
library(dplyr)
library(BAS)
library(corrplot)
library(GGally)
library(ggplot2)
library(MASS)

2.1 Part 1 - Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).


Because the goal of the analysis is to find out which variables are the most predictable to form a price of a house, we will plot different variables against price. For the purpose of deep exploration of the dataset, we picked:
Overall.Qual + Garage.Area +
Total.Bsmt.SF + Garage.Cars + area+ Full.Bath + Half.Bath + Bedroom.AbvGr + Year.Built + X1st.Flr.SF + X2nd.Flr.SF +
Lot.Area + Central.Air + Overall.Cond
variables. According to the plotting results, we kept only 3 the most correlated results.

3 Familarize yourself with the dataset

head(ames_train)
## # A tibble: 6 x 81
##         PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area
##       <int> <int>  <int>       <int>    <fctr>        <int>    <int>
## 1 909176150   856 126000          30        RL           NA     7890
## 2 905476230  1049 139500         120        RL           42     4235
## 3 911128020  1001 124900          30   C (all)           60     6060
## 4 535377150  1039 114000          70        RL           80     8146
## 5 534177230  1665 227000          60        RL           70     8400
## 6 908128060  1922 198500          85        RL           64     7301
## # ... with 74 more variables: Street <fctr>, Alley <fctr>,
## #   Lot.Shape <fctr>, Land.Contour <fctr>, Utilities <fctr>,
## #   Lot.Config <fctr>, Land.Slope <fctr>, Neighborhood <fctr>,
## #   Condition.1 <fctr>, Condition.2 <fctr>, Bldg.Type <fctr>,
## #   House.Style <fctr>, Overall.Qual <int>, Overall.Cond <int>,
## #   Year.Built <int>, Year.Remod.Add <int>, Roof.Style <fctr>,
## #   Roof.Matl <fctr>, Exterior.1st <fctr>, Exterior.2nd <fctr>,
## #   Mas.Vnr.Type <fctr>, Mas.Vnr.Area <int>, Exter.Qual <fctr>,
## #   Exter.Cond <fctr>, Foundation <fctr>, Bsmt.Qual <fctr>,
## #   Bsmt.Cond <fctr>, Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>,
## #   BsmtFin.SF.1 <int>, BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>,
## #   Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Heating <fctr>,
## #   Heating.QC <fctr>, Central.Air <fctr>, Electrical <fctr>,
## #   X1st.Flr.SF <int>, X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>,
## #   Bsmt.Full.Bath <int>, Bsmt.Half.Bath <int>, Full.Bath <int>,
## #   Half.Bath <int>, Bedroom.AbvGr <int>, Kitchen.AbvGr <int>,
## #   Kitchen.Qual <fctr>, TotRms.AbvGrd <int>, Functional <fctr>,
## #   Fireplaces <int>, Fireplace.Qu <fctr>, Garage.Type <fctr>,
## #   Garage.Yr.Blt <int>, Garage.Finish <fctr>, Garage.Cars <int>,
## #   Garage.Area <int>, Garage.Qual <fctr>, Garage.Cond <fctr>,
## #   Paved.Drive <fctr>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
## #   Enclosed.Porch <int>, X3Ssn.Porch <int>, Screen.Porch <int>,
## #   Pool.Area <int>, Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>,
## #   Misc.Val <int>, Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>,
## #   Sale.Condition <fctr>
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1000 obs. of  81 variables:
##  $ PID            : int  909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
##  $ area           : int  856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
##  $ price          : int  126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
##  $ MS.SubClass    : int  30 120 30 70 60 85 20 20 20 180 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
##  $ Lot.Frontage   : int  NA 42 60 80 70 64 60 53 74 35 ...
##  $ Lot.Area       : int  7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
##  $ Overall.Qual   : int  6 5 5 4 8 7 4 7 5 6 ...
##  $ Overall.Cond   : int  6 5 9 8 6 5 4 5 6 5 ...
##  $ Year.Built     : int  1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
##  $ Year.Remod.Add : int  1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
##  $ Mas.Vnr.Area   : int  0 149 0 0 0 500 0 20 0 76 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
##  $ BsmtFin.SF.1   : int  238 552 737 0 643 0 0 0 647 467 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
##  $ BsmtFin.SF.2   : int  0 393 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  618 104 100 405 167 0 936 1146 217 80 ...
##  $ Total.Bsmt.SF  : int  856 1049 837 405 810 0 936 1146 864 547 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  856 1049 1001 717 810 495 936 1246 889 1072 ...
##  $ X2nd.Flr.SF    : int  0 0 0 322 855 1427 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  1 1 0 0 1 0 0 0 0 1 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 2 1 1 2 3 1 2 1 1 ...
##  $ Half.Bath      : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 2 2 2 3 4 2 2 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
##  $ TotRms.AbvGrd  : int  4 5 5 6 6 7 4 5 6 5 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
##  $ Garage.Yr.Blt  : int  1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
##  $ Garage.Cars    : int  2 1 1 1 2 2 2 2 2 2 ...
##  $ Garage.Area    : int  399 266 216 281 528 672 576 428 484 525 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 154 0 0 0 0 100 0 0 ...
##  $ Open.Porch.SF  : int  0 105 0 0 45 0 32 24 0 44 ...
##  $ Enclosed.Porch : int  0 0 42 168 0 177 112 0 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 86 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  166 0 0 111 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  3 2 11 5 11 7 2 3 4 5 ...
##  $ Yr.Sold        : int  2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
head(ames_train)
## # A tibble: 6 x 81
##         PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area
##       <int> <int>  <int>       <int>    <fctr>        <int>    <int>
## 1 909176150   856 126000          30        RL           NA     7890
## 2 905476230  1049 139500         120        RL           42     4235
## 3 911128020  1001 124900          30   C (all)           60     6060
## 4 535377150  1039 114000          70        RL           80     8146
## 5 534177230  1665 227000          60        RL           70     8400
## 6 908128060  1922 198500          85        RL           64     7301
## # ... with 74 more variables: Street <fctr>, Alley <fctr>,
## #   Lot.Shape <fctr>, Land.Contour <fctr>, Utilities <fctr>,
## #   Lot.Config <fctr>, Land.Slope <fctr>, Neighborhood <fctr>,
## #   Condition.1 <fctr>, Condition.2 <fctr>, Bldg.Type <fctr>,
## #   House.Style <fctr>, Overall.Qual <int>, Overall.Cond <int>,
## #   Year.Built <int>, Year.Remod.Add <int>, Roof.Style <fctr>,
## #   Roof.Matl <fctr>, Exterior.1st <fctr>, Exterior.2nd <fctr>,
## #   Mas.Vnr.Type <fctr>, Mas.Vnr.Area <int>, Exter.Qual <fctr>,
## #   Exter.Cond <fctr>, Foundation <fctr>, Bsmt.Qual <fctr>,
## #   Bsmt.Cond <fctr>, Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>,
## #   BsmtFin.SF.1 <int>, BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>,
## #   Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Heating <fctr>,
## #   Heating.QC <fctr>, Central.Air <fctr>, Electrical <fctr>,
## #   X1st.Flr.SF <int>, X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>,
## #   Bsmt.Full.Bath <int>, Bsmt.Half.Bath <int>, Full.Bath <int>,
## #   Half.Bath <int>, Bedroom.AbvGr <int>, Kitchen.AbvGr <int>,
## #   Kitchen.Qual <fctr>, TotRms.AbvGrd <int>, Functional <fctr>,
## #   Fireplaces <int>, Fireplace.Qu <fctr>, Garage.Type <fctr>,
## #   Garage.Yr.Blt <int>, Garage.Finish <fctr>, Garage.Cars <int>,
## #   Garage.Area <int>, Garage.Qual <fctr>, Garage.Cond <fctr>,
## #   Paved.Drive <fctr>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
## #   Enclosed.Porch <int>, X3Ssn.Porch <int>, Screen.Porch <int>,
## #   Pool.Area <int>, Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>,
## #   Misc.Val <int>, Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>,
## #   Sale.Condition <fctr>
tail(ames_train)
## # A tibble: 6 x 81
##         PID  area  price MS.SubClass MS.Zoning Lot.Frontage Lot.Area
##       <int> <int>  <int>       <int>    <fctr>        <int>    <int>
## 1 528354110  2398 315750          60        RL           95    11787
## 2 907290250   848 145000         120        RM           NA     4426
## 3 528480150  1576 197000          60        FV           65     8125
## 4 534427010  1728  84900          90        RL           98    13260
## 5 905106140  1352 158000          60        RL           80     9364
## 6 914452120   912 156000          85        RL           NA     7540
## # ... with 74 more variables: Street <fctr>, Alley <fctr>,
## #   Lot.Shape <fctr>, Land.Contour <fctr>, Utilities <fctr>,
## #   Lot.Config <fctr>, Land.Slope <fctr>, Neighborhood <fctr>,
## #   Condition.1 <fctr>, Condition.2 <fctr>, Bldg.Type <fctr>,
## #   House.Style <fctr>, Overall.Qual <int>, Overall.Cond <int>,
## #   Year.Built <int>, Year.Remod.Add <int>, Roof.Style <fctr>,
## #   Roof.Matl <fctr>, Exterior.1st <fctr>, Exterior.2nd <fctr>,
## #   Mas.Vnr.Type <fctr>, Mas.Vnr.Area <int>, Exter.Qual <fctr>,
## #   Exter.Cond <fctr>, Foundation <fctr>, Bsmt.Qual <fctr>,
## #   Bsmt.Cond <fctr>, Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>,
## #   BsmtFin.SF.1 <int>, BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>,
## #   Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Heating <fctr>,
## #   Heating.QC <fctr>, Central.Air <fctr>, Electrical <fctr>,
## #   X1st.Flr.SF <int>, X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>,
## #   Bsmt.Full.Bath <int>, Bsmt.Half.Bath <int>, Full.Bath <int>,
## #   Half.Bath <int>, Bedroom.AbvGr <int>, Kitchen.AbvGr <int>,
## #   Kitchen.Qual <fctr>, TotRms.AbvGrd <int>, Functional <fctr>,
## #   Fireplaces <int>, Fireplace.Qu <fctr>, Garage.Type <fctr>,
## #   Garage.Yr.Blt <int>, Garage.Finish <fctr>, Garage.Cars <int>,
## #   Garage.Area <int>, Garage.Qual <fctr>, Garage.Cond <fctr>,
## #   Paved.Drive <fctr>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
## #   Enclosed.Porch <int>, X3Ssn.Porch <int>, Screen.Porch <int>,
## #   Pool.Area <int>, Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>,
## #   Misc.Val <int>, Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>,
## #   Sale.Condition <fctr>
summary(ames_train)
##       PID                 area          price         MS.SubClass    
##  Min.   :5.263e+08   Min.   : 334   Min.   : 12789   Min.   : 20.00  
##  1st Qu.:5.285e+08   1st Qu.:1092   1st Qu.:129762   1st Qu.: 20.00  
##  Median :5.354e+08   Median :1411   Median :159467   Median : 50.00  
##  Mean   :7.059e+08   Mean   :1477   Mean   :181190   Mean   : 57.15  
##  3rd Qu.:9.071e+08   3rd Qu.:1743   3rd Qu.:213000   3rd Qu.: 70.00  
##  Max.   :1.007e+09   Max.   :4676   Max.   :615000   Max.   :190.00  
##                                                                      
##    MS.Zoning    Lot.Frontage       Lot.Area       Street     Alley    
##  A (agr):  0   Min.   : 21.00   Min.   :  1470   Grvl:  3   Grvl: 33  
##  C (all):  9   1st Qu.: 57.00   1st Qu.:  7314   Pave:997   Pave: 34  
##  FV     : 56   Median : 69.00   Median :  9317              NA's:933  
##  I (all):  1   Mean   : 69.21   Mean   : 10352                        
##  RH     :  7   3rd Qu.: 80.00   3rd Qu.: 11650                        
##  RL     :772   Max.   :313.00   Max.   :215245                        
##  RM     :155   NA's   :167                                            
##  Lot.Shape Land.Contour  Utilities      Lot.Config  Land.Slope
##  IR1:338   Bnk: 33      AllPub:1000   Corner :173   Gtl:962   
##  IR2: 30   HLS: 38      NoSeWa:   0   CulDSac: 76   Mod: 33   
##  IR3:  3   Low: 20      NoSewr:   0   FR2    : 36   Sev:  5   
##  Reg:629   Lvl:909                    FR3    :  5             
##                                       Inside :710             
##                                                               
##                                                               
##   Neighborhood  Condition.1   Condition.2   Bldg.Type    House.Style 
##  NAmes  :155   Norm   :875   Norm   :988   1Fam  :823   1Story :521  
##  CollgCr: 85   Feedr  : 53   Feedr  :  6   2fmCon: 20   2Story :286  
##  Somerst: 74   Artery : 23   Artery :  2   Duplex: 35   1.5Fin : 98  
##  OldTown: 71   RRAn   : 14   PosN   :  2   Twnhs : 38   SLvl   : 41  
##  Sawyer : 61   PosN   : 11   PosA   :  1   TwnhsE: 84   SFoyer : 36  
##  Edwards: 60   RRAe   : 11   RRNn   :  1                2.5Unf : 10  
##  (Other):494   (Other): 13   (Other):  0                (Other):  8  
##   Overall.Qual     Overall.Cond     Year.Built   Year.Remod.Add
##  Min.   : 1.000   Min.   :1.000   Min.   :1872   Min.   :1950  
##  1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1955   1st Qu.:1966  
##  Median : 6.000   Median :5.000   Median :1975   Median :1992  
##  Mean   : 6.095   Mean   :5.559   Mean   :1972   Mean   :1984  
##  3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2001   3rd Qu.:2004  
##  Max.   :10.000   Max.   :9.000   Max.   :2010   Max.   :2010  
##                                                                
##    Roof.Style    Roof.Matl    Exterior.1st  Exterior.2nd  Mas.Vnr.Type
##  Flat   :  9   CompShg:984   VinylSd:349   VinylSd:345          :  7  
##  Gable  :775   Tar&Grv: 11   HdBoard:164   HdBoard:150   BrkCmn :  8  
##  Gambrel:  8   WdShake:  2   MetalSd:147   MetalSd:148   BrkFace:317  
##  Hip    :204   WdShngl:  2   Wd Sdng:138   Wd Sdng:130   CBlock :  0  
##  Mansard:  4   Metal  :  1   Plywood: 74   Plywood: 96   None   :593  
##  Shed   :  0   ClyTile:  0   CemntBd: 40   CmentBd: 40   Stone  : 75  
##                (Other):  0   (Other): 88   (Other): 91                
##   Mas.Vnr.Area    Exter.Qual Exter.Cond  Foundation  Bsmt.Qual  Bsmt.Cond 
##  Min.   :   0.0   Ex: 39     Ex:  4     BrkTil:102       :  1       :  1  
##  1st Qu.:   0.0   Fa: 11     Fa: 19     CBlock:430   Ex  : 87   Ex  :  2  
##  Median :   0.0   Gd:337     Gd:116     PConc :453   Fa  : 28   Fa  : 23  
##  Mean   : 104.1   TA:613     Po:  0     Slab  : 12   Gd  :424   Gd  : 44  
##  3rd Qu.: 160.0              TA:861     Stone :  3   Po  :  1   Po  :  1  
##  Max.   :1290.0                         Wood  :  0   TA  :438   TA  :908  
##  NA's   :7                                           NA's: 21   NA's: 21  
##  Bsmt.Exposure BsmtFin.Type.1  BsmtFin.SF.1    BsmtFin.Type.2
##      :  2      GLQ    :294    Min.   :   0.0   Unf    :863   
##  Av  :157      Unf    :279    1st Qu.:   0.0   LwQ    : 31   
##  Gd  : 98      ALQ    :163    Median : 400.0   Rec    : 29   
##  Mn  : 87      Rec    :107    Mean   : 464.1   BLQ    : 24   
##  No  :635      BLQ    : 87    3rd Qu.: 773.0   ALQ    : 20   
##  NA's: 21      (Other): 49    Max.   :2260.0   (Other): 12   
##                NA's   : 21    NA's   :1        NA's   : 21   
##   BsmtFin.SF.2      Bsmt.Unf.SF     Total.Bsmt.SF     Heating   
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Floor:  0  
##  1st Qu.:   0.00   1st Qu.: 223.5   1st Qu.: 797.5   GasA :988  
##  Median :   0.00   Median : 461.0   Median : 998.0   GasW :  8  
##  Mean   :  48.07   Mean   : 547.0   Mean   :1059.2   Grav :  2  
##  3rd Qu.:   0.00   3rd Qu.: 783.0   3rd Qu.:1301.0   OthW :  1  
##  Max.   :1526.00   Max.   :2336.0   Max.   :3138.0   Wall :  1  
##  NA's   :1         NA's   :1        NA's   :1                   
##  Heating.QC Central.Air Electrical   X1st.Flr.SF      X2nd.Flr.SF    
##  Ex:516     N: 55            :  0   Min.   : 334.0   Min.   :   0.0  
##  Fa: 22     Y:945       FuseA: 54   1st Qu.: 876.2   1st Qu.:   0.0  
##  Gd:157                 FuseF: 12   Median :1080.5   Median :   0.0  
##  Po:  1                 FuseP:  2   Mean   :1157.1   Mean   : 315.2  
##  TA:304                 Mix  :  0   3rd Qu.:1376.2   3rd Qu.: 688.2  
##                         SBrkr:932   Max.   :3138.0   Max.   :1836.0  
##                                                                      
##  Low.Qual.Fin.SF   Bsmt.Full.Bath   Bsmt.Half.Bath      Full.Bath    
##  Min.   :   0.00   Min.   :0.0000   Min.   :0.00000   Min.   :0.000  
##  1st Qu.:   0.00   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000  
##  Median :   0.00   Median :0.0000   Median :0.00000   Median :2.000  
##  Mean   :   4.32   Mean   :0.4474   Mean   :0.06106   Mean   :1.541  
##  3rd Qu.:   0.00   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000  
##  Max.   :1064.00   Max.   :3.0000   Max.   :2.00000   Max.   :4.000  
##                    NA's   :1        NA's   :1                        
##    Half.Bath     Bedroom.AbvGr   Kitchen.AbvGr   Kitchen.Qual
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Ex: 67      
##  1st Qu.:0.000   1st Qu.:2.000   1st Qu.:1.000   Fa: 20      
##  Median :0.000   Median :3.000   Median :1.000   Gd:403      
##  Mean   :0.378   Mean   :2.806   Mean   :1.039   Po:  1      
##  3rd Qu.:1.000   3rd Qu.:3.000   3rd Qu.:1.000   TA:509      
##  Max.   :2.000   Max.   :6.000   Max.   :2.000               
##                                                              
##  TotRms.AbvGrd     Functional    Fireplaces    Fireplace.Qu  Garage.Type 
##  Min.   : 2.00   Typ    :935   Min.   :0.000   Ex  : 16     2Types : 10  
##  1st Qu.: 5.00   Min2   : 24   1st Qu.:0.000   Fa  : 24     Attchd :610  
##  Median : 6.00   Min1   : 18   Median :1.000   Gd  :232     Basment: 11  
##  Mean   : 6.34   Mod    : 16   Mean   :0.597   Po  : 18     BuiltIn: 56  
##  3rd Qu.: 7.00   Maj1   :  4   3rd Qu.:1.000   TA  :219     CarPort:  1  
##  Max.   :13.00   Maj2   :  2   Max.   :4.000   NA's:491     Detchd :266  
##                  (Other):  1                                NA's   : 46  
##  Garage.Yr.Blt  Garage.Finish  Garage.Cars     Garage.Area     Garage.Qual
##  Min.   :1900       :  2      Min.   :0.000   Min.   :   0.0       :  1   
##  1st Qu.:1961   Fin :247      1st Qu.:1.000   1st Qu.: 312.0   Ex  :  1   
##  Median :1979   RFn :278      Median :2.000   Median : 480.0   Fa  : 37   
##  Mean   :1978   Unf :427      Mean   :1.767   Mean   : 475.4   Gd  :  7   
##  3rd Qu.:2002   NA's: 46      3rd Qu.:2.000   3rd Qu.: 576.0   Po  :  3   
##  Max.   :2010                 Max.   :5.000   Max.   :1390.0   TA  :904   
##  NA's   :48                   NA's   :1       NA's   :1        NA's: 47   
##  Garage.Cond Paved.Drive  Wood.Deck.SF    Open.Porch.SF   
##      :  1    N: 67       Min.   :  0.00   Min.   :  0.00  
##  Ex  :  1    P: 29       1st Qu.:  0.00   1st Qu.:  0.00  
##  Fa  : 21    Y:904       Median :  0.00   Median : 28.00  
##  Gd  :  6                Mean   : 93.84   Mean   : 48.93  
##  Po  :  6                3rd Qu.:168.00   3rd Qu.: 74.00  
##  TA  :918                Max.   :857.00   Max.   :742.00  
##  NA's: 47                                                 
##  Enclosed.Porch    X3Ssn.Porch       Screen.Porch      Pool.Area      
##  Min.   :  0.00   Min.   :  0.000   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.000   Median :  0.00   Median :  0.000  
##  Mean   : 23.48   Mean   :  3.118   Mean   : 14.77   Mean   :  1.463  
##  3rd Qu.:  0.00   3rd Qu.:  0.000   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :432.00   Max.   :508.000   Max.   :440.00   Max.   :800.000  
##                                                                       
##  Pool.QC      Fence     Misc.Feature    Misc.Val           Mo.Sold      
##  Ex  :  1   GdPrv: 43   Elev:  0     Min.   :    0.00   Min.   : 1.000  
##  Fa  :  1   GdWo : 37   Gar2:  2     1st Qu.:    0.00   1st Qu.: 4.000  
##  Gd  :  1   MnPrv:120   Othr:  1     Median :    0.00   Median : 6.000  
##  TA  :  0   MnWw :  2   Shed: 25     Mean   :   45.81   Mean   : 6.243  
##  NA's:997   NA's :798   TenC:  1     3rd Qu.:    0.00   3rd Qu.: 8.000  
##                         NA's:971     Max.   :15500.00   Max.   :12.000  
##                                                                         
##     Yr.Sold       Sale.Type   Sale.Condition
##  Min.   :2006   WD     :863   Abnorml: 61   
##  1st Qu.:2007   New    : 79   AdjLand:  2   
##  Median :2008   COD    : 27   Alloca :  4   
##  Mean   :2008   ConLD  :  7   Family : 17   
##  3rd Qu.:2009   ConLw  :  6   Normal :834   
##  Max.   :2010   Con    :  5   Partial: 82   
##                 (Other): 13

4 Finding missing values

k<-colSums(is.na(ames_train))
missingval<-sort(k, decreasing = TRUE)[1:20]
barplot(missingval, main = "Missing values", las = 2 )

Out of this summary we can see that the Pool QC. Alles. Misc.Feature. Fence and Fireplace are the valuses with top-5 missing data, it explains that not a lot of houses in that area has these features.

5 Next, we would like to know something about price range and how much is the average price for a house in Ames, Iowa.

hist(ames_train$price, main = " House price", xlab = "price", ylab = "amount", col = "green")

summary(ames_train$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12790  129800  159500  181200  213000  615000

… here we see, because of the median is the most appropriate tool to mesure that, meadian of the house price is 159500. A As we can see, the diagram is right skewed which makes us conclude that there are more cheap houses than expensive one.

6 Now we can start our exploration what’s exactly afffects the price.

hist(ames_train$area, main = "Area", xlab = "area", col = "red")

hist(log(ames_train$area), main = "Log:Area", xlab = "area", col = "red")

hist(ames_train$Year.Built, main = "Year of Building", col = "blue")

boxplot(ames_train$price~ames_train$Overall.Cond, main = "Overall Condition", xlab = "condition", col = "yellow")

hist(ames_train$Overall.Cond)

#adjusting overall.cond variable
overallcond<- ames_train %>% 
        filter(Overall.Cond != 5)%>%
        group_by(Overall.Cond)%>%
        summarise(median = median(price), mean = mean(price))

table(ames_train$Overall.Cond)
## 
##   1   2   3   4   5   6   7   8   9 
##   3   3  14  36 561 193 131  47  12
plot(overallcond, col = "red")

#checking the relationships of the price and the variables

plot(ames_train$price~ames_train$area, main = "Area vs Price", xlab = "area", col = "red")

plot(log(ames_train$price)~log(ames_train$area), main = "Log:Area vs Price", xlab = "area", col = "red")

plot(ames_train$price~ames_train$Year.Built, main = "Year of Building vs Price", col = "blue")

plot(ames_train$price~ames_train$Overall.Cond, main = "Overall Condition vs Price", xlab = "condition", col = "yellow")


After examining different variables, we stopped on three of them which appeared to be the most linear to the price: Year.Built, area, Overall.Cond. After log transgormation the price and area we could see almost linear relationship of the scatters on the plot which tells us about correlation of both of the vriables. Ploting the Year of built we see, also, correltaion betwen the price and year; also, it gives us sence that the newer the house the higher the price and we could see the Ames experienced building boom from the last decade of 20 century. Also we can observe a linear relationship between the Overall condition and price. Applying boxplot, we notice that the higher the overall condition, the higher the house price is, except the condition #5 which is quite interesting to look at. On the diagram we can see the houses with condition = 5 has higher average price and the higher price range, too. Putting that on the histogram we found the explanation, the total number of the houses with quality = 5 is prevalient of the market. After excluding the condition #5 and proting the results, we observing quite linear situation of price and Overall.Cond. variables. We included log transformed variables of price and area because,based on EDA, they showed skeweness and non-symmetrical distribution which could affect the final results.

6.1 Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

6.1.1 Section 2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from ames_train and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


Running the multilinear model we included all variables from the previous step and in the summary table we could see that the variables, which were discussed above in EDA, show the lowest p-values (marked with triple stars). At this moment it’s interesting to see the other variables which could be also included in the model which show the same low values: Total.Bsmt.SF, Bedroom.AbvGr, Lot.Area, Central.AirY which are quite corrsponded to the are variable (the bigger the area is, the more bedrooms the house has, etc) This model does not have a good prediction potential that’s why it will accept some transformations later. Even though current model has a low residual standart error = 0.1635 and quite firm Adjusted r-squared = 0.8487 , some of the variabels has quite high p-value which are suggested to be removed because they do not provide significant values to the model, even worse, make it heavier, increase standart error.

model.lm=lm(log(price) ~  Overall.Cond+ log(area)+ Year.Built+ Garage.Area +   
                  Total.Bsmt.SF + Garage.Cars + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr +  X1st.Flr.SF + 
                  X2nd.Flr.SF +  
                  log(Lot.Area) +  Central.Air    , data=ames_train )
summary(model.lm)
## 
## Call:
## lm(formula = log(price) ~ Overall.Cond + log(area) + Year.Built + 
##     Garage.Area + Total.Bsmt.SF + Garage.Cars + Full.Bath + Half.Bath + 
##     Bedroom.AbvGr + X1st.Flr.SF + X2nd.Flr.SF + log(Lot.Area) + 
##     Central.Air, data = ames_train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.57637 -0.08183 -0.00030  0.09023  0.56662 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.699e+00  7.253e-01  -6.479 1.46e-10 ***
## Overall.Cond   8.236e-02  5.480e-03  15.030  < 2e-16 ***
## log(area)      6.120e-01  6.416e-02   9.539  < 2e-16 ***
## Year.Built     5.491e-03  2.828e-04  19.418  < 2e-16 ***
## Garage.Area    4.053e-06  5.686e-05   0.071 0.943194    
## Total.Bsmt.SF  2.100e-04  2.330e-05   9.015  < 2e-16 ***
## Garage.Cars    5.317e-02  1.685e-02   3.155 0.001655 ** 
## Full.Bath     -2.598e-02  1.559e-02  -1.666 0.095961 .  
## Half.Bath     -2.698e-02  1.513e-02  -1.783 0.074967 .  
## Bedroom.AbvGr -7.746e-02  8.129e-03  -9.528  < 2e-16 ***
## X1st.Flr.SF    4.398e-05  4.876e-05   0.902 0.367222    
## X2nd.Flr.SF    8.054e-05  4.241e-05   1.899 0.057809 .  
## log(Lot.Area)  8.562e-02  1.173e-02   7.297 6.06e-13 ***
## Central.AirY   9.941e-02  2.685e-02   3.703 0.000225 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1635 on 984 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.8507, Adjusted R-squared:  0.8487 
## F-statistic: 431.2 on 13 and 984 DF,  p-value: < 2.2e-16

6.1.2 Section 2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


At that point we will run 2 different model selection methods: BIC with Markov Chain Monte Carlo and AIC with Bayesian Adaptive Sampling .

When AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. On the contrary, BIC tries to find the TRUE model among the set of candidates.

MCMC are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.

BAS algorithm samples models without replacement from the space of models. For problems that permit enumeration of all models, BAS is guaranteed to enumerate the model space in 2p iterations where p is the number of potential variables under consideration. For larger problems where sampling is required, we provide conditions under which BAS provides perfect samples without replacement.

Trying Bayesian model averaging, with BIC steps, we see the model suggests to exclude Garage.Area, Full.Bath, Half.Bath, X1st.Flr.SF, X2nd.Flr.SF variables when AIC suggest to remove Garage.Area and X1st.Flr.SF only. Posterior probability in the first case is 0.7501 , in case of AIC = 0.0894, which tells us the model with higher posterior probability has the better predistion quality. At the same time, the aic model still has its right to be because its probability is higher then the prior - uniform. R2s stay aproximate the same in both cases.

model.bic = bas.lm(log(price) ~  Overall.Cond+ log(area)+ Year.Built+ Garage.Area +   
                  Total.Bsmt.SF + Garage.Cars + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr +  X1st.Flr.SF + 
                  X2nd.Flr.SF +  
                  log(Lot.Area) +  Central.Air, 
                  prior = "BIC",
                  method = "MCMC",
                  modelprior = uniform(),
                  data=ames_train
                  )
## Warning in bas.lm(log(price) ~ Overall.Cond + log(area) + Year.Built +
## Garage.Area + : dropping 2 rows due to missing data
image(model.bic)

model.aic = bas.lm(log(price) ~  Overall.Cond+ log(area)+ Year.Built+ Garage.Area +   
                  Total.Bsmt.SF + Garage.Cars + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr +  X1st.Flr.SF + 
                  X2nd.Flr.SF +  
                  log(Lot.Area) +  Central.Air, 
                  prior = "AIC",
                  method = "BAS",
                  modelprior = uniform(),
                  data=ames_train
                  )
## Warning in bas.lm(log(price) ~ Overall.Cond + log(area) + Year.Built +
## Garage.Area + : dropping 2 rows due to missing data
image(model.aic)

summary(model.bic)
##               P(B != 0 | Y)    model 1       model 2       model 3
## Intercept        1.00000000     1.0000  1.000000e+00  1.000000e+00
## Overall.Cond     0.99996948     1.0000  1.000000e+00  1.000000e+00
## log(area)        0.99994507     1.0000  1.000000e+00  1.000000e+00
## Year.Built       0.99987793     1.0000  1.000000e+00  1.000000e+00
## Garage.Area      0.04313354     0.0000  0.000000e+00  0.000000e+00
## Total.Bsmt.SF    0.99998779     1.0000  1.000000e+00  1.000000e+00
## Garage.Cars      0.99197388     1.0000  1.000000e+00  1.000000e+00
## Full.Bath        0.05531616     0.0000  0.000000e+00  1.000000e+00
## Half.Bath        0.04059448     0.0000  0.000000e+00  0.000000e+00
## Bedroom.AbvGr    0.99992065     1.0000  1.000000e+00  1.000000e+00
## X1st.Flr.SF      0.03625488     0.0000  0.000000e+00  0.000000e+00
## X2nd.Flr.SF      0.06660156     0.0000  1.000000e+00  0.000000e+00
## log(Lot.Area)    0.99982910     1.0000  1.000000e+00  1.000000e+00
## Central.AirY     0.95845337     1.0000  1.000000e+00  1.000000e+00
## BF                       NA     1.0000  7.122177e-02  5.370529e-02
## PostProbs                NA     0.7514  5.230000e-02  4.090000e-02
## R2                       NA     0.8496  8.499000e-01  8.498000e-01
## dim                      NA     9.0000  1.000000e+01  1.000000e+01
## logmarg                  NA -1666.2341 -1.668876e+03 -1.669158e+03
##                     model 4       model 5
## Intercept      1.000000e+00     1.0000000
## Overall.Cond   1.000000e+00     1.0000000
## log(area)      1.000000e+00     1.0000000
## Year.Built     1.000000e+00     1.0000000
## Garage.Area    0.000000e+00     0.0000000
## Total.Bsmt.SF  1.000000e+00     1.0000000
## Garage.Cars    1.000000e+00     1.0000000
## Full.Bath      0.000000e+00     0.0000000
## Half.Bath      0.000000e+00     1.0000000
## Bedroom.AbvGr  1.000000e+00     1.0000000
## X1st.Flr.SF    0.000000e+00     0.0000000
## X2nd.Flr.SF    0.000000e+00     0.0000000
## log(Lot.Area)  1.000000e+00     1.0000000
## Central.AirY   0.000000e+00     1.0000000
## BF             3.818177e-02     0.0404895
## PostProbs      3.240000e-02     0.0288000
## R2             8.476000e-01     0.8497000
## dim            8.000000e+00    10.0000000
## logmarg       -1.669500e+03 -1669.4408194
summary(model.aic)
##               P(B != 0 | Y)    model 1      model 2       model 3
## Intercept         1.0000000     1.0000     1.000000     1.0000000
## Overall.Cond      1.0000000     1.0000     1.000000     1.0000000
## log(area)         1.0000000     1.0000     1.000000     1.0000000
## Year.Built        1.0000000     1.0000     1.000000     1.0000000
## Garage.Area       0.2828088     0.0000     0.000000     0.0000000
## Total.Bsmt.SF     1.0000000     1.0000     1.000000     1.0000000
## Garage.Cars       0.9926658     1.0000     1.000000     1.0000000
## Full.Bath         0.4733494     1.0000     0.000000     0.0000000
## Half.Bath         0.4821001     1.0000     0.000000     0.0000000
## Bedroom.AbvGr     1.0000000     1.0000     1.000000     1.0000000
## X1st.Flr.SF       0.3445712     0.0000     0.000000     0.0000000
## X2nd.Flr.SF       0.5722852     1.0000     0.000000     1.0000000
## log(Lot.Area)     1.0000000     1.0000     1.000000     1.0000000
## Central.AirY      0.9963427     1.0000     1.000000     1.0000000
## BF                       NA     1.0000     0.922009     0.7631655
## PostProbs                NA     0.0894     0.082400     0.0682000
## R2                       NA     0.8505     0.849600     0.8499000
## dim                      NA    12.0000     9.000000    10.0000000
## logmarg                  NA -1644.0770 -1644.158217 -1644.3472971
##                     model 4       model 5
## Intercept         1.0000000     1.0000000
## Overall.Cond      1.0000000     1.0000000
## log(area)         1.0000000     1.0000000
## Year.Built        1.0000000     1.0000000
## Garage.Area       0.0000000     0.0000000
## Total.Bsmt.SF     1.0000000     1.0000000
## Garage.Cars       1.0000000     1.0000000
## Full.Bath         0.0000000     1.0000000
## Half.Bath         1.0000000     0.0000000
## Bedroom.AbvGr     1.0000000     1.0000000
## X1st.Flr.SF       0.0000000     0.0000000
## X2nd.Flr.SF       1.0000000     0.0000000
## log(Lot.Area)     1.0000000     1.0000000
## Central.AirY      1.0000000     1.0000000
## BF                0.6501771     0.5754705
## PostProbs         0.0581000     0.0514000
## R2                0.8501000     0.8498000
## dim              11.0000000    10.0000000
## logmarg       -1644.5075272 -1644.6295841

7 Let’s transorm the bic suggested model, dropping any extra variables.

model.bic.full=bas.lm(log(price) ~  Overall.Cond+ log(area)+ Year.Built+   
                  Total.Bsmt.SF + Garage.Cars + 
                  Bedroom.AbvGr + 
                  log(Lot.Area) +  Central.Air,
                  prior = "BIC",
                  method = "MCMC",
                  modelprior = uniform(),
                  data=ames_train
                  )
## Warning in bas.lm(log(price) ~ Overall.Cond + log(area) + Year.Built +
## Total.Bsmt.SF + : dropping 2 rows due to missing data
summary(model.bic.full)
##               P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.0000000     1.0000  1.000000e+00  1.000000e+00
## Overall.Cond      0.9996094     1.0000  1.000000e+00  1.000000e+00
## log(area)         0.9998047     1.0000  1.000000e+00  1.000000e+00
## Year.Built        0.9994141     1.0000  1.000000e+00  1.000000e+00
## Total.Bsmt.SF     0.9972656     1.0000  1.000000e+00  1.000000e+00
## Garage.Cars       0.9919922     1.0000  1.000000e+00  0.000000e+00
## Bedroom.AbvGr     0.9958984     1.0000  1.000000e+00  0.000000e+00
## log(Lot.Area)     0.9953125     1.0000  1.000000e+00  0.000000e+00
## Central.AirY      0.9705078     1.0000  0.000000e+00  1.000000e+00
## BF                       NA     1.0000  3.818177e-02  1.894233e-36
## PostProbs                NA     0.9660  2.600000e-02  2.000000e-03
## R2                       NA     0.8496  8.476000e-01  8.189000e-01
## dim                      NA     9.0000  8.000000e+00  6.000000e+00
## logmarg                  NA -1666.2341 -1.669500e+03 -1.748488e+03
##                     model 4       model 5
## Intercept      1.000000e+00  1.000000e+00
## Overall.Cond   1.000000e+00  1.000000e+00
## log(area)      1.000000e+00  1.000000e+00
## Year.Built     1.000000e+00  1.000000e+00
## Total.Bsmt.SF  1.000000e+00  0.000000e+00
## Garage.Cars    0.000000e+00  0.000000e+00
## Bedroom.AbvGr  0.000000e+00  1.000000e+00
## log(Lot.Area)  1.000000e+00  0.000000e+00
## Central.AirY   1.000000e+00  0.000000e+00
## BF             4.621375e-28  2.206854e-71
## PostProbs      1.800000e-03  1.400000e-03
## R2             8.270000e-01  7.858000e-01
## dim            7.000000e+00  5.000000e+00
## logmarg       -1.729176e+03 -1.828926e+03
lm.bic = lm(log(price) ~  Overall.Cond+ log(area)+ Year.Built+   
                  Total.Bsmt.SF + Garage.Cars + 
                  Bedroom.AbvGr + 
                  log(Lot.Area) +  Central.Air,data=ames_train
                  )



model.aic.full=bas.lm(log(price) ~  Overall.Cond+ log(area)+ Year.Built+   
                  Total.Bsmt.SF + Garage.Cars + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr +  
                  X2nd.Flr.SF +  
                  log(Lot.Area) +  Central.Air,
                  prior = "AIC",
                  method = "BAS",
                  modelprior = uniform(),
                  data=ames_train
                  )
## Warning in bas.lm(log(price) ~ Overall.Cond + log(area) + Year.Built +
## Total.Bsmt.SF + : dropping 2 rows due to missing data
summary(model.aic.full)
##               P(B != 0 | Y)    model 1      model 2       model 3
## Intercept         1.0000000     1.0000     1.000000     1.0000000
## Overall.Cond      1.0000000     1.0000     1.000000     1.0000000
## log(area)         1.0000000     1.0000     1.000000     1.0000000
## Year.Built        1.0000000     1.0000     1.000000     1.0000000
## Total.Bsmt.SF     1.0000000     1.0000     1.000000     1.0000000
## Garage.Cars       0.9999997     1.0000     1.000000     1.0000000
## Full.Bath         0.4718752     1.0000     0.000000     0.0000000
## Half.Bath         0.4744016     1.0000     0.000000     0.0000000
## Bedroom.AbvGr     1.0000000     1.0000     1.000000     1.0000000
## X2nd.Flr.SF       0.5541969     1.0000     0.000000     1.0000000
## log(Lot.Area)     1.0000000     1.0000     1.000000     1.0000000
## Central.AirY      0.9962521     1.0000     1.000000     1.0000000
## BF                       NA     1.0000     0.922009     0.7631655
## PostProbs                NA     0.1902     0.175400     0.1452000
## R2                       NA     0.8505     0.849600     0.8499000
## dim                      NA    12.0000     9.000000    10.0000000
## logmarg                  NA -1644.0770 -1644.158217 -1644.3472971
##                     model 4       model 5
## Intercept         1.0000000     1.0000000
## Overall.Cond      1.0000000     1.0000000
## log(area)         1.0000000     1.0000000
## Year.Built        1.0000000     1.0000000
## Total.Bsmt.SF     1.0000000     1.0000000
## Garage.Cars       1.0000000     1.0000000
## Full.Bath         0.0000000     1.0000000
## Half.Bath         1.0000000     0.0000000
## Bedroom.AbvGr     1.0000000     1.0000000
## X2nd.Flr.SF       1.0000000     0.0000000
## log(Lot.Area)     1.0000000     1.0000000
## Central.AirY      1.0000000     1.0000000
## BF                0.6501771     0.5754705
## PostProbs         0.1237000     0.1095000
## R2                0.8501000     0.8498000
## dim              11.0000000    10.0000000
## logmarg       -1644.5075272 -1644.6295841
lm.aic = lm(log(price) ~  Overall.Cond+ log(area)+ Year.Built+   
                  Total.Bsmt.SF + Garage.Cars + 
                  Full.Bath + Half.Bath + 
                  Bedroom.AbvGr +  
                  X2nd.Flr.SF +  
                  log(Lot.Area) +  Central.Air,data=ames_train
                  )

After dropping suggested variables, we see that the bic model has quite high PostProbs = 0.9850 At the same time, after dropping suggested varibales by aic Full.Bath, Half.Bath and X2nd.Flr.SF, posterior probabilities shoe only 0.192 which is still goo because it’s higher than 0.0 While we are interested in the first model as the parsimonian one, we will get the right answer only after the RMSE testings. * * *

7.0.1 Section 2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


The model’s residuals are normally distributed with a center around 0, the shape of distribution is normal (not a fun or skewed) and lays close to the qq line. But, we could observe some strong outliers # 428, 310 and #181 which we can track on all of 3 plots.

plot(model.bic.full)

abline(h=0)


7.0.2 Section 2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


Calculated RMSE = $35826.36 represents the sample (ames_test) standart deviation of the differences between predicted values and observed values by using the model.full multilinear model, which is quite high.

# Extract Predictions for bic
predict.full.train.bic <- exp(predict(lm.bic, ames_train))

# Extract Residuals
resid.full.train.bic <- ames_train$price - predict.full.train.bic

# Calculate RMSE
rmse.full.train.bic <- sqrt(mean(resid.full.train.bic^2, na.rm = TRUE))
rmse.full.train.bic
## [1] 35826.36
# Extract Predictions for aic
predict.full.train.aic <- exp(predict(lm.aic, ames_train))

# Extract Residuals
resid.full.train.aic <- ames_train$price - predict.full.train.aic

# Calculate RMSE
rmse.full.train.aic <- sqrt(mean(resid.full.train.aic^2, na.rm = TRUE))
rmse.full.train.aic
## [1] 36112.87

Running RMSE test, we found that the RMSE is lower for bic model which is the sign to pick the bic model. Because, in general, the lower the RMSE the better. * * *

7.0.3 Section 2.5 Overfitting

The process of building a model generally involves starting with an initial model, identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called overfitting. To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

load("ames_test")

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


In general, the better the model fit, the lower the RMSE. The prediction for the train model is more accurate for the test data, even though the model was build for the training data. This cunclusion is done by comparing the RMSE results which are lower for the test data.

# Extract Predictions
predict.full.test <- exp(predict(lm.bic, ames_test))

# Extract Residuals
resid.full.test <- ames_test$price - predict.full.test

# Calculate RMSE
rmse.full.test.bic <- sqrt(mean(resid.full.test^2, na.rm = TRUE))
rmse.full.test.bic
## [1] 26864.08
rmse.full.train.bic > rmse.full.test.bic
## [1] TRUE

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

7.1 Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

7.1.1 Section 3.1 Final Model

Provide the summary table for your model.


Out of this model with huge amount of variables (13) we can see that many of them does not provide any sence to the model and could be either transformed (next step) or excluded. Residual standard error: 36600 on 932 degrees of freedom is quite high and shows that the chosen model does not fit to predict the house price. To compare 36600 with 0.1637 on 986 degrees of freedom on the previous model.

allvar.model.step1 = lm(price ~ area + Overall.Cond + Year.Built + Year.Remod.Add + Heating + Central.Air +  Bedroom.AbvGr+ Kitchen.AbvGr +  Kitchen.Qual  + TotRms.AbvGrd+ Fireplaces + Garage.Yr.Blt + Garage.Area + Pool.Area , data = ames_train)
summary(allvar.model.step1)
## 
## Call:
## lm(formula = price ~ area + Overall.Cond + Year.Built + Year.Remod.Add + 
##     Heating + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual + 
##     TotRms.AbvGrd + Fireplaces + Garage.Yr.Blt + Garage.Area + 
##     Pool.Area, data = ames_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -353341  -19602   -2026   16953  269110 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.480e+06  1.798e+05  -8.230 6.28e-16 ***
## area            6.368e+01  5.071e+00  12.557  < 2e-16 ***
## Overall.Cond    7.556e+03  1.389e+03   5.442 6.74e-08 ***
## Year.Built      8.073e+02  9.458e+01   8.535  < 2e-16 ***
## Year.Remod.Add  1.656e+02  9.442e+01   1.754 0.079708 .  
## HeatingGasW     2.863e+04  1.469e+04   1.949 0.051568 .  
## HeatingGrav     4.354e+04  3.726e+04   1.169 0.242814    
## HeatingWall     2.890e+04  3.788e+04   0.763 0.445612    
## Central.AirY    9.627e+03  7.541e+03   1.276 0.202100    
## Bedroom.AbvGr  -1.481e+04  2.159e+03  -6.859 1.26e-11 ***
## Kitchen.AbvGr  -2.706e+04  7.760e+03  -3.487 0.000512 ***
## Kitchen.QualFa -7.411e+04  1.222e+04  -6.065 1.92e-09 ***
## Kitchen.QualGd -7.082e+04  5.300e+03 -13.364  < 2e-16 ***
## Kitchen.QualPo -2.335e+04  3.793e+04  -0.615 0.538376    
## Kitchen.QualTA -8.269e+04  6.251e+03 -13.228  < 2e-16 ***
## TotRms.AbvGrd   4.745e+03  1.569e+03   3.025 0.002551 ** 
## Fireplaces      1.558e+04  2.199e+03   7.082 2.80e-12 ***
## Garage.Yr.Blt  -1.731e+02  1.090e+02  -1.587 0.112764    
## Garage.Area     8.051e+01  8.463e+00   9.513  < 2e-16 ***
## Pool.Area      -3.673e+01  3.896e+01  -0.943 0.346003    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36600 on 932 degrees of freedom
##   (48 observations deleted due to missingness)
## Multiple R-squared:  0.8038, Adjusted R-squared:  0.7998 
## F-statistic:   201 on 19 and 932 DF,  p-value: < 2.2e-16

7.1.2 Section 3.2 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


After the expection of residuals in the previous model, we found that 3 houses were sold under the abnormal condiitons 610 (PID = 531375070), 428 (PID = 902207130) and 310 (PID = 908154205), it’s better to drop them and leave those which were sold under normal conditions, so our prediction will be more delicate to regular houses. Also, we need to log transform area variables. After those actions, we could see significan drop on the Residual standard error: 0.1417 on 778 degrees of freedom from 36600, as well as growing of Adjusted R-squared from 0.7998 to 0.8523 That means, we are on the right track!

ames_train_normal <- ames_train%>%
        filter(Sale.Condition == "Normal" & !(PID %in% c("531375070", "902207130", "908154205")))%>%
        dplyr::select(price, area, Overall.Cond, Year.Built, Year.Remod.Add, Heating, Central.Air, Bedroom.AbvGr, Kitchen.AbvGr, Kitchen.Qual, TotRms.AbvGrd,Fireplaces, Garage.Yr.Blt,Garage.Area, Pool.Area) 
        
allvar.model.step2 = lm(log(price) ~ log(area)  + Overall.Cond + Year.Built + Year.Remod.Add + Heating + Central.Air +  Bedroom.AbvGr+ Kitchen.AbvGr +  Kitchen.Qual  + TotRms.AbvGrd+ Fireplaces + Garage.Yr.Blt + log(Garage.Area) + Pool.Area , data = ames_train_normal)


summary(allvar.model.step2)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Cond + Year.Built + 
##     Year.Remod.Add + Heating + Central.Air + Bedroom.AbvGr + 
##     Kitchen.AbvGr + Kitchen.Qual + TotRms.AbvGrd + Fireplaces + 
##     Garage.Yr.Blt + log(Garage.Area) + Pool.Area, data = ames_train_normal)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.74726 -0.08643  0.00512  0.08990  0.50311 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -1.5675243  0.7640547  -2.052 0.040544 *  
## log(area)         0.5488495  0.0330454  16.609  < 2e-16 ***
## Overall.Cond      0.0524961  0.0059707   8.792  < 2e-16 ***
## Year.Built        0.0042442  0.0003880  10.938  < 2e-16 ***
## Year.Remod.Add    0.0006107  0.0003910   1.562 0.118773    
## HeatingGasW       0.2119427  0.0608877   3.481 0.000528 ***
## HeatingGrav       0.3390236  0.1446595   2.344 0.019350 *  
## HeatingWall       0.1171465  0.1471950   0.796 0.426357    
## Central.AirY      0.1703173  0.0312530   5.450 6.78e-08 ***
## Bedroom.AbvGr    -0.0439392  0.0093640  -4.692 3.19e-06 ***
## Kitchen.AbvGr    -0.0988982  0.0324339  -3.049 0.002372 ** 
## Kitchen.QualFa   -0.2413396  0.0512914  -4.705 3.00e-06 ***
## Kitchen.QualGd   -0.2314970  0.0247071  -9.370  < 2e-16 ***
## Kitchen.QualPo    0.0923343  0.1480338   0.624 0.532982    
## Kitchen.QualTA   -0.2999732  0.0279207 -10.744  < 2e-16 ***
## TotRms.AbvGrd     0.0115864  0.0065618   1.766 0.077833 .  
## Fireplaces        0.0774477  0.0094210   8.221 8.44e-16 ***
## Garage.Yr.Blt    -0.0004732  0.0004587  -1.032 0.302506    
## log(Garage.Area)  0.1429367  0.0185184   7.719 3.61e-14 ***
## Pool.Area        -0.0001527  0.0001517  -1.006 0.314718    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1417 on 778 degrees of freedom
##   (36 observations deleted due to missingness)
## Multiple R-squared:  0.8558, Adjusted R-squared:  0.8523 
## F-statistic:   243 on 19 and 778 DF,  p-value: < 2.2e-16

7.1.3 Section 3.3 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.


Before assesing variable interaction, we will drop the variables wihh the highes p-value based on the previous summary: TotRms.AbvGrd, Pool.Area , Garage.Yr.Blt

ames_train_normal_2 <- ames_train%>%
        filter(Sale.Condition == "Normal" & !(PID %in% c("531375070", "902207130", "908154205")))%>%
        dplyr::select(price, area, Overall.Cond, Year.Built, Year.Remod.Add, Heating, Central.Air, Bedroom.AbvGr, Kitchen.AbvGr, Kitchen.Qual, Fireplaces, Garage.Area) 
        
allvar.model.step3 = lm(log(price) ~ ., data = ames_train_normal_2)


summary(allvar.model.step3)
## 
## Call:
## lm(formula = log(price) ~ ., data = ames_train_normal_2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.71928 -0.08462  0.00294  0.09094  0.52087 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.695e+00  7.079e-01   2.394 0.016874 *  
## area            3.780e-04  1.846e-05  20.470  < 2e-16 ***
## Overall.Cond    4.831e-02  5.714e-03   8.453  < 2e-16 ***
## Year.Built      4.148e-03  2.847e-04  14.571  < 2e-16 ***
## Year.Remod.Add  6.765e-04  3.774e-04   1.792 0.073426 .  
## HeatingGasW     2.202e-01  5.789e-02   3.805 0.000153 ***
## HeatingGrav     2.713e-01  1.464e-01   1.853 0.064252 .  
## HeatingOthW     2.331e-01  1.490e-01   1.564 0.118198    
## HeatingWall     8.680e-02  1.479e-01   0.587 0.557349    
## Central.AirY    1.740e-01  2.878e-02   6.048 2.23e-09 ***
## Bedroom.AbvGr  -3.119e-02  8.449e-03  -3.691 0.000238 ***
## Kitchen.AbvGr  -5.861e-02  2.951e-02  -1.986 0.047375 *  
## Kitchen.QualFa -2.882e-01  4.605e-02  -6.259 6.25e-10 ***
## Kitchen.QualGd -2.047e-01  2.502e-02  -8.179 1.09e-15 ***
## Kitchen.QualPo  1.105e-01  1.497e-01   0.738 0.460507    
## Kitchen.QualTA -2.753e-01  2.829e-02  -9.732  < 2e-16 ***
## Fireplaces      7.807e-02  9.239e-03   8.451  < 2e-16 ***
## Garage.Area     2.949e-04  3.145e-05   9.378  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1441 on 816 degrees of freedom
## Multiple R-squared:  0.8605, Adjusted R-squared:  0.8576 
## F-statistic: 296.2 on 17 and 816 DF,  p-value: < 2.2e-16

In this part we are interested to check if our variables have any internal linear correlations. If they are, we will need to exclude one of them. We check them in pairs: Year.Built and Year.Remod.Add, Heating and Central.Air; Bedroom.AbvGr, Kitchen.AbvGr, area and garage area.

ggpairs(ames_train_normal_2, columns = c(5,6), main = "Years of Remodeling and Building")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'main' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggpairs(ames_train_normal_2, columns = c(7,8), main = "Heating and Air")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'main' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggpairs(ames_train_normal_2, columns = c(2, 8,9, 12), main = "Areas")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'main' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.

ggpairs(ames_train_normal_2, columns = c(10, 2), main = "Kitchen vs area")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'main' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After running the plot we can seee that most of them in the pair are correlated. Year.Built and Year.Remod.Add, is 0.5775 which is quite high and we have to take off one - Year.Remod.Add, Heating and Central.Air shows no correlation, but we will remove Heating as a factor with high p-values and as a meaningfullu correlated with Central.Air variable; Bedroom.AbvGr, Kitchen.AbvGr, area and garage area: area and Bedroom.AbvGr is 0.584, so we need to drop Bedroom.AbvGr; area and Garage.Area is 0.474 which is close to 0.5, so we could drop the last one and will do that.

As the result of this step, our model shrunk to

ames_train_normal_3 <- ames_train%>%
        filter(Sale.Condition == "Normal" & !(PID %in% c("531375070", "902207130", "908154205")))%>%
        dplyr::select(price, area, Overall.Cond, Year.Built, Central.Air, Kitchen.AbvGr, Kitchen.Qual, Fireplaces) 
        
allvar.model.step4 = lm(log(price) ~ log(area)+ Overall.Cond+ Year.Built+Central.Air+Kitchen.AbvGr+ Kitchen.Qual+ Fireplaces, data = ames_train_normal_3)


summary(allvar.model.step4)
## 
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Cond + Year.Built + 
##     Central.Air + Kitchen.AbvGr + Kitchen.Qual + Fireplaces, 
##     data = ames_train_normal_3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.82054 -0.08550  0.00456  0.09613  0.59119 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.7218825  0.5608928  -3.070  0.00221 ** 
## log(area)       0.5801915  0.0211289  27.460  < 2e-16 ***
## Overall.Cond    0.0520614  0.0057668   9.028  < 2e-16 ***
## Year.Built      0.0047945  0.0002638  18.173  < 2e-16 ***
## Central.AirY    0.1350342  0.0292066   4.623 4.38e-06 ***
## Kitchen.AbvGr  -0.0831529  0.0308234  -2.698  0.00712 ** 
## Kitchen.QualFa -0.3556534  0.0475486  -7.480 1.91e-13 ***
## Kitchen.QualGd -0.2674320  0.0261879 -10.212  < 2e-16 ***
## Kitchen.QualPo -0.0820961  0.1596924  -0.514  0.60733    
## Kitchen.QualTA -0.3590976  0.0284771 -12.610  < 2e-16 ***
## Fireplaces      0.0867521  0.0095348   9.098  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1545 on 823 degrees of freedom
## Multiple R-squared:  0.8382, Adjusted R-squared:  0.8363 
## F-statistic: 426.5 on 10 and 823 DF,  p-value: < 2.2e-16

7.1.4 Section 3.4 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


During the selecting procedure, I was interested to pick variables which show off year of build, condition of the house and size, as well as heating part of the house. Those variables are low in missing data, based on EDA, show quite low p-values and show no correlation within variable, so it makes the model quite light for prediciton. It was quite interesting to observe, but after all of those modification, the BIC MCMC model average suggests to drop Kitchen.AbvGr, so we do.

model.step4.bas = bas.lm(log(price) ~ log(area)+ Overall.Cond+ Year.Built+Central.Air+ Kitchen.Qual+ Fireplaces, prior = "BIC",
                  method = "MCMC",
                  modelprior = uniform(), data = ames_train_normal_3)
summary(model.step4.bas)
##                P(B != 0 | Y)    model 1       model 2       model 3
## Intercept         1.00000000     1.0000  1.000000e+00  1.000000e+00
## log(area)         0.99960938     1.0000  1.000000e+00  1.000000e+00
## Overall.Cond      0.99707031     1.0000  1.000000e+00  0.000000e+00
## Year.Built        0.99863281     1.0000  1.000000e+00  1.000000e+00
## Central.AirY      0.99746094     1.0000  1.000000e+00  0.000000e+00
## Kitchen.QualFa    0.99853516     1.0000  1.000000e+00  0.000000e+00
## Kitchen.QualGd    0.99736328     1.0000  1.000000e+00  0.000000e+00
## Kitchen.QualPo    0.03671875     0.0000  1.000000e+00  0.000000e+00
## Kitchen.QualTA    0.99892578     1.0000  1.000000e+00  1.000000e+00
## Fireplaces        0.99960938     1.0000  1.000000e+00  1.000000e+00
## BF                        NA     1.0000  3.781668e-02  1.957066e-50
## PostProbs                 NA     0.9602  3.670000e-02  7.000000e-04
## R2                        NA     0.8368  8.368000e-01  7.782000e-01
## dim                       NA     9.0000  1.000000e+01  5.000000e+00
## logmarg                   NA -1275.9448 -1.279220e+03 -1.390403e+03
##                       model 4        model 5
## Intercept        1.000000e+00   1.000000e+00
## log(area)        1.000000e+00   0.000000e+00
## Overall.Cond     0.000000e+00   0.000000e+00
## Year.Built       0.000000e+00   0.000000e+00
## Central.AirY     0.000000e+00   0.000000e+00
## Kitchen.QualFa   1.000000e+00   0.000000e+00
## Kitchen.QualGd   0.000000e+00   0.000000e+00
## Kitchen.QualPo   0.000000e+00   0.000000e+00
## Kitchen.QualTA   0.000000e+00   0.000000e+00
## Fireplaces       1.000000e+00   1.000000e+00
## BF              7.987519e-149  1.511276e-258
## PostProbs        6.000000e-04   4.000000e-04
## R2               6.150000e-01   2.829000e-01
## dim              4.000000e+00   2.000000e+00
## logmarg         -1.616952e+03  -1.869599e+03
image(model.step4.bas, top.models = 5)

final.model=lm(log(price) ~ log(area)+ Overall.Cond+ Year.Built+Central.Air+ Kitchen.Qual+ Fireplaces, data = ames_train_normal_3)

As we could see, the model also suggests to drop some variables which are the groups of a single variable, just to kepp the marginal probability safe, we would not drop any of them.


7.1.5 Section 3.5 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


Applying the model to the test dataset, we found the model results are close enough to the inicial one. Test model falls within the prediction interval with results of 95% when the original one has results 0f 92%

model.train<-mean(exp(predict(final.model, ames_train)))
model.test<-mean(exp(predict(final.model, ames_test)))
model.train>model.test
## [1] TRUE
# Predict prices train
predict.train <- exp(predict(final.model, ames_train, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage.prob.train <- mean(ames_train$price > predict.train[,"lwr"] &
                            ames_train$price < predict.train[,"upr"])
coverage.prob.train
## [1] 0.921
# Predict prices test
predict.full <- exp(predict(final.model, ames_test, interval = "prediction"))

# Calculate proportion of observations that fall within prediction intervals
coverage.prob.full <- mean(ames_test$price > predict.full[,"lwr"] &
                            ames_test$price < predict.full[,"upr"])
coverage.prob.full
## [1] 0.9498164

7.2 Part 4 Final Model Assessment

7.2.1 Section 4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.

Testing residuals we can see the residuals are normal distributed with founded earlier outliers 428 and 310 which appearance we could track on all three graphs.


plot(final.model, which = 1)

hist(final.model$residuals)

qqnorm(final.model$residuals)
qqline(final.model$residuals)

The scatter plot has a very light curve which is a good sing of residual spreading. REsiduals has normal distribution with a center around 0 and show slight curve on the edges of QQ plot * * *

7.2.2 Section 4.2 Final Model RMSE

Ruuning the test model, we found the created model has the lower RMSE values, so we could conclude the created model fits well.

#extract prediction
predict.model.train <- exp(predict(final.model, ames_train))

# Extract Residuals
resid.train <- ames_train$price - predict.model.train

# Calculate RMSE
rmse.train <- sqrt(mean(resid.train^2))


#extract prediction

predict.model.test <- exp(predict(final.model, ames_test))

# Extract Residuals
resid.test <- ames_test$price - predict.model.test

# Calculate RMSE
rmse.test <- sqrt(mean(resid.test^2))


#final model
rmse.train
## [1] 37910.31
rmse.test 
## [1] 32348
#initial model
rmse.full.train.bic
## [1] 35826.36
rmse.full.test.bic
## [1] 26864.08

Summarising, initial train model’s RMSE is $35826.36
initial test model’s RMSE is $26864.08
final train model’s RMSE is $37910.31
final test model’s RMSE is $32348

    Initial test model shoew the lowest RMSE which speaking of better fitiings of the model to compare with the final test model. Nevertheless, the final testing model has RMSE lower than the training one, which is a good sign.     
    As an option, we can come back to the model and analyse the summary of the model again and drop a few more variables which were questionated during the selection process. because we are interested in those variables, it's our decision to keep them on.

7.2.3 Section 4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?


Strength The final model has high posterior probability - 0.60 which points to the well performing model. Also, the model helps to observe the house pricing formation from the different point of view, not only are year of build and condition quality, but also, from a heating point and quality of kitchen.

Weakness. posterior probability = 0.60 points to the well performing model and do not guarantee the best results. The model took in charge of normal condidion houses, eliminantin any out-of-case houses.


7.2.4 Section 4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the ames_validation dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?

load("ames_validation.Rdata")
str(ames_validation)
## Classes 'tbl_df', 'tbl' and 'data.frame':    763 obs. of  81 variables:
##  $ PID            : int  909276070 534278190 534204120 535425040 907202160 903227070 531375050 535375140 534201230 906380100 ...
##  $ area           : int  1717 894 1073 1773 1026 1348 1175 869 1051 1800 ...
##  $ price          : int  194000 117600 156000 183000 147000 117000 146000 89900 142000 239000 ...
##  $ MS.SubClass    : int  50 20 20 20 80 50 20 20 20 20 ...
##  $ MS.Zoning      : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 6 6 6 7 6 6 6 6 ...
##  $ Lot.Frontage   : int  80 70 79 80 NA 50 63 60 80 122 ...
##  $ Lot.Area       : int  12400 8680 10289 10800 10970 6000 13072 10122 9600 11923 ...
##  $ Street         : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley          : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ Lot.Shape      : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 1 4 4 4 4 1 ...
##  $ Land.Contour   : Factor w/ 4 levels "Bnk","HLS","Low",..: 2 4 4 4 3 4 4 4 4 4 ...
##  $ Utilities      : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Lot.Config     : Factor w/ 5 levels "Corner","CulDSac",..: 5 5 5 5 5 1 5 5 1 1 ...
##  $ Land.Slope     : Factor w/ 3 levels "Gtl","Mod","Sev": 2 1 1 1 2 1 1 1 1 1 ...
##  $ Neighborhood   : Factor w/ 28 levels "Blmngtn","Blueste",..: 7 16 16 16 6 4 23 21 16 6 ...
##  $ Condition.1    : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 4 3 3 6 3 2 3 ...
##  $ Condition.2    : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ Bldg.Type      : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ House.Style    : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 1 3 3 3 8 1 3 3 3 3 ...
##  $ Overall.Qual   : int  5 5 5 6 6 5 6 4 5 9 ...
##  $ Overall.Cond   : int  6 7 7 6 6 8 5 6 5 5 ...
##  $ Year.Built     : int  1940 1960 1965 1961 1978 1925 2005 1948 1968 2007 ...
##  $ Year.Remod.Add : int  1950 1960 1965 1992 1978 1997 2006 1950 1968 2007 ...
##  $ Roof.Style     : Factor w/ 6 levels "Flat","Gable",..: 2 4 4 2 2 2 2 2 2 2 ...
##  $ Roof.Matl      : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior.1st   : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 6 9 7 10 15 14 9 9 14 ...
##  $ Exterior.2nd   : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 6 9 7 7 16 15 9 9 15 ...
##  $ Mas.Vnr.Type   : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 5 3 3 5 5 3 5 5 5 ...
##  $ Mas.Vnr.Area   : int  0 0 168 104 0 0 126 0 0 0 ...
##  $ Exter.Qual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 4 4 4 4 4 4 3 ...
##  $ Exter.Cond     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation     : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 2 1 3 4 2 3 ...
##  $ Bsmt.Qual      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 4 6 6 6 4 6 4 NA 6 2 ...
##  $ Bsmt.Cond      : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 4 6 4 NA 6 6 ...
##  $ Bsmt.Exposure  : Factor w/ 5 levels "","Av","Gd","Mn",..: 4 5 5 5 3 5 5 NA 5 5 ...
##  $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 3 7 2 6 4 7 4 NA 2 7 ...
##  $ BsmtFin.SF.1   : int  602 0 836 913 505 0 80 0 758 0 ...
##  $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 7 7 7 5 7 7 NA 7 7 ...
##  $ BsmtFin.SF.2   : int  0 0 0 0 435 0 0 0 0 0 ...
##  $ Bsmt.Unf.SF    : int  299 894 237 400 0 884 1095 0 293 1800 ...
##  $ Total.Bsmt.SF  : int  901 894 1073 1313 940 884 1175 0 1051 1800 ...
##  $ Heating        : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Heating.QC     : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 1 1 5 3 1 ...
##  $ Central.Air    : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 1 2 2 ...
##  $ Electrical     : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ X1st.Flr.SF    : int  1125 894 1073 1773 1026 884 1175 869 1051 1800 ...
##  $ X2nd.Flr.SF    : int  592 0 0 0 0 464 0 0 0 0 ...
##  $ Low.Qual.Fin.SF: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Bsmt.Full.Bath : int  0 0 1 1 1 1 1 0 1 0 ...
##  $ Bsmt.Half.Bath : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Full.Bath      : int  1 1 1 2 1 1 1 1 1 2 ...
##  $ Half.Bath      : int  1 0 1 0 0 0 0 0 0 0 ...
##  $ Bedroom.AbvGr  : int  2 3 3 3 3 3 3 1 3 2 ...
##  $ Kitchen.AbvGr  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Kitchen.Qual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 3 5 5 1 ...
##  $ TotRms.AbvGrd  : int  7 5 6 6 5 5 6 3 6 7 ...
##  $ Functional     : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 4 8 8 8 8 8 8 ...
##  $ Fireplaces     : int  1 0 0 2 0 1 0 0 0 0 ...
##  $ Fireplace.Qu   : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA 5 NA 2 NA NA NA NA ...
##  $ Garage.Type    : Factor w/ 6 levels "2Types","Attchd",..: 2 6 2 2 6 6 NA 6 2 2 ...
##  $ Garage.Yr.Blt  : int  1940 1965 1965 1961 1981 1960 NA 1948 1968 2007 ...
##  $ Garage.Finish  : Factor w/ 4 levels "","Fin","RFn",..: 4 4 3 3 4 4 NA 4 3 2 ...
##  $ Garage.Cars    : int  1 1 2 2 2 1 0 1 2 2 ...
##  $ Garage.Area    : int  410 312 515 418 576 216 0 390 504 702 ...
##  $ Garage.Qual    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 NA 3 6 6 ...
##  $ Garage.Cond    : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 3 6 NA 6 6 6 ...
##  $ Paved.Drive    : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 1 3 1 3 3 ...
##  $ Wood.Deck.SF   : int  0 0 0 355 0 0 0 0 0 288 ...
##  $ Open.Porch.SF  : int  0 0 0 98 0 0 90 0 0 136 ...
##  $ Enclosed.Porch : int  0 0 0 0 34 208 0 66 0 0 ...
##  $ X3Ssn.Porch    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Screen.Porch   : int  113 0 0 144 0 0 0 0 0 0 ...
##  $ Pool.Area      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Pool.QC        : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence          : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA 3 NA NA 1 NA NA ...
##  $ Misc.Feature   : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ Misc.Val       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Mo.Sold        : int  2 1 6 8 10 5 6 8 7 5 ...
##  $ Yr.Sold        : int  2006 2007 2007 2006 2008 2009 2008 2008 2006 2009 ...
##  $ Sale.Type      : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
##  $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...

RMSE: final train =37910.31
final test = 32348
final validation = 29329.01
The values of RMSE go down as expected.

The model test shows that 97% of 95% prediction interval contain the true price of the house in the validation data set. As the result we could say that the model properly reflect uncertainty. Using the median probability model to generate out-of-sample predictions and a 95% prediction interval, 0.02 proportion of observations (rows) in ames_validation have sales prices that fall outside the prediction intervals. I assume, it’s a very good result for the testing model.

#prediction
predict.model.validation <- exp(predict(final.model, ames_validation))

# Extract Residuals
resid.validation <- ames_validation$price - predict.model.validation 

# Calculate RMSE
rmse.validation <- sqrt(mean(resid.validation^2))

rmse.validation
## [1] 29329.01
rmse.test
## [1] 32348
rmse.train
## [1] 37910.31

7.3 Getting confidence

ci_ames = exp(confint(predict(model.step4.bas, ames_validation, se.fit = TRUE), parm="pred"))

length(which(ifelse(ames_validation$price > ci_ames[,"2.5%"] & ames_validation$price < ci_ames[,"97.5%"], "yes", "no") == "yes"))/nrow(ci_ames)*100
## [1] 95.54391

95,4% of the house price lay within 95% confidence interval


7.4 Part 5 Conclusion


We started the research with EDA where we found that a few variables should be transformed, also, we found variables with missing data and did not use them to build a regression model.

By applying different model selcetions, we found that BIC MCMC was better model than AIC BAS in this case. Thus, we run the test for both model and obtained the speaking results.

After that we used the BIC MCMC algorithm to extend our model and add new varibales which we considered to be significant for price estimation or house evaluation. Running BMA we dropped a few variable in order to reach pasimonial model.

The initial model included 3 variables: area, Overall.Cond,Year.Built And we added Central.Air, Kitchen.Qual and Fireplaces.

We found out that the model could be applied for out-of case data and it gives 95,4% predicted prices which stays in 95% confidence interval.