getwd()
## [1] "/Users/marinazub/Desktop/Study/statistics and R/capstone project/capstone"
As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
load("ames_train.Rdata")
Use the code block below to load any necessary packages
library(statsr)
library(dplyr)
library(BAS)
library(corrplot)
library(GGally)
library(ggplot2)
library(MASS)
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
Because the goal of the analysis is to find out which variables are the most predictable to form a price of a house, we will plot different variables against price. For the purpose of deep exploration of the dataset, we picked:
Overall.Qual + Garage.Area +
Total.Bsmt.SF + Garage.Cars + area+ Full.Bath + Half.Bath + Bedroom.AbvGr + Year.Built + X1st.Flr.SF + X2nd.Flr.SF +
Lot.Area + Central.Air + Overall.Cond
variables. According to the plotting results, we kept only 3 the most correlated results.
head(ames_train)
## # A tibble: 6 x 81
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area
## <int> <int> <int> <int> <fctr> <int> <int>
## 1 909176150 856 126000 30 RL NA 7890
## 2 905476230 1049 139500 120 RL 42 4235
## 3 911128020 1001 124900 30 C (all) 60 6060
## 4 535377150 1039 114000 70 RL 80 8146
## 5 534177230 1665 227000 60 RL 70 8400
## 6 908128060 1922 198500 85 RL 64 7301
## # ... with 74 more variables: Street <fctr>, Alley <fctr>,
## # Lot.Shape <fctr>, Land.Contour <fctr>, Utilities <fctr>,
## # Lot.Config <fctr>, Land.Slope <fctr>, Neighborhood <fctr>,
## # Condition.1 <fctr>, Condition.2 <fctr>, Bldg.Type <fctr>,
## # House.Style <fctr>, Overall.Qual <int>, Overall.Cond <int>,
## # Year.Built <int>, Year.Remod.Add <int>, Roof.Style <fctr>,
## # Roof.Matl <fctr>, Exterior.1st <fctr>, Exterior.2nd <fctr>,
## # Mas.Vnr.Type <fctr>, Mas.Vnr.Area <int>, Exter.Qual <fctr>,
## # Exter.Cond <fctr>, Foundation <fctr>, Bsmt.Qual <fctr>,
## # Bsmt.Cond <fctr>, Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>,
## # BsmtFin.SF.1 <int>, BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>,
## # Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Heating <fctr>,
## # Heating.QC <fctr>, Central.Air <fctr>, Electrical <fctr>,
## # X1st.Flr.SF <int>, X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>,
## # Bsmt.Full.Bath <int>, Bsmt.Half.Bath <int>, Full.Bath <int>,
## # Half.Bath <int>, Bedroom.AbvGr <int>, Kitchen.AbvGr <int>,
## # Kitchen.Qual <fctr>, TotRms.AbvGrd <int>, Functional <fctr>,
## # Fireplaces <int>, Fireplace.Qu <fctr>, Garage.Type <fctr>,
## # Garage.Yr.Blt <int>, Garage.Finish <fctr>, Garage.Cars <int>,
## # Garage.Area <int>, Garage.Qual <fctr>, Garage.Cond <fctr>,
## # Paved.Drive <fctr>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
## # Enclosed.Porch <int>, X3Ssn.Porch <int>, Screen.Porch <int>,
## # Pool.Area <int>, Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>,
## # Misc.Val <int>, Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>,
## # Sale.Condition <fctr>
str(ames_train)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1000 obs. of 81 variables:
## $ PID : int 909176150 905476230 911128020 535377150 534177230 908128060 902135020 528228540 923426010 908186050 ...
## $ area : int 856 1049 1001 1039 1665 1922 936 1246 889 1072 ...
## $ price : int 126000 139500 124900 114000 227000 198500 93000 187687 137500 140000 ...
## $ MS.SubClass : int 30 120 30 70 60 85 20 20 20 180 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 2 6 6 6 7 6 6 7 ...
## $ Lot.Frontage : int NA 42 60 80 70 64 60 53 74 35 ...
## $ Lot.Area : int 7890 4235 6060 8146 8400 7301 6000 3710 12395 3675 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA 2 NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 1 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 1 5 5 1 5 1 5 5 1 5 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 2 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 26 8 12 21 20 8 21 1 15 8 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 5 1 1 1 1 2 1 1 5 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 3 3 3 6 6 7 3 3 3 7 ...
## $ Overall.Qual : int 6 5 5 4 8 7 4 7 5 6 ...
## $ Overall.Cond : int 6 5 9 8 6 5 4 5 6 5 ...
## $ Year.Built : int 1939 1984 1930 1900 2001 2003 1953 2007 1984 2005 ...
## $ Year.Remod.Add : int 1950 1984 2007 2003 2001 2003 1953 2008 1984 2005 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 2 2 4 2 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 7 9 9 14 7 9 16 7 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 7 9 9 15 7 9 17 11 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 3 5 5 5 3 5 3 5 6 ...
## $ Mas.Vnr.Area : int 0 149 0 0 0 500 0 20 0 76 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 3 3 3 3 3 2 3 4 4 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 3 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 1 1 3 4 2 3 2 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 4 6 3 4 NA 3 4 6 4 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 NA 6 6 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 5 4 5 5 5 NA 5 3 5 3 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 6 4 2 7 4 NA 7 7 2 4 ...
## $ BsmtFin.SF.1 : int 238 552 737 0 643 0 0 0 647 467 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 2 7 7 7 NA 7 7 7 7 ...
## $ BsmtFin.SF.2 : int 0 393 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 618 104 100 405 167 0 936 1146 217 80 ...
## $ Total.Bsmt.SF : int 856 1049 837 405 810 0 936 1146 864 547 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 1 3 1 1 5 1 5 1 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 1 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 856 1049 1001 717 810 495 936 1246 889 1072 ...
## $ X2nd.Flr.SF : int 0 0 0 322 855 1427 0 0 0 0 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : int 1 1 0 0 1 0 0 0 0 1 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 2 1 1 2 3 1 2 1 1 ...
## $ Half.Bath : int 0 0 0 0 1 0 0 0 0 0 ...
## $ Bedroom.AbvGr : int 2 2 2 2 3 4 2 2 3 2 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 3 3 5 3 3 5 3 5 3 ...
## $ TotRms.AbvGrd : int 4 5 5 6 6 7 4 5 6 5 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 8 8 8 4 8 8 8 ...
## $ Fireplaces : int 1 0 0 0 0 1 0 1 0 0 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA NA NA 1 NA 3 NA NA ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 6 2 6 6 2 4 6 2 2 3 ...
## $ Garage.Yr.Blt : int 1939 1984 1930 1940 2001 2003 1974 2007 1984 2005 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 4 2 4 4 2 3 4 2 4 2 ...
## $ Garage.Cars : int 2 1 1 1 2 2 2 2 2 2 ...
## $ Garage.Area : int 399 266 216 281 528 672 576 428 484 525 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 5 6 6 6 6 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 3 3 1 1 3 3 3 3 3 3 ...
## $ Wood.Deck.SF : int 0 0 154 0 0 0 0 100 0 0 ...
## $ Open.Porch.SF : int 0 105 0 0 45 0 32 24 0 44 ...
## $ Enclosed.Porch : int 0 0 42 168 0 177 112 0 0 0 ...
## $ X3Ssn.Porch : int 0 0 86 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 166 0 0 111 0 0 0 0 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 3 2 11 5 11 7 2 3 4 5 ...
## $ Yr.Sold : int 2010 2009 2007 2009 2009 2009 2009 2008 2008 2007 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 3 10 7 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 6 5 5 ...
head(ames_train)
## # A tibble: 6 x 81
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area
## <int> <int> <int> <int> <fctr> <int> <int>
## 1 909176150 856 126000 30 RL NA 7890
## 2 905476230 1049 139500 120 RL 42 4235
## 3 911128020 1001 124900 30 C (all) 60 6060
## 4 535377150 1039 114000 70 RL 80 8146
## 5 534177230 1665 227000 60 RL 70 8400
## 6 908128060 1922 198500 85 RL 64 7301
## # ... with 74 more variables: Street <fctr>, Alley <fctr>,
## # Lot.Shape <fctr>, Land.Contour <fctr>, Utilities <fctr>,
## # Lot.Config <fctr>, Land.Slope <fctr>, Neighborhood <fctr>,
## # Condition.1 <fctr>, Condition.2 <fctr>, Bldg.Type <fctr>,
## # House.Style <fctr>, Overall.Qual <int>, Overall.Cond <int>,
## # Year.Built <int>, Year.Remod.Add <int>, Roof.Style <fctr>,
## # Roof.Matl <fctr>, Exterior.1st <fctr>, Exterior.2nd <fctr>,
## # Mas.Vnr.Type <fctr>, Mas.Vnr.Area <int>, Exter.Qual <fctr>,
## # Exter.Cond <fctr>, Foundation <fctr>, Bsmt.Qual <fctr>,
## # Bsmt.Cond <fctr>, Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>,
## # BsmtFin.SF.1 <int>, BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>,
## # Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Heating <fctr>,
## # Heating.QC <fctr>, Central.Air <fctr>, Electrical <fctr>,
## # X1st.Flr.SF <int>, X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>,
## # Bsmt.Full.Bath <int>, Bsmt.Half.Bath <int>, Full.Bath <int>,
## # Half.Bath <int>, Bedroom.AbvGr <int>, Kitchen.AbvGr <int>,
## # Kitchen.Qual <fctr>, TotRms.AbvGrd <int>, Functional <fctr>,
## # Fireplaces <int>, Fireplace.Qu <fctr>, Garage.Type <fctr>,
## # Garage.Yr.Blt <int>, Garage.Finish <fctr>, Garage.Cars <int>,
## # Garage.Area <int>, Garage.Qual <fctr>, Garage.Cond <fctr>,
## # Paved.Drive <fctr>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
## # Enclosed.Porch <int>, X3Ssn.Porch <int>, Screen.Porch <int>,
## # Pool.Area <int>, Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>,
## # Misc.Val <int>, Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>,
## # Sale.Condition <fctr>
tail(ames_train)
## # A tibble: 6 x 81
## PID area price MS.SubClass MS.Zoning Lot.Frontage Lot.Area
## <int> <int> <int> <int> <fctr> <int> <int>
## 1 528354110 2398 315750 60 RL 95 11787
## 2 907290250 848 145000 120 RM NA 4426
## 3 528480150 1576 197000 60 FV 65 8125
## 4 534427010 1728 84900 90 RL 98 13260
## 5 905106140 1352 158000 60 RL 80 9364
## 6 914452120 912 156000 85 RL NA 7540
## # ... with 74 more variables: Street <fctr>, Alley <fctr>,
## # Lot.Shape <fctr>, Land.Contour <fctr>, Utilities <fctr>,
## # Lot.Config <fctr>, Land.Slope <fctr>, Neighborhood <fctr>,
## # Condition.1 <fctr>, Condition.2 <fctr>, Bldg.Type <fctr>,
## # House.Style <fctr>, Overall.Qual <int>, Overall.Cond <int>,
## # Year.Built <int>, Year.Remod.Add <int>, Roof.Style <fctr>,
## # Roof.Matl <fctr>, Exterior.1st <fctr>, Exterior.2nd <fctr>,
## # Mas.Vnr.Type <fctr>, Mas.Vnr.Area <int>, Exter.Qual <fctr>,
## # Exter.Cond <fctr>, Foundation <fctr>, Bsmt.Qual <fctr>,
## # Bsmt.Cond <fctr>, Bsmt.Exposure <fctr>, BsmtFin.Type.1 <fctr>,
## # BsmtFin.SF.1 <int>, BsmtFin.Type.2 <fctr>, BsmtFin.SF.2 <int>,
## # Bsmt.Unf.SF <int>, Total.Bsmt.SF <int>, Heating <fctr>,
## # Heating.QC <fctr>, Central.Air <fctr>, Electrical <fctr>,
## # X1st.Flr.SF <int>, X2nd.Flr.SF <int>, Low.Qual.Fin.SF <int>,
## # Bsmt.Full.Bath <int>, Bsmt.Half.Bath <int>, Full.Bath <int>,
## # Half.Bath <int>, Bedroom.AbvGr <int>, Kitchen.AbvGr <int>,
## # Kitchen.Qual <fctr>, TotRms.AbvGrd <int>, Functional <fctr>,
## # Fireplaces <int>, Fireplace.Qu <fctr>, Garage.Type <fctr>,
## # Garage.Yr.Blt <int>, Garage.Finish <fctr>, Garage.Cars <int>,
## # Garage.Area <int>, Garage.Qual <fctr>, Garage.Cond <fctr>,
## # Paved.Drive <fctr>, Wood.Deck.SF <int>, Open.Porch.SF <int>,
## # Enclosed.Porch <int>, X3Ssn.Porch <int>, Screen.Porch <int>,
## # Pool.Area <int>, Pool.QC <fctr>, Fence <fctr>, Misc.Feature <fctr>,
## # Misc.Val <int>, Mo.Sold <int>, Yr.Sold <int>, Sale.Type <fctr>,
## # Sale.Condition <fctr>
summary(ames_train)
## PID area price MS.SubClass
## Min. :5.263e+08 Min. : 334 Min. : 12789 Min. : 20.00
## 1st Qu.:5.285e+08 1st Qu.:1092 1st Qu.:129762 1st Qu.: 20.00
## Median :5.354e+08 Median :1411 Median :159467 Median : 50.00
## Mean :7.059e+08 Mean :1477 Mean :181190 Mean : 57.15
## 3rd Qu.:9.071e+08 3rd Qu.:1743 3rd Qu.:213000 3rd Qu.: 70.00
## Max. :1.007e+09 Max. :4676 Max. :615000 Max. :190.00
##
## MS.Zoning Lot.Frontage Lot.Area Street Alley
## A (agr): 0 Min. : 21.00 Min. : 1470 Grvl: 3 Grvl: 33
## C (all): 9 1st Qu.: 57.00 1st Qu.: 7314 Pave:997 Pave: 34
## FV : 56 Median : 69.00 Median : 9317 NA's:933
## I (all): 1 Mean : 69.21 Mean : 10352
## RH : 7 3rd Qu.: 80.00 3rd Qu.: 11650
## RL :772 Max. :313.00 Max. :215245
## RM :155 NA's :167
## Lot.Shape Land.Contour Utilities Lot.Config Land.Slope
## IR1:338 Bnk: 33 AllPub:1000 Corner :173 Gtl:962
## IR2: 30 HLS: 38 NoSeWa: 0 CulDSac: 76 Mod: 33
## IR3: 3 Low: 20 NoSewr: 0 FR2 : 36 Sev: 5
## Reg:629 Lvl:909 FR3 : 5
## Inside :710
##
##
## Neighborhood Condition.1 Condition.2 Bldg.Type House.Style
## NAmes :155 Norm :875 Norm :988 1Fam :823 1Story :521
## CollgCr: 85 Feedr : 53 Feedr : 6 2fmCon: 20 2Story :286
## Somerst: 74 Artery : 23 Artery : 2 Duplex: 35 1.5Fin : 98
## OldTown: 71 RRAn : 14 PosN : 2 Twnhs : 38 SLvl : 41
## Sawyer : 61 PosN : 11 PosA : 1 TwnhsE: 84 SFoyer : 36
## Edwards: 60 RRAe : 11 RRNn : 1 2.5Unf : 10
## (Other):494 (Other): 13 (Other): 0 (Other): 8
## Overall.Qual Overall.Cond Year.Built Year.Remod.Add
## Min. : 1.000 Min. :1.000 Min. :1872 Min. :1950
## 1st Qu.: 5.000 1st Qu.:5.000 1st Qu.:1955 1st Qu.:1966
## Median : 6.000 Median :5.000 Median :1975 Median :1992
## Mean : 6.095 Mean :5.559 Mean :1972 Mean :1984
## 3rd Qu.: 7.000 3rd Qu.:6.000 3rd Qu.:2001 3rd Qu.:2004
## Max. :10.000 Max. :9.000 Max. :2010 Max. :2010
##
## Roof.Style Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type
## Flat : 9 CompShg:984 VinylSd:349 VinylSd:345 : 7
## Gable :775 Tar&Grv: 11 HdBoard:164 HdBoard:150 BrkCmn : 8
## Gambrel: 8 WdShake: 2 MetalSd:147 MetalSd:148 BrkFace:317
## Hip :204 WdShngl: 2 Wd Sdng:138 Wd Sdng:130 CBlock : 0
## Mansard: 4 Metal : 1 Plywood: 74 Plywood: 96 None :593
## Shed : 0 ClyTile: 0 CemntBd: 40 CmentBd: 40 Stone : 75
## (Other): 0 (Other): 88 (Other): 91
## Mas.Vnr.Area Exter.Qual Exter.Cond Foundation Bsmt.Qual Bsmt.Cond
## Min. : 0.0 Ex: 39 Ex: 4 BrkTil:102 : 1 : 1
## 1st Qu.: 0.0 Fa: 11 Fa: 19 CBlock:430 Ex : 87 Ex : 2
## Median : 0.0 Gd:337 Gd:116 PConc :453 Fa : 28 Fa : 23
## Mean : 104.1 TA:613 Po: 0 Slab : 12 Gd :424 Gd : 44
## 3rd Qu.: 160.0 TA:861 Stone : 3 Po : 1 Po : 1
## Max. :1290.0 Wood : 0 TA :438 TA :908
## NA's :7 NA's: 21 NA's: 21
## Bsmt.Exposure BsmtFin.Type.1 BsmtFin.SF.1 BsmtFin.Type.2
## : 2 GLQ :294 Min. : 0.0 Unf :863
## Av :157 Unf :279 1st Qu.: 0.0 LwQ : 31
## Gd : 98 ALQ :163 Median : 400.0 Rec : 29
## Mn : 87 Rec :107 Mean : 464.1 BLQ : 24
## No :635 BLQ : 87 3rd Qu.: 773.0 ALQ : 20
## NA's: 21 (Other): 49 Max. :2260.0 (Other): 12
## NA's : 21 NA's :1 NA's : 21
## BsmtFin.SF.2 Bsmt.Unf.SF Total.Bsmt.SF Heating
## Min. : 0.00 Min. : 0.0 Min. : 0.0 Floor: 0
## 1st Qu.: 0.00 1st Qu.: 223.5 1st Qu.: 797.5 GasA :988
## Median : 0.00 Median : 461.0 Median : 998.0 GasW : 8
## Mean : 48.07 Mean : 547.0 Mean :1059.2 Grav : 2
## 3rd Qu.: 0.00 3rd Qu.: 783.0 3rd Qu.:1301.0 OthW : 1
## Max. :1526.00 Max. :2336.0 Max. :3138.0 Wall : 1
## NA's :1 NA's :1 NA's :1
## Heating.QC Central.Air Electrical X1st.Flr.SF X2nd.Flr.SF
## Ex:516 N: 55 : 0 Min. : 334.0 Min. : 0.0
## Fa: 22 Y:945 FuseA: 54 1st Qu.: 876.2 1st Qu.: 0.0
## Gd:157 FuseF: 12 Median :1080.5 Median : 0.0
## Po: 1 FuseP: 2 Mean :1157.1 Mean : 315.2
## TA:304 Mix : 0 3rd Qu.:1376.2 3rd Qu.: 688.2
## SBrkr:932 Max. :3138.0 Max. :1836.0
##
## Low.Qual.Fin.SF Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath
## Min. : 0.00 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.: 0.00 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median : 0.00 Median :0.0000 Median :0.00000 Median :2.000
## Mean : 4.32 Mean :0.4474 Mean :0.06106 Mean :1.541
## 3rd Qu.: 0.00 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :1064.00 Max. :3.0000 Max. :2.00000 Max. :4.000
## NA's :1 NA's :1
## Half.Bath Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual
## Min. :0.000 Min. :0.000 Min. :0.000 Ex: 67
## 1st Qu.:0.000 1st Qu.:2.000 1st Qu.:1.000 Fa: 20
## Median :0.000 Median :3.000 Median :1.000 Gd:403
## Mean :0.378 Mean :2.806 Mean :1.039 Po: 1
## 3rd Qu.:1.000 3rd Qu.:3.000 3rd Qu.:1.000 TA:509
## Max. :2.000 Max. :6.000 Max. :2.000
##
## TotRms.AbvGrd Functional Fireplaces Fireplace.Qu Garage.Type
## Min. : 2.00 Typ :935 Min. :0.000 Ex : 16 2Types : 10
## 1st Qu.: 5.00 Min2 : 24 1st Qu.:0.000 Fa : 24 Attchd :610
## Median : 6.00 Min1 : 18 Median :1.000 Gd :232 Basment: 11
## Mean : 6.34 Mod : 16 Mean :0.597 Po : 18 BuiltIn: 56
## 3rd Qu.: 7.00 Maj1 : 4 3rd Qu.:1.000 TA :219 CarPort: 1
## Max. :13.00 Maj2 : 2 Max. :4.000 NA's:491 Detchd :266
## (Other): 1 NA's : 46
## Garage.Yr.Blt Garage.Finish Garage.Cars Garage.Area Garage.Qual
## Min. :1900 : 2 Min. :0.000 Min. : 0.0 : 1
## 1st Qu.:1961 Fin :247 1st Qu.:1.000 1st Qu.: 312.0 Ex : 1
## Median :1979 RFn :278 Median :2.000 Median : 480.0 Fa : 37
## Mean :1978 Unf :427 Mean :1.767 Mean : 475.4 Gd : 7
## 3rd Qu.:2002 NA's: 46 3rd Qu.:2.000 3rd Qu.: 576.0 Po : 3
## Max. :2010 Max. :5.000 Max. :1390.0 TA :904
## NA's :48 NA's :1 NA's :1 NA's: 47
## Garage.Cond Paved.Drive Wood.Deck.SF Open.Porch.SF
## : 1 N: 67 Min. : 0.00 Min. : 0.00
## Ex : 1 P: 29 1st Qu.: 0.00 1st Qu.: 0.00
## Fa : 21 Y:904 Median : 0.00 Median : 28.00
## Gd : 6 Mean : 93.84 Mean : 48.93
## Po : 6 3rd Qu.:168.00 3rd Qu.: 74.00
## TA :918 Max. :857.00 Max. :742.00
## NA's: 47
## Enclosed.Porch X3Ssn.Porch Screen.Porch Pool.Area
## Min. : 0.00 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.00 1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 0.000
## Median : 0.00 Median : 0.000 Median : 0.00 Median : 0.000
## Mean : 23.48 Mean : 3.118 Mean : 14.77 Mean : 1.463
## 3rd Qu.: 0.00 3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :432.00 Max. :508.000 Max. :440.00 Max. :800.000
##
## Pool.QC Fence Misc.Feature Misc.Val Mo.Sold
## Ex : 1 GdPrv: 43 Elev: 0 Min. : 0.00 Min. : 1.000
## Fa : 1 GdWo : 37 Gar2: 2 1st Qu.: 0.00 1st Qu.: 4.000
## Gd : 1 MnPrv:120 Othr: 1 Median : 0.00 Median : 6.000
## TA : 0 MnWw : 2 Shed: 25 Mean : 45.81 Mean : 6.243
## NA's:997 NA's :798 TenC: 1 3rd Qu.: 0.00 3rd Qu.: 8.000
## NA's:971 Max. :15500.00 Max. :12.000
##
## Yr.Sold Sale.Type Sale.Condition
## Min. :2006 WD :863 Abnorml: 61
## 1st Qu.:2007 New : 79 AdjLand: 2
## Median :2008 COD : 27 Alloca : 4
## Mean :2008 ConLD : 7 Family : 17
## 3rd Qu.:2009 ConLw : 6 Normal :834
## Max. :2010 Con : 5 Partial: 82
## (Other): 13
k<-colSums(is.na(ames_train))
missingval<-sort(k, decreasing = TRUE)[1:20]
barplot(missingval, main = "Missing values", las = 2 )
Out of this summary we can see that the Pool QC. Alles. Misc.Feature. Fence and Fireplace are the valuses with top-5 missing data, it explains that not a lot of houses in that area has these features.
hist(ames_train$price, main = " House price", xlab = "price", ylab = "amount", col = "green")
summary(ames_train$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12790 129800 159500 181200 213000 615000
… here we see, because of the median is the most appropriate tool to mesure that, meadian of the house price is 159500. A As we can see, the diagram is right skewed which makes us conclude that there are more cheap houses than expensive one.
hist(ames_train$area, main = "Area", xlab = "area", col = "red")
hist(log(ames_train$area), main = "Log:Area", xlab = "area", col = "red")
hist(ames_train$Year.Built, main = "Year of Building", col = "blue")
boxplot(ames_train$price~ames_train$Overall.Cond, main = "Overall Condition", xlab = "condition", col = "yellow")
hist(ames_train$Overall.Cond)
#adjusting overall.cond variable
overallcond<- ames_train %>%
filter(Overall.Cond != 5)%>%
group_by(Overall.Cond)%>%
summarise(median = median(price), mean = mean(price))
table(ames_train$Overall.Cond)
##
## 1 2 3 4 5 6 7 8 9
## 3 3 14 36 561 193 131 47 12
plot(overallcond, col = "red")
#checking the relationships of the price and the variables
plot(ames_train$price~ames_train$area, main = "Area vs Price", xlab = "area", col = "red")
plot(log(ames_train$price)~log(ames_train$area), main = "Log:Area vs Price", xlab = "area", col = "red")
plot(ames_train$price~ames_train$Year.Built, main = "Year of Building vs Price", col = "blue")
plot(ames_train$price~ames_train$Overall.Cond, main = "Overall Condition vs Price", xlab = "condition", col = "yellow")
After examining different variables, we stopped on three of them which appeared to be the most linear to the price: Year.Built, area, Overall.Cond. After log transgormation the price and area we could see almost linear relationship of the scatters on the plot which tells us about correlation of both of the vriables. Ploting the Year of built we see, also, correltaion betwen the price and year; also, it gives us sence that the newer the house the higher the price and we could see the Ames experienced building boom from the last decade of 20 century. Also we can observe a linear relationship between the Overall condition and price. Applying boxplot, we notice that the higher the overall condition, the higher the house price is, except the condition #5 which is quite interesting to look at. On the diagram we can see the houses with condition = 5 has higher average price and the higher price range, too. Putting that on the histogram we found the explanation, the total number of the houses with quality = 5 is prevalient of the market. After excluding the condition #5 and proting the results, we observing quite linear situation of price and Overall.Cond. variables. We included log transformed variables of price and area because,based on EDA, they showed skeweness and non-symmetrical distribution which could affect the final results.
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from ames_train and create a linear model for price
(or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
Running the multilinear model we included all variables from the previous step and in the summary table we could see that the variables, which were discussed above in EDA, show the lowest p-values (marked with triple stars). At this moment it’s interesting to see the other variables which could be also included in the model which show the same low values: Total.Bsmt.SF, Bedroom.AbvGr, Lot.Area, Central.AirY which are quite corrsponded to the are variable (the bigger the area is, the more bedrooms the house has, etc) This model does not have a good prediction potential that’s why it will accept some transformations later. Even though current model has a low residual standart error = 0.1635 and quite firm Adjusted r-squared = 0.8487 , some of the variabels has quite high p-value which are suggested to be removed because they do not provide significant values to the model, even worse, make it heavier, increase standart error.
model.lm=lm(log(price) ~ Overall.Cond+ log(area)+ Year.Built+ Garage.Area +
Total.Bsmt.SF + Garage.Cars +
Full.Bath + Half.Bath +
Bedroom.AbvGr + X1st.Flr.SF +
X2nd.Flr.SF +
log(Lot.Area) + Central.Air , data=ames_train )
summary(model.lm)
##
## Call:
## lm(formula = log(price) ~ Overall.Cond + log(area) + Year.Built +
## Garage.Area + Total.Bsmt.SF + Garage.Cars + Full.Bath + Half.Bath +
## Bedroom.AbvGr + X1st.Flr.SF + X2nd.Flr.SF + log(Lot.Area) +
## Central.Air, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.57637 -0.08183 -0.00030 0.09023 0.56662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.699e+00 7.253e-01 -6.479 1.46e-10 ***
## Overall.Cond 8.236e-02 5.480e-03 15.030 < 2e-16 ***
## log(area) 6.120e-01 6.416e-02 9.539 < 2e-16 ***
## Year.Built 5.491e-03 2.828e-04 19.418 < 2e-16 ***
## Garage.Area 4.053e-06 5.686e-05 0.071 0.943194
## Total.Bsmt.SF 2.100e-04 2.330e-05 9.015 < 2e-16 ***
## Garage.Cars 5.317e-02 1.685e-02 3.155 0.001655 **
## Full.Bath -2.598e-02 1.559e-02 -1.666 0.095961 .
## Half.Bath -2.698e-02 1.513e-02 -1.783 0.074967 .
## Bedroom.AbvGr -7.746e-02 8.129e-03 -9.528 < 2e-16 ***
## X1st.Flr.SF 4.398e-05 4.876e-05 0.902 0.367222
## X2nd.Flr.SF 8.054e-05 4.241e-05 1.899 0.057809 .
## log(Lot.Area) 8.562e-02 1.173e-02 7.297 6.06e-13 ***
## Central.AirY 9.941e-02 2.685e-02 3.703 0.000225 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1635 on 984 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.8507, Adjusted R-squared: 0.8487
## F-statistic: 431.2 on 13 and 984 DF, p-value: < 2.2e-16
Now either using BAS
another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
At that point we will run 2 different model selection methods: BIC with Markov Chain Monte Carlo and AIC with Bayesian Adaptive Sampling .
When AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. On the contrary, BIC tries to find the TRUE model among the set of candidates.
MCMC are a class of algorithms for sampling from a probability distribution based on constructing a Markov chain that has the desired distribution as its equilibrium distribution. The state of the chain after a number of steps is then used as a sample of the desired distribution. The quality of the sample improves as a function of the number of steps.
BAS algorithm samples models without replacement from the space of models. For problems that permit enumeration of all models, BAS is guaranteed to enumerate the model space in 2p iterations where p is the number of potential variables under consideration. For larger problems where sampling is required, we provide conditions under which BAS provides perfect samples without replacement.
Trying Bayesian model averaging, with BIC steps, we see the model suggests to exclude Garage.Area, Full.Bath, Half.Bath, X1st.Flr.SF, X2nd.Flr.SF variables when AIC suggest to remove Garage.Area and X1st.Flr.SF only. Posterior probability in the first case is 0.7501 , in case of AIC = 0.0894, which tells us the model with higher posterior probability has the better predistion quality. At the same time, the aic model still has its right to be because its probability is higher then the prior - uniform. R2s stay aproximate the same in both cases.
model.bic = bas.lm(log(price) ~ Overall.Cond+ log(area)+ Year.Built+ Garage.Area +
Total.Bsmt.SF + Garage.Cars +
Full.Bath + Half.Bath +
Bedroom.AbvGr + X1st.Flr.SF +
X2nd.Flr.SF +
log(Lot.Area) + Central.Air,
prior = "BIC",
method = "MCMC",
modelprior = uniform(),
data=ames_train
)
## Warning in bas.lm(log(price) ~ Overall.Cond + log(area) + Year.Built +
## Garage.Area + : dropping 2 rows due to missing data
image(model.bic)
model.aic = bas.lm(log(price) ~ Overall.Cond+ log(area)+ Year.Built+ Garage.Area +
Total.Bsmt.SF + Garage.Cars +
Full.Bath + Half.Bath +
Bedroom.AbvGr + X1st.Flr.SF +
X2nd.Flr.SF +
log(Lot.Area) + Central.Air,
prior = "AIC",
method = "BAS",
modelprior = uniform(),
data=ames_train
)
## Warning in bas.lm(log(price) ~ Overall.Cond + log(area) + Year.Built +
## Garage.Area + : dropping 2 rows due to missing data
image(model.aic)
summary(model.bic)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.000000e+00 1.000000e+00
## Overall.Cond 0.99996948 1.0000 1.000000e+00 1.000000e+00
## log(area) 0.99994507 1.0000 1.000000e+00 1.000000e+00
## Year.Built 0.99987793 1.0000 1.000000e+00 1.000000e+00
## Garage.Area 0.04313354 0.0000 0.000000e+00 0.000000e+00
## Total.Bsmt.SF 0.99998779 1.0000 1.000000e+00 1.000000e+00
## Garage.Cars 0.99197388 1.0000 1.000000e+00 1.000000e+00
## Full.Bath 0.05531616 0.0000 0.000000e+00 1.000000e+00
## Half.Bath 0.04059448 0.0000 0.000000e+00 0.000000e+00
## Bedroom.AbvGr 0.99992065 1.0000 1.000000e+00 1.000000e+00
## X1st.Flr.SF 0.03625488 0.0000 0.000000e+00 0.000000e+00
## X2nd.Flr.SF 0.06660156 0.0000 1.000000e+00 0.000000e+00
## log(Lot.Area) 0.99982910 1.0000 1.000000e+00 1.000000e+00
## Central.AirY 0.95845337 1.0000 1.000000e+00 1.000000e+00
## BF NA 1.0000 7.122177e-02 5.370529e-02
## PostProbs NA 0.7514 5.230000e-02 4.090000e-02
## R2 NA 0.8496 8.499000e-01 8.498000e-01
## dim NA 9.0000 1.000000e+01 1.000000e+01
## logmarg NA -1666.2341 -1.668876e+03 -1.669158e+03
## model 4 model 5
## Intercept 1.000000e+00 1.0000000
## Overall.Cond 1.000000e+00 1.0000000
## log(area) 1.000000e+00 1.0000000
## Year.Built 1.000000e+00 1.0000000
## Garage.Area 0.000000e+00 0.0000000
## Total.Bsmt.SF 1.000000e+00 1.0000000
## Garage.Cars 1.000000e+00 1.0000000
## Full.Bath 0.000000e+00 0.0000000
## Half.Bath 0.000000e+00 1.0000000
## Bedroom.AbvGr 1.000000e+00 1.0000000
## X1st.Flr.SF 0.000000e+00 0.0000000
## X2nd.Flr.SF 0.000000e+00 0.0000000
## log(Lot.Area) 1.000000e+00 1.0000000
## Central.AirY 0.000000e+00 1.0000000
## BF 3.818177e-02 0.0404895
## PostProbs 3.240000e-02 0.0288000
## R2 8.476000e-01 0.8497000
## dim 8.000000e+00 10.0000000
## logmarg -1.669500e+03 -1669.4408194
summary(model.aic)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.000000 1.0000000
## Overall.Cond 1.0000000 1.0000 1.000000 1.0000000
## log(area) 1.0000000 1.0000 1.000000 1.0000000
## Year.Built 1.0000000 1.0000 1.000000 1.0000000
## Garage.Area 0.2828088 0.0000 0.000000 0.0000000
## Total.Bsmt.SF 1.0000000 1.0000 1.000000 1.0000000
## Garage.Cars 0.9926658 1.0000 1.000000 1.0000000
## Full.Bath 0.4733494 1.0000 0.000000 0.0000000
## Half.Bath 0.4821001 1.0000 0.000000 0.0000000
## Bedroom.AbvGr 1.0000000 1.0000 1.000000 1.0000000
## X1st.Flr.SF 0.3445712 0.0000 0.000000 0.0000000
## X2nd.Flr.SF 0.5722852 1.0000 0.000000 1.0000000
## log(Lot.Area) 1.0000000 1.0000 1.000000 1.0000000
## Central.AirY 0.9963427 1.0000 1.000000 1.0000000
## BF NA 1.0000 0.922009 0.7631655
## PostProbs NA 0.0894 0.082400 0.0682000
## R2 NA 0.8505 0.849600 0.8499000
## dim NA 12.0000 9.000000 10.0000000
## logmarg NA -1644.0770 -1644.158217 -1644.3472971
## model 4 model 5
## Intercept 1.0000000 1.0000000
## Overall.Cond 1.0000000 1.0000000
## log(area) 1.0000000 1.0000000
## Year.Built 1.0000000 1.0000000
## Garage.Area 0.0000000 0.0000000
## Total.Bsmt.SF 1.0000000 1.0000000
## Garage.Cars 1.0000000 1.0000000
## Full.Bath 0.0000000 1.0000000
## Half.Bath 1.0000000 0.0000000
## Bedroom.AbvGr 1.0000000 1.0000000
## X1st.Flr.SF 0.0000000 0.0000000
## X2nd.Flr.SF 1.0000000 0.0000000
## log(Lot.Area) 1.0000000 1.0000000
## Central.AirY 1.0000000 1.0000000
## BF 0.6501771 0.5754705
## PostProbs 0.0581000 0.0514000
## R2 0.8501000 0.8498000
## dim 11.0000000 10.0000000
## logmarg -1644.5075272 -1644.6295841
model.bic.full=bas.lm(log(price) ~ Overall.Cond+ log(area)+ Year.Built+
Total.Bsmt.SF + Garage.Cars +
Bedroom.AbvGr +
log(Lot.Area) + Central.Air,
prior = "BIC",
method = "MCMC",
modelprior = uniform(),
data=ames_train
)
## Warning in bas.lm(log(price) ~ Overall.Cond + log(area) + Year.Built +
## Total.Bsmt.SF + : dropping 2 rows due to missing data
summary(model.bic.full)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.000000e+00 1.000000e+00
## Overall.Cond 0.9996094 1.0000 1.000000e+00 1.000000e+00
## log(area) 0.9998047 1.0000 1.000000e+00 1.000000e+00
## Year.Built 0.9994141 1.0000 1.000000e+00 1.000000e+00
## Total.Bsmt.SF 0.9972656 1.0000 1.000000e+00 1.000000e+00
## Garage.Cars 0.9919922 1.0000 1.000000e+00 0.000000e+00
## Bedroom.AbvGr 0.9958984 1.0000 1.000000e+00 0.000000e+00
## log(Lot.Area) 0.9953125 1.0000 1.000000e+00 0.000000e+00
## Central.AirY 0.9705078 1.0000 0.000000e+00 1.000000e+00
## BF NA 1.0000 3.818177e-02 1.894233e-36
## PostProbs NA 0.9660 2.600000e-02 2.000000e-03
## R2 NA 0.8496 8.476000e-01 8.189000e-01
## dim NA 9.0000 8.000000e+00 6.000000e+00
## logmarg NA -1666.2341 -1.669500e+03 -1.748488e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## Overall.Cond 1.000000e+00 1.000000e+00
## log(area) 1.000000e+00 1.000000e+00
## Year.Built 1.000000e+00 1.000000e+00
## Total.Bsmt.SF 1.000000e+00 0.000000e+00
## Garage.Cars 0.000000e+00 0.000000e+00
## Bedroom.AbvGr 0.000000e+00 1.000000e+00
## log(Lot.Area) 1.000000e+00 0.000000e+00
## Central.AirY 1.000000e+00 0.000000e+00
## BF 4.621375e-28 2.206854e-71
## PostProbs 1.800000e-03 1.400000e-03
## R2 8.270000e-01 7.858000e-01
## dim 7.000000e+00 5.000000e+00
## logmarg -1.729176e+03 -1.828926e+03
lm.bic = lm(log(price) ~ Overall.Cond+ log(area)+ Year.Built+
Total.Bsmt.SF + Garage.Cars +
Bedroom.AbvGr +
log(Lot.Area) + Central.Air,data=ames_train
)
model.aic.full=bas.lm(log(price) ~ Overall.Cond+ log(area)+ Year.Built+
Total.Bsmt.SF + Garage.Cars +
Full.Bath + Half.Bath +
Bedroom.AbvGr +
X2nd.Flr.SF +
log(Lot.Area) + Central.Air,
prior = "AIC",
method = "BAS",
modelprior = uniform(),
data=ames_train
)
## Warning in bas.lm(log(price) ~ Overall.Cond + log(area) + Year.Built +
## Total.Bsmt.SF + : dropping 2 rows due to missing data
summary(model.aic.full)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.0000000 1.0000 1.000000 1.0000000
## Overall.Cond 1.0000000 1.0000 1.000000 1.0000000
## log(area) 1.0000000 1.0000 1.000000 1.0000000
## Year.Built 1.0000000 1.0000 1.000000 1.0000000
## Total.Bsmt.SF 1.0000000 1.0000 1.000000 1.0000000
## Garage.Cars 0.9999997 1.0000 1.000000 1.0000000
## Full.Bath 0.4718752 1.0000 0.000000 0.0000000
## Half.Bath 0.4744016 1.0000 0.000000 0.0000000
## Bedroom.AbvGr 1.0000000 1.0000 1.000000 1.0000000
## X2nd.Flr.SF 0.5541969 1.0000 0.000000 1.0000000
## log(Lot.Area) 1.0000000 1.0000 1.000000 1.0000000
## Central.AirY 0.9962521 1.0000 1.000000 1.0000000
## BF NA 1.0000 0.922009 0.7631655
## PostProbs NA 0.1902 0.175400 0.1452000
## R2 NA 0.8505 0.849600 0.8499000
## dim NA 12.0000 9.000000 10.0000000
## logmarg NA -1644.0770 -1644.158217 -1644.3472971
## model 4 model 5
## Intercept 1.0000000 1.0000000
## Overall.Cond 1.0000000 1.0000000
## log(area) 1.0000000 1.0000000
## Year.Built 1.0000000 1.0000000
## Total.Bsmt.SF 1.0000000 1.0000000
## Garage.Cars 1.0000000 1.0000000
## Full.Bath 0.0000000 1.0000000
## Half.Bath 1.0000000 0.0000000
## Bedroom.AbvGr 1.0000000 1.0000000
## X2nd.Flr.SF 1.0000000 0.0000000
## log(Lot.Area) 1.0000000 1.0000000
## Central.AirY 1.0000000 1.0000000
## BF 0.6501771 0.5754705
## PostProbs 0.1237000 0.1095000
## R2 0.8501000 0.8498000
## dim 11.0000000 10.0000000
## logmarg -1644.5075272 -1644.6295841
lm.aic = lm(log(price) ~ Overall.Cond+ log(area)+ Year.Built+
Total.Bsmt.SF + Garage.Cars +
Full.Bath + Half.Bath +
Bedroom.AbvGr +
X2nd.Flr.SF +
log(Lot.Area) + Central.Air,data=ames_train
)
After dropping suggested variables, we see that the bic model has quite high PostProbs = 0.9850 At the same time, after dropping suggested varibales by aic Full.Bath, Half.Bath and X2nd.Flr.SF, posterior probabilities shoe only 0.192 which is still goo because it’s higher than 0.0 While we are interested in the first model as the parsimonian one, we will get the right answer only after the RMSE testings. * * *
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
The model’s residuals are normally distributed with a center around 0, the shape of distribution is normal (not a fun or skewed) and lays close to the qq line. But, we could observe some strong outliers # 428, 310 and #181 which we can track on all of 3 plots.
plot(model.bic.full)
abline(h=0)
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
Calculated RMSE = $35826.36 represents the sample (ames_test) standart deviation of the differences between predicted values and observed values by using the model.full multilinear model, which is quite high.
# Extract Predictions for bic
predict.full.train.bic <- exp(predict(lm.bic, ames_train))
# Extract Residuals
resid.full.train.bic <- ames_train$price - predict.full.train.bic
# Calculate RMSE
rmse.full.train.bic <- sqrt(mean(resid.full.train.bic^2, na.rm = TRUE))
rmse.full.train.bic
## [1] 35826.36
# Extract Predictions for aic
predict.full.train.aic <- exp(predict(lm.aic, ames_train))
# Extract Residuals
resid.full.train.aic <- ames_train$price - predict.full.train.aic
# Calculate RMSE
rmse.full.train.aic <- sqrt(mean(resid.full.train.aic^2, na.rm = TRUE))
rmse.full.train.aic
## [1] 36112.87
Running RMSE test, we found that the RMSE is lower for bic model which is the sign to pick the bic model. Because, in general, the lower the RMSE the better. * * *
The process of building a model generally involves starting with an initial model, identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called overfitting. To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test
.
load("ames_test")
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
In general, the better the model fit, the lower the RMSE. The prediction for the train model is more accurate for the test data, even though the model was build for the training data. This cunclusion is done by comparing the RMSE results which are lower for the test data.
# Extract Predictions
predict.full.test <- exp(predict(lm.bic, ames_test))
# Extract Residuals
resid.full.test <- ames_test$price - predict.full.test
# Calculate RMSE
rmse.full.test.bic <- sqrt(mean(resid.full.test^2, na.rm = TRUE))
rmse.full.test.bic
## [1] 26864.08
rmse.full.train.bic > rmse.full.test.bic
## [1] TRUE
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Provide the summary table for your model.
Out of this model with huge amount of variables (13) we can see that many of them does not provide any sence to the model and could be either transformed (next step) or excluded. Residual standard error: 36600 on 932 degrees of freedom is quite high and shows that the chosen model does not fit to predict the house price. To compare 36600 with 0.1637 on 986 degrees of freedom on the previous model.
allvar.model.step1 = lm(price ~ area + Overall.Cond + Year.Built + Year.Remod.Add + Heating + Central.Air + Bedroom.AbvGr+ Kitchen.AbvGr + Kitchen.Qual + TotRms.AbvGrd+ Fireplaces + Garage.Yr.Blt + Garage.Area + Pool.Area , data = ames_train)
summary(allvar.model.step1)
##
## Call:
## lm(formula = price ~ area + Overall.Cond + Year.Built + Year.Remod.Add +
## Heating + Central.Air + Bedroom.AbvGr + Kitchen.AbvGr + Kitchen.Qual +
## TotRms.AbvGrd + Fireplaces + Garage.Yr.Blt + Garage.Area +
## Pool.Area, data = ames_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -353341 -19602 -2026 16953 269110
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.480e+06 1.798e+05 -8.230 6.28e-16 ***
## area 6.368e+01 5.071e+00 12.557 < 2e-16 ***
## Overall.Cond 7.556e+03 1.389e+03 5.442 6.74e-08 ***
## Year.Built 8.073e+02 9.458e+01 8.535 < 2e-16 ***
## Year.Remod.Add 1.656e+02 9.442e+01 1.754 0.079708 .
## HeatingGasW 2.863e+04 1.469e+04 1.949 0.051568 .
## HeatingGrav 4.354e+04 3.726e+04 1.169 0.242814
## HeatingWall 2.890e+04 3.788e+04 0.763 0.445612
## Central.AirY 9.627e+03 7.541e+03 1.276 0.202100
## Bedroom.AbvGr -1.481e+04 2.159e+03 -6.859 1.26e-11 ***
## Kitchen.AbvGr -2.706e+04 7.760e+03 -3.487 0.000512 ***
## Kitchen.QualFa -7.411e+04 1.222e+04 -6.065 1.92e-09 ***
## Kitchen.QualGd -7.082e+04 5.300e+03 -13.364 < 2e-16 ***
## Kitchen.QualPo -2.335e+04 3.793e+04 -0.615 0.538376
## Kitchen.QualTA -8.269e+04 6.251e+03 -13.228 < 2e-16 ***
## TotRms.AbvGrd 4.745e+03 1.569e+03 3.025 0.002551 **
## Fireplaces 1.558e+04 2.199e+03 7.082 2.80e-12 ***
## Garage.Yr.Blt -1.731e+02 1.090e+02 -1.587 0.112764
## Garage.Area 8.051e+01 8.463e+00 9.513 < 2e-16 ***
## Pool.Area -3.673e+01 3.896e+01 -0.943 0.346003
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36600 on 932 degrees of freedom
## (48 observations deleted due to missingness)
## Multiple R-squared: 0.8038, Adjusted R-squared: 0.7998
## F-statistic: 201 on 19 and 932 DF, p-value: < 2.2e-16
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
After the expection of residuals in the previous model, we found that 3 houses were sold under the abnormal condiitons 610 (PID = 531375070), 428 (PID = 902207130) and 310 (PID = 908154205), it’s better to drop them and leave those which were sold under normal conditions, so our prediction will be more delicate to regular houses. Also, we need to log transform area variables. After those actions, we could see significan drop on the Residual standard error: 0.1417 on 778 degrees of freedom from 36600, as well as growing of Adjusted R-squared from 0.7998 to 0.8523 That means, we are on the right track!
ames_train_normal <- ames_train%>%
filter(Sale.Condition == "Normal" & !(PID %in% c("531375070", "902207130", "908154205")))%>%
dplyr::select(price, area, Overall.Cond, Year.Built, Year.Remod.Add, Heating, Central.Air, Bedroom.AbvGr, Kitchen.AbvGr, Kitchen.Qual, TotRms.AbvGrd,Fireplaces, Garage.Yr.Blt,Garage.Area, Pool.Area)
allvar.model.step2 = lm(log(price) ~ log(area) + Overall.Cond + Year.Built + Year.Remod.Add + Heating + Central.Air + Bedroom.AbvGr+ Kitchen.AbvGr + Kitchen.Qual + TotRms.AbvGrd+ Fireplaces + Garage.Yr.Blt + log(Garage.Area) + Pool.Area , data = ames_train_normal)
summary(allvar.model.step2)
##
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Cond + Year.Built +
## Year.Remod.Add + Heating + Central.Air + Bedroom.AbvGr +
## Kitchen.AbvGr + Kitchen.Qual + TotRms.AbvGrd + Fireplaces +
## Garage.Yr.Blt + log(Garage.Area) + Pool.Area, data = ames_train_normal)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.74726 -0.08643 0.00512 0.08990 0.50311
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5675243 0.7640547 -2.052 0.040544 *
## log(area) 0.5488495 0.0330454 16.609 < 2e-16 ***
## Overall.Cond 0.0524961 0.0059707 8.792 < 2e-16 ***
## Year.Built 0.0042442 0.0003880 10.938 < 2e-16 ***
## Year.Remod.Add 0.0006107 0.0003910 1.562 0.118773
## HeatingGasW 0.2119427 0.0608877 3.481 0.000528 ***
## HeatingGrav 0.3390236 0.1446595 2.344 0.019350 *
## HeatingWall 0.1171465 0.1471950 0.796 0.426357
## Central.AirY 0.1703173 0.0312530 5.450 6.78e-08 ***
## Bedroom.AbvGr -0.0439392 0.0093640 -4.692 3.19e-06 ***
## Kitchen.AbvGr -0.0988982 0.0324339 -3.049 0.002372 **
## Kitchen.QualFa -0.2413396 0.0512914 -4.705 3.00e-06 ***
## Kitchen.QualGd -0.2314970 0.0247071 -9.370 < 2e-16 ***
## Kitchen.QualPo 0.0923343 0.1480338 0.624 0.532982
## Kitchen.QualTA -0.2999732 0.0279207 -10.744 < 2e-16 ***
## TotRms.AbvGrd 0.0115864 0.0065618 1.766 0.077833 .
## Fireplaces 0.0774477 0.0094210 8.221 8.44e-16 ***
## Garage.Yr.Blt -0.0004732 0.0004587 -1.032 0.302506
## log(Garage.Area) 0.1429367 0.0185184 7.719 3.61e-14 ***
## Pool.Area -0.0001527 0.0001517 -1.006 0.314718
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1417 on 778 degrees of freedom
## (36 observations deleted due to missingness)
## Multiple R-squared: 0.8558, Adjusted R-squared: 0.8523
## F-statistic: 243 on 19 and 778 DF, p-value: < 2.2e-16
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
Before assesing variable interaction, we will drop the variables wihh the highes p-value based on the previous summary: TotRms.AbvGrd, Pool.Area , Garage.Yr.Blt
ames_train_normal_2 <- ames_train%>%
filter(Sale.Condition == "Normal" & !(PID %in% c("531375070", "902207130", "908154205")))%>%
dplyr::select(price, area, Overall.Cond, Year.Built, Year.Remod.Add, Heating, Central.Air, Bedroom.AbvGr, Kitchen.AbvGr, Kitchen.Qual, Fireplaces, Garage.Area)
allvar.model.step3 = lm(log(price) ~ ., data = ames_train_normal_2)
summary(allvar.model.step3)
##
## Call:
## lm(formula = log(price) ~ ., data = ames_train_normal_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.71928 -0.08462 0.00294 0.09094 0.52087
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.695e+00 7.079e-01 2.394 0.016874 *
## area 3.780e-04 1.846e-05 20.470 < 2e-16 ***
## Overall.Cond 4.831e-02 5.714e-03 8.453 < 2e-16 ***
## Year.Built 4.148e-03 2.847e-04 14.571 < 2e-16 ***
## Year.Remod.Add 6.765e-04 3.774e-04 1.792 0.073426 .
## HeatingGasW 2.202e-01 5.789e-02 3.805 0.000153 ***
## HeatingGrav 2.713e-01 1.464e-01 1.853 0.064252 .
## HeatingOthW 2.331e-01 1.490e-01 1.564 0.118198
## HeatingWall 8.680e-02 1.479e-01 0.587 0.557349
## Central.AirY 1.740e-01 2.878e-02 6.048 2.23e-09 ***
## Bedroom.AbvGr -3.119e-02 8.449e-03 -3.691 0.000238 ***
## Kitchen.AbvGr -5.861e-02 2.951e-02 -1.986 0.047375 *
## Kitchen.QualFa -2.882e-01 4.605e-02 -6.259 6.25e-10 ***
## Kitchen.QualGd -2.047e-01 2.502e-02 -8.179 1.09e-15 ***
## Kitchen.QualPo 1.105e-01 1.497e-01 0.738 0.460507
## Kitchen.QualTA -2.753e-01 2.829e-02 -9.732 < 2e-16 ***
## Fireplaces 7.807e-02 9.239e-03 8.451 < 2e-16 ***
## Garage.Area 2.949e-04 3.145e-05 9.378 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1441 on 816 degrees of freedom
## Multiple R-squared: 0.8605, Adjusted R-squared: 0.8576
## F-statistic: 296.2 on 17 and 816 DF, p-value: < 2.2e-16
In this part we are interested to check if our variables have any internal linear correlations. If they are, we will need to exclude one of them. We check them in pairs: Year.Built and Year.Remod.Add, Heating and Central.Air; Bedroom.AbvGr, Kitchen.AbvGr, area and garage area.
ggpairs(ames_train_normal_2, columns = c(5,6), main = "Years of Remodeling and Building")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'main' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(ames_train_normal_2, columns = c(7,8), main = "Heating and Air")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'main' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggpairs(ames_train_normal_2, columns = c(2, 8,9, 12), main = "Areas")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'main' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
ggpairs(ames_train_normal_2, columns = c(10, 2), main = "Kitchen vs area")
## Warning in warn_if_args_exist(list(...)): Extra arguments: 'main' are
## being ignored. If these are meant to be aesthetics, submit them using the
## 'mapping' variable within ggpairs with ggplot2::aes or ggplot2::aes_string.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
After running the plot we can seee that most of them in the pair are correlated. Year.Built and Year.Remod.Add, is 0.5775 which is quite high and we have to take off one - Year.Remod.Add, Heating and Central.Air shows no correlation, but we will remove Heating as a factor with high p-values and as a meaningfullu correlated with Central.Air variable; Bedroom.AbvGr, Kitchen.AbvGr, area and garage area: area and Bedroom.AbvGr is 0.584, so we need to drop Bedroom.AbvGr; area and Garage.Area is 0.474 which is close to 0.5, so we could drop the last one and will do that.
As the result of this step, our model shrunk to
ames_train_normal_3 <- ames_train%>%
filter(Sale.Condition == "Normal" & !(PID %in% c("531375070", "902207130", "908154205")))%>%
dplyr::select(price, area, Overall.Cond, Year.Built, Central.Air, Kitchen.AbvGr, Kitchen.Qual, Fireplaces)
allvar.model.step4 = lm(log(price) ~ log(area)+ Overall.Cond+ Year.Built+Central.Air+Kitchen.AbvGr+ Kitchen.Qual+ Fireplaces, data = ames_train_normal_3)
summary(allvar.model.step4)
##
## Call:
## lm(formula = log(price) ~ log(area) + Overall.Cond + Year.Built +
## Central.Air + Kitchen.AbvGr + Kitchen.Qual + Fireplaces,
## data = ames_train_normal_3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82054 -0.08550 0.00456 0.09613 0.59119
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.7218825 0.5608928 -3.070 0.00221 **
## log(area) 0.5801915 0.0211289 27.460 < 2e-16 ***
## Overall.Cond 0.0520614 0.0057668 9.028 < 2e-16 ***
## Year.Built 0.0047945 0.0002638 18.173 < 2e-16 ***
## Central.AirY 0.1350342 0.0292066 4.623 4.38e-06 ***
## Kitchen.AbvGr -0.0831529 0.0308234 -2.698 0.00712 **
## Kitchen.QualFa -0.3556534 0.0475486 -7.480 1.91e-13 ***
## Kitchen.QualGd -0.2674320 0.0261879 -10.212 < 2e-16 ***
## Kitchen.QualPo -0.0820961 0.1596924 -0.514 0.60733
## Kitchen.QualTA -0.3590976 0.0284771 -12.610 < 2e-16 ***
## Fireplaces 0.0867521 0.0095348 9.098 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1545 on 823 degrees of freedom
## Multiple R-squared: 0.8382, Adjusted R-squared: 0.8363
## F-statistic: 426.5 on 10 and 823 DF, p-value: < 2.2e-16
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
During the selecting procedure, I was interested to pick variables which show off year of build, condition of the house and size, as well as heating part of the house. Those variables are low in missing data, based on EDA, show quite low p-values and show no correlation within variable, so it makes the model quite light for prediciton. It was quite interesting to observe, but after all of those modification, the BIC MCMC model average suggests to drop Kitchen.AbvGr, so we do.
model.step4.bas = bas.lm(log(price) ~ log(area)+ Overall.Cond+ Year.Built+Central.Air+ Kitchen.Qual+ Fireplaces, prior = "BIC",
method = "MCMC",
modelprior = uniform(), data = ames_train_normal_3)
summary(model.step4.bas)
## P(B != 0 | Y) model 1 model 2 model 3
## Intercept 1.00000000 1.0000 1.000000e+00 1.000000e+00
## log(area) 0.99960938 1.0000 1.000000e+00 1.000000e+00
## Overall.Cond 0.99707031 1.0000 1.000000e+00 0.000000e+00
## Year.Built 0.99863281 1.0000 1.000000e+00 1.000000e+00
## Central.AirY 0.99746094 1.0000 1.000000e+00 0.000000e+00
## Kitchen.QualFa 0.99853516 1.0000 1.000000e+00 0.000000e+00
## Kitchen.QualGd 0.99736328 1.0000 1.000000e+00 0.000000e+00
## Kitchen.QualPo 0.03671875 0.0000 1.000000e+00 0.000000e+00
## Kitchen.QualTA 0.99892578 1.0000 1.000000e+00 1.000000e+00
## Fireplaces 0.99960938 1.0000 1.000000e+00 1.000000e+00
## BF NA 1.0000 3.781668e-02 1.957066e-50
## PostProbs NA 0.9602 3.670000e-02 7.000000e-04
## R2 NA 0.8368 8.368000e-01 7.782000e-01
## dim NA 9.0000 1.000000e+01 5.000000e+00
## logmarg NA -1275.9448 -1.279220e+03 -1.390403e+03
## model 4 model 5
## Intercept 1.000000e+00 1.000000e+00
## log(area) 1.000000e+00 0.000000e+00
## Overall.Cond 0.000000e+00 0.000000e+00
## Year.Built 0.000000e+00 0.000000e+00
## Central.AirY 0.000000e+00 0.000000e+00
## Kitchen.QualFa 1.000000e+00 0.000000e+00
## Kitchen.QualGd 0.000000e+00 0.000000e+00
## Kitchen.QualPo 0.000000e+00 0.000000e+00
## Kitchen.QualTA 0.000000e+00 0.000000e+00
## Fireplaces 1.000000e+00 1.000000e+00
## BF 7.987519e-149 1.511276e-258
## PostProbs 6.000000e-04 4.000000e-04
## R2 6.150000e-01 2.829000e-01
## dim 4.000000e+00 2.000000e+00
## logmarg -1.616952e+03 -1.869599e+03
image(model.step4.bas, top.models = 5)
final.model=lm(log(price) ~ log(area)+ Overall.Cond+ Year.Built+Central.Air+ Kitchen.Qual+ Fireplaces, data = ames_train_normal_3)
As we could see, the model also suggests to drop some variables which are the groups of a single variable, just to kepp the marginal probability safe, we would not drop any of them.
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
Applying the model to the test dataset, we found the model results are close enough to the inicial one. Test model falls within the prediction interval with results of 95% when the original one has results 0f 92%
model.train<-mean(exp(predict(final.model, ames_train)))
model.test<-mean(exp(predict(final.model, ames_test)))
model.train>model.test
## [1] TRUE
# Predict prices train
predict.train <- exp(predict(final.model, ames_train, interval = "prediction"))
# Calculate proportion of observations that fall within prediction intervals
coverage.prob.train <- mean(ames_train$price > predict.train[,"lwr"] &
ames_train$price < predict.train[,"upr"])
coverage.prob.train
## [1] 0.921
# Predict prices test
predict.full <- exp(predict(final.model, ames_test, interval = "prediction"))
# Calculate proportion of observations that fall within prediction intervals
coverage.prob.full <- mean(ames_test$price > predict.full[,"lwr"] &
ames_test$price < predict.full[,"upr"])
coverage.prob.full
## [1] 0.9498164
For your final model, create and briefly interpret an informative plot of the residuals.
Testing residuals we can see the residuals are normal distributed with founded earlier outliers 428 and 310 which appearance we could track on all three graphs.
plot(final.model, which = 1)
hist(final.model$residuals)
qqnorm(final.model$residuals)
qqline(final.model$residuals)
The scatter plot has a very light curve which is a good sing of residual spreading. REsiduals has normal distribution with a center around 0 and show slight curve on the edges of QQ plot * * *
Ruuning the test model, we found the created model has the lower RMSE values, so we could conclude the created model fits well.
#extract prediction
predict.model.train <- exp(predict(final.model, ames_train))
# Extract Residuals
resid.train <- ames_train$price - predict.model.train
# Calculate RMSE
rmse.train <- sqrt(mean(resid.train^2))
#extract prediction
predict.model.test <- exp(predict(final.model, ames_test))
# Extract Residuals
resid.test <- ames_test$price - predict.model.test
# Calculate RMSE
rmse.test <- sqrt(mean(resid.test^2))
#final model
rmse.train
## [1] 37910.31
rmse.test
## [1] 32348
#initial model
rmse.full.train.bic
## [1] 35826.36
rmse.full.test.bic
## [1] 26864.08
Summarising, initial train model’s RMSE is $35826.36
initial test model’s RMSE is $26864.08
final train model’s RMSE is $37910.31
final test model’s RMSE is $32348
Initial test model shoew the lowest RMSE which speaking of better fitiings of the model to compare with the final test model. Nevertheless, the final testing model has RMSE lower than the training one, which is a good sign.
As an option, we can come back to the model and analyse the summary of the model again and drop a few more variables which were questionated during the selection process. because we are interested in those variables, it's our decision to keep them on.
What are some strengths and weaknesses of your model?
Strength The final model has high posterior probability - 0.60 which points to the well performing model. Also, the model helps to observe the house pricing formation from the different point of view, not only are year of build and condition quality, but also, from a heating point and quality of kitchen.
Weakness. posterior probability = 0.60 points to the well performing model and do not guarantee the best results. The model took in charge of normal condidion houses, eliminantin any out-of-case houses.
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the ames_validation dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention: * What is the RMSE of your final model when applied to the validation data?
* How does this value compare to that of the training data and/or testing data? * What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
* From this result, does your final model properly reflect uncertainty?
load("ames_validation.Rdata")
str(ames_validation)
## Classes 'tbl_df', 'tbl' and 'data.frame': 763 obs. of 81 variables:
## $ PID : int 909276070 534278190 534204120 535425040 907202160 903227070 531375050 535375140 534201230 906380100 ...
## $ area : int 1717 894 1073 1773 1026 1348 1175 869 1051 1800 ...
## $ price : int 194000 117600 156000 183000 147000 117000 146000 89900 142000 239000 ...
## $ MS.SubClass : int 50 20 20 20 80 50 20 20 20 20 ...
## $ MS.Zoning : Factor w/ 7 levels "A (agr)","C (all)",..: 6 6 6 6 6 7 6 6 6 6 ...
## $ Lot.Frontage : int 80 70 79 80 NA 50 63 60 80 122 ...
## $ Lot.Area : int 12400 8680 10289 10800 10970 6000 13072 10122 9600 11923 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ Lot.Shape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 4 4 1 4 4 4 4 1 ...
## $ Land.Contour : Factor w/ 4 levels "Bnk","HLS","Low",..: 2 4 4 4 3 4 4 4 4 4 ...
## $ Utilities : Factor w/ 3 levels "AllPub","NoSeWa",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Lot.Config : Factor w/ 5 levels "Corner","CulDSac",..: 5 5 5 5 5 1 5 5 1 1 ...
## $ Land.Slope : Factor w/ 3 levels "Gtl","Mod","Sev": 2 1 1 1 2 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 28 levels "Blmngtn","Blueste",..: 7 16 16 16 6 4 23 21 16 6 ...
## $ Condition.1 : Factor w/ 9 levels "Artery","Feedr",..: 3 3 3 4 3 3 6 3 2 3 ...
## $ Condition.2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ Bldg.Type : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ House.Style : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 1 3 3 3 8 1 3 3 3 3 ...
## $ Overall.Qual : int 5 5 5 6 6 5 6 4 5 9 ...
## $ Overall.Cond : int 6 7 7 6 6 8 5 6 5 5 ...
## $ Year.Built : int 1940 1960 1965 1961 1978 1925 2005 1948 1968 2007 ...
## $ Year.Remod.Add : int 1950 1960 1965 1992 1978 1997 2006 1950 1968 2007 ...
## $ Roof.Style : Factor w/ 6 levels "Flat","Gable",..: 2 4 4 2 2 2 2 2 2 2 ...
## $ Roof.Matl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior.1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 15 6 9 7 10 15 14 9 9 14 ...
## $ Exterior.2nd : Factor w/ 17 levels "AsbShng","AsphShn",..: 16 6 9 7 7 16 15 9 9 15 ...
## $ Mas.Vnr.Type : Factor w/ 6 levels "","BrkCmn","BrkFace",..: 5 5 3 3 5 5 3 5 5 5 ...
## $ Mas.Vnr.Area : int 0 0 168 104 0 0 126 0 0 0 ...
## $ Exter.Qual : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 4 4 4 4 4 4 3 ...
## $ Exter.Cond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 2 2 2 2 2 1 3 4 2 3 ...
## $ Bsmt.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 4 6 6 6 4 6 4 NA 6 2 ...
## $ Bsmt.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 4 6 4 NA 6 6 ...
## $ Bsmt.Exposure : Factor w/ 5 levels "","Av","Gd","Mn",..: 4 5 5 5 3 5 5 NA 5 5 ...
## $ BsmtFin.Type.1 : Factor w/ 7 levels "","ALQ","BLQ",..: 3 7 2 6 4 7 4 NA 2 7 ...
## $ BsmtFin.SF.1 : int 602 0 836 913 505 0 80 0 758 0 ...
## $ BsmtFin.Type.2 : Factor w/ 7 levels "","ALQ","BLQ",..: 7 7 7 7 5 7 7 NA 7 7 ...
## $ BsmtFin.SF.2 : int 0 0 0 0 435 0 0 0 0 0 ...
## $ Bsmt.Unf.SF : int 299 894 237 400 0 884 1095 0 293 1800 ...
## $ Total.Bsmt.SF : int 901 894 1073 1313 940 884 1175 0 1051 1800 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Heating.QC : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 1 1 5 3 1 ...
## $ Central.Air : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 1 2 2 ...
## $ Electrical : Factor w/ 6 levels "","FuseA","FuseF",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ X1st.Flr.SF : int 1125 894 1073 1773 1026 884 1175 869 1051 1800 ...
## $ X2nd.Flr.SF : int 592 0 0 0 0 464 0 0 0 0 ...
## $ Low.Qual.Fin.SF: int 0 0 0 0 0 0 0 0 0 0 ...
## $ Bsmt.Full.Bath : int 0 0 1 1 1 1 1 0 1 0 ...
## $ Bsmt.Half.Bath : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Full.Bath : int 1 1 1 2 1 1 1 1 1 2 ...
## $ Half.Bath : int 1 0 1 0 0 0 0 0 0 0 ...
## $ Bedroom.AbvGr : int 2 3 3 3 3 3 3 1 3 2 ...
## $ Kitchen.AbvGr : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Kitchen.Qual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 3 5 5 1 ...
## $ TotRms.AbvGrd : int 7 5 6 6 5 5 6 3 6 7 ...
## $ Functional : Factor w/ 8 levels "Maj1","Maj2",..: 8 8 8 4 8 8 8 8 8 8 ...
## $ Fireplaces : int 1 0 0 2 0 1 0 0 0 0 ...
## $ Fireplace.Qu : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 NA NA 5 NA 2 NA NA NA NA ...
## $ Garage.Type : Factor w/ 6 levels "2Types","Attchd",..: 2 6 2 2 6 6 NA 6 2 2 ...
## $ Garage.Yr.Blt : int 1940 1965 1965 1961 1981 1960 NA 1948 1968 2007 ...
## $ Garage.Finish : Factor w/ 4 levels "","Fin","RFn",..: 4 4 3 3 4 4 NA 4 3 2 ...
## $ Garage.Cars : int 1 1 2 2 2 1 0 1 2 2 ...
## $ Garage.Area : int 410 312 515 418 576 216 0 390 504 702 ...
## $ Garage.Qual : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 6 6 NA 3 6 6 ...
## $ Garage.Cond : Factor w/ 6 levels "","Ex","Fa","Gd",..: 6 6 6 6 3 6 NA 6 6 6 ...
## $ Paved.Drive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 1 3 1 3 3 ...
## $ Wood.Deck.SF : int 0 0 0 355 0 0 0 0 0 288 ...
## $ Open.Porch.SF : int 0 0 0 98 0 0 90 0 0 136 ...
## $ Enclosed.Porch : int 0 0 0 0 34 208 0 66 0 0 ...
## $ X3Ssn.Porch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Screen.Porch : int 113 0 0 144 0 0 0 0 0 0 ...
## $ Pool.Area : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Pool.QC : Factor w/ 4 levels "Ex","Fa","Gd",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA 3 NA NA 1 NA NA ...
## $ Misc.Feature : Factor w/ 5 levels "Elev","Gar2",..: NA NA NA NA NA NA NA NA NA NA ...
## $ Misc.Val : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Mo.Sold : int 2 1 6 8 10 5 6 8 7 5 ...
## $ Yr.Sold : int 2006 2007 2007 2006 2008 2009 2008 2008 2006 2009 ...
## $ Sale.Type : Factor w/ 10 levels "COD","Con","ConLD",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ Sale.Condition : Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 5 5 5 5 5 5 5 ...
RMSE: final train =37910.31
final test = 32348
final validation = 29329.01
The values of RMSE go down as expected.
The model test shows that 97% of 95% prediction interval contain the true price of the house in the validation data set. As the result we could say that the model properly reflect uncertainty. Using the median probability model to generate out-of-sample predictions and a 95% prediction interval, 0.02 proportion of observations (rows) in ames_validation
have sales prices that fall outside the prediction intervals. I assume, it’s a very good result for the testing model.
#prediction
predict.model.validation <- exp(predict(final.model, ames_validation))
# Extract Residuals
resid.validation <- ames_validation$price - predict.model.validation
# Calculate RMSE
rmse.validation <- sqrt(mean(resid.validation^2))
rmse.validation
## [1] 29329.01
rmse.test
## [1] 32348
rmse.train
## [1] 37910.31
ci_ames = exp(confint(predict(model.step4.bas, ames_validation, se.fit = TRUE), parm="pred"))
length(which(ifelse(ames_validation$price > ci_ames[,"2.5%"] & ames_validation$price < ci_ames[,"97.5%"], "yes", "no") == "yes"))/nrow(ci_ames)*100
## [1] 95.54391
95,4% of the house price lay within 95% confidence interval
We started the research with EDA where we found that a few variables should be transformed, also, we found variables with missing data and did not use them to build a regression model.
By applying different model selcetions, we found that BIC MCMC was better model than AIC BAS in this case. Thus, we run the test for both model and obtained the speaking results.
After that we used the BIC MCMC algorithm to extend our model and add new varibales which we considered to be significant for price estimation or house evaluation. Running BMA we dropped a few variable in order to reach pasimonial model.
The initial model included 3 variables: area, Overall.Cond,Year.Built And we added Central.Air, Kitchen.Qual and Fireplaces.
We found out that the model could be applied for out-of case data and it gives 95,4% predicted prices which stays in 95% confidence interval.