MSDA 621 Data Preparation Prject - Housing Prices Prediction in Illinois

The purpose of this project is to take raw data sampling concerning housing prices in Illinois and build a model that will allow us to predict the prices of properties based on comparable characteristics.

Business Understanding, Project Idea, and Data Selection

Currently, given the economic climate, to say the housing market in flux is a significant understatement. Additionally, now more than ever, there are so many data points collected concerning property sales that is difficult to understand what truly is important and can affect a property’s price. If one looks at Zillow or Realtor.com, every listing has a significant number of features. details, and extras that again, it can be difficult to discern what is important to the average home buyer and what is an outlier.

To begin to address this challenge, we downloaded two sets of data concerning the Illinois housing market. (Disclaimer this model will only be effecting in this particular market as geographic regions have different features that can affect the housing market; for example, ocean front property in California has the high value feature of being ocean front that a home in Illinois will not.) Our training data contained 1,460 rows containing 81 variables. The test data contained more that 2,900 rows of the same with the exception being that sales price was NOT included. The included variables ranged from general information like the type of dwelling, lot shape, and access to a main road to the veryt detailed like basement finish, electrical system, and fireplace quality. The source for hour data can be found here.

For this project, we will be using visualtizations and plots to look for correlations at a surface level followed by linear regression with log transformation, bagging, random forest, and XGBoost utilized to minimize the RMSE and build the most accurate model.

Data Understanding

As previously touched on, this data totals approximately 4,400 rows with 80 columns worth of variables, or roughly 350,000 individual data points. As we inspected the data, the first challenge we noticed was that not all columns contained values that could be used in a regression. Additionally, specifically around alleys, there were missing values that had to be cleaned up as well, which is documented further on. The good news with this particular data set is that the data was collected without error and in a uniform fashion meaning the data did not contain formatting or data entry errors that needed to be discovered and cleaned up.

To begin the Data Preparation phase, we must first load the data.

library(readr)
housing_train <- read.csv("C:/Users/raze1/OneDrive/Desktop/UIndy/MSDA 621/Project/Project Presentation/train.csv")
housing_test <- read.csv("C:/Users/raze1/OneDrive/Desktop/UIndy/MSDA 621/Project/Project Presentation/test.csv")
colnames(housing_train)
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "Alley"         "LotShape"     
##  [9] "LandContour"   "Utilities"     "LotConfig"     "LandSlope"    
## [13] "Neighborhood"  "Condition1"    "Condition2"    "BldgType"     
## [17] "HouseStyle"    "OverallQual"   "OverallCond"   "YearBuilt"    
## [21] "YearRemodAdd"  "RoofStyle"     "RoofMatl"      "Exterior1st"  
## [25] "Exterior2nd"   "MasVnrType"    "MasVnrArea"    "ExterQual"    
## [29] "ExterCond"     "Foundation"    "BsmtQual"      "BsmtCond"     
## [33] "BsmtExposure"  "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2" 
## [37] "BsmtFinSF2"    "BsmtUnfSF"     "TotalBsmtSF"   "Heating"      
## [41] "HeatingQC"     "CentralAir"    "Electrical"    "X1stFlrSF"    
## [45] "X2ndFlrSF"     "LowQualFinSF"  "GrLivArea"     "BsmtFullBath" 
## [49] "BsmtHalfBath"  "FullBath"      "HalfBath"      "BedroomAbvGr" 
## [53] "KitchenAbvGr"  "KitchenQual"   "TotRmsAbvGrd"  "Functional"   
## [57] "Fireplaces"    "FireplaceQu"   "GarageType"    "GarageYrBlt"  
## [61] "GarageFinish"  "GarageCars"    "GarageArea"    "GarageQual"   
## [65] "GarageCond"    "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"  
## [69] "EnclosedPorch" "X3SsnPorch"    "ScreenPorch"   "PoolArea"     
## [73] "PoolQC"        "Fence"         "MiscFeature"   "MiscVal"      
## [77] "MoSold"        "YrSold"        "SaleType"      "SaleCondition"
## [81] "SalePrice"
head(housing_train)
##   Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1  1         60       RL          65    8450   Pave  <NA>      Reg         Lvl
## 2  2         20       RL          80    9600   Pave  <NA>      Reg         Lvl
## 3  3         60       RL          68   11250   Pave  <NA>      IR1         Lvl
## 4  4         70       RL          60    9550   Pave  <NA>      IR1         Lvl
## 5  5         60       RL          84   14260   Pave  <NA>      IR1         Lvl
## 6  6         50       RL          85   14115   Pave  <NA>      IR1         Lvl
##   Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 2    AllPub       FR2       Gtl      Veenker      Feedr       Norm     1Fam
## 3    AllPub    Inside       Gtl      CollgCr       Norm       Norm     1Fam
## 4    AllPub    Corner       Gtl      Crawfor       Norm       Norm     1Fam
## 5    AllPub       FR2       Gtl      NoRidge       Norm       Norm     1Fam
## 6    AllPub    Inside       Gtl      Mitchel       Norm       Norm     1Fam
##   HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1     2Story           7           5      2003         2003     Gable  CompShg
## 2     1Story           6           8      1976         1976     Gable  CompShg
## 3     2Story           7           5      2001         2002     Gable  CompShg
## 4     2Story           7           5      1915         1970     Gable  CompShg
## 5     2Story           8           5      2000         2000     Gable  CompShg
## 6     1.5Fin           5           5      1993         1995     Gable  CompShg
##   Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1     VinylSd     VinylSd    BrkFace        196        Gd        TA      PConc
## 2     MetalSd     MetalSd       None          0        TA        TA     CBlock
## 3     VinylSd     VinylSd    BrkFace        162        Gd        TA      PConc
## 4     Wd Sdng     Wd Shng       None          0        TA        TA     BrkTil
## 5     VinylSd     VinylSd    BrkFace        350        Gd        TA      PConc
## 6     VinylSd     VinylSd       None          0        TA        TA       Wood
##   BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1       Gd       TA           No          GLQ        706          Unf
## 2       Gd       TA           Gd          ALQ        978          Unf
## 3       Gd       TA           Mn          GLQ        486          Unf
## 4       TA       Gd           No          ALQ        216          Unf
## 5       Gd       TA           Av          GLQ        655          Unf
## 6       Gd       TA           No          GLQ        732          Unf
##   BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1          0       150         856    GasA        Ex          Y      SBrkr
## 2          0       284        1262    GasA        Ex          Y      SBrkr
## 3          0       434         920    GasA        Ex          Y      SBrkr
## 4          0       540         756    GasA        Gd          Y      SBrkr
## 5          0       490        1145    GasA        Ex          Y      SBrkr
## 6          0        64         796    GasA        Ex          Y      SBrkr
##   X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1       856       854            0      1710            1            0        2
## 2      1262         0            0      1262            0            1        2
## 3       920       866            0      1786            1            0        2
## 4       961       756            0      1717            1            0        1
## 5      1145      1053            0      2198            1            0        2
## 6       796       566            0      1362            1            0        1
##   HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1        1            3            1          Gd            8        Typ
## 2        0            3            1          TA            6        Typ
## 3        1            3            1          Gd            6        Typ
## 4        0            3            1          Gd            7        Typ
## 5        1            4            1          Gd            9        Typ
## 6        1            1            1          TA            5        Typ
##   Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1          0        <NA>     Attchd        2003          RFn          2
## 2          1          TA     Attchd        1976          RFn          2
## 3          1          TA     Attchd        2001          RFn          2
## 4          1          Gd     Detchd        1998          Unf          3
## 5          1          TA     Attchd        2000          RFn          3
## 6          0        <NA>     Attchd        1993          Unf          2
##   GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1        548         TA         TA          Y          0          61
## 2        460         TA         TA          Y        298           0
## 3        608         TA         TA          Y          0          42
## 4        642         TA         TA          Y          0          35
## 5        836         TA         TA          Y        192          84
## 6        480         TA         TA          Y         40          30
##   EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1             0          0           0        0   <NA>  <NA>        <NA>
## 2             0          0           0        0   <NA>  <NA>        <NA>
## 3             0          0           0        0   <NA>  <NA>        <NA>
## 4           272          0           0        0   <NA>  <NA>        <NA>
## 5             0          0           0        0   <NA>  <NA>        <NA>
## 6             0        320           0        0   <NA> MnPrv        Shed
##   MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1       0      2   2008       WD        Normal    208500
## 2       0      5   2007       WD        Normal    181500
## 3       0      9   2008       WD        Normal    223500
## 4       0      2   2006       WD       Abnorml    140000
## 5       0     12   2008       WD        Normal    250000
## 6     700     10   2009       WD        Normal    143000
cbind(c("train", "test"),
      rbind(dim(housing_train), dim(housing_test)))
##      [,1]    [,2]   [,3]
## [1,] "train" "1460" "81"
## [2,] "test"  "1459" "80"

Data Preparation

The two primary requirements for the Data Preparation phase were:

  1. Adjust all variables using as.numeric or as.factor as appropriate
  2. Change all of the missing (null) values to 0s

Below is the list of all of the variables and a brief explanation as to what they mean.

MSSubClass: Identifies the type of dwelling involved in the sale.

    20  1-STORY 1946 & NEWER ALL STYLES
    30  1-STORY 1945 & OLDER
    40  1-STORY W/FINISHED ATTIC ALL AGES
    45  1-1/2 STORY - UNFINISHED ALL AGES
    50  1-1/2 STORY FINISHED ALL AGES
    60  2-STORY 1946 & NEWER
    70  2-STORY 1945 & OLDER
    75  2-1/2 STORY ALL AGES
    80  SPLIT OR MULTI-LEVEL
    85  SPLIT FOYER
    90  DUPLEX - ALL STYLES AND AGES
   120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
   150  1-1/2 STORY PUD - ALL AGES
   160  2-STORY PUD - 1946 & NEWER
   180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
   190  2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.

     A  Agriculture
  1  C  Commercial
  2  FV Floating Village Residential
     I  Industrial
  3  RH Residential High Density
  4  RL Residential Low Density
     RP Residential Low Density Park 
  5  RM Residential Medium Density

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access to property

   2 Grvl   Gravel  
   1 Pave   Paved
    

Alley: Type of alley access to property

  2  Grvl   Gravel
  1  Pave   Paved
  0  NA     No alley access
    

LotShape: General shape of property

   4 Reg    Regular 
   1 IR1    Slightly irregular
   2 IR2    Moderately Irregular
   3 IR3    Irregular
   

LandContour: Flatness of the property

   4 Lvl    Near Flat/Level 
   1 Bnk    Banked - Quick and significant rise from street grade to building
   2 HLS    Hillside - Significant slope from side to side
   3 Low    Depression
    

Utilities: Type of utilities available

   1 AllPub All public Utilities (E,G,W,& S)    
     NoSewr Electricity, Gas, and Water (Septic Tank)
   2 NoSeWa Electricity and Gas Only
   ELO  Electricity only    

LotConfig: Lot configuration

   5 Inside Inside lot
   1 Corner Corner lot
   2 CulDSac Cul-de-sac
   3 FR2    Frontage on 2 sides of property
   4 FR3    Frontage on 3 sides of property

LandSlope: Slope of property

   1 Gtl    Gentle slope
   2 Mod    Moderate Slope  
   3 Sev    Severe Slope

Neighborhood: Physical locations within Ames city limits

   1    Blmngtn Bloomington Heights
   2    Blueste Bluestem
   3    BrDale  Briardale
   4    BrkSide Brookside
   5    ClearCr Clear Creek
   6    CollgCr College Creek
   7    Crawfor Crawford
   8    Edwards Edwards
   9    Gilbert Gilbert
   10   IDOTRR  Iowa DOT and Rail Road
   11   MeadowV Meadow Village
   12   Mitchel Mitchell
   13   Names   North Ames
   14   NoRidge Northridge
   15   NPkVill Northpark Villa
   16   NridgHt Northridge Heights
   17   NWAmes  Northwest Ames
   18   OldTown Old Town
   19   SWISU   South & West of Iowa State University
   20   Sawyer  Sawyer
   21   SawyerW Sawyer West
   22   Somerst Somerset
   23   StoneBr Stone Brook
   24   Timber  Timberland
   25   Veenker Veenker
        

Condition1: Proximity to various conditions

   1    Artery  Adjacent to arterial street
   2    Feedr   Adjacent to feeder street   
   3    Norm    Normal  
   4    RRNn    Within 200' of North-South Railroad
   5    RRAn    Adjacent to North-South Railroad
   6    PosN    Near positive off-site feature--park, greenbelt, etc.
   7    PosA    Adjacent to postive off-site feature
   8    RRNe    Within 200' of East-West Railroad
   9    RRAe    Adjacent to East-West Railroad

Condition2: Proximity to various conditions (if more than one is present) – Same as above

   1    Artery  Adjacent to arterial street
   2    Feedr   Adjacent to feeder street   
   3    Norm    Normal  
   4    RRNn    Within 200' of North-South Railroad
   5    RRAn    Adjacent to North-South Railroad
   6    PosN    Near positive off-site feature--park, greenbelt, etc.
   7    PosA    Adjacent to postive off-site feature
   8    RRNe    Within 200' of East-West Railroad
   9    RRAe    Adjacent to East-West Railroad

BldgType: Type of dwelling

   1    1Fam    Single-family Detached  
   2    2FmCon  Two-family Conversion; originally built as one-family dwelling
   3    Duplx   Duplex
   4    TwnhsE  Townhouse End Unit
   5    TwnhsI  Townhouse Inside Unit

HouseStyle: Style of dwelling

   1    1Story  One story
   2    1.5Fin  One and one-half story: 2nd level finished
   3    1.5Unf  One and one-half story: 2nd level unfinished
   4    2Story  Two story
   5    2.5Fin  Two and one-half story: 2nd level finished
   6    2.5Unf  Two and one-half story: 2nd level unfinished
   7    SFoyer  Split Foyer
   8    SLvl    Split Level

OverallQual: Rates the overall material and finish of the house

   10   Very Excellent
   9    Excellent
   8    Very Good
   7    Good
   6    Above Average
   5    Average
   4    Below Average
   3    Fair
   2    Poor
   1    Very Poor

OverallCond: Rates the overall condition of the house

   10   Very Excellent
   9    Excellent
   8    Very Good
   7    Good
   6    Above Average   
   5    Average
   4    Below Average   
   3    Fair
   2    Poor
   1    Very Poor
    

YearBuilt: Original construction date

YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

RoofStyle: Type of roof

   1    Flat    Flat
   2    Gable   Gable
   3    Gambrel Gabrel (Barn)
   4    Hip Hip
   5    Mansard Mansard
   6    Shed    Shed
    

RoofMatl: Roof material

   1    ClyTile Clay or Tile
   2    CompShg Standard (Composite) Shingle
   3    Membran Membrane
   4    Metal   Metal
   5    Roll    Roll
   6    Tar&Grv Gravel & Tar
   7    WdShake Wood Shakes
   8    WdShngl Wood Shingles
    

Exterior1st: Exterior covering on house

   1    AsbShng Asbestos Shingles
   2    AsphShn Asphalt Shingles
   3    BrkComm Brick Common
   4    BrkFace Brick Face
   5    CBlock  Cinder Block
   6    CemntBd Cement Board
   7    HdBoard Hard Board
   8    ImStucc Imitation Stucco
   9    MetalSd Metal Siding
   10   Other   Other
   11   Plywood Plywood
   12   PreCast PreCast 
   13   Stone   Stone
   14   Stucco  Stucco
   15   VinylSd Vinyl Siding
   16   Wd Sdng Wood Siding
   17   WdShing Wood Shingles

Exterior2nd: Exterior covering on house (if more than one material)

   1    AsbShng Asbestos Shingles
   2    AsphShn Asphalt Shingles
   3    BrkComm Brick Common
   4    BrkFace Brick Face
   5    CBlock  Cinder Block
   6    CemntBd Cement Board
   7    HdBoard Hard Board
   8    ImStucc Imitation Stucco
   9    MetalSd Metal Siding
   10   Other   Other
   11   Plywood Plywood
   12   PreCast PreCast 
   13   Stone   Stone
   14   Stucco  Stucco
   15   VinylSd Vinyl Siding
   16   Wd Sdng Wood Siding
   17   WdShing Wood Shingles

MasVnrType: Masonry veneer type

   1    BrkCmn  Brick Common
   2    BrkFace Brick Face
   3    CBlock  Cinder Block
   0    None    None
   4    Stone   Stone

MasVnrArea: Masonry veneer area in square feet

ExterQual: Evaluates the quality of the material on the exterior

  1 Ex  Excellent
  2 Gd  Good
  3 TA  Average/Typical
  4 Fa  Fair
  5 Po  Poor
    

ExterCond: Evaluates the present condition of the material on the exterior

  1 Ex  Excellent
  2 Gd  Good
  3 TA  Average/Typical
  4 Fa  Fair
  5 Po  Poor
    

Foundation: Type of foundation

   1 BrkTil Brick & Tile
   2 CBlock Cinder Block
   3 PConc  Poured Contrete 
   4 Slab   Slab
   5 Stone  Stone
   6 Wood   Wood
    

BsmtQual: Evaluates the height of the basement

   1 Ex Excellent (100+ inches) 
   2 Gd Good (90-99 inches)
   3 TA Typical (80-89 inches)
   4 Fa Fair (70-79 inches)
   5 Po Poor (<70 inches
   0 NA No Basement
    

BsmtCond: Evaluates the general condition of the basement

   1 Ex Excellent
   2 Gd Good
   3 TA Typical - slight dampness allowed
   4 Fa Fair - dampness or some cracking or settling
   5 Po Poor - Severe cracking, settling, or wetness
   0 NA No Basement

BsmtExposure: Refers to walkout or garden level walls

   1 Gd Good Exposure
   2 Av Average Exposure (split levels or foyers typically score average or above)  
   3 Mn Mimimum Exposure
   4 No No Exposure
   0 NA No Basement

BsmtFinType1: Rating of basement finished area

  3 GLQ Good Living Quarters
  1 ALQ Average Living Quarters
  2 BLQ Below Average Living Quarters   
  5 Rec Average Rec Room
  4 LwQ Low Quality
  6 Unf Unfinshed
  0 NA  No Basement
    

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Rating of basement finished area (if multiple types)

   3 GLQ    Good Living Quarters
   1 ALQ    Average Living Quarters
   2 BLQ    Below Average Living Quarters   
   5 Rec    Average Rec Room
   4 LwQ    Low Quality
   6 Unf    Unfinshed
    0   No Basement

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

   1 Floor  Floor Furnace
   2 GasA   Gas forced warm air furnace
   3 GasW   Gas hot water or steam heat
   4 Grav   Gravity furnace 
   5 OthW   Hot water or steam heat other than gas
   6 Wall   Wall furnace
    

HeatingQC: Heating quality and condition

   1 Ex Excellent
   2 Gd Good
   3 TA Average/Typical
   4 Fa Fair
   5 Po Poor
    

CentralAir: Central air conditioning

   0 N  No
   1 Y  Yes
    

Electrical: Electrical system

   1 SBrkr  Standard Circuit Breakers & Romex
   2 FuseA  Fuse Box over 60 AMP and all Romex wiring (Average) 
   3 FuseF  60 AMP Fuse Box and mostly Romex wiring (Fair)
   4 FuseP  60 AMP Fuse Box and mostly knob & tube wiring (poor)
   5 Mix    Mixed
    

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

Kitchen: Kitchens above grade

KitchenQual: Kitchen quality

   1 Ex Excellent
   2 Gd Good
   3 TA Typical/Average
   4 Fa Fair
   5 Po Poor
    

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality (Assume typical unless deductions are warranted)

   8 Typ    Typical Functionality
   7 Min1   Minor Deductions 1
   6 Min2   Minor Deductions 2
   5 Mod    Moderate Deductions
   4 Maj1   Major Deductions 1
   3 Maj2   Major Deductions 2
   2 Sev    Severely Damaged
   1 Sal    Salvage only
    

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

   5 Ex Excellent - Exceptional Masonry Fireplace
   4 Gd Good - Masonry Fireplace in main level
   3 TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
   2 Fa Fair - Prefabricated Fireplace in basement
   1 Po Poor - Ben Franklin Stove
   0 NA No Fireplace
    

GarageType: Garage location

   6 2Types More than one type of garage
   5 Attchd Attached to home
   4 Basment    Basement Garage
   3 BuiltIn    Built-In (Garage part of house - typically has room above garage)
   2 CarPort    Car Port
   1 Detchd Detached from home
   NA   No Garage
    

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

   3 Fin    Finished
   2 RFn    Rough Finished  
   1 Unf    Unfinished
   0 NA No Garage
    

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

   5 Ex Excellent
   4 Gd Good
   3 TA Typical/Average
   2 Fa Fair
   1 Po Poor
   0 NA No Garage
    

GarageCond: Garage condition

   5 Ex Excellent
   4 Gd Good
   3 TA Typical/Average
   2 Fa Fair
   1 Po Poor
   0 NA No Garage
    
    

PavedDrive: Paved driveway

   3 Y  Paved 
   2 P  Partial Pavement
   1 N  Dirt/Gravel
    

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet

PoolQC: Pool quality

   5 Ex Excellent
   4 Gd Good
   3 TA Typical/Average
   2 Fa Fair
   1 Po Poor
   0 NA No Garage
    
    

Fence: Fence quality

  4  GdPrv  Good Privacy
  3  MnPrv  Minimum Privacy
  2 GdWo    Good Wood
  1  MnWw   Minimum Wood/Wire
  0  NA No Fence

MiscFeature: Miscellaneous feature not covered in other categories

   Elev Elevator
   4 Gar2   2nd Garage (if not described in garage section)
   3 Othr   Other
   2 Shed   Shed (over 100 SF)
   1 TenC   Tennis Court
   0 NA None
    

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold (MM)

YrSold: Year Sold (YYYY)

SaleType: Type of sale

   9 WD     Warranty Deed - Conventional
   8 CWD    Warranty Deed - Cash
   VWD  Warranty Deed - VA Loan
   7 New    Home just constructed and sold
   6 COD    Court Officer Deed/Estate
   5 Con    Contract 15% Down payment regular terms
   4 ConLw  Contract Low Down payment and low interest
   3 ConLI  Contract Low Interest
   2 ConLD  Contract Low Down
   1 Oth    Other
    

SaleCondition: Condition of sale

   6 Normal Normal Sale
   5 Abnorml    Abnormal Sale -  trade, foreclosure, short sale
   4 AdjLand    Adjoining Land Purchase
   3 Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit  
   2 Family Sale between family members
   1 Partial    Home was not completed when last assessed (associated with New Homes)

Data Preparation Coding - as.factor & as.numeric

housing_train$MSZoning = as.factor(housing_train$MSZoning)
levels(housing_train$MSZoning)
## [1] "C (all)" "FV"      "RH"      "RL"      "RM"
# MSZoning column of train dataset has following levels: "C (all)", "FV", "RH", "RL", "RM"
housing_test$MSZoning = as.factor(housing_test$MSZoning)
levels(housing_test$MSZoning)
## [1] "C (all)" "FV"      "RH"      "RL"      "RM"
# Change of factors to numeric in train dataset
MSZoning=as.numeric(housing_train$MSZoning,"C "=1, "FV"=2, "RH"=3, "RL"=4, "RM"=5)
housing_train$MSZoning <-MSZoning
# Change of factors to numeric in test dataset
MSZoning=as.numeric(housing_test$MSZoning,"C "=1, "FV"=2, "RH"=3, "RL"=4, "RM"=5)
housing_test$MSZoning <-MSZoning
Street = as.factor(housing_train$Street)
Street = as.numeric(Street, "Pave"= 1,"Grvl"= 2)
housing_train$Street <-Street
# Pave got replaced with 1 and Grvl type of rode got replaced with 2
Street = as.factor(housing_test$Street)
Street = as.numeric(Street, "Pave"= 1,"Grvl"= 2)
housing_test$Street <-Street
# Pave got replaced with 1 and Grvl type of rode got replaced with 2
# Transforming Alley column to numeric in train dataset
Alley<-as.factor(housing_train$Alley)
levels(Alley)
## [1] "Grvl" "Pave"
Alley = as.numeric(Alley, "Pave"= 1,"Grvl"= 2, "NA"=0)
housing_train$Alley <- Alley
# Transforming Alley column to numeric in test dataset
Alley<-as.factor(housing_test$Alley)
levels(Alley)
## [1] "Grvl" "Pave"
Alley = as.numeric(Alley, "Pave"= 1,"Grvl"= 2)
housing_test$Alley <- Alley
# Transforming LotShape column to numeric in train dataset
LotShape <-as.factor(housing_train$LotShape)
levels(LotShape) # 4 levels: "IR1", "IR2", "IR3", "Reg"
## [1] "IR1" "IR2" "IR3" "Reg"
LotShape=as.numeric(LotShape,"IR1"=1, "IR2"=2, "IR3"=3, "Reg"=4)
housing_train$LotShape <- LotShape
# Transforming LotShape column to numeric in test dataset
LotShape <-as.factor(housing_test$LotShape)
levels(LotShape) # 4 levels: "IR1", "IR2", "IR3", "Reg"
## [1] "IR1" "IR2" "IR3" "Reg"
LotShape=as.numeric(LotShape,"IR1"=1, "IR2"=2, "IR3"=3, "Reg"=4)
housing_test$LotShape <- LotShape
# Transforming LandContour column to numeric in train dataset
LandContour <-as.factor(housing_train$LandContour)
levels(LandContour) # 4 levels: "Bnk", "HLS", "Low", "Lvl"
## [1] "Bnk" "HLS" "Low" "Lvl"
LandContour=as.numeric(LandContour,"Bnk"=1, "HLS"=2, "Low"=3, "Lvl"=4)
housing_train$LandContour <- LandContour
# Transforming LandContour column to numeric in test dataset
LandContour <-as.factor(housing_test$LandContour)
levels(LandContour) # 4 levels: "Bnk", "HLS", "Low", "Lvl"
## [1] "Bnk" "HLS" "Low" "Lvl"
LandContour=as.numeric(LandContour,"Bnk"=1, "HLS"=2, "Low"=3, "Lvl"=4)
housing_test$LandContour <- LandContour
# Transforming Utilities column to numeric in train dataset
Utilities <-as.factor(housing_train$Utilities)
levels(Utilities) # 2 levels: "AllPub", "NoSeWa"
## [1] "AllPub" "NoSeWa"
Utilities=as.numeric(Utilities,"AllPub"=1, "NoSeWa"=2, "NA"=0)
housing_train$Utilities <- Utilities

# Transforming Utilities column to numeric in test dataset
Utilities <-as.factor(housing_test$Utilities)
levels(Utilities) # 2 levels: "AllPub", "NoSeWa"
## [1] "AllPub"
Utilities=as.numeric(Utilities,"AllPub"=1, "NA"=0)
housing_test$Utilities <- Utilities
# Transforming LotConfig column to numeric in train dataset
LotConfig <-as.factor(housing_train$LotConfig)
levels(LotConfig)
## [1] "Corner"  "CulDSac" "FR2"     "FR3"     "Inside"
LotConfig=as.numeric(LotConfig,"Corner"=1, "CulDSac"=2, "FR2"=3, "FR3"=4, "Inside"=5, "NA"= 0)
housing_train$LotConfig <- LotConfig

# Transforming LotConfig column to numeric in test dataset
LotConfig <-as.factor(housing_test$LotConfig)
levels(LotConfig)
## [1] "Corner"  "CulDSac" "FR2"     "FR3"     "Inside"
LotConfig=as.numeric(LotConfig,"Corner"=1, "CulDSac"=2, "FR2"=3, "FR3"=4, "Inside"=5, "NA"=0)
housing_test$LotConfig <- LotConfig

# Transforming LandSlope column to numeric in train dataset
LandSlope <-as.factor(housing_train$LandSlope)
levels(LandSlope)
## [1] "Gtl" "Mod" "Sev"
LandSlope=as.numeric(LandSlope,"Gtl"=1, "Mod"=2, "Sev"=3)
housing_train$LandSlope <- LandSlope

# Transforming LandSlope column to numeric in test dataset
LandSlope <-as.factor(housing_test$LandSlope)
levels(LandSlope)
## [1] "Gtl" "Mod" "Sev"
LandSlope=as.numeric(LandSlope,"Blmngtn"=1, "Blueste"=2, "Sev"=3)
housing_test$LandSlope <- LandSlope

# Transforming Neighborhood column to numeric in train dataset
Neighborhood <-as.factor(housing_train$Neighborhood)
levels(Neighborhood)
##  [1] "Blmngtn" "Blueste" "BrDale"  "BrkSide" "ClearCr" "CollgCr" "Crawfor"
##  [8] "Edwards" "Gilbert" "IDOTRR"  "MeadowV" "Mitchel" "NAmes"   "NoRidge"
## [15] "NPkVill" "NridgHt" "NWAmes"  "OldTown" "Sawyer"  "SawyerW" "Somerst"
## [22] "StoneBr" "SWISU"   "Timber"  "Veenker"
Neighborhood=as.numeric(Neighborhood,"Blmngtn"=1, "Blueste"=2, "BrDale"=3, "BrkSide"=4, "ClearCr"=5, "CollgCr"=6, "Crawfor"=7, "Edwards"=8, "Gilbert"=9, "IDOTRR"=10, "MeadowV"=11, "Mitchel"=12, "NAmes"=13, "NoRidge"=14, "NPkVill"=15, "NridgHt"=16, "NWAmes"=17, "OldTown"=18, "SWISU"=19, "Sawyer"=20, "SawyerW"=21, "Somerst"=22, "StoneBr"=23, "Timber"=24, "Veenker"=25)
housing_train$Neighborhood <- Neighborhood

# Transforming Neighborhood column to numeric in test dataset
Neighborhood <-as.factor(housing_test$Neighborhood)
levels(Neighborhood)
##  [1] "Blmngtn" "Blueste" "BrDale"  "BrkSide" "ClearCr" "CollgCr" "Crawfor"
##  [8] "Edwards" "Gilbert" "IDOTRR"  "MeadowV" "Mitchel" "NAmes"   "NoRidge"
## [15] "NPkVill" "NridgHt" "NWAmes"  "OldTown" "Sawyer"  "SawyerW" "Somerst"
## [22] "StoneBr" "SWISU"   "Timber"  "Veenker"
Neighborhood=as.numeric(Neighborhood,"Blmngtn"=1, "Blueste"=2, "BrDale"=3, "BrkSide"=4, "ClearCr"=5, "CollgCr"=6, "Crawfor"=7, "Edwards"=8, "Gilbert"=9, "IDOTRR"=10, "MeadowV"=11, "Mitchel"=12, "NAmes"=13, "NoRidge"=14, "NPkVill"=15, "NridgHt"=16, "NWAmes"=17, "OldTown"=18, "SWISU"=19, "Sawyer"=20, "SawyerW"=21, "Somerst"=22, "StoneBr"=23, "Timber"=24, "Veenker"=25)
housing_test$Neighborhood <- Neighborhood
# Transforming Condition1 column to numeric in train dataset
Condition1 <-as.factor(housing_train$Condition1)
levels(Condition1)
## [1] "Artery" "Feedr"  "Norm"   "PosA"   "PosN"   "RRAe"   "RRAn"   "RRNe"  
## [9] "RRNn"
Condition1=as.numeric(Condition1,"Artery"=1, "Feedr"=2, "Norm"=3, "RRNn"=4, "RRAn"=5, "PosN"=6, "PosA"=7, "RRNe"=8, "RRAe"=9)
housing_train$Condition1 <- Condition1
# Transforming Condition1 column to numeric in test dataset
Condition1 <-as.factor(housing_test$Condition1)
levels(Condition1) 
## [1] "Artery" "Feedr"  "Norm"   "PosA"   "PosN"   "RRAe"   "RRAn"   "RRNe"  
## [9] "RRNn"
Condition1=as.numeric(Condition1,"Artery"=1, "Feedr"=2, "Norm"=3, "RRNn"=4, "RRAn"=5, "PosN"=6, "PosA"=7, "RRNe"=8, "RRAe"=9)
housing_test$Condition1 <- Condition1
# Transforming Condition2 column to numeric in train dataset
Condition2 <-as.factor(housing_train$Condition2)
levels(Condition2)
## [1] "Artery" "Feedr"  "Norm"   "PosA"   "PosN"   "RRAe"   "RRAn"   "RRNn"
Condition2=as.numeric(Condition2,"Artery"=1, "Feedr"=2, "Norm"=3, "RRNn"=4, "RRAn"=5, "PosN"=6, "PosA"=7, "RRNe"=8, "RRAe"=9)
housing_train$Condition2 <- Condition2
# Transforming Condition2 column to numeric in test dataset
Condition2 <-as.factor(housing_test$Condition2)
levels(Condition2) #values
## [1] "Artery" "Feedr"  "Norm"   "PosA"   "PosN"
Condition2=as.numeric(Condition2,"Artery"=1, "Feedr"=2, "Norm"=3, "RRNn"=4, "RRAn"=5, "PosN"=6, "PosA"=7, "RRNe"=8, "RRAe"=9)
housing_test$Condition2 <- Condition2
# Transforming BldgType column to numeric in train dataset
BldgType <-as.factor(housing_train$BldgType)
levels(BldgType)
## [1] "1Fam"   "2fmCon" "Duplex" "Twnhs"  "TwnhsE"
BldgType=as.numeric(BldgType,"1Fam"=1, "2FmCon"=2, "Duplx"=3, "TwnhsE"=4, "TwnhsI"=5)
housing_train$BldgType <- BldgType
# Transforming BldgType column to numeric in test dataset
BldgType <-as.factor(housing_test$BldgType)
levels(BldgType) 
## [1] "1Fam"   "2fmCon" "Duplex" "Twnhs"  "TwnhsE"
BldgType=as.numeric(BldgType,"1Fam"=1, "2FmCon"=2, "Duplx"=3, "TwnhsE"=4, "TwnhsI"=5)
housing_test$BldgType <- BldgType
# Transforming HouseStyle column to numeric in train dataset
HouseStyle <-as.factor(housing_train$HouseStyle)
levels(HouseStyle)
## [1] "1.5Fin" "1.5Unf" "1Story" "2.5Fin" "2.5Unf" "2Story" "SFoyer" "SLvl"
HouseStyle=as.numeric(HouseStyle,"1Story"=1, "1.5Fin"=2, "1.5Unf"=3, "2Story"=4, "2.5Fin"=5, "2.5Unf"=6, "SFoyer"=7, "SLvl"=8)
housing_train$HouseStyle <- HouseStyle
# Transforming HouseStyle column to numeric in test dataset
HouseStyle <-as.factor(housing_test$HouseStyle)
levels(HouseStyle) 
## [1] "1.5Fin" "1.5Unf" "1Story" "2.5Unf" "2Story" "SFoyer" "SLvl"
HouseStyle=as.numeric(HouseStyle,"1Story"=1, "1.5Fin"=2, "1.5Unf"=3, "2Story"=4, "2.5Fin"=5, "2.5Unf"=6, "SFoyer"=7, "SLvl"=8)
housing_test$HouseStyle <- HouseStyle
RoofStyle <-as.factor(housing_train$RoofStyle)
levels(RoofStyle)
## [1] "Flat"    "Gable"   "Gambrel" "Hip"     "Mansard" "Shed"
RoofStyle=as.numeric(RoofStyle,"Flat"=1, "Gable"=2, "Gambrel"=3, "Hip"=4, "Mansard"=5, "Shed"=6)
housing_train$RoofStyle <- RoofStyle
# Transforming RoofStyle column to numeric in test dataset
RoofStyle <-as.factor(housing_test$RoofStyle)
levels(RoofStyle) 
## [1] "Flat"    "Gable"   "Gambrel" "Hip"     "Mansard" "Shed"
RoofStyle=as.numeric(RoofStyle,"Flat"=1, "Gable"=2, "Gambrel"=3, "Hip"=4, "Mansard"=5, "Shed"=6)
housing_test$RoofStyle <- RoofStyle
# Transforming RoofMatl column to numeric in train dataset
RoofMatl <-as.factor(housing_train$RoofMatl)
levels(RoofMatl)
## [1] "ClyTile" "CompShg" "Membran" "Metal"   "Roll"    "Tar&Grv" "WdShake"
## [8] "WdShngl"
RoofMatl=as.numeric(RoofMatl,"ClyTile"=1, "CompShg"=2, "Membran"=3, "Metal"=4, "Roll"=5, "Tar&Grv"=6, "WdShake"=7, "WdShngl"=8)
housing_train$RoofMatl <- RoofMatl
# Transforming RoofMatl column to numeric in test dataset
RoofMatl <-as.factor(housing_test$RoofMatl)
levels(RoofMatl)
## [1] "CompShg" "Tar&Grv" "WdShake" "WdShngl"
RoofMatl=as.numeric(RoofMatl,"ClyTile"=1, "CompShg"=2, "Membran"=3, "Metal"=4, "Roll"=5, "Tar&Grv"=6, "WdShake"=7, "WdShngl"=8)
housing_test$RoofMatl <- RoofMatl
# Transforming Exterior1st column to numeric in train dataset
Exterior1st <-as.factor(housing_train$Exterior1st)
levels(Exterior1st)
##  [1] "AsbShng" "AsphShn" "BrkComm" "BrkFace" "CBlock"  "CemntBd" "HdBoard"
##  [8] "ImStucc" "MetalSd" "Plywood" "Stone"   "Stucco"  "VinylSd" "Wd Sdng"
## [15] "WdShing"
Exterior1st=as.numeric(Exterior1st,"AsbShng"=1, "AsphShn"=2, "BrkComm"=3, "BrkFace"=4, "CBlock"=5, "CemntBd"=6, "HdBoard"=7, "ImStucc"=8, "MetalSd"=9, "Plywood"=10, "Stone"=11, "Stucco"=12,"VinylSd"=13, "Wd Sdng"=14, "WdShing"=15)
housing_train$Exterior1st <- Exterior1st
# Transforming Exterior1st column to numeric in test dataset
Exterior1st <-as.factor(housing_test$Exterior1st)
levels(Exterior1st)
##  [1] "AsbShng" "AsphShn" "BrkComm" "BrkFace" "CBlock"  "CemntBd" "HdBoard"
##  [8] "MetalSd" "Plywood" "Stucco"  "VinylSd" "Wd Sdng" "WdShing"
Exterior1st=as.numeric(Exterior1st,"AsbShng"=1, "AsphShn"=2, "BrkComm"=3, "BrkFace"=4, "CBlock"=5, "CemntBd"=6, "HdBoard"=7, "ImStucc"=8, "MetalSd"=9, "Plywood"=10, "Stone"=11, "Stucco"=12,"VinylSd"=13, "Wd Sdng"=14, "WdShing"=15)
housing_test$Exterior1st <- Exterior1st
# Transforming Exterior2nd column to numeric in train dataset
Exterior2nd <-as.factor(housing_train$Exterior2nd)
levels(Exterior2nd)
##  [1] "AsbShng" "AsphShn" "Brk Cmn" "BrkFace" "CBlock"  "CmentBd" "HdBoard"
##  [8] "ImStucc" "MetalSd" "Other"   "Plywood" "Stone"   "Stucco"  "VinylSd"
## [15] "Wd Sdng" "Wd Shng"
Exterior2nd = as.numeric(Exterior2nd,"AsbShng"=1, "AsphShn"=2, "BrkComm"=3, "BrkFace"=4, "CBlock"=5, "CemntBd"=6, "HdBoard"=7, "ImStucc"=8, "MetalSd"=9, "Plywood"=10, "Stone"=11, "Stucco"=12,"VinylSd"=13, "Wd Sdng"=14, "WdShing"=15)
housing_train$Exterior2nd <- Exterior2nd
# Transforming Exterior2nd column to numeric in test dataset
Exterior2nd <-as.factor(housing_test$Exterior2nd)
levels(Exterior2nd)
##  [1] "AsbShng" "AsphShn" "Brk Cmn" "BrkFace" "CBlock"  "CmentBd" "HdBoard"
##  [8] "ImStucc" "MetalSd" "Plywood" "Stone"   "Stucco"  "VinylSd" "Wd Sdng"
## [15] "Wd Shng"
Exterior2nd=as.numeric(Exterior2nd,"AsbShng"=1, "AsphShn"=2, "Brk Cmn"=3, "BrkFace"=4, "CBlock"=5, "CemntBd"=6, "HdBoard"=7, "ImStucc"=8, "MetalSd"=9, "Plywood"=10,"PreCast"=11, "Stone"=12, "Stucco"=13,"VinylSd"=14, "Wd Sdng"=15, "Wd Shing"=16)
housing_test$Exterior2nd <- Exterior2nd
# Transforming MasVnrType column to numeric in train dataset
MasVnrType <-as.factor(housing_train$MasVnrType)
levels(MasVnrType)
## [1] "BrkCmn"  "BrkFace" "None"    "Stone"
MasVnrType=as.numeric(MasVnrType,"BrkCmn"=1, "BrkFace"=2, "None"=0, "Stone"=4)
housing_train$MasVnrType <- MasVnrType
# Transforming MasVnrType column to numeric in test dataset
MasVnrType <-as.factor(housing_test$MasVnrType)
levels(MasVnrType)
## [1] "BrkCmn"  "BrkFace" "None"    "Stone"
MasVnrType=as.numeric(MasVnrType,"BrkCmn"=1, "BrkFace"=2, "Stone"=3, "NA"=0) 
housing_test$MasVnrType <- MasVnrType
# Transforming ExterQual column to numeric in train dataset
ExterQual <-as.factor(housing_train$ExterQual)
levels(ExterQual)
## [1] "Ex" "Fa" "Gd" "TA"
ExterQual=as.numeric(ExterQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 ) 
housing_train$ExterQual <- ExterQual
# Transforming ExterQual column to numeric in test dataset
ExterQual <-as.factor(housing_test$ExterQual)
levels(ExterQual)
## [1] "Ex" "Fa" "Gd" "TA"
ExterQual=as.numeric(ExterQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 ) 
housing_test$ExterQual <- ExterQual
# Transforming ExterCond column to numeric in train dataset
ExterCond <-as.factor(housing_train$ExterCond)
levels(ExterCond)
## [1] "Ex" "Fa" "Gd" "Po" "TA"
ExterCond=as.numeric(ExterCond,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 ) 
housing_train$ExterCond <- ExterCond
# Transforming ExterCond column to numeric in test dataset
ExterCond <-as.factor(housing_test$ExterCond)
levels(ExterCond)
## [1] "Ex" "Fa" "Gd" "Po" "TA"
ExterCond=as.numeric(ExterCond,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 ) 
housing_test$ExterCond <- ExterCond
# Transforming Foundation column to numeric in train dataset
Foundation <-as.factor(housing_train$Foundation)
levels(Foundation)
## [1] "BrkTil" "CBlock" "PConc"  "Slab"   "Stone"  "Wood"
Foundation=as.numeric(Foundation,"BrkTil"=1, "CBlock"=2, "PConc"=3, "Slab"=4, "Stone"=5, "Wood" = 6 ) 
housing_train$Foundation <- Foundation
# Transforming Foundation column to numeric in test dataset
Foundation <-as.factor(housing_test$Foundation)
levels(Foundation)
## [1] "BrkTil" "CBlock" "PConc"  "Slab"   "Stone"  "Wood"
Foundation=as.numeric(Foundation,"BrkTil"=1, "CBlock"=2, "PConc"=3, "Slab"=4, "Stone"=5, "Wood" = 6 ) 
housing_test$Foundation<- Foundation
# Transforming BsmtQual column to numeric in train dataset
BsmtQual <-as.factor(housing_train$BsmtQual)
levels(BsmtQual)
## [1] "Ex" "Fa" "Gd" "TA"
BsmtQual=as.numeric(BsmtQual,"Ex"=1, "Fa"=2, "Gd"=3, "TA"=4, "Fa"=5, "Po" = 6, "NA"=0) 
housing_train$BsmtQual <- BsmtQual
# Transforming BsmtQual column to numeric in test dataset
BsmtQual <-as.factor(housing_test$BsmtQual)
levels(BsmtQual)
## [1] "Ex" "Fa" "Gd" "TA"
BsmtQual=as.numeric(BsmtQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5, "NA"=0 ) 
housing_test$BsmtQual<- BsmtQual
# Transforming BsmtCond column to numeric in train dataset
BsmtCond <-as.factor(housing_train$BsmtCond)
levels(BsmtCond)
## [1] "Fa" "Gd" "Po" "TA"
BsmtCond=as.numeric(BsmtCond,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5, "NA"=0) 
housing_train$BsmtCond <- BsmtCond
# Transforming BsmtCond column to numeric in test dataset
BsmtCond <-as.factor(housing_test$BsmtCond)
levels(BsmtCond)
## [1] "Fa" "Gd" "Po" "TA"
BsmtCond=as.numeric(BsmtCond,"Ex"=1, "Fa"=4, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5, "NA"=0 ) 
housing_test$BsmtCond<- BsmtCond
# Transforming BsmtExposure column to numeric in train dataset
BsmtExposure <-as.factor(housing_train$BsmtExposure)
levels(BsmtExposure)
## [1] "Av" "Gd" "Mn" "No"
BsmtExposure=as.numeric(BsmtExposure,"Av"=1, "Gd"=2, "Mn"=3, "No"=4, "NA"=0) 
housing_train$BsmtExposure <- BsmtExposure
# Transforming BsmtExposure column to numeric in test dataset
BsmtExposure <-as.factor(housing_test$BsmtExposure)
levels(BsmtExposure)
## [1] "Av" "Gd" "Mn" "No"
BsmtExposure=as.numeric(BsmtExposure,"Av"=1, "Gd"=2, "Mn"=3, "No"=4, "NA"=0) 
housing_test$BsmtExposure<- BsmtExposure
# Transforming BsmtFinType1 column to numeric in train dataset
BsmtFinType1 <-as.factor(housing_train$BsmtFinType1)
levels(BsmtFinType1)
## [1] "ALQ" "BLQ" "GLQ" "LwQ" "Rec" "Unf"
BsmtFinType1=as.numeric(BsmtFinType1,"ALQ"=1, "BLQ"=2, "GLQ"=3, "LwQ"=4, "Rec"=5, "Unf"=6, "NA"=0) 
housing_train$BsmtFinType1 <- BsmtFinType1
# Transforming BsmtFinType1 column to numeric in test dataset
BsmtFinType1 <-as.factor(housing_test$BsmtFinType1)
levels(BsmtFinType1)
## [1] "ALQ" "BLQ" "GLQ" "LwQ" "Rec" "Unf"
BsmtFinType1=as.numeric(BsmtFinType1,"ALQ"=1, "BLQ"=2, "GLQ"=3, "LwQ"=4, "Rec"=5, "Unf"=6) 
housing_test$BsmtFinType1<- BsmtFinType1
housing_train$BsmtFinType1[is.na(housing_train$BsmtFinType1)] <- 0
sum(is.na(housing_train$BsmtFinType1))
## [1] 0
housing_test$BsmtFinType1[is.na(housing_test$BsmtFinType1)] <- 0
sum(is.na(housing_test$BsmtFinType1))
## [1] 0
# Transforming BsmtFinType2 column to numeric in train dataset
BsmtFinType2 <-as.factor(housing_train$BsmtFinType2)
levels(BsmtFinType2)
## [1] "ALQ" "BLQ" "GLQ" "LwQ" "Rec" "Unf"
BsmtFinType2=as.numeric(BsmtFinType2,"ALQ"=1, "BLQ"=2, "GLQ"=3, "LwQ"=4, "Rec"=5, "Unf"=6, "NA"=0) 
housing_train$BsmtFinType2 <- BsmtFinType2
# Transforming BsmtFinType2 column to numeric in test dataset
BsmtFinType2 <-as.factor(housing_test$BsmtFinType2)
levels(BsmtFinType2)
## [1] "ALQ" "BLQ" "GLQ" "LwQ" "Rec" "Unf"
BsmtFinType2=as.numeric(BsmtFinType2,"ALQ"=1, "BLQ"=2, "GLQ"=3, "LwQ"=4, "Rec"=5, "Unf"=6, "NA"=0) 
housing_test$BsmtFinType2<- BsmtFinType2
housing_train$BsmtFinType2[is.na(housing_train$BsmtFinType2)] <- 0
sum(is.na(housing_train$BsmtFinType2))
## [1] 0
housing_test$BsmtFinType2[is.na(housing_test$BsmtFinType2)] <- 0
sum(is.na(housing_test$BsmtFinType2))
## [1] 0
# Transforming Heating column to numeric in train dataset
Heating <-as.factor(housing_train$Heating)
levels(Heating)
## [1] "Floor" "GasA"  "GasW"  "Grav"  "OthW"  "Wall"
Heating=as.numeric(Heating,"Floor"=1, "GasA"=2, "GasW"=3, "Grav"=4, "OthW"=5, "Wall"=6, "NA"=0) 
housing_train$Heating <- Heating
# Transforming Heating column to numeric in test dataset
Heating <-as.factor(housing_test$Heating)
levels(Heating)
## [1] "GasA" "GasW" "Grav" "Wall"
Heating=as.numeric(Heating,"Floor"=1, "GasA"=2, "GasW"=3, "Grav"=4, "OthW"=5, "Wall"=6, "NA"=0) 
housing_test$Heating<- Heating
# Transforming HeatingQC column to numeric in train dataset
HeatingQC <-as.factor(housing_train$HeatingQC)
levels(HeatingQC)
## [1] "Ex" "Fa" "Gd" "Po" "TA"
HeatingQC=as.numeric(HeatingQC,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5) 
housing_train$HeatingQC <- HeatingQC
# Transforming HHeatingQC column to numeric in test dataset
HeatingQC <-as.factor(housing_test$HeatingQC)
levels(HeatingQC)
## [1] "Ex" "Fa" "Gd" "Po" "TA"
HeatingQC=as.numeric(HeatingQC,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5) 
housing_test$HeatingQC<- HeatingQC
# Transforming CentralAir column to numeric in train dataset
CentralAir <-as.factor(housing_train$CentralAir)
levels(CentralAir)
## [1] "N" "Y"
CentralAir=as.numeric(CentralAir,"N"=0, "Y"=1) 
housing_train$CentralAir <- CentralAir
# Transforming CentralAir column to numeric in test dataset
CentralAir <-as.factor(housing_test$CentralAir)
levels(CentralAir)
## [1] "N" "Y"
CentralAir=as.numeric(CentralAir,"N"=0, "Y"=1) 
housing_test$CentralAir<- CentralAir
# Transforming Electrical column to numeric in train dataset
Electrical <-as.factor(housing_train$Electrical)
levels(Electrical)
## [1] "FuseA" "FuseF" "FuseP" "Mix"   "SBrkr"
Electrical=as.numeric(Electrical,"SBrkr"=1, "FuseA"=2, "FuseF"=3, "FuseP"=4, "Mix"=5, "NA"=0) 
housing_train$Electrical <- Electrical
# Transforming Electrical column to numeric in test dataset
Electrical <-as.factor(housing_test$Electrical)
levels(Electrical)
## [1] "FuseA" "FuseF" "FuseP" "SBrkr"
Electrical=as.numeric(Electrical,"SBrkr"=1, "FuseA"=2, "FuseF"=3, "FuseP"=4, "Mix"=5 ) 
housing_test$Electrical<- Electrical
housing_train$Electrical[is.na(housing_train$Electrical)] <- 0
sum(is.na(housing_train$Electrical))
## [1] 0
housing_test$Electrical[is.na(housing_test$Electrical)] <- 0
sum(is.na(housing_test$Electrical))
## [1] 0
# Transforming KitchenQual column to numeric in train dataset
KitchenQual <-as.factor(housing_train$KitchenQual)
levels(KitchenQual)
## [1] "Ex" "Fa" "Gd" "TA"
KitchenQual=as.numeric(KitchenQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5) 
housing_train$KitchenQual <- KitchenQual
# Transforming KitchenQual column to numeric in test dataset
KitchenQual <-as.factor(housing_test$KitchenQual)
levels(KitchenQual)
## [1] "Ex" "Fa" "Gd" "TA"
KitchenQual=as.numeric(KitchenQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 ) 
housing_test$KitchenQual<- KitchenQual
#Transforming Functional column to numeric in train dataset
Functional=as.factor(housing_train$Functional)
levels(Functional) #"Maj1" "Maj2" "Min1" "Min2" "Mod"  "Sev"  "Typ"
## [1] "Maj1" "Maj2" "Min1" "Min2" "Mod"  "Sev"  "Typ"
Functional=as.numeric(Functional,  "Sev"=1, "Maj2"=2,"Maj1"=3, "Mod"=4, "Min2"=5, "Min1"=6, "Typ"=7)
Functional=Functional+1 # "Sal"=1, Sev"=2, "Maj2"=3,"Maj1"=4, "Mod"=5, "Min2"=6, "Min1"=7, "Typ"=8
housing_train$Functional <- Functional
#Transforming Functional column to numeric in test dataset
Functional=as.factor(housing_test$Functional)
levels(Functional) #"Maj1" "Maj2" "Min1" "Min2" "Mod"  "Sev"  "Typ"
## [1] "Maj1" "Maj2" "Min1" "Min2" "Mod"  "Sev"  "Typ"
Functional=as.numeric(Functional,  "Sev"=1, "Maj2"=2,"Maj1"=3, "Mod"=4, "Min2"=5, "Min1"=6, "Typ"=7, "NA"=0)
Functional=Functional +1 # "Sal"=1, Sev"=2, "Maj2"=3,"Maj1"=4, "Mod"=5, "Min2"=6, "Min1"=7, "Typ"=8
housing_test$Functional <- Functional
#Transforming FireplaceQu column to numeric in train dataset
FireplaceQu=as.factor(housing_train$FireplaceQu)
levels(FireplaceQu) #"Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(FireplaceQu)) # 690 Missing Entries
## [1] 690
FireplaceQu=as.numeric(FireplaceQu,  "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5, "NA"=0)
FireplaceQu[is.na(FireplaceQu)]<-0
housing_train$FireplaceQu <- FireplaceQu
#Transforming FireplaceQu column to numeric in test dataset
FireplaceQu=as.factor(housing_test$FireplaceQu)
levels(FireplaceQu) #"Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(FireplaceQu)) # 730 Missing Entries
## [1] 730
FireplaceQu=as.numeric(FireplaceQu,  "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5, "NA"=0)
FireplaceQu[is.na(FireplaceQu)]<-0
housing_test$FireplaceQu <- FireplaceQu
#Transforming GarageType column to numeric in train dataset
GarageType=as.factor(housing_train$GarageType)
levels(GarageType) #"2Types"  "Attchd"  "Basment" "BuiltIn" "CarPort" "Detchd"
## [1] "2Types"  "Attchd"  "Basment" "BuiltIn" "CarPort" "Detchd"
sum(is.na(GarageType)) # 81 Missing Entries
## [1] 81
GarageType=as.numeric(GarageType,  "Detchd"=1, "CarPort"=2,"BuiltIn"=3, "Basment"=4, "Attchd"=5, "2Types"=6)
GarageType[is.na(GarageType)]<-0
housing_train$GarageType <- GarageType
#Transforming GarageType column to numeric in test dataset
GarageType=as.factor(housing_test$GarageType)
levels(GarageType) #"2Types"  "Attchd"  "Basment" "BuiltIn" "CarPort" "Detchd"
## [1] "2Types"  "Attchd"  "Basment" "BuiltIn" "CarPort" "Detchd"
sum(is.na(GarageType)) # 76 Missing Entries
## [1] 76
GarageType=as.numeric(GarageType,  "Detchd"=1, "CarPort"=2,"BuiltIn"=3, "Basment"=4, "Attchd"=5, "2Types"=6)
GarageType[is.na(GarageType)]<-0
housing_test$GarageType <- GarageType
# Changing missing values of GarageYrBlt 
sum(is.na(housing_train$GarageYrBlt)) # 81 missing values
## [1] 81
sum(is.na(housing_test$GarageYrBlt)) #78 missing values
## [1] 78
housing_train$GarageYrBlt[is.na(housing_train$GarageYrBlt)] <- 0
sum(is.na(housing_train$GarageYrBlt))
## [1] 0
housing_test$GarageYrBlt[is.na(housing_test$GarageYrBlt)] <- 0
sum(is.na(housing_test$GarageYrBlt))
## [1] 0
#Transforming GarageFinish column to numeric in train dataset
GarageFinish=as.factor(housing_train$GarageFinish)
levels(GarageFinish) #"Fin" "RFn" "Unf"
## [1] "Fin" "RFn" "Unf"
sum(is.na(GarageFinish)) # 81 Missing Entries
## [1] 81
GarageFinish=as.numeric(GarageFinish,  "Unf"=1, "RFn"=2,"Fin"=3)
GarageFinish[is.na(GarageFinish)]<-0
housing_train$GarageFinish <- GarageFinish
#Transforming GarageFinish column to numeric in test dataset
GarageFinish=as.factor(housing_test$GarageFinish)
levels(GarageFinish) #"Fin" "RFn" "Unf"
## [1] "Fin" "RFn" "Unf"
sum(is.na(GarageFinish)) # 78 Missing Entries
## [1] 78
GarageFinish=as.numeric(GarageFinish,  "Unf"=1, "RFn"=2,"Fin"=3)
GarageFinish[is.na(GarageFinish)]<-0
housing_test$GarageFinish <- GarageFinish
# Changing missing values of GarageCars 
sum(is.na(housing_train$GarageCars)) # no missing values
## [1] 0
sum(is.na(housing_test$GarageCars)) #1 missing value
## [1] 1
housing_test$GarageCars[is.na(housing_test$GarageCars)]<-0
# Changing missing values of GarageArea
sum(is.na(housing_train$GarageArea)) # no missing values
## [1] 0
sum(is.na(housing_test$GarageArea)) #1 missing value
## [1] 1
housing_test$GarageArea[is.na(housing_test$GarageArea)]<-0
#Transforming GarageQual column to numeric in train dataset
GarageQual=as.factor(housing_train$GarageQual)
levels(GarageQual) # "Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(GarageQual)) # 81 Missing Entries
## [1] 81
GarageQual=as.numeric(GarageQual,  "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5)
GarageQual[is.na(GarageQual)]<-0
housing_train$GarageQual <- GarageQual
#Transforming GarageQual column to numeric in test dataset
GarageQual=as.factor(housing_test$GarageQual)
levels(GarageQual) # "Fa" "Gd" "Po" "TA"
## [1] "Fa" "Gd" "Po" "TA"
sum(is.na(GarageQual)) # 78 Missing Entries
## [1] 78
GarageQual=as.numeric(GarageQual,  "Po"=1, "Fa"=2,"TA"=3, "Gd"=4)
GarageQual[is.na(GarageQual)]<-0
housing_test$GarageQual <- GarageQual
#Transforming GarageCond column to numeric in train dataset
GarageCond=as.factor(housing_train$GarageCond)
levels(GarageCond) # "Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(GarageCond)) # 81 Missing Entries
## [1] 81
GarageCond=as.numeric(GarageCond,  "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5)
GarageCond[is.na(GarageCond)]<-0
housing_train$GarageCond <- GarageCond
#Transforming GarageCond column to numeric in test dataset
GarageCond=as.factor(housing_test$GarageCond)
levels(GarageCond) # "Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(GarageCond)) # 78 Missing Entries
## [1] 78
GarageCond=as.numeric(GarageCond,  "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5)
GarageCond[is.na(GarageCond)]<-0
housing_test$GarageCond <- GarageCond
#Transforming PavedDrive column to numeric in train dataset
PavedDrive=as.factor(housing_train$PavedDrive)
levels(PavedDrive) # "N" "P" "Y"
## [1] "N" "P" "Y"
sum(is.na(PavedDrive)) # 0 Missing Entries
## [1] 0
PavedDrive=as.numeric(PavedDrive,  "N"=1, "P"=2,"Y"=3)
housing_train$PavedDrive <- PavedDrive
#Transforming PavedDrive column to numeric in test dataset
PavedDrive=as.factor(housing_test$PavedDrive)
levels(PavedDrive) # "N" "P" "Y"
## [1] "N" "P" "Y"
sum(is.na(PavedDrive)) # 0 Missing Entries
## [1] 0
PavedDrive=as.numeric(PavedDrive,  "N"=1, "P"=2,"Y"=3)
housing_test$PavedDrive <- PavedDrive
#Transforming PoolQC column to numeric in train dataset
PoolQC=as.factor(housing_train$PoolQC)
levels(PoolQC) # "N" "P" "Y"
## [1] "Ex" "Fa" "Gd"
sum(is.na(PoolQC)) # 1453 Missing Entries
## [1] 1453
PoolQC=as.numeric(PoolQC,  "Fa"=1, "Gd"=2,"Ex"=3)
PoolQC <-ifelse(PoolQC==2|PoolQC==3,PoolQC+1,PoolQC) # No pool=0, Fa=1, TA=2, Gd=3, Ex=4
PoolQC[is.na(PoolQC)]<-0
housing_train$PoolQC <- PoolQC
#Transforming PoolQC column to numeric in test dataset
PoolQC=as.factor(housing_test$PoolQC)
levels(PoolQC) # "Ex" "Gd" 
## [1] "Ex" "Gd"
sum(is.na(PoolQC)) # 1456 Missing Entries
## [1] 1456
PoolQC=as.numeric(PoolQC,  "Gd"=1, "Ex"=2)
PoolQC=PoolQC+2
PoolQC[is.na(PoolQC)]<-0
housing_test$PoolQC <- PoolQC
#Transforming Fence column to numeric in train dataset
Fence=as.factor(housing_train$Fence)
levels(Fence) # "GdPrv" "GdWo"  "MnPrv" "MnWw" 
## [1] "GdPrv" "GdWo"  "MnPrv" "MnWw"
sum(is.na(Fence)) # 1179 Missing Entries
## [1] 1179
Fence=as.numeric(Fence,  "MnWw"=1, "GdWo"=2,"MnPrv"=3, "GdPrv"=4)
Fence[is.na(Fence)]<-0
housing_train$Fence <- Fence
#Transforming Fence column to numeric in test dataset
Fence=as.factor(housing_test$Fence)
levels(Fence) # "GdPrv" "GdWo"  "MnPrv" "MnWw" 
## [1] "GdPrv" "GdWo"  "MnPrv" "MnWw"
sum(is.na(Fence)) # 1169 Missing Entries
## [1] 1169
Fence=as.numeric(Fence,  "MnWw"=1, "GdWo"=2,"MnPrv"=3, "GdPrv"=4)
Fence[is.na(Fence)]<-0
housing_test$Fence <- Fence
#Transforming MiscFeature column to numeric in train dataset
MiscFeature=as.factor(housing_train$MiscFeature)
levels(MiscFeature) # "Gar2" "Othr" "Shed" "TenC"
## [1] "Gar2" "Othr" "Shed" "TenC"
sum(is.na(MiscFeature)) # 1406 Missing Entries
## [1] 1406
MiscFeature=as.numeric(MiscFeature,  "TenC"=1, "Shed"=2,"Othr"=3, "Gar2"=4)
MiscFeature[is.na(MiscFeature)]<-0
housing_train$MiscFeature <- MiscFeature
#Transforming MiscFeature column to numeric in test dataset
MiscFeature=as.factor(housing_test$MiscFeature)
levels(MiscFeature) # "Gar2" "Othr" "Shed" 
## [1] "Gar2" "Othr" "Shed"
sum(is.na(MiscFeature)) # 1408 Missing Entries
## [1] 1408
MiscFeature=as.numeric(MiscFeature, "Shed"=1,"Othr"=2, "Gar2"=3)
MiscFeature=MiscFeature+1
MiscFeature[is.na(MiscFeature)]<-0 #"TenC"=1, "Shed"=2,"Othr"=3, "Gar2"=4
housing_test$MiscFeature <- MiscFeature
#Transforming SaleType column to numeric in train dataset
SaleType=as.factor(housing_train$SaleType)
levels(SaleType) # "COD"   "Con"   "ConLD" "ConLI" "ConLw" "CWD"   "New"   "Oth"   "WD" 
## [1] "COD"   "Con"   "ConLD" "ConLI" "ConLw" "CWD"   "New"   "Oth"   "WD"
sum(is.na(SaleType)) # 0 Missing Entries
## [1] 0
SaleType=as.numeric(SaleType,  "Oth"=1, "ConLD"=2,"ConLI"=3, "ConLw"=4, "Con"=5, "COD"=6, "New"=7, "CWD"=8, "WD"=9)
housing_train$SaleType <- SaleType
#Transforming SaleType column to numeric in test dataset
SaleType=as.factor(housing_test$SaleType)
levels(SaleType) # "COD"   "Con"   "ConLD" "ConLI" "ConLw" "CWD"   "New"   "Oth"   "WD" 
## [1] "COD"   "Con"   "ConLD" "ConLI" "ConLw" "CWD"   "New"   "Oth"   "WD"
sum(is.na(SaleType)) # 1 Missing Entries
## [1] 1
SaleType=as.numeric(SaleType,  "Oth"=1, "ConLD"=2,"ConLI"=3, "ConLw"=4, "Con"=5, "COD"=6, "New"=7, "CWD"=8, "WD"=9)
SaleType[is.na(SaleType)]<-1
housing_test$SaleType <- SaleType
#Transforming SaleCondition column to numeric in train dataset
SaleCondition=as.factor(housing_train$SaleCondition)
levels(SaleCondition) # "Abnorml" "AdjLand" "Alloca"  "Family"  "Normal"  "Partial" 
## [1] "Abnorml" "AdjLand" "Alloca"  "Family"  "Normal"  "Partial"
sum(is.na(SaleCondition)) # 0 Missing Entries
## [1] 0
SaleCondition=as.numeric(SaleCondition,  "Partial"=1, "Family"=2,"Alloca"=3, "AdjLand"=4, "Abnorml"=5, "Normal"=6)
housing_train$SaleCondition <- SaleCondition
#Transforming SaleCondition column to numeric in test dataset
SaleCondition=as.factor(housing_test$SaleCondition)
levels(SaleCondition) # "Abnorml" "AdjLand" "Alloca"  "Family"  "Normal"  "Partial" 
## [1] "Abnorml" "AdjLand" "Alloca"  "Family"  "Normal"  "Partial"
sum(is.na(SaleCondition)) # 0 Missing Entries
## [1] 0
SaleCondition=as.numeric(SaleCondition,  "Partial"=1, "Family"=2,"Alloca"=3, "AdjLand"=4, "Abnorml"=5, "Normal"=6)
housing_test$SaleCondition <- SaleCondition

Data Preparation - Replacing Null values with Zeros in data set

Interestingly enough, the only significant number of nulls came from LotFrontage, indicating the lot was NOT on a street and Alley, indicating whether the property had access to an alley. A handful of nulls resulted from basement details, but nothing significant.

data.frame(num_missing=colSums(is.na(housing_train)))
##               num_missing
## Id                      0
## MSSubClass              0
## MSZoning                0
## LotFrontage           259
## LotArea                 0
## Street                  0
## Alley                1369
## LotShape                0
## LandContour             0
## Utilities               0
## LotConfig               0
## LandSlope               0
## Neighborhood            0
## Condition1              0
## Condition2              0
## BldgType                0
## HouseStyle              0
## OverallQual             0
## OverallCond             0
## YearBuilt               0
## YearRemodAdd            0
## RoofStyle               0
## RoofMatl                0
## Exterior1st             0
## Exterior2nd             0
## MasVnrType              8
## MasVnrArea              8
## ExterQual               0
## ExterCond               0
## Foundation              0
## BsmtQual               37
## BsmtCond               37
## BsmtExposure           38
## BsmtFinType1            0
## BsmtFinSF1              0
## BsmtFinType2            0
## BsmtFinSF2              0
## BsmtUnfSF               0
## TotalBsmtSF             0
## Heating                 0
## HeatingQC               0
## CentralAir              0
## Electrical              0
## X1stFlrSF               0
## X2ndFlrSF               0
## LowQualFinSF            0
## GrLivArea               0
## BsmtFullBath            0
## BsmtHalfBath            0
## FullBath                0
## HalfBath                0
## BedroomAbvGr            0
## KitchenAbvGr            0
## KitchenQual             0
## TotRmsAbvGrd            0
## Functional              0
## Fireplaces              0
## FireplaceQu             0
## GarageType              0
## GarageYrBlt             0
## GarageFinish            0
## GarageCars              0
## GarageArea              0
## GarageQual              0
## GarageCond              0
## PavedDrive              0
## WoodDeckSF              0
## OpenPorchSF             0
## EnclosedPorch           0
## X3SsnPorch              0
## ScreenPorch             0
## PoolArea                0
## PoolQC                  0
## Fence                   0
## MiscFeature             0
## MiscVal                 0
## MoSold                  0
## YrSold                  0
## SaleType                0
## SaleCondition           0
## SalePrice               0
data.frame(num_missing=colSums(is.na(housing_test)))
##               num_missing
## Id                      0
## MSSubClass              0
## MSZoning                4
## LotFrontage           227
## LotArea                 0
## Street                  0
## Alley                1352
## LotShape                0
## LandContour             0
## Utilities               2
## LotConfig               0
## LandSlope               0
## Neighborhood            0
## Condition1              0
## Condition2              0
## BldgType                0
## HouseStyle              0
## OverallQual             0
## OverallCond             0
## YearBuilt               0
## YearRemodAdd            0
## RoofStyle               0
## RoofMatl                0
## Exterior1st             1
## Exterior2nd             1
## MasVnrType             16
## MasVnrArea             15
## ExterQual               0
## ExterCond               0
## Foundation              0
## BsmtQual               44
## BsmtCond               45
## BsmtExposure           44
## BsmtFinType1            0
## BsmtFinSF1              1
## BsmtFinType2            0
## BsmtFinSF2              1
## BsmtUnfSF               1
## TotalBsmtSF             1
## Heating                 0
## HeatingQC               0
## CentralAir              0
## Electrical              0
## X1stFlrSF               0
## X2ndFlrSF               0
## LowQualFinSF            0
## GrLivArea               0
## BsmtFullBath            2
## BsmtHalfBath            2
## FullBath                0
## HalfBath                0
## BedroomAbvGr            0
## KitchenAbvGr            0
## KitchenQual             1
## TotRmsAbvGrd            0
## Functional              2
## Fireplaces              0
## FireplaceQu             0
## GarageType              0
## GarageYrBlt             0
## GarageFinish            0
## GarageCars              0
## GarageArea              0
## GarageQual              0
## GarageCond              0
## PavedDrive              0
## WoodDeckSF              0
## OpenPorchSF             0
## EnclosedPorch           0
## X3SsnPorch              0
## ScreenPorch             0
## PoolArea                0
## PoolQC                  0
## Fence                   0
## MiscFeature             0
## MiscVal                 0
## MoSold                  0
## YrSold                  0
## SaleType                0
## SaleCondition           0

The following code inputs 0 for missing values.

housing_train$LotFrontage[is.na(housing_train$LotFrontage)] <- 0
sum(is.na(housing_train$LotFrontage))
## [1] 0
housing_test$LotFrontage[is.na(housing_test$LotFrontage)] <- 0
sum(is.na(housing_test$LotFrontage))
## [1] 0
housing_train$BsmtQual[is.na(housing_train$BsmtQual)] <- 0
sum(is.na(housing_train$BsmtQual))
## [1] 0
housing_test$BsmtQual[is.na(housing_test$BsmtQual)] <- 0
sum(is.na(housing_test$BsmtQual))
## [1] 0
housing_train$MasVnrType[is.na(housing_train$MasVnrType)] <- 0
sum(is.na(housing_train$MasVnrType))
## [1] 0
housing_test$MasVnrType[is.na(housing_test$MasVnrType)] <- 0
sum(is.na(housing_test$MasVnrType))
## [1] 0
housing_train$MasVnrArea[is.na(housing_train$MasVnrArea)] <- 0
sum(is.na(housing_train$MasVnrArea))
## [1] 0
housing_test$MasVnrArea[is.na(housing_test$MasVnrArea)] <- 0
sum(is.na(housing_test$MasVnrArea))
## [1] 0
housing_train$BsmtCond[is.na(housing_train$BsmtCond)] <- 0
sum(is.na(housing_train$BsmtCond))
## [1] 0
housing_test$BsmtCond[is.na(housing_test$BsmtCond)] <- 0
sum(is.na(housing_test$BsmtCond))
## [1] 0
housing_train$BsmtExposure[is.na(housing_train$BsmtExposure)] <- 0
sum(is.na(housing_train$BsmtExposure))
## [1] 0
housing_test$BsmtExposure[is.na(housing_test$BsmtExposure)] <- 0
sum(is.na(housing_test$BsmtExposure))
## [1] 0
housing_train$MiscFeature[is.na(housing_train$MiscFeature)] <- 0
sum(is.na(housing_train$MiscFeature))
## [1] 0
housing_test$GarageQual[is.na(housing_test$GarageQual)] <- 0
sum(is.na(housing_test$GarageQual))
## [1] 0
housing_test$MSZoning[is.na(housing_test$MSZoning)] <- 0
sum(is.na(housing_test$MSZoning))
## [1] 0
housing_test$Exterior1st[is.na(housing_test$Exterior1st)] <- 0
sum(is.na(housing_test$Exterior1st))
## [1] 0
housing_test$Exterior2nd[is.na(housing_test$Exterior2nd)] <- 0
sum(is.na(housing_test$Exterior2nd))
## [1] 0
housing_test$BsmtFinSF1[is.na(housing_test$BsmtFinSF1)] <- 0
sum(is.na(housing_test$BsmtFinSF1))
## [1] 0
housing_test$BsmtFinSF2[is.na(housing_test$BsmtFinSF2)] <- 0
sum(is.na(housing_test$BsmtFinSF2   ))
## [1] 0
housing_test$BsmtUnfSF[is.na(housing_test$BsmtUnfSF)] <- 0
sum(is.na(housing_test$BsmtUnfSF))
## [1] 0
housing_test$BsmtFullBath[is.na(housing_test$BsmtFullBath)] <- 0
sum(is.na(housing_test$BsmtFullBath))
## [1] 0
housing_test$BsmtHalfBath[is.na(housing_test$BsmtHalfBath)] <- 0
sum(is.na(housing_test$BsmtHalfBath))
## [1] 0
housing_test$KitchenQual[is.na(housing_test$KitchenQual)] <- 0
sum(is.na(housing_test$KitchenQual))
## [1] 0
housing_test$Functional[is.na(housing_test$Functional)] <- 0
sum(is.na(housing_test$Functional))
## [1] 0
#housing_train$Alley[is.na(housing_train$Alley)] <- 0
#sum(is.na(housing_train$Alley))
#housing_test$Alley[is.na(housing_test$Alley)] <- 0
#sum(is.na(housing_test$Alley))

As the majority of the properties in the data did NOT have an Alley, that variable will be excluded from our model.

trainAlley <- housing_train$Alley
testAlley <- housing_test$Alley
housing_train$Alley <- NULL
housing_test$Alley <- NULL

As previous stated, the data was fairly clean to begin with and the remainder of the missing data was replaced with zeros.

Data Visualizations

Once we cleaned the data, we needed to visualize the data to get a baseline understanding which variables might work and which ones might not at very basic level. To do this, we loaded the following packages:

  1. ggplot2 - a package used for creating graphics
  2. ggcorrplot - used to visualize a correlation matrix using ggplot2; provides a solution for reordering the correlation matrix and displays the significance level on the correlogram

Data Visualizations - corrplot

#Libraries for the next visualizations
library(ggcorrplot) 
## Warning: package 'ggcorrplot' was built under R version 4.2.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.1
library(ggplot2)
correlations <- cor(housing_train[,c(2:15
                                    ,80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower",  sig.level = 0.01, insig = "blank")

Looking at positive correlations the variables that show the strongest correlation is MSSubClass and BldgType, followed by landSlope and LotArea. Looking a negative correlations there are greater correlations on LandContour and LandSlope, BlgdType with LotFrontage and LotArea. Now looking on SalesPrice variables MSSubClass, MSZoning, LotShape, LotConfig, BldgType have negative correlation with sales price. The variables LotFrontage, LotArea, Neighborhood, and condition have a positive correlation.

correlations <- cor(housing_train[,c(16:26, 80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower",  sig.level = 0.01, insig = "blank")

Looking at the variables that have greater correlation to sales price. OverallQual, YearBuilt, YearRemondAdd, MasVnrArea, and RoofStyle have a positive correlation. The only variable that has a negative correlation to SalesPrice is OverallCond.

correlations <- cor(housing_train[,c(27:40, 80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower",  sig.level = 0.01, insig = "blank")

In this set of variables it shows that ExterQual has a negative correlation to the Sales price, as well as BsmtQual, HeatingQC, and BsmtQUal. TotalBsmtSF has a positive correlation followed by BsmtFinSF1, and foundation.

correlations <- cor(housing_train[,c(41:60, 80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower",  sig.level = 0.01, insig = "blank")

All the variables that are related to living spaces and quality have a positive correlation, the only living space that has a negative correlation is the kitchen.

correlations <- cor(housing_train[,c(61:79, 80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower",  sig.level = 0.01, insig = "blank")

Finally, garage area is the last thing that has a stronger correlation to SalesPrice. With the exception of the fireplace, the remaining amenities do not appear to have a strong relationship to the SalePrice.

Data Visualization - Scatterplot Matrix

In order to run more advanced scatterplots, we loaded the car package.

pairs(SalePrice~YearBuilt+OverallQual+TotalBsmtSF+GrLivArea,data=housing_train,
   main="Simple Scatterplot Matrix")

Looking at the aboe scatterplots, the data seems to be well distributed while also showing how the variables correlate.

#install.packages('carData')
library(car)
## Warning: package 'car' was built under R version 4.2.1
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.1
scatterplot(SalePrice ~ YearBuilt, data=housing_train,  xlab="Year Built", ylab="Sale Price", grid=FALSE)

The above chart shows the sale price comparing to the year it was built, We can see a correlation indicating that the newer the home, the higher the SalesPrice.

scatterplot(SalePrice ~ YrSold, data=housing_train,  xlab="Year Built", ylab="Sale Price", grid=FALSE)

Interestingly, the Year Built vs Sale Price shows how the dip in the 2008 housing market influenced the current sale price of houses; the data shows a small decline from 2007 to 2008, but then shows a slight increase in 2009. SalesPrice seems to stabilize afterwards.

scatterplot(SalePrice ~ LotArea, data=housing_train,  xlab="Lot Area", ylab="Sale Price", grid=FALSE)

The chart shows a non-linear relationship between the size of the lot and the Sales Price indicating that other house factors have a greater weight on the price of the house than just the lot size.

scatterplot(SalePrice ~ X1stFlrSF, data=housing_train,  xlab="1st Floor Square Foot", ylab="Sale Price", grid=FALSE)

For a final look, data would indicate that 1st floor Square Footage shows a enjoys a strong relationship to SalesPrice, but outliers still exist, indicating there are other important variables.

Data Modeling and Evaluation

Packages Used

  1. caret - Short for Classification And REgression Training; a set of functions that attempt to streamline the process for creating predictive models. For our purposes, this package was used to create partitions for modeling and testing our data.
  2. metrics - evaluation metrics for R. For our purposes, this package was used to evaluate RMSE.
  3. XGBoost - provides a parallel tree boosting (also known as GBDT, GBM) that solve many data science problems in a fast and accurate way. For our purposes, this package was utilized to predict the SalesPrice for our test data.
  4. MASS - Modern Applied Statistics, for our purposes, this package was used for visual analysis in regards to data transformation.
  5. ipred - improved predictive models. For our purposes, this package was used for bagging.
  6. RandomForest - The random forest can deal with a large number of features and it helps to identify the important attributes. For our purposes, this package was used for the Random Forest Analysis.
#Data partition using caret partition function.
#install.packages('lattice')
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
#Packages for RMSE
#install.packages('Metrics')
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.2.2
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
#Naming the Sale Price as Outcome
outcome <- housing_train$SalePrice
#Partition the data to be 60% train and 40% test
partition <- createDataPartition(y=outcome, p=.6, list=FALSE)
train <- housing_train[partition,]
test <- housing_train[-partition,]

NOTE: After testing the models several times with different train and test sets, reducing the train set improved the XGBoost but also increased the error for Linear Regression. Train 60% of the data was where the best prediction was displayed for XGBoost; using 50% as train data made our prediction error increase by 2%. The first part of the project will use a different percentage of train data than the XGBoost to make the model better.

Data Analysis - Linear Regression

Step one: create a linear model to identify variables that share a strong relationship with the Sale Price.

LM_model1 <- lm(SalePrice ~., data=train)
summary(LM_model1)
## 
## Call:
## lm(formula = SalePrice ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -135253  -12988   -1143   12562  160803 
## 
## Coefficients: (3 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.705e+06  1.385e+06   1.232 0.218404    
## Id            -2.715e+00  2.118e+00  -1.282 0.200240    
## MSSubClass    -6.641e+01  4.762e+01  -1.395 0.163549    
## MSZoning      -1.304e+03  1.525e+03  -0.855 0.392666    
## LotFrontage    3.017e+01  2.986e+01   1.010 0.312566    
## LotArea        5.554e-01  1.042e-01   5.331 1.27e-07 ***
## Street         4.789e+04  1.424e+04   3.364 0.000805 ***
## LotShape      -1.190e+03  7.147e+02  -1.665 0.096245 .  
## LandContour   -2.322e+03  1.407e+03  -1.650 0.099257 .  
## Utilities             NA         NA      NA       NA    
## LotConfig     -7.665e+01  5.702e+02  -0.134 0.893090    
## LandSlope     -4.710e+02  4.150e+03  -0.113 0.909674    
## Neighborhood   5.082e+00  1.654e+02   0.031 0.975500    
## Condition1     1.056e+02  1.017e+03   0.104 0.917315    
## Condition2    -3.593e+03  4.204e+03  -0.855 0.393039    
## BldgType      -1.371e+03  1.562e+03  -0.877 0.380518    
## HouseStyle     4.851e+02  6.885e+02   0.705 0.481307    
## OverallQual    8.057e+03  1.185e+03   6.797 2.08e-11 ***
## OverallCond    3.776e+03  1.125e+03   3.357 0.000825 ***
## YearBuilt      2.003e+02  7.536e+01   2.658 0.008027 ** 
## YearRemodAdd   7.123e+01  6.965e+01   1.023 0.306786    
## RoofStyle      1.457e+03  1.163e+03   1.252 0.210794    
## RoofMatl      -2.531e+03  1.444e+03  -1.752 0.080159 .  
## Exterior1st   -1.222e+03  5.740e+02  -2.129 0.033523 *  
## Exterior2nd    7.309e+02  5.140e+02   1.422 0.155417    
## MasVnrType     6.813e+03  1.564e+03   4.357 1.49e-05 ***
## MasVnrArea     2.670e+01  6.028e+00   4.430 1.08e-05 ***
## ExterQual     -1.392e+04  2.059e+03  -6.760 2.65e-11 ***
## ExterCond     -7.138e+01  1.336e+03  -0.053 0.957408    
## Foundation    -7.836e+01  1.701e+03  -0.046 0.963261    
## BsmtQual      -5.945e+03  1.440e+03  -4.128 4.04e-05 ***
## BsmtCond       3.271e+03  1.415e+03   2.312 0.021019 *  
## BsmtExposure  -2.684e+03  8.953e+02  -2.997 0.002806 ** 
## BsmtFinType1   3.329e+02  6.475e+02   0.514 0.607255    
## BsmtFinSF1     4.702e+01  5.757e+00   8.167 1.23e-15 ***
## BsmtFinType2  -8.020e+02  1.189e+03  -0.674 0.500243    
## BsmtFinSF2     3.304e+01  8.848e+00   3.734 0.000202 ***
## BsmtUnfSF      2.528e+01  5.436e+00   4.651 3.86e-06 ***
## TotalBsmtSF           NA         NA      NA       NA    
## Heating       -1.093e+03  3.007e+03  -0.363 0.716429    
## HeatingQC     -2.456e+02  6.362e+02  -0.386 0.699545    
## CentralAir     5.512e+02  4.495e+03   0.123 0.902432    
## Electrical    -2.493e+02  9.599e+02  -0.260 0.795131    
## X1stFlrSF      5.211e+01  6.910e+00   7.541 1.26e-13 ***
## X2ndFlrSF      5.031e+01  5.316e+00   9.463  < 2e-16 ***
## LowQualFinSF  -2.744e+01  2.329e+01  -1.178 0.238983    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   1.834e+03  2.573e+03   0.713 0.476306    
## BsmtHalfBath   1.399e+03  3.973e+03   0.352 0.724788    
## FullBath       6.860e+02  2.846e+03   0.241 0.809573    
## HalfBath       3.822e+03  2.724e+03   1.403 0.160882    
## BedroomAbvGr  -7.348e+03  1.730e+03  -4.248 2.41e-05 ***
## KitchenAbvGr  -2.902e+04  5.661e+03  -5.126 3.72e-07 ***
## KitchenQual   -5.072e+03  1.537e+03  -3.300 0.001009 ** 
## TotRmsAbvGrd   4.582e+03  1.255e+03   3.651 0.000278 ***
## Functional     4.990e+03  9.536e+02   5.233 2.13e-07 ***
## Fireplaces     7.823e+03  2.920e+03   2.679 0.007531 ** 
## FireplaceQu   -1.699e+03  8.524e+02  -1.993 0.046607 *  
## GarageType     1.883e+03  6.788e+02   2.775 0.005653 ** 
## GarageYrBlt   -1.447e+01  6.365e+00  -2.274 0.023228 *  
## GarageFinish  -3.868e+02  1.525e+03  -0.254 0.799851    
## GarageCars     3.996e+03  2.903e+03   1.377 0.168989    
## GarageArea     1.394e+01  9.539e+00   1.461 0.144284    
## GarageQual    -4.068e+02  1.970e+03  -0.207 0.836445    
## GarageCond     2.434e+03  2.372e+03   1.026 0.305259    
## PavedDrive     2.473e+03  2.090e+03   1.183 0.236985    
## WoodDeckSF     1.546e+01  7.727e+00   2.001 0.045705 *  
## OpenPorchSF    2.649e+00  1.557e+01   0.170 0.864900    
## EnclosedPorch  7.252e+00  1.693e+01   0.428 0.668419    
## X3SsnPorch     3.606e+00  3.281e+01   0.110 0.912509    
## ScreenPorch    4.706e+01  1.761e+01   2.673 0.007670 ** 
## PoolArea       1.907e+03  1.292e+02  14.762  < 2e-16 ***
## PoolQC        -4.095e+05  2.263e+04 -18.097  < 2e-16 ***
## Fence          2.516e+02  9.384e+02   0.268 0.788655    
## MiscFeature   -3.476e+02  1.821e+03  -0.191 0.848668    
## MiscVal        3.859e-03  1.535e+00   0.003 0.997994    
## MoSold        -1.098e+02  3.325e+02  -0.330 0.741330    
## YrSold        -1.135e+03  6.837e+02  -1.661 0.097146 .  
## SaleType      -1.277e+03  6.002e+02  -2.127 0.033714 *  
## SaleCondition  3.742e+03  8.731e+02   4.286 2.04e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25110 on 801 degrees of freedom
## Multiple R-squared:  0.9058, Adjusted R-squared:  0.8969 
## F-statistic: 101.4 on 76 and 801 DF,  p-value: < 2.2e-16
#The competition asks to use RMSE for predicting error.
prediction_lm1 <- predict(LM_model1, test, type="response")
## Warning in predict.lm(LM_model1, test, type = "response"): prediction from a
## rank-deficient fit may be misleading
model_output <- cbind(test, prediction_lm1)

model_output$log_prediction <- log(model_output$prediction_lm1)
## Warning in log(model_output$prediction_lm1): NaNs produced
model_output$log_SalePrice <- log(model_output$SalePrice)

#Test with RMSE

rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] NaN

The produced model has an R-Squared of .835, RMSE of .1667 and approximately 25% of the included variables are significant.

LM_model2 <- lm(SalePrice ~LotArea+Street+Neighborhood+Condition1+BldgType+OverallCond+OverallQual+YearBuilt+RoofMatl+MasVnrArea+ExterQual+BsmtFinSF1+BsmtUnfSF+X1stFlrSF+ X2ndFlrSF+BedroomAbvGr+KitchenAbvGr+KitchenQual+TotRmsAbvGrd+Fireplaces+GarageArea+GarageQual, data=train)
summary(LM_model2)
## 
## Call:
## lm(formula = SalePrice ~ LotArea + Street + Neighborhood + Condition1 + 
##     BldgType + OverallCond + OverallQual + YearBuilt + RoofMatl + 
##     MasVnrArea + ExterQual + BsmtFinSF1 + BsmtUnfSF + X1stFlrSF + 
##     X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd + 
##     Fireplaces + GarageArea + GarageQual, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -471136  -15782   -1509   12844  206921 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8.556e+05  1.155e+05  -7.409 3.07e-13 ***
## LotArea       6.151e-01  1.171e-01   5.251 1.91e-07 ***
## Street        3.629e+04  1.805e+04   2.011 0.044676 *  
## Neighborhood  5.222e+02  2.000e+02   2.611 0.009189 ** 
## Condition1    1.087e+03  1.269e+03   0.857 0.391611    
## BldgType     -4.552e+03  1.119e+03  -4.068 5.18e-05 ***
## OverallCond   4.629e+03  1.179e+03   3.927 9.29e-05 ***
## OverallQual   1.337e+04  1.449e+03   9.228  < 2e-16 ***
## YearBuilt     4.318e+02  5.607e+01   7.701 3.74e-14 ***
## RoofMatl      1.722e+03  1.798e+03   0.958 0.338502    
## MasVnrArea    2.383e+01  7.054e+00   3.377 0.000765 ***
## ExterQual    -1.657e+04  2.551e+03  -6.495 1.41e-10 ***
## BsmtFinSF1    8.057e+00  4.574e+00   1.761 0.078512 .  
## BsmtUnfSF    -2.191e+00  4.455e+00  -0.492 0.623068    
## X1stFlrSF     3.657e+01  6.886e+00   5.312 1.39e-07 ***
## X2ndFlrSF     2.220e+01  5.172e+00   4.293 1.96e-05 ***
## BedroomAbvGr -7.573e+03  2.093e+03  -3.618 0.000315 ***
## KitchenAbvGr -1.980e+04  6.675e+03  -2.966 0.003097 ** 
## KitchenQual  -8.949e+03  1.924e+03  -4.650 3.84e-06 ***
## TotRmsAbvGrd  7.986e+03  1.550e+03   5.153 3.19e-07 ***
## Fireplaces    7.303e+03  2.201e+03   3.318 0.000945 ***
## GarageArea    3.339e+01  8.377e+00   3.986 7.29e-05 ***
## GarageQual   -1.140e+03  1.102e+03  -1.035 0.301007    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33440 on 855 degrees of freedom
## Multiple R-squared:  0.8217, Adjusted R-squared:  0.8171 
## F-statistic: 179.1 on 22 and 855 DF,  p-value: < 2.2e-16
#The competition asks to use RMSE for predicting error.
prediction_lm <- predict(LM_model2, test, type="response")
model_output <- cbind(test, prediction_lm)

model_output$log_prediction <- log(model_output$prediction_lm)
model_output$log_SalePrice <- log(model_output$SalePrice)

#Test with RMSE

rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] 0.16122

The new model, with reduced number of variables has a lowered R-squared of 0.791 but an improved RMSE of .1529.

Data Analysis - Data Transformation

We run a new set of plots to determine if data transformation is necessary.

plot(LM_model2$fitted.values, LM_model2$residuals, pch = 20, col = "blue")
abline(h = 0)

The model seem to be around the 0 with some scatter prices when the fitted value increases.

#Package for BoxCox 
library(MASS)

To further validate, we utilize boxcox.

boxcox(LM_model2)

Based on our boxcox output, we determine that using a log transformation will improve the linear regression model.

model3 <- lm(I(log(SalePrice)) ~LotArea+Street+Neighborhood+Condition1+BldgType+OverallCond+OverallQual+YearBuilt+RoofMatl+MasVnrArea+ExterQual+BsmtFinSF1+BsmtUnfSF+X1stFlrSF+ X2ndFlrSF+BedroomAbvGr+KitchenAbvGr+KitchenQual+TotRmsAbvGrd+Fireplaces+GarageArea+GarageQual, data=train)
summary(model3)
## 
## Call:
## lm(formula = I(log(SalePrice)) ~ LotArea + Street + Neighborhood + 
##     Condition1 + BldgType + OverallCond + OverallQual + YearBuilt + 
##     RoofMatl + MasVnrArea + ExterQual + BsmtFinSF1 + BsmtUnfSF + 
##     X1stFlrSF + X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual + 
##     TotRmsAbvGrd + Fireplaces + GarageArea + GarageQual, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.17438 -0.06670  0.00648  0.08324  0.51722 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.161e+00  5.496e-01   5.752 1.23e-08 ***
## LotArea       2.531e-06  5.574e-07   4.541 6.41e-06 ***
## Street        1.230e-01  8.589e-02   1.432 0.152433    
## Neighborhood  2.007e-03  9.518e-04   2.108 0.035299 *  
## Condition1    7.016e-03  6.038e-03   1.162 0.245587    
## BldgType     -1.855e-02  5.325e-03  -3.483 0.000520 ***
## OverallCond   4.902e-02  5.610e-03   8.739  < 2e-16 ***
## OverallQual   8.439e-02  6.894e-03  12.240  < 2e-16 ***
## YearBuilt     3.754e-03  2.669e-04  14.067  < 2e-16 ***
## RoofMatl      1.393e-02  8.558e-03   1.628 0.103828    
## MasVnrArea    5.455e-06  3.357e-05   0.162 0.870952    
## ExterQual    -2.429e-02  1.214e-02  -2.000 0.045797 *  
## BsmtFinSF1    4.315e-05  2.177e-05   1.982 0.047763 *  
## BsmtUnfSF     1.531e-06  2.120e-05   0.072 0.942456    
## X1stFlrSF     2.054e-04  3.277e-05   6.269 5.75e-10 ***
## X2ndFlrSF     1.538e-04  2.461e-05   6.249 6.50e-10 ***
## BedroomAbvGr  3.905e-03  9.962e-03   0.392 0.695188    
## KitchenAbvGr -6.512e-02  3.177e-02  -2.050 0.040671 *  
## KitchenQual  -3.287e-02  9.158e-03  -3.589 0.000351 ***
## TotRmsAbvGrd  1.904e-02  7.376e-03   2.582 0.010001 *  
## Fireplaces    5.295e-02  1.048e-02   5.055 5.26e-07 ***
## GarageArea    1.705e-04  3.987e-05   4.278 2.10e-05 ***
## GarageQual    1.180e-02  5.242e-03   2.251 0.024629 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1591 on 855 degrees of freedom
## Multiple R-squared:  0.846,  Adjusted R-squared:  0.842 
## F-statistic: 213.5 on 22 and 855 DF,  p-value: < 2.2e-16
prediction3 <- predict(model3, test, type="response")
model_output <- cbind(test, prediction3)

model_output$log_prediction3 <- log(model_output$prediction3)
model_output$log_SalePrice3 <- log(model_output$SalePrice)
#Test with RMSE

rmse(model_output$log_SalePrice3, model_output$log_prediction3)
## [1] 9.547395

The R-Squared has improved to .8201 (just shy of our original model) but our RMSE is SIGNIFICANTLY off at 9.544 from our previous models. In order to further adjust our model, we will utilize cooks distance to determine if there are any outliers that can influence the model. In turn, this will help us decide if there are variables that should be excluded or not.

mean(hatvalues(model3))
## [1] 0.0261959
qqnorm(LM_model2$residuals, main = "LM_model2") 
qqline(LM_model2$residuals)
abline(h = 0, col = "grey")

QQ-plot looking at leverage of data points; overall the model does not need to remove any data points.

Data Analysis - Bagging

For the bagging model no preparation was required sinnce the data was already changed from categorical to numeric. The model will start with 500 bootstrap samples and will be reduced as see fit.

#Package for bagging
#install.packages('ipred')
library(ipred)
## Warning: package 'ipred' was built under R version 4.2.1
house_bag <- bagging(formula = SalePrice ~., data = train, nbagg = 500) 
house_bag
## 
## Bagging regression trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = SalePrice ~ ., data = train, nbagg = 500)

Out_of_bag Prediction

house_bag_oob <- bagging(formula = SalePrice~., data = train, coob = T, nbagg = 500)
house_bag_oob
## 
## Bagging regression trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = SalePrice ~ ., data = train, coob = T, 
##     nbagg = 500)
## 
## Out-of-bag estimate of root mean squared error:  35960.38

The OBB error is high, but smaller than the linear regression with no transformation ( linear regression = , oob = ).

The out of bag show a large error. Looking at the RMSE:

# Predict using the test set
house_bag_pred_1 <- predict(house_bag_oob, test)
model_output <- cbind(test, house_bag_pred_1)


model_output$log_prediction_bag <- log(model_output$house_bag_pred_1)
model_output$log_SalePrice_bag <- log(model_output$SalePrice)

#Test with RMSE

rmse(model_output$log_SalePrice_bag,model_output$log_prediction_bag)
## [1] 0.1924199

The prediction model is showing error.

house_bag2 <- bagging(formula = SalePrice ~LotArea+Street+Neighborhood+Condition1+BldgType+OverallCond+OverallQual+YearBuilt+RoofMatl+MasVnrArea+ExterQual+BsmtFinSF1+BsmtUnfSF+X1stFlrSF+ X2ndFlrSF+BedroomAbvGr+KitchenAbvGr+KitchenQual+TotRmsAbvGrd+Fireplaces+GarageArea+GarageQual, data = train, nbagg = 500) 
house_bag2
## 
## Bagging regression trees with 500 bootstrap replications 
## 
## Call: bagging.data.frame(formula = SalePrice ~ LotArea + Street + Neighborhood + 
##     Condition1 + BldgType + OverallCond + OverallQual + YearBuilt + 
##     RoofMatl + MasVnrArea + ExterQual + BsmtFinSF1 + BsmtUnfSF + 
##     X1stFlrSF + X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual + 
##     TotRmsAbvGrd + Fireplaces + GarageArea + GarageQual, data = train, 
##     nbagg = 500)
# Predict using the test set
house_bag_pred_2 <- predict(house_bag2, test)
model_output2 <- cbind(test, house_bag_pred_2)


model_output2$log_prediction_bag2 <- log(model_output2$house_bag_pred_2)
model_output2$log_SalePrice_bag2 <- log(model_output2$SalePrice)

#Test with RMSE

rmse(model_output2$log_SalePrice_bag2,model_output2$log_prediction_bag2)
## [1] 0.2042387

Looking at the trees split and error.

ntree <- c(1, 3, 5, seq(20, 500, 20))
MSE_test <- rep(0, length(ntree))
for(i in 1:length(ntree)){
  bag1 <- bagging(SalePrice~., data = train, nbagg = ntree[i])
 predict <- predict(bag1, newdata = test)
 MSE_test[i] <- mean((test$SalePrice - predict)^2)
}
plot(ntree, MSE_test, type = 'l', col = 2, lwd = 2, xaxt = "n")
axis(1, at = ntree, las = 1)

The chart shows the first decline on trees at around 20, but the most significant decline around 200 trees.

house_bag3 <- bagging(formula = SalePrice ~LotArea+Street+Neighborhood+Condition1+BldgType+OverallCond+OverallQual+YearBuilt+RoofMatl+MasVnrArea+ExterQual+BsmtFinSF1+BsmtUnfSF+X1stFlrSF+ X2ndFlrSF+BedroomAbvGr+KitchenAbvGr+KitchenQual+TotRmsAbvGrd+Fireplaces+GarageArea+GarageQual, data = train, nbagg = 250) 
house_bag3
## 
## Bagging regression trees with 250 bootstrap replications 
## 
## Call: bagging.data.frame(formula = SalePrice ~ LotArea + Street + Neighborhood + 
##     Condition1 + BldgType + OverallCond + OverallQual + YearBuilt + 
##     RoofMatl + MasVnrArea + ExterQual + BsmtFinSF1 + BsmtUnfSF + 
##     X1stFlrSF + X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual + 
##     TotRmsAbvGrd + Fireplaces + GarageArea + GarageQual, data = train, 
##     nbagg = 250)
# Predict using the test set
house_bag_pred_3 <- predict(house_bag3, test)
model_output3 <- cbind(test, house_bag_pred_3)


model_output3$log_prediction_bag3 <- log(model_output3$house_bag_pred_3)
model_output3$log_SalePrice_bag3 <- log(model_output3$SalePrice)

#Test with RMSE

rmse(model_output3$log_SalePrice_bag3,model_output3$log_prediction_bag3)
## [1] 0.2036751

Bagging did not show an improvement to the RMSE over previous models.

Data Analysis - Random Forest

#Package for Randpom Florest
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
house_rf <- randomForest(SalePrice~., data = train, importance = TRUE) 
house_rf
## 
## Call:
##  randomForest(formula = SalePrice ~ ., data = train, importance = TRUE) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 26
## 
##           Mean of squared residuals: 862144873
##                     % Var explained: 85.88

Since random Forest has it owns filter for variables, there is no need to select the variables that showed correlations previously.

# Predict using the test set
prediction_rf <- predict(house_rf, test)
model_output_rf <- cbind(test, prediction_rf)


model_output_rf$log_prediction_rf <- log(model_output_rf$prediction_rf)
model_output_rf$log_SalePrice_rf <- log(model_output_rf$SalePrice)

#Test with RMSE

rmse(model_output_rf$log_SalePrice_rf,model_output_rf$log_prediction_rf)
## [1] 0.1440042

The prediction model has a smaller error than bagging showing an RMSE of .0363.

Data Analysis - XGBoost

#Package for XGBoost
#install.packages('xgboost')
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.2.2

Splitting the data again:

#Partition the data to be 60% train and 40% test
partition2 <- createDataPartition(y=outcome, p=.9, list=FALSE)
train <- housing_train[partition2,]
test <- housing_train[-partition2,]

The first step it is to transform the data set into Sparse Matrix.

#Assemble and format the data - Using Log for Variable Sale Price
train$log_SalePrice <- log(train$SalePrice)
test$log_SalePrice <- log(test$SalePrice)

#Create matrices from the data frames
trainData<- as.matrix(train, rownames.force=NA)
testData<- as.matrix(test, rownames.force=NA)

#Turn the matrices into sparse matrices
train2 <- as(trainData, "sparseMatrix")
test2 <- as(testData, "sparseMatrix")

#colnames(train2)
#colnames(pred_data)
#Cross Validate the model
vars <- c(1:78) #Choose the variables 
trainD <- xgb.DMatrix(data = train2[,vars], label = train2[,"SalePrice"]) #Convert to xgb.DMatrix format for space and efficiency 

Creating a cross validation model:

#Cross validate the model
cv.sparse <- xgb.cv(data = trainD,
                    nrounds = 500,
                    min_child_weight = 0,
                    max_depth = 10,
                    eta = 0.04,
                    subsample = .7,
                    colsample_bytree = .7,
                    booster = "gbtree",
                    eval_metric = "rmse",
                    print_every_n = 100,
                    nfold = 4,
                    nthread = 2,
                    objective="reg:linear")
## [18:36:13] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [18:36:13] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [18:36:13] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [18:36:13] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:189050.919230+1173.054544    test-rmse:189103.397150+3690.871760 
## [101]    train-rmse:9093.985430+481.279412   test-rmse:29463.944696+2769.224325 
## [201]    train-rmse:2628.380058+372.538611   test-rmse:28734.947692+3085.620299 
## [301]    train-rmse:1076.411601+197.517498   test-rmse:28773.133768+3113.038632 
## [401]    train-rmse:444.516212+103.406670    test-rmse:28751.591512+3095.618314 
## [500]    train-rmse:179.000714+63.906102 test-rmse:28760.266901+3094.409819
#Choose the parameters for the model - tuning the model
param <- list(colsample_bytree = .7, #amount of features for each tree
             subsample = .7, #fractions of observation for random samples bt .5 and 1 lower than .5 is very conservative model
             booster = "gbtree", #tree Based model for a linear model use 'gblinear' 
             max_depth = 10, #maximun dept of a tree
             eta = 0.04, #makes the model more robust by shrinking the weight of each step 
             eval_metric = "rmse",
             objective="reg:linear")
#Train the model using those parameters
bstSparse <-
  xgb.train(params = param,
            data = trainD,
            nrounds = 500,
            watchlist = list(train = trainD),
            verbose = TRUE,
            print_every_n = 100,
            nthread = 2)
## [18:36:32] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:189014.967259 
## [101]    train-rmse:9072.405113 
## [201]    train-rmse:2657.627783 
## [301]    train-rmse:1178.258176 
## [401]    train-rmse:479.246070 
## [500]    train-rmse:171.761976

Prediction of the bstSparse Model:

testD <- xgb.DMatrix(data = test2[,vars])
#Column names must match the inputs EXACTLY
prediction <- predict(bstSparse, testD) #Make the prediction based on the half of the training data set aside

#Put testing prediction and test dataset all together
test3 <- as.data.frame(as.matrix(test2))
prediction <- as.data.frame(as.matrix(prediction))
colnames(prediction) <- "prediction"
model_output <- cbind(test3, prediction)

model_output$log_prediction <- log(model_output$prediction)
model_output$log_SalePrice <- log(model_output$SalePrice)

#Test with RMSE

rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] 0.1544653

The RMSE error is 13.88% for the first model, after running many different model with different values the best RMSE was 1.11% error (but it varies between 1% and 2%).

#Changing the parameters
param2 <- list(colsample_bytree = .6, 
             subsample = .8, 
             booster = "gbtree", 
             max_depth = 12, 
             eta = 0.05, 
             eval_metric = "rmse",
             objective="reg:linear")

Make a second model

#Train the model using those parameters
bstSparse2 <-
  xgb.train(params = param2,
            data = trainD,
            nrounds = 500,
            watchlist = list(train = trainD),
            verbose = TRUE,
            print_every_n = 100,
            nthread = 2)
## [18:36:38] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:187225.998476 
## [101]    train-rmse:4641.332695 
## [201]    train-rmse:802.036158 
## [301]    train-rmse:189.122680 
## [401]    train-rmse:42.507610 
## [500]    train-rmse:9.741979
#Column names must match the inputs EXACTLY
prediction_2 <- predict(bstSparse2, testD) #Make the prediction based on the half of the training data set aside

#Put testing prediction and test dataset all together
test3 <- as.data.frame(as.matrix(test2))
prediction2 <- as.data.frame(as.matrix(prediction_2))
colnames(prediction2) <- "prediction"
output <- cbind(test3, prediction2)

output$log_prediction_2 <- log(output$prediction)
output$log_SalePrice2 <- log(output$SalePrice)

#Test with RMSE

rmse(output$log_SalePrice2,output$log_prediction_2)
## [1] 0.1564827

The RMSE error is .1067 what is slightly higher than the previous model.

Preparing the test data set

# Get the supplied test data ready #
predict <- as.data.frame(housing_test) #Get the dataset formatted as a frame for later combining

#Create matrices from the data frames
predData<- as.matrix(predict, rownames.force=NA)

#Turn the matrices into sparse matrices
predicting <- as(predData, "sparseMatrix")
#colnames(train[,c(2:79)])
vars <- c("Id", "MSSubClass", "MSZoning", "LotFrontage", "LotArea", "Street",       
"LotShape", "LandContour", "Utilities",     "LotConfig",     "LandSlope",     "Neighborhood", 
"Condition1",    "Condition2",    "BldgType",      "HouseStyle",    "OverallQual",   "OverallCond",  
"YearBuilt",     "YearRemodAdd",  "RoofStyle",     "RoofMatl",      "Exterior1st",   "Exterior2nd",  
"MasVnrType",    "MasVnrArea",    "ExterQual",     "ExterCond",     "Foundation",    "BsmtQual",     
"BsmtCond",      "BsmtExposure",  "BsmtFinType1",  "BsmtFinSF1",    "BsmtFinType2",  "BsmtFinSF2",   
"BsmtUnfSF",     "TotalBsmtSF",   "Heating",       "HeatingQC",     "CentralAir",    "Electrical",   
"X1stFlrSF",     "X2ndFlrSF",     "LowQualFinSF", "GrLivArea",     "BsmtFullBath",  "BsmtHalfBath", 
"FullBath",      "HalfBath",      "BedroomAbvGr",  "KitchenAbvGr",  "KitchenQual",   "TotRmsAbvGrd", 
"Functional",    "Fireplaces",    "FireplaceQu",   "GarageType",    "GarageYrBlt",   "GarageFinish", 
"GarageCars",    "GarageArea",    "GarageQual",    "GarageCond",    "PavedDrive",    "WoodDeckSF","OpenPorchSF",   "EnclosedPorch", "X3SsnPorch",    "ScreenPorch",   "PoolArea",      "PoolQC",       
"Fence",         "MiscFeature",   "MiscVal",       "MoSold",        "YrSold",  "SaleType",     
"SaleCondition")
colnames(predicting[,vars])
##  [1] "Id"            "MSSubClass"    "MSZoning"      "LotFrontage"  
##  [5] "LotArea"       "Street"        "LotShape"      "LandContour"  
##  [9] "Utilities"     "LotConfig"     "LandSlope"     "Neighborhood" 
## [13] "Condition1"    "Condition2"    "BldgType"      "HouseStyle"   
## [17] "OverallQual"   "OverallCond"   "YearBuilt"     "YearRemodAdd" 
## [21] "RoofStyle"     "RoofMatl"      "Exterior1st"   "Exterior2nd"  
## [25] "MasVnrType"    "MasVnrArea"    "ExterQual"     "ExterCond"    
## [29] "Foundation"    "BsmtQual"      "BsmtCond"      "BsmtExposure" 
## [33] "BsmtFinType1"  "BsmtFinSF1"    "BsmtFinType2"  "BsmtFinSF2"   
## [37] "BsmtUnfSF"     "TotalBsmtSF"   "Heating"       "HeatingQC"    
## [41] "CentralAir"    "Electrical"    "X1stFlrSF"     "X2ndFlrSF"    
## [45] "LowQualFinSF"  "GrLivArea"     "BsmtFullBath"  "BsmtHalfBath" 
## [49] "FullBath"      "HalfBath"      "BedroomAbvGr"  "KitchenAbvGr" 
## [53] "KitchenQual"   "TotRmsAbvGrd"  "Functional"    "Fireplaces"   
## [57] "FireplaceQu"   "GarageType"    "GarageYrBlt"   "GarageFinish" 
## [61] "GarageCars"    "GarageArea"    "GarageQual"    "GarageCond"   
## [65] "PavedDrive"    "WoodDeckSF"    "OpenPorchSF"   "EnclosedPorch"
## [69] "X3SsnPorch"    "ScreenPorch"   "PoolArea"      "PoolQC"       
## [73] "Fence"         "MiscFeature"   "MiscVal"       "MoSold"       
## [77] "YrSold"        "SaleType"      "SaleCondition"
rm(bstSparse)
#Create matrices from the data frames
retrainData<- as.matrix(train, rownames.force=NA)

#Turn the matrices into sparse matrices
retrain <- as(retrainData, "sparseMatrix")

param3 <- list(colsample_bytree = .7,
             subsample = .7,
             booster = "gbtree",
             max_depth = 10,
             eta = 0.04,
             eval_metric = "rmse",
             objective="reg:linear")

retrainD <- xgb.DMatrix(data = retrain[,vars], label = retrain[,"SalePrice"])

#retrain the model using those parameters
bstSparse3 <-
 xgb.train(params = param3,
           data = retrainD,
           nrounds = 500,
           watchlist = list(train = trainD),
           verbose = TRUE,
           print_every_n = 100,
           nthread = 2)
## [18:36:45] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1]  train-rmse:189013.622178 
## [101]    train-rmse:11690.407035 
## [201]    train-rmse:6175.194693 
## [301]    train-rmse:5460.061183 
## [401]    train-rmse:5236.044375 
## [500]    train-rmse:5176.225863
#Column names must match the inputs EXACTLY
prediction <- predict(bstSparse3, predicting[,vars])

prediction <- as.data.frame(as.matrix(prediction))  #Get the dataset formatted as a frame for later combining
colnames(prediction) <- "prediction"
model_output <- cbind(predict, prediction) #Combine the prediction output with the rest of the set

results <- data.frame(Id = model_output$Id, SalePrice = model_output$prediction)
length(model_output$prediction)
## [1] 1459

###Data Analysis - Results

write.csv(results, file = "Prediction.csv", row.names = F)
head(results$SalePrice)
## [1] 125088.7 158861.6 175157.6 189112.8 190358.8 174472.2

The file has a sales price prediction for the house_testing set.

summary <- read.csv("C:/Users/raze1/OneDrive/Desktop/UIndy/MSDA 621/Project/Project Presentation/Prediction.csv")
head(summary)
##     Id SalePrice
## 1 1461  125088.7
## 2 1462  158861.6
## 3 1463  175157.6
## 4 1464  189112.8
## 5 1465  190358.8
## 6 1466  174472.2

Report Summary and Deployment

As stated in the beginning of this analysis, the housing market is in incredible flux and homes have numerous data points on which they are categorized. Trying to understand which points truly matter and which points do not is incredibly complex but can be incredibly helpful when predicting the sale price of a home.

In this analysis, we cleaned and categorized data so it could be used to build a linear model. Then we refined the model to minimize the RMSE and found XGBoost to be the best approach. And ultimately built a predictive model for that specific housing market.

Moving forward, we can use this process in different markets independently, compare statistically signficant variables, and ultimately expand this model beyond Illinois as a national means of predicting home prices.