Description of dataset

The dataset for my prediction Project consist of various houses attributes including only for the train set the Sales Price

For this project i’m going to estimate the Sale Price of Houses by using different models.

In order to do that, i will use a train set of houses characteristics in order to predict the variable of interest (Sale_Price) for an other dataset of houses that don’t have already a Price.

Both Dataset are provided by Kaggle.

The train set 1460 observations (rows) while the test set 1459 and both of them have 80 features (columns):

I will merge them in order to have make the same data cleaning and manipulation.

  • MSSubClass: Identifies the type of dwelling involved in the sale.

      20  1-STORY 1946 & NEWER ALL STYLES
      30  1-STORY 1945 & OLDER
      40  1-STORY W/FINISHED ATTIC ALL AGES
      45  1-1/2 STORY - UNFINISHED ALL AGES
      50  1-1/2 STORY FINISHED ALL AGES
      60  2-STORY 1946 & NEWER
      70  2-STORY 1945 & OLDER
      75  2-1/2 STORY ALL AGES
      80  SPLIT OR MULTI-LEVEL
      85  SPLIT FOYER
      90  DUPLEX - ALL STYLES AND AGES
     120  1-STORY PUD (Planned Unit Development) - 1946 & NEWER
     150  1-1/2 STORY PUD - ALL AGES
     160  2-STORY PUD - 1946 & NEWER
     180  PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
     190  2 FAMILY CONVERSION - ALL STYLES AND AGES
  • MSZoning: Identifies the general zoning classification of the sale.

     A    Agriculture
     C    Commercial
     FV   Floating Village Residential
     I    Industrial
     RH   Residential High Density
     RL   Residential Low Density
     RP   Residential Low Density Park 
     RM   Residential Medium Density
  • LotFrontage: Linear feet of street connected to property

  • LotArea: Lot size in square feet

  • Street: Type of road access to property

     Grvl Gravel  
     Pave Paved
  • Alley: Type of alley access to property

     Grvl Gravel
     Pave Paved
     NA   No alley access
  • LotShape: General shape of property

     Reg  Regular 
     IR1  Slightly irregular
     IR2  Moderately Irregular
     IR3  Irregular
  • LandContour: Flatness of the property

     Lvl  Near Flat/Level 
     Bnk  Banked - Quick and significant rise from street grade to building
     HLS  Hillside - Significant slope from side to side
     Low  Depression
  • Utilities: Type of utilities available

     AllPub   All public Utilities (E,G,W,& S)    
     NoSewr   Electricity, Gas, and Water (Septic Tank)
     NoSeWa   Electricity and Gas Only
     ELO  Electricity only    
  • LotConfig: Lot configuration

     Inside   Inside lot
     Corner   Corner lot
     CulDSac  Cul-de-sac
     FR2  Frontage on 2 sides of property
     FR3  Frontage on 3 sides of property
  • LandSlope: Slope of property

     Gtl  Gentle slope
     Mod  Moderate Slope  
     Sev  Severe Slope
  • Neighborhood: Physical locations within Ames city limits

     Blmngtn  Bloomington Heights
     Blueste  Bluestem
     BrDale   Briardale
     BrkSide  Brookside
     ClearCr  Clear Creek
     CollgCr  College Creek
     Crawfor  Crawford
     Edwards  Edwards
     Gilbert  Gilbert
     IDOTRR   Iowa DOT and Rail Road
     MeadowV  Meadow Village
     Mitchel  Mitchell
     Names    North Ames
     NoRidge  Northridge
     NPkVill  Northpark Villa
     NridgHt  Northridge Heights
     NWAmes   Northwest Ames
     OldTown  Old Town
     SWISU    South & West of Iowa State University
     Sawyer   Sawyer
     SawyerW  Sawyer West
     Somerst  Somerset
     StoneBr  Stone Brook
     Timber   Timberland
     Veenker  Veenker
  • Condition1: Proximity to various conditions

     Artery   Adjacent to arterial street
     Feedr    Adjacent to feeder street   
     Norm Normal  
     RRNn Within 200' of North-South Railroad
     RRAn Adjacent to North-South Railroad
     PosN Near positive off-site feature--park, greenbelt, etc.
     PosA Adjacent to postive off-site feature
     RRNe Within 200' of East-West Railroad
     RRAe Adjacent to East-West Railroad
  • Condition2: Proximity to various conditions (if more than one is present)

     Artery   Adjacent to arterial street
     Feedr    Adjacent to feeder street   
     Norm Normal  
     RRNn Within 200' of North-South Railroad
     RRAn Adjacent to North-South Railroad
     PosN Near positive off-site feature--park, greenbelt, etc.
     PosA Adjacent to postive off-site feature
     RRNe Within 200' of East-West Railroad
     RRAe Adjacent to East-West Railroad
  • BldgType: Type of dwelling

     1Fam Single-family Detached  
     2FmCon   Two-family Conversion; originally built as one-family dwelling
     Duplx    Duplex
     TwnhsE   Townhouse End Unit
     TwnhsI   Townhouse Inside Unit
  • HouseStyle: Style of dwelling

     1Story   One story
     1.5Fin   One and one-half story: 2nd level finished
     1.5Unf   One and one-half story: 2nd level unfinished
     2Story   Two story
     2.5Fin   Two and one-half story: 2nd level finished
     2.5Unf   Two and one-half story: 2nd level unfinished
     SFoyer   Split Foyer
     SLvl Split Level
  • OverallQual: Rates the overall material and finish of the house

     10   Very Excellent
     9    Excellent
     8    Very Good
     7    Good
     6    Above Average
     5    Average
     4    Below Average
     3    Fair
     2    Poor
     1    Very Poor
  • OverallCond: Rates the overall condition of the house

     10   Very Excellent
     9    Excellent
     8    Very Good
     7    Good
     6    Above Average   
     5    Average
     4    Below Average   
     3    Fair
     2    Poor
     1    Very Poor
  • YearBuilt: Original construction date

  • YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)

  • RoofStyle: Type of roof

     Flat Flat
     Gable    Gable
     Gambrel  Gabrel (Barn)
     Hip  Hip
     Mansard  Mansard
     Shed Shed
  • RoofMatl: Roof material

     ClyTile  Clay or Tile
     CompShg  Standard (Composite) Shingle
     Membran  Membrane
     Metal    Metal
     Roll Roll
     Tar&Grv  Gravel & Tar
     WdShake  Wood Shakes
     WdShngl  Wood Shingles
  • Exterior1st: Exterior covering on house

     AsbShng  Asbestos Shingles
     AsphShn  Asphalt Shingles
     BrkComm  Brick Common
     BrkFace  Brick Face
     CBlock   Cinder Block
     CemntBd  Cement Board
     HdBoard  Hard Board
     ImStucc  Imitation Stucco
     MetalSd  Metal Siding
     Other    Other
     Plywood  Plywood
     PreCast  PreCast 
     Stone    Stone
     Stucco   Stucco
     VinylSd  Vinyl Siding
     Wd Sdng  Wood Siding
     WdShing  Wood Shingles
  • Exterior2nd: Exterior covering on house (if more than one material)

     AsbShng  Asbestos Shingles
     AsphShn  Asphalt Shingles
     BrkComm  Brick Common
     BrkFace  Brick Face
     CBlock   Cinder Block
     CemntBd  Cement Board
     HdBoard  Hard Board
     ImStucc  Imitation Stucco
     MetalSd  Metal Siding
     Other    Other
     Plywood  Plywood
     PreCast  PreCast
     Stone    Stone
     Stucco   Stucco
     VinylSd  Vinyl Siding
     Wd Sdng  Wood Siding
     WdShing  Wood Shingles
  • MasVnrType: Masonry veneer type

     BrkCmn   Brick Common
     BrkFace  Brick Face
     CBlock   Cinder Block
     None None
     Stone    Stone
  • MasVnrArea: Masonry veneer area in square feet

  • ExterQual: Evaluates the quality of the material on the exterior

     Ex   Excellent
     Gd   Good
     TA   Average/Typical
     Fa   Fair
     Po   Poor
  • ExterCond: Evaluates the present condition of the material on the exterior

     Ex   Excellent
     Gd   Good
     TA   Average/Typical
     Fa   Fair
     Po   Poor
  • Foundation: Type of foundation

     BrkTil   Brick & Tile
     CBlock   Cinder Block
     PConc    Poured Contrete 
     Slab Slab
     Stone    Stone
     Wood Wood
  • BsmtQual: Evaluates the height of the basement

     Ex   Excellent (100+ inches) 
     Gd   Good (90-99 inches)
     TA   Typical (80-89 inches)
     Fa   Fair (70-79 inches)
     Po   Poor (<70 inches
     NA   No Basement
  • BsmtCond: Evaluates the general condition of the basement

     Ex   Excellent
     Gd   Good
     TA   Typical - slight dampness allowed
     Fa   Fair - dampness or some cracking or settling
     Po   Poor - Severe cracking, settling, or wetness
     NA   No Basement
  • BsmtExposure: Refers to walkout or garden level walls

     Gd   Good Exposure
     Av   Average Exposure (split levels or foyers typically score average or above)  
     Mn   Mimimum Exposure
     No   No Exposure
     NA   No Basement
  • BsmtFinType1: Rating of basement finished area

     GLQ  Good Living Quarters
     ALQ  Average Living Quarters
     BLQ  Below Average Living Quarters   
     Rec  Average Rec Room
     LwQ  Low Quality
     Unf  Unfinshed
     NA   No Basement
  • BsmtFinSF1: Type 1 finished square feet

  • BsmtFinType2: Rating of basement finished area (if multiple types)

     GLQ  Good Living Quarters
     ALQ  Average Living Quarters
     BLQ  Below Average Living Quarters   
     Rec  Average Rec Room
     LwQ  Low Quality
     Unf  Unfinshed
     NA   No Basement
  • BsmtFinSF2: Type 2 finished square feet

  • BsmtUnfSF: Unfinished square feet of basement area

  • TotalBsmtSF: Total square feet of basement area

  • Heating: Type of heating

     Floor    Floor Furnace
     GasA Gas forced warm air furnace
     GasW Gas hot water or steam heat
     Grav Gravity furnace 
     OthW Hot water or steam heat other than gas
     Wall Wall furnace
  • HeatingQC: Heating quality and condition

     Ex   Excellent
     Gd   Good
     TA   Average/Typical
     Fa   Fair
     Po   Poor
  • CentralAir: Central air conditioning

     N    No
     Y    Yes
  • Electrical: Electrical system

     SBrkr    Standard Circuit Breakers & Romex
     FuseA    Fuse Box over 60 AMP and all Romex wiring (Average) 
     FuseF    60 AMP Fuse Box and mostly Romex wiring (Fair)
     FuseP    60 AMP Fuse Box and mostly knob & tube wiring (poor)
     Mix  Mixed
  • 1stFlrSF: First Floor square feet

  • 2ndFlrSF: Second floor square feet

  • LowQualFinSF: Low quality finished square feet (all floors)

  • GrLivArea: Above grade (ground) living area square feet

  • BsmtFullBath: Basement full bathrooms

  • BsmtHalfBath: Basement half bathrooms

  • FullBath: Full bathrooms above grade

  • HalfBath: Half baths above grade

  • Bedroom: Bedrooms above grade (does NOT include basement bedrooms)

  • Kitchen: Kitchens above grade

  • KitchenQual: Kitchen quality

     Ex   Excellent
     Gd   Good
     TA   Typical/Average
     Fa   Fair
     Po   Poor
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

  • Functional: Home functionality (Assume typical unless deductions are warranted)

     Typ  Typical Functionality
     Min1 Minor Deductions 1
     Min2 Minor Deductions 2
     Mod  Moderate Deductions
     Maj1 Major Deductions 1
     Maj2 Major Deductions 2
     Sev  Severely Damaged
     Sal  Salvage only
  • Fireplaces: Number of fireplaces

  • FireplaceQu: Fireplace quality

     Ex   Excellent - Exceptional Masonry Fireplace
     Gd   Good - Masonry Fireplace in main level
     TA   Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
     Fa   Fair - Prefabricated Fireplace in basement
     Po   Poor - Ben Franklin Stove
     NA   No Fireplace
  • GarageType: Garage location

     2Types   More than one type of garage
     Attchd   Attached to home
     Basment  Basement Garage
     BuiltIn  Built-In (Garage part of house - typically has room above garage)
     CarPort  Car Port
     Detchd   Detached from home
     NA   No Garage
  • GarageYrBlt: Year garage was built

  • GarageFinish: Interior finish of the garage

     Fin  Finished
     RFn  Rough Finished  
     Unf  Unfinished
     NA   No Garage
  • GarageCars: Size of garage in car capacity

  • GarageArea: Size of garage in square feet

  • GarageQual: Garage quality

     Ex   Excellent
     Gd   Good
     TA   Typical/Average
     Fa   Fair
     Po   Poor
     NA   No Garage
  • GarageCond: Garage condition

     Ex   Excellent
     Gd   Good
     TA   Typical/Average
     Fa   Fair
     Po   Poor
     NA   No Garage
  • PavedDrive: Paved driveway

     Y    Paved 
     P    Partial Pavement
     N    Dirt/Gravel
  • WoodDeckSF: Wood deck area in square feet

  • OpenPorchSF: Open porch area in square feet

  • EnclosedPorch: Enclosed porch area in square feet

  • 3SsnPorch: Three season porch area in square feet

  • ScreenPorch: Screen porch area in square feet

  • PoolArea: Pool area in square feet

  • PoolQC: Pool quality

     Ex   Excellent
     Gd   Good
     TA   Average/Typical
     Fa   Fair
     NA   No Pool
  • Fence: Fence quality

     GdPrv    Good Privacy
     MnPrv    Minimum Privacy
     GdWo Good Wood
     MnWw Minimum Wood/Wire
     NA   No Fence
  • MiscFeature: Miscellaneous feature not covered in other categories

     Elev Elevator
     Gar2 2nd Garage (if not described in garage section)
     Othr Other
     Shed Shed (over 100 SF)
     TenC Tennis Court
     NA   None
  • MiscVal: $Value of miscellaneous feature

  • MoSold: Month Sold (MM)

  • YrSold: Year Sold (YYYY)

  • SaleType: Type of sale

     WD   Warranty Deed - Conventional
     CWD  Warranty Deed - Cash
     VWD  Warranty Deed - VA Loan
     New  Home just constructed and sold
     COD  Court Officer Deed/Estate
     Con  Contract 15% Down payment regular terms
     ConLw    Contract Low Down payment and low interest
     ConLI    Contract Low Interest
     ConLD    Contract Low Down
     Oth  Other
  • SaleCondition: Condition of sale

     Normal   Normal Sale
     Abnorml  Abnormal Sale -  trade, foreclosure, short sale
     AdjLand  Adjoining Land Purchase
     Alloca   Allocation - two linked properties with separate deeds, typically condo with a garage unit  
     Family   Sale between family members
     Partial  Home was not completed when last assessed (associated with New Homes)
test$SalePrice <- 0
houses <- bind_rows(train, test)
houses %>% glimpse()
## Rows: 2,919
## Columns: 81
## $ Id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
## $ MSSubClass    <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,~
## $ MSZoning      <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R~
## $ LotFrontage   <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, ~
## $ LotArea       <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612~
## $ Street        <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ~
## $ Alley         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ LotShape      <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", ~
## $ LandContour   <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", ~
## $ Utilities     <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu~
## $ LotConfig     <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I~
## $ LandSlope     <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", ~
## $ Neighborhood  <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "~
## $ Condition1    <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",~
## $ Condition2    <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ~
## $ BldgType      <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ~
## $ HouseStyle    <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi~
## $ OverallQual   <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,~
## $ OverallCond   <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,~
## $ YearBuilt     <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19~
## $ YearRemodAdd  <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19~
## $ RoofStyle     <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G~
## $ RoofMatl      <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "~
## $ Exterior1st   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "~
## $ Exterior2nd   <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "~
## $ MasVnrType    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",~
## $ MasVnrArea    <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, ~
## $ ExterQual     <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ ExterCond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ Foundation    <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "~
## $ BsmtQual      <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T~
## $ BsmtCond      <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T~
## $ BsmtExposure  <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N~
## $ BsmtFinType1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", ~
## $ BsmtFinSF1    <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99~
## $ BsmtFinType2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", ~
## $ BsmtFinSF2    <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ BsmtUnfSF     <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17~
## $ TotalBsmtSF   <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10~
## $ Heating       <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ~
## $ HeatingQC     <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E~
## $ CentralAir    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ Electrical    <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S~
## $ X1stFlrSF     <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, ~
## $ X2ndFlrSF     <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,~
## $ LowQualFinSF  <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ GrLivArea     <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10~
## $ BsmtFullBath  <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,~
## $ BsmtHalfBath  <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ FullBath      <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,~
## $ HalfBath      <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,~
## $ BedroomAbvGr  <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,~
## $ KitchenAbvGr  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,~
## $ KitchenQual   <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ TotRmsAbvGrd  <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6~
## $ Functional    <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", ~
## $ Fireplaces    <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,~
## $ FireplaceQu   <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", ~
## $ GarageType    <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch~
## $ GarageYrBlt   <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19~
## $ GarageFinish  <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", ~
## $ GarageCars    <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,~
## $ GarageArea    <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7~
## $ GarageQual    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G~
## $ GarageCond    <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ PavedDrive    <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ WoodDeckSF    <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160~
## $ OpenPorchSF   <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,~
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, ~
## $ X3SsnPorch    <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ ScreenPorch   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, ~
## $ PoolArea      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ PoolQC        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Fence         <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,~
## $ MiscFeature   <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, ~
## $ MiscVal       <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,~
## $ MoSold        <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10~
## $ YrSold        <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20~
## $ SaleType      <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W~
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm~
## $ SalePrice     <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, ~

Analyses of the data

Accurate estimation offers a chance to better identify the value at which to sell a house in order to take as much profit as possible and sell as soon as possible.

Thus, the purposes of this project is to create several model using different Machine Learning technique in order to find the value of Houses.

I examine how different model developed improve significantly the prediction accuracy in term of Mean Squared Error and R squared.

Furthermore, i will understand the importance of the Houses characteristics and how they change within different model prediction.

Data Preparation

Controlling duplicated values

First ensure that there are no duplicate (remove the ID in order to have better interpretation of duplicate).

houses[,-1][duplicated(houses)]
## data frame con 0 colonne e 2919 righe

Finding “NA” value

From the dataset description i saw that the house without basement have “NA” so let replace them with “NoB” in all the level regarding that.

also see that the house without garage have “NA” so let substitute them with “NoG” in all the level regarding that. Then to no delete the houses without garage from the sample i have to replace the “NA” value with the mean of the other in order to not bias the sample.

Furthermore, from dataset description i also see that the house without Alley access, Fireplace, Pool, Fency and Miscellaneous feature have “NA” so replace them with “NoAll”,“NoFen” and “NoFp”.

houses[,c(31:34,36)][is.na(houses[,c(31:34,36)])] <- "NoB"
houses[,c(59,61,64,65)][is.na(houses[,c(59,61,64,65)])] <- "NoG"
gar_year <- na.omit(houses[,60])
houses[,60][is.na(houses[,60])] <- 0
#or
#houses[,60][is.na(houses[,60])] <- as.integer(mean(gar_year))
houses[,7][is.na(houses[,7])] <- "NoAll"
houses[,58][is.na(houses[,58])] <- "NoFp"
houses[,73][is.na(houses[,73])] <- "NoPo"
houses[,74][is.na(houses[,74])] <- "NoFen"
houses[,75][is.na(houses[,75])] <- "None"

Before continue i control if and where are still present “NA”.

As first control if are present variable that have more than 10% of “NA”.

NA_values <- matrix(NA,81,2)

for(i in 1:81){
  NA_values[i,1] <- sum(is.na(houses[,i]))/2919*100
}

var_with_NA <- NA

for(i in 1:81){
  if(NA_values[i,1] > 1){
    var_with_NA[i] <- i
  }
}

var_with_NA
## [1] NA NA NA  4
var_with_NA <- 4

I found many “NA” in LotFrontage variable so let’s replace it with “0”.

houses[,4][is.na(houses[,4])] <- 0

Now i control how many “NA” are still present in the dataset and I inspect them one by one.

sum(is.na(houses))
## [1] 70
which(is.na(houses), arr.ind=TRUE)
##        row col
##  [1,] 1916   3
##  [2,] 2217   3
##  [3,] 2251   3
##  [4,] 2905   3
##  [5,] 1916  10
##  [6,] 1946  10
##  [7,] 2152  24
##  [8,] 2152  25
##  [9,]  235  26
## [10,]  530  26
## [11,]  651  26
## [12,]  937  26
## [13,]  974  26
## [14,]  978  26
## [15,] 1244  26
## [16,] 1279  26
## [17,] 1692  26
## [18,] 1707  26
## [19,] 1883  26
## [20,] 1993  26
## [21,] 2005  26
## [22,] 2042  26
## [23,] 2312  26
## [24,] 2326  26
## [25,] 2341  26
## [26,] 2350  26
## [27,] 2369  26
## [28,] 2593  26
## [29,] 2611  26
## [30,] 2658  26
## [31,] 2687  26
## [32,] 2863  26
## [33,]  235  27
## [34,]  530  27
## [35,]  651  27
## [36,]  937  27
## [37,]  974  27
## [38,]  978  27
## [39,] 1244  27
## [40,] 1279  27
## [41,] 1692  27
## [42,] 1707  27
## [43,] 1883  27
## [44,] 1993  27
## [45,] 2005  27
## [46,] 2042  27
## [47,] 2312  27
## [48,] 2326  27
## [49,] 2341  27
## [50,] 2350  27
## [51,] 2369  27
## [52,] 2593  27
## [53,] 2658  27
## [54,] 2687  27
## [55,] 2863  27
## [56,] 2121  35
## [57,] 2121  37
## [58,] 2121  38
## [59,] 2121  39
## [60,] 1380  43
## [61,] 2121  48
## [62,] 2189  48
## [63,] 2121  49
## [64,] 2189  49
## [65,] 1556  54
## [66,] 2217  56
## [67,] 2474  56
## [68,] 2577  62
## [69,] 2577  63
## [70,] 2490  79
  • 4 observation have NA at MSZoning, let’s add “NA” value to the most common category
prop.table(table(houses$MSZoning))
## 
##     C (all)          FV          RH          RL          RM 
## 0.008576329 0.047684391 0.008919383 0.777015437 0.157804460
houses[,3][is.na(houses[,3])] <- "RL"
  • 2 observation have NA at Utilities, 99,9% have this variable level at “AllPub” replace NA with that value (later I’ll delete this variable).
prop.table(table(houses$Utilities))
## 
##      AllPub      NoSeWa 
## 0.999657182 0.000342818
houses[,10][is.na(houses[,10])] <- "AllPub"
  • 1 observation has NA at Exterior1st and Exterior2nd, let’s replace both with the level “Other” already present as Exterior2nd level (later I’ll group the less frequent Exterior1st.
prop.table(table(houses$Exterior1st))
## 
##      AsbShng      AsphShn      BrkComm      BrkFace       CBlock      CemntBd 
## 0.0150788211 0.0006854010 0.0020562029 0.0298149417 0.0006854010 0.0431802605 
##      HdBoard      ImStucc      MetalSd      Plywood        Stone       Stucco 
## 0.1514736121 0.0003427005 0.1542152159 0.0757368060 0.0006854010 0.0147361206 
##      VinylSd      Wd Sdng      WdShing 
## 0.3512679918 0.1408498972 0.0191912269
houses$Exterior1st <- as.character(houses$Exterior1st)
houses[,24][is.na(houses[,24])] <- "Other"
houses$Exterior1st <- as.factor(houses$Exterior1st)
houses[,25][is.na(houses[,25])] <- "Other"
  • “Masonry veneer type” and “Masonry veneer area in square feet” are the most NA probably NA are because there is no Masonry veneer, let’s replace first with the level “None” already present as variable level and the 2nd with 0.
houses[,26][is.na(houses[,26])] <- "None"
houses[,27][is.na(houses[,27])] <- 0
  • observation n2121 and n2189 have any NA regarding the basement variables, i can notice that this house hasn’t the basement let’s fix the variables.
houses[2121,]
##        Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 2121 2121         20       RM          99    5940   Pave NoAll      IR1
##      LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 2121         Lvl    AllPub       FR3       Gtl      BrkSide      Feedr
##      Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 2121       Norm     1Fam     1Story           4           7      1946
##      YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 2121         1950     Gable  CompShg     MetalSd      CBlock       None
##      MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 2121          0        TA        TA      PConc      NoB      NoB          NoB
##      BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 2121          NoB         NA          NoB         NA        NA          NA
##      Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 2121    GasA        TA          Y      FuseA       896         0            0
##      GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 2121       896           NA           NA        1        0            2
##      KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 2121            1          TA            4        Typ          0        NoFp
##      GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 2121     Detchd        1946          Unf          1        280         TA
##      GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 2121         TA          Y          0           0             0          0
##      ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 2121           0        0   NoPo MnPrv        None       0      4   2008
##      SaleType SaleCondition SalePrice
## 2121    ConLD       Abnorml         0
houses[,c(35,37:39,48,49)][is.na(houses[,c(35,37:39,48,49)])] <- 0
  • 1 observation has NA at Electrical, let’s replace it with “Other”, level that i will create later in order to merge less frequent Electrical level.
prop.table(table(houses$Electrical))
## 
##        FuseA        FuseF        FuseP          Mix        SBrkr 
## 0.0644276902 0.0171350240 0.0027416038 0.0003427005 0.9153529815
houses[,43][is.na(houses[,43])] <- "Other"
  • 1 observation has NA at KitchenQual, let’s replace it with “TA” that means Typical, the most frequent level of this variable
prop.table(table(houses$KitchenQual))
## 
##         Ex         Fa         Gd         TA 
## 0.07025360 0.02398903 0.39444825 0.51130912
houses[,54][is.na(houses[,54])] <- "TA"
  • 2 observation has NA at Functional, let’s replace it with “Typ” that means Typical Functionality, the most frequent level of this variable.
prop.table(table(houses$Functional))
## 
##         Maj1         Maj2         Min1         Min2          Mod          Sev 
## 0.0065135413 0.0030853617 0.0222831676 0.0239972575 0.0119986287 0.0006856359 
##          Typ 
## 0.9314364073
houses[,56][is.na(houses[,56])] <- "Typ"
  • Observation n2577 has NA at GarageCars and GarageArea, let’s replace both with 0 as the other House without Garage
houses[,c(62,63)][is.na(houses[,c(62,63)])] <- 0
  • 1 observation has NA at SaleType, let’s replace it with “Oth” that is a level already present.
prop.table(table(houses$SaleType))
## 
##         COD         Con       ConLD       ConLI       ConLw         CWD 
## 0.029814942 0.001713502 0.008910212 0.003084304 0.002741604 0.004112406 
##         New         Oth          WD 
## 0.081905415 0.002398903 0.865318711
houses[,79][is.na(houses[,79])] <- "Oth"

Finally control if all “NA” have been replaced.

sum(is.na(houses))
## [1] 0

Transforming all categorical variable as factor

houses$MSSubClass <- as.factor(houses$MSSubClass)
houses$MSZoning <- as.factor(houses$MSZoning)
houses$Street <- as.factor(houses$Street)
houses$Alley <- as.factor(houses$Alley)
houses$LotShape <-  as.factor(houses$LotShape)
houses$LandContour <- as.factor(houses$LandContour)
houses$Utilities <- as.factor(houses$Utilities)
houses$LotConfig <-  as.factor(houses$LotConfig)
houses$LandSlope <-  as.factor(houses$LandSlope)
houses$Neighborhood <-  as.factor(houses$Neighborhood)
houses$Condition1 <- as.factor(houses$Condition1)
houses$Condition2 <- as.factor(houses$Condition2)
houses$BldgType <- as.factor(houses$BldgType)
houses$HouseStyle <- as.factor(houses$HouseStyle)
houses$RoofStyle <- as.factor(houses$RoofStyle)
houses$RoofMatl <- as.factor(houses$RoofMatl)
houses$Exterior1st <- as.factor(houses$Exterior1st)
houses$Exterior2nd <- as.factor(houses$Exterior2nd)
houses$MasVnrType <- as.factor(houses$MasVnrType)
houses$ExterQual <- as.factor(houses$ExterQual)
houses$ExterCond <- as.factor(houses$ExterCond)
houses$Foundation <- as.factor(houses$Foundation)
houses$BsmtQual <- as.factor(houses$BsmtQual)
houses$BsmtCond <- as.factor(houses$BsmtCond)
houses$BsmtExposure <- as.factor(houses$BsmtExposure)
houses$BsmtFinType1 <- as.factor(houses$BsmtFinType1)
houses$BsmtFinType2 <- as.factor(houses$BsmtFinType2)
houses$Heating <- as.factor(houses$Heating)
houses$HeatingQC <- as.factor(houses$HeatingQC)
houses$CentralAir <- as.factor(houses$CentralAir)
houses$Electrical <- as.factor(houses$Electrical)
houses$KitchenQual <- as.factor(houses$KitchenQual)
houses$Functional <- as.factor(houses$Functional)
houses$FireplaceQu <- as.factor(houses$FireplaceQu)
houses$GarageType <- as.factor(houses$GarageType)
houses$GarageFinish  <- as.factor(houses$GarageFinish)
houses$GarageQual  <- as.factor(houses$GarageQual)
houses$GarageCond  <- as.factor(houses$GarageCond)
houses$PavedDrive <- as.factor(houses$PavedDrive)
houses$PoolQC <- as.factor(houses$PoolQC)
houses$Fence <- as.factor(houses$Fence)
houses$MiscFeature <- as.factor(houses$MiscFeature)
houses$SaleType <- as.factor(houses$SaleType)
houses$SaleCondition <- as.factor(houses$SaleCondition)

Inspecting the factorial variables

factorial_variables <- c(2,3,6:17,22:26,28:34,36,40:43,54,56,58,59,61,64:66,73:75,79,80)
str(houses[,factorial_variables])
## 'data.frame':    2919 obs. of  44 variables:
##  $ MSSubClass   : Factor w/ 16 levels "20","30","40",..: 6 1 6 7 6 5 1 6 5 16 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 3 levels "Grvl","NoAll",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 15 14 14 14 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 3 3 5 3 3 1 3 5 5 ...
##  $ BsmtCond     : Factor w/ 5 levels "Fa","Gd","NoB",..: 5 5 5 2 5 5 5 5 5 5 ...
##  $ BsmtExposure : Factor w/ 5 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 7 3 ...
##  $ BsmtFinType2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 7 7 7 7 7 7 2 7 7 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 6 levels "FuseA","FuseF",..: 6 6 6 6 6 6 6 6 2 6 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ FireplaceQu  : Factor w/ 6 levels "Ex","Fa","Gd",..: 4 6 6 3 6 4 3 6 6 6 ...
##  $ GarageType   : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageFinish : Factor w/ 4 levels "Fin","NoG","RFn",..: 3 3 3 4 3 4 3 3 4 3 ...
##  $ GarageQual   : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 2 3 ...
##  $ GarageCond   : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ PoolQC       : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Fence        : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
##  $ MiscFeature  : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
houses %>%
  gather(Attributes, value, factorial_variables[1:9]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_bar(stat="count", show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Categorical Variables - Histograms") +
  scale_fill_discrete() +
  my_theme

houses %>%
  gather(Attributes, value, factorial_variables[10:18]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_bar(stat="count", show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Categorical Variables - Histograms") +
  scale_fill_discrete() +
  my_theme

houses %>%
  gather(Attributes, value, factorial_variables[19:27]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_bar(stat="count", show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Categorical Variables - Histograms") +
  scale_fill_discrete() +
  my_theme

houses %>%
  gather(Attributes, value, factorial_variables[28:36]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_bar(stat="count", show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Categorical Variables - Histograms") +
  scale_fill_discrete() +
  my_theme

houses %>%
  gather(Attributes, value, factorial_variables[37:44]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_bar(stat="count", show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values", y="Frequency",
       title="Categorical Variables - Histograms") +
  scale_fill_discrete() +
  my_theme

Factorial variables Plots interpretation

1st plot:

  • I can see that Street, Utilities have only one value and Neighborhood have to many level and and MSSubClass have many not interpretable level that can create bias so I delete them.
houses <- houses[,-which(names(houses) == "Street")]
houses <- houses[,-which(names(houses) == "Utilities")]
houses <- houses[,-which(names(houses) == "MSSubClass")]
  • I have to inspect LandSlope –> I’ll merge “Mod” and “Sev” as “ModSev” = Moderate/Severe Slope.
prop.table(table(houses$LandSlope))
## 
##         Gtl         Mod         Sev 
## 0.951695786 0.042822885 0.005481329
houses <- transform(houses, LandSlope=revalue(LandSlope,c("Mod" = "ModSev")))
houses <- transform(houses, LandSlope=revalue(LandSlope,c("Sev" = "ModSev")))
prop.table(table(houses$LandSlope))
## 
##        Gtl     ModSev 
## 0.95169579 0.04830421
  • I have to inspect LotConfig –> I’ll merge “FR2” and “FR3” as “FR2/3” = Frontage on 2/3 sides of property.
prop.table(table(houses$LotConfig))
## 
##      Corner     CulDSac         FR2         FR3      Inside 
## 0.175059952 0.060294621 0.029119561 0.004796163 0.730729702
houses <- transform(houses, LotConfig=revalue(LotConfig,c("FR2" = "FR2/3")))
houses <- transform(houses, LotConfig=revalue(LotConfig,c("FR3" = "FR2/3")))
prop.table(table(houses$LotConfig))
## 
##     Corner    CulDSac      FR2/3     Inside 
## 0.17505995 0.06029462 0.03391572 0.73072970
  • I have to inspect LotShape –> I’ll merge “IR2” and “IR3” as “IR2” = Moderately or more Irregular.
prop.table(table(houses$LotShape))
## 
##         IR1         IR2         IR3         Reg 
## 0.331620418 0.026036314 0.005481329 0.636861939
houses <- transform(houses, LotShape=revalue(LotShape,c("IR3" = "IR2")))
prop.table(table(houses$LotShape))
## 
##        IR1        IR2        Reg 
## 0.33162042 0.03151764 0.63686194
  • I have to inspect MSZoning –> I’ll merge “RM” and “RH” as “RMH” = Residential Medium/High Density.
prop.table(table(houses$MSZoning))
## 
##     C (all)          FV          RH          RL          RM 
## 0.008564577 0.047619048 0.008907160 0.777321000 0.157588215
houses <- transform(houses, MSZoning=revalue(MSZoning,c("RM" = "RMH")))
houses <- transform(houses, MSZoning=revalue(MSZoning,c("RH" = "RMH")))
prop.table(table(houses$MSZoning))
## 
##     C (all)          FV         RMH          RL 
## 0.008564577 0.047619048 0.166495375 0.777321000

2nd plot:

  • I have chosen to drop condition1 and condition2 because of both have a lot of level difficult to be interpreted and any of that are without observation. I have also removed Roofmatl because there are many empty category and almost all the observations have the same value. Then i have deleted Exterior2nd because are very similar to Exterior1st.
houses <- houses[,-which(names(houses) == "Condition1")]
houses <- houses[,-which(names(houses) == "Condition2")]
houses <- houses[,-which(names(houses) == "RoofMatl")]
houses <- houses[,-which(names(houses) == "Exterior2nd")]
houses <- houses[,-which(names(houses) == "Neighborhood")]
  • I have to inspect BldgType –> I’ll merge “TwnhsE” with “Twnhs” = Townhouse and also merge “2fmCon” and “Duplex” as “2Fam” = Two-family.
prop.table(table(houses$BldgType))
## 
##       1Fam     2fmCon     Duplex      Twnhs     TwnhsE 
## 0.83076396 0.02124015 0.03734156 0.03288798 0.07776636
houses <- transform(houses, BldgType=revalue(BldgType,c("TwnhsE" = "Twnhs")))
houses <- transform(houses, BldgType=revalue(BldgType,c("2fmCon" = "2Fam")))
houses <- transform(houses, BldgType=revalue(BldgType,c("Duplex" = "2Fam")))
prop.table(table(houses$BldgType))
## 
##       1Fam       2Fam      Twnhs 
## 0.83076396 0.05858171 0.11065433
  • I have to inspect Exterior1st –> I’ll merge in “Other” all the level that have percentage less than less than 5%.
prop.table(table(houses$Exterior1st))
## 
##      AsbShng      AsphShn      BrkComm      BrkFace       CBlock      CemntBd 
## 0.0150736554 0.0006851662 0.0020554985 0.0298047276 0.0006851662 0.0431654676 
##      HdBoard      ImStucc      MetalSd        Other      Plywood        Stone 
## 0.1514217198 0.0003425831 0.1541623844 0.0003425831 0.0757108599 0.0006851662 
##       Stucco      VinylSd      Wd Sdng      WdShing 
## 0.0147310723 0.3511476533 0.1408016444 0.0191846523
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("AsbShng" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("AsphShn" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("BrkComm" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("BrkFace" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("CBlock" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("CemntBd" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("ImStucc" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("Stone" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("Stucco" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("WdShing" = "Other")))
prop.table(table(houses$Exterior1st))
## 
##      Other    HdBoard    MetalSd    Plywood    VinylSd    Wd Sdng 
## 0.12675574 0.15142172 0.15416238 0.07571086 0.35114765 0.14080164

I have to inspect HouseStyle –> I’ll merge “1Story”, “1.5Fin” and “1.5Unf” as “1aH” = One or one and one-half story, merge “2Story”, “2.5Fin” and “2.5Unf” as “2aH” = Two or one and Two-half story and also merge “SFoyer” and “SLvl” as “SFL” = Split Foyer or split Level.

prop.table(table(houses$HouseStyle))
## 
##      1.5Fin      1.5Unf      1Story      2.5Fin      2.5Unf      2Story 
## 0.107571086 0.006509078 0.503939705 0.002740665 0.008221994 0.298732443 
##      SFoyer        SLvl 
## 0.028434395 0.043850634
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("1Story" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("1.5Fin" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("1.5Unf" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2Story" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2.5Fin" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2.5Unf" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("SFoyer" = "SFL")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("SLvl" = "SFL")))
prop.table(table(houses$HouseStyle))
## 
##        1aH        2aH        SFL 
## 0.61801987 0.30969510 0.07228503
  • I have to inspect RoofStyle –> I’ll merge in “Other” all the level that have percentage less than less than 5%.
prop.table(table(houses$RoofStyle))
## 
##        Flat       Gable     Gambrel         Hip     Mansard        Shed 
## 0.006851662 0.791366906 0.007536828 0.188763275 0.003768414 0.001712915
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Flat" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Gambrel" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Mansard" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Shed" = "Other")))
prop.table(table(houses$RoofStyle))
## 
##      Other      Gable        Hip 
## 0.01986982 0.79136691 0.18876328

3rd plot:

  • I have to inspect BsmtCond –> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.
prop.table(table(houses$BsmtCond))
## 
##          Fa          Gd         NoB          Po          TA 
## 0.035628640 0.041795135 0.028091812 0.001712915 0.892771497
houses <- transform(houses, BsmtCond=revalue(BsmtCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, BsmtCond=revalue(BsmtCond,c("Po" = "Fa_Po")))
prop.table(table(houses$BsmtCond))
## 
##      Fa_Po         Gd        NoB         TA 
## 0.03734156 0.04179514 0.02809181 0.89277150
  • I have to inspect BsmtFinType1 –> I’ll merge “ALQ”, “BLQ” and “GLQ” as “LQ” = Living Quarters and also merge “LwQ” and “Unf” as “LwQ_Unf” = Low Quality or Unfinished.
prop.table(table(houses$BsmtFinType1))
## 
##        ALQ        BLQ        GLQ        LwQ        NoB        Rec        Unf 
## 0.14696814 0.09215485 0.29085303 0.05275779 0.02706406 0.09866393 0.29153820
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("ALQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("BLQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("GLQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("LwQ" = "LwQ_Unf")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("Unf" = "LwQ_Unf")))
prop.table(table(houses$BsmtFinType1))
## 
##         LQ    LwQ_Unf        NoB        Rec 
## 0.52997602 0.34429599 0.02706406 0.09866393
  • I have to inspect BsmtFinType2 –> I’ll merge “ALQ”, “BLQ” and “GLQ” as “LQ” = Living Quarters and also merge “LwQ” and “Unf” as “LwQ_Unf” = Low Quality or Unfinished.
prop.table(table(houses$BsmtFinType2))
## 
##        ALQ        BLQ        GLQ        LwQ        NoB        Rec        Unf 
## 0.01781432 0.02329565 0.01164782 0.02980473 0.02740665 0.03597122 0.85405961
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("ALQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("BLQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("GLQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("LwQ" = "LwQ_Unf")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("Unf" = "LwQ_Unf")))
prop.table(table(houses$BsmtFinType2))
## 
##         LQ    LwQ_Unf        NoB        Rec 
## 0.05275779 0.88386434 0.02740665 0.03597122
  • I have to inspect ExterCond –> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.
prop.table(table(houses$ExterCond))
## 
##          Ex          Fa          Gd          Po          TA 
## 0.004110997 0.022953066 0.102432340 0.001027749 0.869475848
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Ex" = "Ex_Gd")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Gd" = "Ex_Gd")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Po" = "Fa_Po")))
prop.table(table(houses$ExterCond))
## 
##      Ex_Gd      Fa_Po         TA 
## 0.10654334 0.02398082 0.86947585
  • I have to inspect ExterQual –> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good.
prop.table(table(houses$ExterQual))
## 
##         Ex         Fa         Gd         TA 
## 0.03665639 0.01199041 0.33538883 0.61596437
houses <- transform(houses, ExterQual=revalue(ExterQual,c("Ex" = "Ex_Gd")))
houses <- transform(houses, ExterQual=revalue(ExterQual,c("Gd" = "Ex_Gd")))
prop.table(table(houses$ExterQual))
## 
##      Ex_Gd         Fa         TA 
## 0.37204522 0.01199041 0.61596437
  • I have to inspect Foundation –> I’ll merge in “Other” all the level that have percentage less than less than 5%.
prop.table(table(houses$Foundation))
## 
##      BrkTil      CBlock       PConc        Slab       Stone        Wood 
## 0.106543337 0.423090099 0.448098664 0.016786571 0.003768414 0.001712915
houses <- transform(houses, Foundation=revalue(Foundation,c("Slab" = "Other")))
houses <- transform(houses, Foundation=revalue(Foundation,c("Stone" = "Other")))
houses <- transform(houses, Foundation=revalue(Foundation,c("Wood" = "Other")))
prop.table(table(houses$Foundation))
## 
##    BrkTil    CBlock     PConc     Other 
## 0.1065433 0.4230901 0.4480987 0.0222679
  • I have to inspect MasVnrType –> I’ll merge “BrkCmn” and “BrkFace” as “Brk” = Brick.
prop.table(table(houses$MasVnrType))
## 
##      BrkCmn     BrkFace        None       Stone 
## 0.008564577 0.301130524 0.605001713 0.085303186
houses <- transform(houses, MasVnrType=revalue(MasVnrType,c("BrkCmn" = "Brk")))
houses <- transform(houses, MasVnrType=revalue(MasVnrType,c("BrkFace" = "Brk")))
prop.table(table(houses$MasVnrType))
## 
##        Brk       None      Stone 
## 0.30969510 0.60500171 0.08530319

4th plot:

  • I have dropped Heating because almost all are Gas and there are empty level.
houses <- houses[,-which(names(houses) == "Heating")]
  • I have to inspect Electrical –> I’ll merge in “Other” all the level that have percentage less than less than 5%.
prop.table(table(houses$Electrical))
## 
##        FuseA        FuseF        FuseP          Mix        Other        SBrkr 
## 0.0644056184 0.0171291538 0.0027406646 0.0003425831 0.0003425831 0.9150393971
houses <- transform(houses, Electrical=revalue(Electrical,c("FuseF" = "Other")))
houses <- transform(houses, Electrical=revalue(Electrical,c("FuseP" = "Other")))
houses <- transform(houses, Electrical=revalue(Electrical,c("Mix" = "Other")))
prop.table(table(houses$Electrical))
## 
##      FuseA      Other      SBrkr 
## 0.06440562 0.02055498 0.91503940
  • I have to inspect FireplaceQu –> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.
prop.table(table(houses$FireplaceQu))
## 
##         Ex         Fa         Gd       NoFp         Po         TA 
## 0.01473107 0.02535115 0.25488181 0.48646797 0.01575882 0.20280918
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Ex" = "Ex_Gd")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Gd" = "Ex_Gd")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Fa" = "Fa_Po")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Po" = "Fa_Po")))
prop.table(table(houses$FireplaceQu))
## 
##      Ex_Gd      Fa_Po       NoFp         TA 
## 0.26961288 0.04110997 0.48646797 0.20280918
  • I have to inspect Functional –> I’ll merge “Maj1”, “Maj2” and “Mod” as “Maj” = Major Deductions and also merge “Min1”, “Min2” and “Sev” as “Min” = Minor Deductions.
prop.table(table(houses$Functional))
## 
##         Maj1         Maj2         Min1         Min2          Mod          Sev 
## 0.0065090785 0.0030832477 0.0222679000 0.0239808153 0.0119904077 0.0006851662 
##          Typ 
## 0.9314833847
houses <- transform(houses, Functional=revalue(Functional,c("Maj1" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Maj2" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Mod" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Min1" = "Min")))
houses <- transform(houses, Functional=revalue(Functional,c("Min2" = "Min")))
houses <- transform(houses, Functional=revalue(Functional,c("Sev" = "Min")))
prop.table(table(houses$Functional))
## 
##        Maj        Min        Typ 
## 0.02158273 0.04693388 0.93148338
  • I have to inspect GarageType –> I’ll merge in “Other” all the level that have percentage less than less than 5%.
prop.table(table(houses$GarageType))
## 
##      2Types      Attchd     Basment     BuiltIn     CarPort      Detchd 
## 0.007879411 0.590270641 0.012332991 0.063720452 0.005138746 0.266872217 
##         NoG 
## 0.053785543
houses <- transform(houses, GarageType=revalue(GarageType,c("2Types" = "Other")))
houses <- transform(houses, GarageType=revalue(GarageType,c("Basment" = "Other")))
houses <- transform(houses, GarageType=revalue(GarageType,c("CarPort" = "Other")))
prop.table(table(houses$GarageType))
## 
##      Other     Attchd    BuiltIn     Detchd        NoG 
## 0.02535115 0.59027064 0.06372045 0.26687222 0.05378554
  • I have to inspect HeatingQC –> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.
prop.table(table(houses$HeatingQC))
## 
##          Ex          Fa          Gd          Po          TA 
## 0.511476533 0.031517643 0.162384378 0.001027749 0.293593696
houses <- transform(houses, HeatingQC=revalue(HeatingQC,c("Fa" = "Fa_Po")))
houses <- transform(houses, HeatingQC=revalue(HeatingQC,c("Po" = "Fa_Po")))
prop.table(table(houses$HeatingQC))
## 
##         Ex      Fa_Po         Gd         TA 
## 0.51147653 0.03254539 0.16238438 0.29359370

5th plot:

  • I have to inspect Fence –> I’ll merge “GdPrv” and “GdWo” as “GdPrvWo” = Good privacy or Good wood and also merge “MnPrv” and “MnWw” as “MnPrvWw” = Minimum privacy or Minimum Wood/Wire.
prop.table(table(houses$Fence))
## 
##       GdPrv        GdWo       MnPrv        MnWw       NoFen 
## 0.040424803 0.038369305 0.112709832 0.004110997 0.804385063
houses <- transform(houses, Fence=revalue(Fence,c("GdPrv" = "GdPrvWo")))
houses <- transform(houses, Fence=revalue(Fence,c("GdWo" = "GdPrvWo")))
houses <- transform(houses, Fence=revalue(Fence,c("MnPrv" = "MnPrvWw")))
houses <- transform(houses, Fence=revalue(Fence,c("MnWw" = "MnPrvWw")))
prop.table(table(houses$Fence))
## 
##    GdPrvWo    MnPrvWw      NoFen 
## 0.07879411 0.11682083 0.80438506
  • I have to inspect GarageCond and GarageQual because they seems similar, i have decide to keep only GarageCond and delete GarageQual due to the most are equal to GarageCond –> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.
prop.table(table(houses$GarageCond))
## 
##          Ex          Fa          Gd         NoG          Po          TA 
## 0.001027749 0.025351148 0.005138746 0.054470709 0.004796163 0.909215485
prop.table(table(houses$GarageQual))
## 
##          Ex          Fa          Gd         NoG          Po          TA 
## 0.001027749 0.042480301 0.008221994 0.054470709 0.001712915 0.892086331
houses <- houses[,-which(names(houses) == "GarageQual")]
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Ex" = "Ex_Gd")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Gd" = "Ex_Gd")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Po" = "Fa_Po")))
prop.table(table(houses$GarageCond))
## 
##       Ex_Gd       Fa_Po         NoG          TA 
## 0.006166495 0.030147311 0.054470709 0.909215485
  • I have to inspect MiscFeature –> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.
prop.table(table(houses$MiscFeature))
## 
##         Gar2         None         Othr         Shed         TenC 
## 0.0017129154 0.9640287770 0.0013703323 0.0325453923 0.0003425831
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("Gar2" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("Othr" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("Shed" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("TenC" = "Yes")))
prop.table(table(houses$MiscFeature))
## 
##        Yes       None 
## 0.03597122 0.96402878
  • I have to inspect PavedDrive –> merge “Y” and “P” as “Y_P” = Paved or partial Paved.
prop.table(table(houses$PavedDrive))
## 
##          N          P          Y 
## 0.07399794 0.02124015 0.90476190
houses <- transform(houses, PavedDrive=revalue(PavedDrive,c("Y" = "Y_P")))
houses <- transform(houses, PavedDrive=revalue(PavedDrive,c("P" = "Y_P")))
prop.table(table(houses$PavedDrive))
## 
##          N        Y_P 
## 0.07399794 0.92600206
  • I have to inspect PoolQC, more then 99% haven’t Pool so in order to don’t delete the variable because it would be a plus have a pool –> I have decide to change variable in Pool with two level “Yes” = Yes and “No” = No.
prop.table(table(houses$PoolQC))
## 
##           Ex           Fa           Gd         NoPo 
## 0.0013703323 0.0006851662 0.0013703323 0.9965741692
names(houses)[names(houses) == 'PoolQC'] <- 'Pool'
houses <- transform(houses, Pool=revalue(Pool,c("Ex" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("Fa" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("Gd" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("NoPo" = "No")))
prop.table(table(houses$Pool))
## 
##         Yes          No 
## 0.003425831 0.996574169
  • I have to inspect SaleCondition –> I’ll merge in Other all the level that have percentage less than less than 5%.
prop.table(table(houses$SaleCondition))
## 
##     Abnorml     AdjLand      Alloca      Family      Normal     Partial 
## 0.065090785 0.004110997 0.008221994 0.015758822 0.822884550 0.083932854
houses <- transform(houses, SaleCondition=revalue(SaleCondition,c("AdjLand" = "Other")))
houses <- transform(houses, SaleCondition=revalue(SaleCondition,c("Alloca" = "Other")))
houses <- transform(houses, SaleCondition=revalue(SaleCondition,c("Family" = "Other")))
prop.table(table(houses$SaleCondition))
## 
##    Abnorml      Other     Normal    Partial 
## 0.06509078 0.02809181 0.82288455 0.08393285
  • I have to inspect SaleType –> I’ll merge “Con”, “ConLD”, “ConLI”, “ConLw” as “Oth” = Other and also merge “WD” and “CWD” as “WD” = Warranty Deed.
prop.table(table(houses$SaleType))
## 
##         COD         Con       ConLD       ConLI       ConLw         CWD 
## 0.029804728 0.001712915 0.008907160 0.003083248 0.002740665 0.004110997 
##         New         Oth          WD 
## 0.081877355 0.002740665 0.865022268
houses <- transform(houses, SaleType=revalue(SaleType,c("Con" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLD" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLI" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLw" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("CWD" = "WD")))
prop.table(table(houses$SaleType))
## 
##        COD        Oth         WD        New 
## 0.02980473 0.01918465 0.86913326 0.08187736

Inspecting the continuous variables

continuous_variables <- c(3,4,12:15,19,27,29:31,35:44,46,48,51,53,54,57:62,66:68,71)
str(houses[,continuous_variables])
## 'data.frame':    2919 obs. of  36 variables:
##  $ LotFrontage  : num  65 80 68 60 84 85 75 0 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ MasVnrArea   : num  196 0 162 0 350 0 186 240 0 0 ...
##  $ BsmtFinSF1   : num  706 978 486 216 655 ...
##  $ BsmtFinSF2   : num  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : num  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : num  856 1262 920 756 1145 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : num  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : num  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ GarageYrBlt  : num  2003 1976 2001 1998 2000 ...
##  $ GarageCars   : num  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : num  548 460 608 642 836 480 636 484 468 205 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SalePrice    : num  208500 181500 223500 140000 250000 ...
houses %>%
  gather(Attributes, value, continuous_variables[1:9]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_boxplot(show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values",
       title="Continous Variables - Boxplot") +
  scale_fill_discrete() +
  my_theme

houses %>%
  gather(Attributes, value, continuous_variables[10:18]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_boxplot(show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values",
       title="Continous Variables - Boxplot") +
  scale_fill_discrete() +
  my_theme

houses %>%
  gather(Attributes, value, continuous_variables[19:27]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_boxplot(show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values",
       title="Continous Variables - Boxplot") +
  scale_fill_discrete() +
  my_theme

houses %>%
  gather(Attributes, value, continuous_variables[28:36]) %>%
  ggplot(aes(x=value, fill=Attributes)) +
  geom_boxplot(show.legend = F) +
  facet_wrap(~Attributes, scales="free_x") +
  labs(x="Values",
       title="Continous Variables - Boxplot") +
  scale_fill_discrete() +
  my_theme

Continous variables Plots interpretation

1st plot:

  • I have notice that BsmtFinSF2 have to many outlier I’ll remove it and I’ll valuated only the presence of absence of the second basement.
houses <- houses[,-which(names(houses) == "BsmtFinSF2")]

2nd plot:

  • I have notice that BsmtHalfBath have almost all 0, it’s difficult to interpret variable I’ll remove it also because I have already BsmtFullBath to consider.
houses <- houses[,-which(names(houses) == "BsmtHalfBath")]
  • I have notice that LowQualFinSF have to many outlier I’ll remove it.
houses <- houses[,-which(names(houses) == "LowQualFinSF")]

4th plot:

  • I have notice that MiscVal have many outlier and it difficult to be interpreted I’ll remove it also because I have already a binomial var to interpret if are present or not Misc.
houses <- houses[,-which(names(houses) == "MiscVal")]
  • The same just happened MiscVal is applied for PollArea.
houses <- houses[,-which(names(houses) == "PoolArea")]
  • I decided to group all the Porch “Closed Type” in a new factorial variable, ClosedPorch with two level (“Yes” or “No”) in order to delete the other not well intrepetable continous variable.
houses$ClosedPorch = as.factor(ifelse(houses$EnclosedPorch > 0 | houses$ScreenPorch > 0 | houses$X3SsnPorch > 0, "Yes", "No"))
houses <- houses[,-which(names(houses) == "EnclosedPorch")]
houses <- houses[,-which(names(houses) == "ScreenPorch")]
houses <- houses[,-which(names(houses) == "X3SsnPorch")]
prop.table(table(houses$ClosedPorch))
## 
##        No       Yes 
## 0.7519699 0.2480301
  • I have decided to replace OpenPorchSF as factorial variable with two level “Yes” or “No”.
houses$OpenPorch = as.factor(ifelse(houses$OpenPorchSF > 0, "Yes", "No"))
houses <- houses[,-which(names(houses) == "OpenPorchSF")]
prop.table(table(houses$OpenPorch))
## 
##        No       Yes 
## 0.4446728 0.5553272

Data diagnostic

After proceeding with the analysis I’ll relocate SalePrice as last column just to have more order in in data.

houses <- houses %>% 
  relocate(SalePrice, .after = last_col())

Split data

I have split Data into train and test set and I have delete the SalePrice from test set (precedently inserted as a 0 column).

train <- houses[1:1460,]
test <- houses[1461:2898,]

Split variables by type

After going forward rewrite factorial and continuous variable obviously excluding Id.

factorial_variables <- c(2,3,5:11,14,16:18,20:26,28,31:33,42,44,46,47,49,52,53,55:57,59:63)
continuous_variables <- c(4,12,13,15,19,27,29,30,34:41,43,45,48,50,51,54,58,64)

Then create train and test set by type of variable.

train_num <- train[,continuous_variables]
test_num <- test[,continuous_variables]
train_fact <- train[,c(1,factorial_variables)]
test_fact <- test[,c(1,factorial_variables)]
test_id <- test[,1]

Correlation of numerical variable

Correlation is a term that is a measure of the strength of a linear relationship between two quantitative variables (e.g., height, weight). This post will define positive and negative correlations, illustrated with examples and explanations of how to measure correlation. Finally, some pitfalls regarding the use of correlation will be discussed.

Positive correlation is a relationship between two variables in which both variables move in the same direction. This is when one variable increases while the other increases and visa versa. For example, positive correlation may be that the more you exercise, the more calories you will burn. Whilst negative correlation is a relationship where one variable increases as the other decreases, and vice versa.

tot_corr <- cor(train_num[,-24])
col<- colorRampPalette(c('darkred', 'white', 'black'))(10)
corrplot(tot_corr, method="pie", type= "upper", diag = F, tl.srt = 40, tl.cex = 0.8, tl.col = "black", col = col)

Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed. I want to remove attributes with an absolute correlation of 0.75 or higher.

So, I have removed GrLivArea, GarageCars and X1stFlrSF

highlyCorrelated <- findCorrelation(tot_corr, cutoff=0.75)
str(train_num[,highlyCorrelated])
## 'data.frame':    1460 obs. of  3 variables:
##  $ GrLivArea : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ GarageCars: num  2 2 2 3 3 2 2 2 2 1 ...
##  $ X1stFlrSF : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
train_num <- train_num[,-highlyCorrelated]
test_num <- test_num[,-highlyCorrelated]

Analyse the dependent variable (SalePrice)

SalePrice Density plot

ggplot(train, aes(x= SalePrice)) +
  geom_density(fill = "black", alpha=0.6, show.legend=FALSE) +
  labs(x="Values", y="Density", title="SalePrice - Density plot") +
  my_theme

SalePrice Box plot

ggplot(train, aes(x= SalePrice)) +
  geom_boxplot(fill = "black", alpha = 0.6, show.legend = F) +
  coord_flip() +
  labs(x="Values",title="SalePrice - Box plot") +
  my_theme

SalePrice Correlation plot

price_cor <- train_num %>% 
  correlate() %>% 
  focus(SalePrice)

price_cor %>% 
  mutate(term = factor(term, levels = term[order(SalePrice)])) %>%  
  ggplot(aes(x = term, y = SalePrice, fill = SalePrice)) +
  geom_bar(stat = "identity", show.legend = F) +
  ylab("Correlation with Sale_Price") +
  xlab("Variable") +
  scale_fill_gradient(low = 'red', high = 'black') +
  theme(plot.title = element_text(color = 'darkred', face = "bold.italic", size = 15),
                   plot.subtitle = element_text(color = 'darkred', size = 8),
                   plot.background =element_rect(fill = "snow",colour = "darkred",size = 1.5),
                   panel.grid.major = element_line(colour = "snow", size = 1),
                   panel.grid.minor = element_line(colour = "snow2"),
                   legend.title = element_text(colour="black", size=10),
                   panel.background =element_rect(fill = "snow2"),
                   legend.background = element_rect(fill="red3",size=0.5, linetype="solid",colour ="red3"),
                   axis.title = element_text(face = "bold.italic", color = "darkred"),
                   axis.text.x = element_text(face = "italic",color = 'red3', angle = 75, hjust = 1),
                   axis.text.y = element_text(face = "italic",color = 'red3'))

Predictive analytics

Pre-processing transformation variables

Before starting to create the predictive models I’m going to rescale the datasets transforming the categorical variable as dummy. A dummy variable is a numeric variable that represents categorical data, such as gender, race, political affiliation, etc. Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small; they can take on only two quantitative values. As a practical matter, regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence. Then I will and merge dummy variable with the continuous into two unique tibble.

train_fact_mtx <- model.matrix(object = Id ~ . , data = train_fact)
train_fact <- data.frame(train_fact_mtx[,-1])

test_fact_mtx <- model.matrix(object = Id ~ . , data = test_fact)
test_fact <- data.frame(test_fact_mtx[,-1])

train <- as_tibble(cbind(train_fact, train_num))
test <- as_tibble(cbind(test_fact, test_num))
test <- test[,-112]

Then I have split train dataset into train data and train target

formula <- SalePrice ~ .
train_data <- train[, -112] %>% as.data.frame()
train_target <- train[, 112] %>% pull()

Define training control

Before proceeding to fit predictive models i have to define a train controll in order to evaluate them.

For these purpose i have chosen the repeated cross validation.

Repeated k-fold cross-validation is a procedure of resampling that provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.

I decide to resampling with k = 10 and repeats the cross validation 3 times

train_control_RCV <- trainControl(method = "repeatedcv", number = 10, repeats = 3)

After the repeated cross validation i will have for each fitted model three value that help me to evaluate them:

  • RMSE = It is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. The RMSE represents the square root of the second sample moment of the differences between predicted values and observed values or the quadratic mean of these differences.

Root mean squared error is calculated as: \[RMSE = \sqrt {\frac {1} {n} \sum (y_i - \hat{y}_i)^2}\]

  • MAE = It is a measure of errors between paired observations expressing the same phenomenon.

Mean absolute error is calculated as: \[MAE = \sqrt {\frac {1} {n} \sum |y_i - \hat{y}_i|}\]

  • Rsquared = It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.

The coefficient of determination is calculated as: \[R^2 = 1- \frac {SS{res}} {SS{tot}}\]

Where the variability of the data set can be measured with two sums of squares formulas:

  • The total sum of squares (proportional to the variance of the data): \[SS{tot} = \sum (y_i - \bar{y})^2\] Where: \[\bar{y} = \frac {1} {n} \sum y_i\]

  • The sum of squares of residuals, also called the residual sum of squares: \[SS{res} = \sum (y_i - \hat{y}_i)^2\]

Multiple Linear Regression

Multiple Linear Regression (MLR) is a statistical technique for finding existence of an association relationship between a dependent variable and several independent variables.

The functional form is given by:

\[Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + ... + \beta_{p} X_{p} + e\]

Where:

  • Y is the dependent variable,
  • X1, X2, Xp are independent variables,
  • β0 is a constant,
  • β1, β2, βp are the partial regression coefficients,
  • e is the error term (residual).

Let’s start to fit the model with Y = SalePrice.

\[SalePrice = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + .. + \beta_{p} X_{p} + e\]

set.seed(123)
lm_RCV <- train(formula, data = train, method = "lm", trControl = train_control_RCV)
print(lm_RCV)
## Linear Regression 
## 
## 1460 samples
##  111 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE  
##   34963.23  0.8125509  20345
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

After have create the model, let see what are the most important variable:

As showed in the graph the variable that influence most the SalePrice is the OverallQual rates with almost 10% of the total importance, followed by KitchenQualGd with a value of importance just above 7.5%.

lm_importance_value <- varImp(lm_RCV, scale = F)
lm_importance_value <- as.data.frame(lm_importance_value[["importance"]])
lm_importance_value <- cbind(Variable = rownames(lm_importance_value),lm_importance_value)
lm_importance_value <- lm_importance_value %>%
  arrange(desc(Overall))

ggplot(lm_importance_value[1:15,], aes(reorder(Variable, Overall), Overall, fill = Overall)) +
  geom_bar(stat = "identity", show.legend = F) + 
  coord_flip()+ 
  scale_fill_gradient(low = "grey75", high = "black") +
  labs(title = "Linear Regression - Variables Importance", y = "Importance", x = "") +
  my_theme

Then try to see how the linear model predict the train data

lm_pred <- predict(lm_RCV, train_data)
tibble(
  pred = lm_pred,
  actual = train_target
) %>% 
  ggplot(aes(pred, actual)) +
  geom_point( color = "black") +
  geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
  labs(title = "Linear Regression - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
  my_theme

Neural Network

Neural networks are a set of algorithms, loosely modeled after the human brain, designed to recognize patterns.

The patterns they recognize are numerical, contained in vectors, to which all data in the real world must be translated.

The stacked neural networks is a networks composed of several layers.

The layers are made of nodes.

A node combines data input with a set of coefficients or weights that either amplifies or dampens that input, thereby assigning significance to inputs for the task that the algorithm is trying to learn.

The output of each layer is the input of the next layer at the same time, starting from the initial input layer receiving your data.

A neural network consists of:

  • Input layers: Layers that take inputs based on existing data
  • Hidden layers: Layers that use backpropagation to optimise the weights of the input variables in order to improve the predictive power of the model
  • Output layers: Output of predictions based on the data from the input and hidden layers

Scale the data

In order to scale all variable inside, we use the maximun and minimum values of each single variable.

train_maxs <- apply(train, 2, max)
train_mins <- apply(train, 2, min)
train_scaled <- as.data.frame(scale(train, 
                      center = train_mins, 
                      scale  = train_maxs - train_mins))

We are looking for the optimal parameters for the model.

As I can see from the output the final values used for the model were:

  • size = 3

Size is the number of hidden layer that use backpropagation to optimise the weights of the input variables in order to improve the predictive power of the model

  • decay = 0.1

Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.

\[loss = loss + weightdecay parameter * L2 norm of the weights\]

Weight decay is used to prevent overfitting and to keep the weights small and avoid exploding gradient.

Because the L2 norm of the weights are added to the loss, each iteration of your network will try to optimize/minimize the model weights in addition to the loss.

set.seed(123)
nn_RCV <- train(formula, data = train_scaled, method = "nnet", trControl = train_control_RCV)
print(nn_RCV)
## Neural Network 
## 
## 1460 samples
##  111 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ... 
## Resampling results across tuning parameters:
## 
##   size  decay  RMSE        Rsquared   MAE       
##   1     0e+00  0.15703415  0.8322300  0.13199676
##   1     1e-04  0.14545102  0.4543285  0.12132953
##   1     1e-01  0.04541866  0.8357140  0.02824910
##   3     0e+00  0.14953582  0.8500791  0.12606823
##   3     1e-04  0.12687862  0.4917349  0.10374026
##   3     1e-01  0.04522695  0.8351553  0.02787022
##   5     0e+00  0.15895984  0.7941364  0.13222516
##   5     1e-04  0.09061938  0.6516267  0.06804890
##   5     1e-01  0.04536545  0.8341218  0.02777200
## 
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.1.

After have create the model, let see what are the most important variable:

As showed in the graph the variable that influence most the SalePrice is still the OverallQual rates but this time with less than 5% of the total importance, followed by 2ndflrSF (Second floor square feet) with a value of importance almost of 3%.

nn_importance_value <- varImp(nn_RCV, scale = F)
nn_importance_value <- as.data.frame(nn_importance_value[["importance"]])
nn_importance_value <- cbind(Variable = rownames(nn_importance_value),nn_importance_value)
nn_importance_value <- nn_importance_value %>%
  arrange(desc(Overall))

ggplot(nn_importance_value[1:15,], aes(reorder(Variable, Overall), Overall, fill = Overall)) +
  geom_bar(stat = "identity", show.legend = F) + 
  coord_flip()+ 
  scale_fill_gradient(low = "grey75", high = "black") +
  labs(title = "Neural Net - Variables Importance", y = "Importance", x = "") +
  my_theme

Then try to see how the neural network model predict the train data.

nn_pred <- predict(nn_RCV,train_scaled[,-115])
nn_pred_unscaled <- nn_pred * (train_maxs["SalePrice"] - train_mins["SalePrice"]) + train_mins["SalePrice"]

tibble(
  pred = nn_pred_unscaled,
  actual = train_target
) %>% 
  ggplot(aes(pred, actual)) +
  geom_point( color = "black") +
  geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
  labs(title = "Neural Net - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
  my_theme

In order to compare the MSE and MAE with the others model I have to calculate them with not scaled data.

nn_RMSE <- rmse(train_target, nn_pred_unscaled)
print(paste0("MSE is: ", round(nn_RMSE)))
## [1] "MSE is: 29401"
nn_MAE <- mae(train_target, nn_pred_unscaled)           
print(paste0("MAE is: ", round(nn_MAE)))
## [1] "MAE is: 18276"

Gradient boosting machine

Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.

We are looking for the optimal parameters for the model, so i create a grid that will be applied in this GBM model.

As we can see from the output the final values used for the model were:

  • n.trees = 150

The total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion

  • interaction.depth = 6

The maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc.

  • shrinkage = 0.01

A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction

  • n.minobsinnode = 15

The minimum number of observations in the terminal nodes of the trees.

grid <- expand.grid(interaction.depth = c(3, 6),
                    n.trees = c(150, 300),
                    shrinkage = c(0.1), 
                    n.minobsinnode = c(10, 15))
set.seed(123)
gbm_RCV <- train(formula, data = train, distribution = "gaussian", method = "gbm",
                trControl = train_control_RCV, tuneGrid = grid)
print(gbm_RCV)
## Stochastic Gradient Boosting 
## 
## 1460 samples
##  111 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.minobsinnode  n.trees  RMSE      Rsquared   MAE     
##   3                  10              150      30703.16  0.8520090  18083.20
##   3                  10              300      30242.15  0.8559285  17790.39
##   3                  15              150      30101.27  0.8571943  18052.16
##   3                  15              300      29814.07  0.8596072  17780.00
##   6                  10              150      30767.00  0.8510928  17545.10
##   6                  10              300      30686.39  0.8512630  17520.70
##   6                  15              150      29571.74  0.8621849  17444.02
##   6                  15              300      29721.45  0.8609000  17576.79
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  6, shrinkage = 0.1 and n.minobsinnode = 15.

After have create the model, let see what are the most important variable:

Also for this model, as showed in the graph, the variable that influence most the SalePrice is still the OverallQual rates this time its importance value is much bigger than the others with more than 45% of the total importance, followed by TotalBsmtSF (Total square feet of basement area) with a value of importance of about 12%.

gbm_importance_value <- as.data.frame(summary(gbm_RCV))

gbm_importance_value <- gbm_importance_value %>%
  arrange(desc(rel.inf))
ggplot(gbm_importance_value[1:15,], aes(reorder(var, rel.inf), rel.inf, fill = rel.inf)) +
  geom_bar(stat = "identity", show.legend = F) + 
  coord_flip()+ 
  scale_fill_gradient(low = "grey75", high = "black") +
  labs(title = "GBM - Variables Importance", y = "Importance", x = "") +
  my_theme

Then try to see how the gradient boosting machine model predict the train data.

gbm_pred <- predict(gbm_RCV, train_data)

tibble(
  pred = gbm_pred,
  actual = train_target
) %>% 
  ggplot(aes(pred, actual)) +
  geom_point( color = "black") +
  geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
  labs(title = "GBM - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
  my_theme

Conclusion

As conclusion I will compare the different prediction model that i have fitted and i will use the best in order to predict the test data SalePrice

Compare the models performs

In order to compare the models performs I will evaluate them by the R squared, the root mean squared error and the mean absolute error.

As the table shows the gradient boosting perform better in terms of R squared and MAE and also the RMSE is pretty similar to the Neural network.

R squared RMSE MAE
Multiple linear regression 0.8125 34963 20345
Neural network 0.8351 29401 18276
Gradient boosting machine 0,8622 29572 17444

Predict the House prices

As showed before the best model is the Gradient boosting machine, so I will use it in order to predict the house prices.

price_prediction <- predict(gbm_RCV, test)
analysis_result <- cbind(test_id,price_prediction)
head(analysis_result, 15)
##       test_id price_prediction
##  [1,]    1461         124793.5
##  [2,]    1462         163828.6
##  [3,]    1463         168015.1
##  [4,]    1464         186131.6
##  [5,]    1465         186922.5
##  [6,]    1466         190698.5
##  [7,]    1467         178203.0
##  [8,]    1468         161966.9
##  [9,]    1469         177181.7
## [10,]    1470         119881.7
## [11,]    1471         200224.3
## [12,]    1472         105117.0
## [13,]    1473         102322.0
## [14,]    1474         152281.4
## [15,]    1475         135077.6