The dataset for my prediction Project consist of various houses attributes including only for the train set the Sales Price
For this project i’m going to estimate the Sale Price of Houses by using different models.
In order to do that, i will use a train set of houses characteristics in order to predict the variable of interest (Sale_Price) for an other dataset of houses that don’t have already a Price.
Both Dataset are provided by Kaggle.
The train set 1460 observations (rows) while the test set 1459 and both of them have 80 features (columns):
I will merge them in order to have make the same data cleaning and manipulation.
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGESMSZoning: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium DensityLotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access to property
Grvl Gravel
Pave PavedAlley: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley accessLotShape: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 IrregularLandContour: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low DepressionUtilities: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only LotConfig: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of propertyLandSlope: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe SlopeNeighborhood: Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker VeenkerCondition1: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West RailroadCondition2: Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West RailroadBldgType: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside UnitHouseStyle: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split LevelOverallQual: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very PoorOverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very PoorYearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
RoofStyle: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed ShedRoofMatl: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood ShinglesExterior1st: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood ShinglesExterior2nd: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood ShinglesMasVnrType: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone StoneMasVnrArea: Masonry veneer area in square feet
ExterQual: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po PoorExterCond: Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po PoorFoundation: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood WoodBsmtQual: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No BasementBsmtCond: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No BasementBsmtExposure: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No BasementBsmtFinType1: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No BasementBsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No BasementBsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnaceHeatingQC: Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po PoorCentralAir: Central air conditioning
N No
Y YesElectrical: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen: Kitchens above grade
KitchenQual: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po PoorTotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage onlyFireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No FireplaceGarageType: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No GarageGarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No GarageGarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No GarageGarageCond: Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No GaragePavedDrive: Paved driveway
Y Paved
P Partial Pavement
N Dirt/GravelWoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No PoolFence: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No FenceMiscFeature: Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA NoneMiscVal: $Value of miscellaneous feature
MoSold: Month Sold (MM)
YrSold: Year Sold (YYYY)
SaleType: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth OtherSaleCondition: Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)test$SalePrice <- 0
houses <- bind_rows(train, test)
houses %>% glimpse()## Rows: 2,919
## Columns: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,~
## $ MSZoning <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R~
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, ~
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612~
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ~
## $ Alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ LotShape <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", ~
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", ~
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu~
## $ LotConfig <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I~
## $ LandSlope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", ~
## $ Neighborhood <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "~
## $ Condition1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",~
## $ Condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ~
## $ BldgType <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ~
## $ HouseStyle <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi~
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,~
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,~
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19~
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19~
## $ RoofStyle <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G~
## $ RoofMatl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "~
## $ Exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "~
## $ Exterior2nd <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "~
## $ MasVnrType <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",~
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, ~
## $ ExterQual <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ ExterCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ Foundation <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "~
## $ BsmtQual <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T~
## $ BsmtCond <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T~
## $ BsmtExposure <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N~
## $ BsmtFinType1 <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", ~
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99~
## $ BsmtFinType2 <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", ~
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17~
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10~
## $ Heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ~
## $ HeatingQC <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E~
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ Electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S~
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, ~
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,~
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10~
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,~
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,~
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,~
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,~
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,~
## $ KitchenQual <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6~
## $ Functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", ~
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,~
## $ FireplaceQu <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", ~
## $ GarageType <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch~
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19~
## $ GarageFinish <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", ~
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,~
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7~
## $ GarageQual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G~
## $ GarageCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ PavedDrive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160~
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,~
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, ~
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, ~
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ PoolQC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Fence <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,~
## $ MiscFeature <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, ~
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,~
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10~
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20~
## $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W~
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm~
## $ SalePrice <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, ~
Accurate estimation offers a chance to better identify the value at which to sell a house in order to take as much profit as possible and sell as soon as possible.
Thus, the purposes of this project is to create several model using different Machine Learning technique in order to find the value of Houses.
I examine how different model developed improve significantly the prediction accuracy in term of Mean Squared Error and R squared.
Furthermore, i will understand the importance of the Houses characteristics and how they change within different model prediction.
First ensure that there are no duplicate (remove the ID in order to have better interpretation of duplicate).
houses[,-1][duplicated(houses)]## data frame con 0 colonne e 2919 righe
From the dataset description i saw that the house without basement have “NA” so let replace them with “NoB” in all the level regarding that.
also see that the house without garage have “NA” so let substitute them with “NoG” in all the level regarding that. Then to no delete the houses without garage from the sample i have to replace the “NA” value with the mean of the other in order to not bias the sample.
Furthermore, from dataset description i also see that the house without Alley access, Fireplace, Pool, Fency and Miscellaneous feature have “NA” so replace them with “NoAll”,“NoFen” and “NoFp”.
houses[,c(31:34,36)][is.na(houses[,c(31:34,36)])] <- "NoB"
houses[,c(59,61,64,65)][is.na(houses[,c(59,61,64,65)])] <- "NoG"
gar_year <- na.omit(houses[,60])
houses[,60][is.na(houses[,60])] <- 0
#or
#houses[,60][is.na(houses[,60])] <- as.integer(mean(gar_year))
houses[,7][is.na(houses[,7])] <- "NoAll"
houses[,58][is.na(houses[,58])] <- "NoFp"
houses[,73][is.na(houses[,73])] <- "NoPo"
houses[,74][is.na(houses[,74])] <- "NoFen"
houses[,75][is.na(houses[,75])] <- "None"Before continue i control if and where are still present “NA”.
As first control if are present variable that have more than 10% of “NA”.
NA_values <- matrix(NA,81,2)
for(i in 1:81){
NA_values[i,1] <- sum(is.na(houses[,i]))/2919*100
}
var_with_NA <- NA
for(i in 1:81){
if(NA_values[i,1] > 1){
var_with_NA[i] <- i
}
}
var_with_NA## [1] NA NA NA 4
var_with_NA <- 4I found many “NA” in LotFrontage variable so let’s replace it with “0”.
houses[,4][is.na(houses[,4])] <- 0Now i control how many “NA” are still present in the dataset and I inspect them one by one.
sum(is.na(houses))## [1] 70
which(is.na(houses), arr.ind=TRUE)## row col
## [1,] 1916 3
## [2,] 2217 3
## [3,] 2251 3
## [4,] 2905 3
## [5,] 1916 10
## [6,] 1946 10
## [7,] 2152 24
## [8,] 2152 25
## [9,] 235 26
## [10,] 530 26
## [11,] 651 26
## [12,] 937 26
## [13,] 974 26
## [14,] 978 26
## [15,] 1244 26
## [16,] 1279 26
## [17,] 1692 26
## [18,] 1707 26
## [19,] 1883 26
## [20,] 1993 26
## [21,] 2005 26
## [22,] 2042 26
## [23,] 2312 26
## [24,] 2326 26
## [25,] 2341 26
## [26,] 2350 26
## [27,] 2369 26
## [28,] 2593 26
## [29,] 2611 26
## [30,] 2658 26
## [31,] 2687 26
## [32,] 2863 26
## [33,] 235 27
## [34,] 530 27
## [35,] 651 27
## [36,] 937 27
## [37,] 974 27
## [38,] 978 27
## [39,] 1244 27
## [40,] 1279 27
## [41,] 1692 27
## [42,] 1707 27
## [43,] 1883 27
## [44,] 1993 27
## [45,] 2005 27
## [46,] 2042 27
## [47,] 2312 27
## [48,] 2326 27
## [49,] 2341 27
## [50,] 2350 27
## [51,] 2369 27
## [52,] 2593 27
## [53,] 2658 27
## [54,] 2687 27
## [55,] 2863 27
## [56,] 2121 35
## [57,] 2121 37
## [58,] 2121 38
## [59,] 2121 39
## [60,] 1380 43
## [61,] 2121 48
## [62,] 2189 48
## [63,] 2121 49
## [64,] 2189 49
## [65,] 1556 54
## [66,] 2217 56
## [67,] 2474 56
## [68,] 2577 62
## [69,] 2577 63
## [70,] 2490 79
MSZoning, let’s add “NA” value to the most common categoryprop.table(table(houses$MSZoning))##
## C (all) FV RH RL RM
## 0.008576329 0.047684391 0.008919383 0.777015437 0.157804460
houses[,3][is.na(houses[,3])] <- "RL"Utilities, 99,9% have this variable level at “AllPub” replace NA with that value (later I’ll delete this variable).prop.table(table(houses$Utilities))##
## AllPub NoSeWa
## 0.999657182 0.000342818
houses[,10][is.na(houses[,10])] <- "AllPub"Exterior1st and Exterior2nd, let’s replace both with the level “Other” already present as Exterior2nd level (later I’ll group the less frequent Exterior1st.prop.table(table(houses$Exterior1st))##
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd
## 0.0150788211 0.0006854010 0.0020562029 0.0298149417 0.0006854010 0.0431802605
## HdBoard ImStucc MetalSd Plywood Stone Stucco
## 0.1514736121 0.0003427005 0.1542152159 0.0757368060 0.0006854010 0.0147361206
## VinylSd Wd Sdng WdShing
## 0.3512679918 0.1408498972 0.0191912269
houses$Exterior1st <- as.character(houses$Exterior1st)
houses[,24][is.na(houses[,24])] <- "Other"
houses$Exterior1st <- as.factor(houses$Exterior1st)
houses[,25][is.na(houses[,25])] <- "Other"houses[,26][is.na(houses[,26])] <- "None"
houses[,27][is.na(houses[,27])] <- 0houses[2121,]## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 2121 2121 20 RM 99 5940 Pave NoAll IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 2121 Lvl AllPub FR3 Gtl BrkSide Feedr
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 2121 Norm 1Fam 1Story 4 7 1946
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 2121 1950 Gable CompShg MetalSd CBlock None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 2121 0 TA TA PConc NoB NoB NoB
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 2121 NoB NA NoB NA NA NA
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 2121 GasA TA Y FuseA 896 0 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 2121 896 NA NA 1 0 2
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 2121 1 TA 4 Typ 0 NoFp
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 2121 Detchd 1946 Unf 1 280 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 2121 TA Y 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 2121 0 0 NoPo MnPrv None 0 4 2008
## SaleType SaleCondition SalePrice
## 2121 ConLD Abnorml 0
houses[,c(35,37:39,48,49)][is.na(houses[,c(35,37:39,48,49)])] <- 0Electrical, let’s replace it with “Other”, level that i will create later in order to merge less frequent Electrical level.prop.table(table(houses$Electrical))##
## FuseA FuseF FuseP Mix SBrkr
## 0.0644276902 0.0171350240 0.0027416038 0.0003427005 0.9153529815
houses[,43][is.na(houses[,43])] <- "Other"KitchenQual, let’s replace it with “TA” that means Typical, the most frequent level of this variableprop.table(table(houses$KitchenQual))##
## Ex Fa Gd TA
## 0.07025360 0.02398903 0.39444825 0.51130912
houses[,54][is.na(houses[,54])] <- "TA"Functional, let’s replace it with “Typ” that means Typical Functionality, the most frequent level of this variable.prop.table(table(houses$Functional))##
## Maj1 Maj2 Min1 Min2 Mod Sev
## 0.0065135413 0.0030853617 0.0222831676 0.0239972575 0.0119986287 0.0006856359
## Typ
## 0.9314364073
houses[,56][is.na(houses[,56])] <- "Typ"GarageCars and GarageArea, let’s replace both with 0 as the other House without Garagehouses[,c(62,63)][is.na(houses[,c(62,63)])] <- 0SaleType, let’s replace it with “Oth” that is a level already present.prop.table(table(houses$SaleType))##
## COD Con ConLD ConLI ConLw CWD
## 0.029814942 0.001713502 0.008910212 0.003084304 0.002741604 0.004112406
## New Oth WD
## 0.081905415 0.002398903 0.865318711
houses[,79][is.na(houses[,79])] <- "Oth"Finally control if all “NA” have been replaced.
sum(is.na(houses))## [1] 0
houses$MSSubClass <- as.factor(houses$MSSubClass)
houses$MSZoning <- as.factor(houses$MSZoning)
houses$Street <- as.factor(houses$Street)
houses$Alley <- as.factor(houses$Alley)
houses$LotShape <- as.factor(houses$LotShape)
houses$LandContour <- as.factor(houses$LandContour)
houses$Utilities <- as.factor(houses$Utilities)
houses$LotConfig <- as.factor(houses$LotConfig)
houses$LandSlope <- as.factor(houses$LandSlope)
houses$Neighborhood <- as.factor(houses$Neighborhood)
houses$Condition1 <- as.factor(houses$Condition1)
houses$Condition2 <- as.factor(houses$Condition2)
houses$BldgType <- as.factor(houses$BldgType)
houses$HouseStyle <- as.factor(houses$HouseStyle)
houses$RoofStyle <- as.factor(houses$RoofStyle)
houses$RoofMatl <- as.factor(houses$RoofMatl)
houses$Exterior1st <- as.factor(houses$Exterior1st)
houses$Exterior2nd <- as.factor(houses$Exterior2nd)
houses$MasVnrType <- as.factor(houses$MasVnrType)
houses$ExterQual <- as.factor(houses$ExterQual)
houses$ExterCond <- as.factor(houses$ExterCond)
houses$Foundation <- as.factor(houses$Foundation)
houses$BsmtQual <- as.factor(houses$BsmtQual)
houses$BsmtCond <- as.factor(houses$BsmtCond)
houses$BsmtExposure <- as.factor(houses$BsmtExposure)
houses$BsmtFinType1 <- as.factor(houses$BsmtFinType1)
houses$BsmtFinType2 <- as.factor(houses$BsmtFinType2)
houses$Heating <- as.factor(houses$Heating)
houses$HeatingQC <- as.factor(houses$HeatingQC)
houses$CentralAir <- as.factor(houses$CentralAir)
houses$Electrical <- as.factor(houses$Electrical)
houses$KitchenQual <- as.factor(houses$KitchenQual)
houses$Functional <- as.factor(houses$Functional)
houses$FireplaceQu <- as.factor(houses$FireplaceQu)
houses$GarageType <- as.factor(houses$GarageType)
houses$GarageFinish <- as.factor(houses$GarageFinish)
houses$GarageQual <- as.factor(houses$GarageQual)
houses$GarageCond <- as.factor(houses$GarageCond)
houses$PavedDrive <- as.factor(houses$PavedDrive)
houses$PoolQC <- as.factor(houses$PoolQC)
houses$Fence <- as.factor(houses$Fence)
houses$MiscFeature <- as.factor(houses$MiscFeature)
houses$SaleType <- as.factor(houses$SaleType)
houses$SaleCondition <- as.factor(houses$SaleCondition)factorial_variables <- c(2,3,6:17,22:26,28:34,36,40:43,54,56,58,59,61,64:66,73:75,79,80)
str(houses[,factorial_variables])## 'data.frame': 2919 obs. of 44 variables:
## $ MSSubClass : Factor w/ 16 levels "20","30","40",..: 6 1 6 7 6 5 1 6 5 16 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 3 levels "Grvl","NoAll",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 15 14 14 14 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 3 3 5 3 3 1 3 5 5 ...
## $ BsmtCond : Factor w/ 5 levels "Fa","Gd","NoB",..: 5 5 5 2 5 5 5 5 5 5 ...
## $ BsmtExposure : Factor w/ 5 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 7 3 ...
## $ BsmtFinType2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 7 7 7 7 7 7 2 7 7 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "FuseA","FuseF",..: 6 6 6 6 6 6 6 6 2 6 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ FireplaceQu : Factor w/ 6 levels "Ex","Fa","Gd",..: 4 6 6 3 6 4 3 6 6 6 ...
## $ GarageType : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageFinish : Factor w/ 4 levels "Fin","NoG","RFn",..: 3 3 3 4 3 4 3 3 4 3 ...
## $ GarageQual : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 2 3 ...
## $ GarageCond : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ PoolQC : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Fence : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
## $ MiscFeature : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
houses %>%
gather(Attributes, value, factorial_variables[1:9]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_themehouses %>%
gather(Attributes, value, factorial_variables[10:18]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_themehouses %>%
gather(Attributes, value, factorial_variables[19:27]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_themehouses %>%
gather(Attributes, value, factorial_variables[28:36]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_themehouses %>%
gather(Attributes, value, factorial_variables[37:44]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_themeStreet, Utilities have only one value and Neighborhood have to many level and and MSSubClass have many not interpretable level that can create bias so I delete them.houses <- houses[,-which(names(houses) == "Street")]
houses <- houses[,-which(names(houses) == "Utilities")]
houses <- houses[,-which(names(houses) == "MSSubClass")]LandSlope –> I’ll merge “Mod” and “Sev” as “ModSev” = Moderate/Severe Slope.prop.table(table(houses$LandSlope))##
## Gtl Mod Sev
## 0.951695786 0.042822885 0.005481329
houses <- transform(houses, LandSlope=revalue(LandSlope,c("Mod" = "ModSev")))
houses <- transform(houses, LandSlope=revalue(LandSlope,c("Sev" = "ModSev")))
prop.table(table(houses$LandSlope))##
## Gtl ModSev
## 0.95169579 0.04830421
LotConfig –> I’ll merge “FR2” and “FR3” as “FR2/3” = Frontage on 2/3 sides of property.prop.table(table(houses$LotConfig))##
## Corner CulDSac FR2 FR3 Inside
## 0.175059952 0.060294621 0.029119561 0.004796163 0.730729702
houses <- transform(houses, LotConfig=revalue(LotConfig,c("FR2" = "FR2/3")))
houses <- transform(houses, LotConfig=revalue(LotConfig,c("FR3" = "FR2/3")))
prop.table(table(houses$LotConfig))##
## Corner CulDSac FR2/3 Inside
## 0.17505995 0.06029462 0.03391572 0.73072970
LotShape –> I’ll merge “IR2” and “IR3” as “IR2” = Moderately or more Irregular.prop.table(table(houses$LotShape))##
## IR1 IR2 IR3 Reg
## 0.331620418 0.026036314 0.005481329 0.636861939
houses <- transform(houses, LotShape=revalue(LotShape,c("IR3" = "IR2")))
prop.table(table(houses$LotShape))##
## IR1 IR2 Reg
## 0.33162042 0.03151764 0.63686194
MSZoning –> I’ll merge “RM” and “RH” as “RMH” = Residential Medium/High Density.prop.table(table(houses$MSZoning))##
## C (all) FV RH RL RM
## 0.008564577 0.047619048 0.008907160 0.777321000 0.157588215
houses <- transform(houses, MSZoning=revalue(MSZoning,c("RM" = "RMH")))
houses <- transform(houses, MSZoning=revalue(MSZoning,c("RH" = "RMH")))
prop.table(table(houses$MSZoning))##
## C (all) FV RMH RL
## 0.008564577 0.047619048 0.166495375 0.777321000
condition1 and condition2 because of both have a lot of level difficult to be interpreted and any of that are without observation. I have also removed Roofmatl because there are many empty category and almost all the observations have the same value. Then i have deleted Exterior2nd because are very similar to Exterior1st.houses <- houses[,-which(names(houses) == "Condition1")]
houses <- houses[,-which(names(houses) == "Condition2")]
houses <- houses[,-which(names(houses) == "RoofMatl")]
houses <- houses[,-which(names(houses) == "Exterior2nd")]
houses <- houses[,-which(names(houses) == "Neighborhood")]BldgType –> I’ll merge “TwnhsE” with “Twnhs” = Townhouse and also merge “2fmCon” and “Duplex” as “2Fam” = Two-family.prop.table(table(houses$BldgType))##
## 1Fam 2fmCon Duplex Twnhs TwnhsE
## 0.83076396 0.02124015 0.03734156 0.03288798 0.07776636
houses <- transform(houses, BldgType=revalue(BldgType,c("TwnhsE" = "Twnhs")))
houses <- transform(houses, BldgType=revalue(BldgType,c("2fmCon" = "2Fam")))
houses <- transform(houses, BldgType=revalue(BldgType,c("Duplex" = "2Fam")))
prop.table(table(houses$BldgType))##
## 1Fam 2Fam Twnhs
## 0.83076396 0.05858171 0.11065433
Exterior1st –> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$Exterior1st))##
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd
## 0.0150736554 0.0006851662 0.0020554985 0.0298047276 0.0006851662 0.0431654676
## HdBoard ImStucc MetalSd Other Plywood Stone
## 0.1514217198 0.0003425831 0.1541623844 0.0003425831 0.0757108599 0.0006851662
## Stucco VinylSd Wd Sdng WdShing
## 0.0147310723 0.3511476533 0.1408016444 0.0191846523
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("AsbShng" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("AsphShn" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("BrkComm" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("BrkFace" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("CBlock" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("CemntBd" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("ImStucc" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("Stone" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("Stucco" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("WdShing" = "Other")))
prop.table(table(houses$Exterior1st))##
## Other HdBoard MetalSd Plywood VinylSd Wd Sdng
## 0.12675574 0.15142172 0.15416238 0.07571086 0.35114765 0.14080164
I have to inspect HouseStyle –> I’ll merge “1Story”, “1.5Fin” and “1.5Unf” as “1aH” = One or one and one-half story, merge “2Story”, “2.5Fin” and “2.5Unf” as “2aH” = Two or one and Two-half story and also merge “SFoyer” and “SLvl” as “SFL” = Split Foyer or split Level.
prop.table(table(houses$HouseStyle))##
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story
## 0.107571086 0.006509078 0.503939705 0.002740665 0.008221994 0.298732443
## SFoyer SLvl
## 0.028434395 0.043850634
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("1Story" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("1.5Fin" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("1.5Unf" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2Story" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2.5Fin" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2.5Unf" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("SFoyer" = "SFL")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("SLvl" = "SFL")))
prop.table(table(houses$HouseStyle))##
## 1aH 2aH SFL
## 0.61801987 0.30969510 0.07228503
RoofStyle –> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$RoofStyle))##
## Flat Gable Gambrel Hip Mansard Shed
## 0.006851662 0.791366906 0.007536828 0.188763275 0.003768414 0.001712915
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Flat" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Gambrel" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Mansard" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Shed" = "Other")))
prop.table(table(houses$RoofStyle))##
## Other Gable Hip
## 0.01986982 0.79136691 0.18876328
BsmtCond –> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$BsmtCond))##
## Fa Gd NoB Po TA
## 0.035628640 0.041795135 0.028091812 0.001712915 0.892771497
houses <- transform(houses, BsmtCond=revalue(BsmtCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, BsmtCond=revalue(BsmtCond,c("Po" = "Fa_Po")))
prop.table(table(houses$BsmtCond))##
## Fa_Po Gd NoB TA
## 0.03734156 0.04179514 0.02809181 0.89277150
BsmtFinType1 –> I’ll merge “ALQ”, “BLQ” and “GLQ” as “LQ” = Living Quarters and also merge “LwQ” and “Unf” as “LwQ_Unf” = Low Quality or Unfinished.prop.table(table(houses$BsmtFinType1))##
## ALQ BLQ GLQ LwQ NoB Rec Unf
## 0.14696814 0.09215485 0.29085303 0.05275779 0.02706406 0.09866393 0.29153820
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("ALQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("BLQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("GLQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("LwQ" = "LwQ_Unf")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("Unf" = "LwQ_Unf")))
prop.table(table(houses$BsmtFinType1))##
## LQ LwQ_Unf NoB Rec
## 0.52997602 0.34429599 0.02706406 0.09866393
BsmtFinType2 –> I’ll merge “ALQ”, “BLQ” and “GLQ” as “LQ” = Living Quarters and also merge “LwQ” and “Unf” as “LwQ_Unf” = Low Quality or Unfinished.prop.table(table(houses$BsmtFinType2))##
## ALQ BLQ GLQ LwQ NoB Rec Unf
## 0.01781432 0.02329565 0.01164782 0.02980473 0.02740665 0.03597122 0.85405961
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("ALQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("BLQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("GLQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("LwQ" = "LwQ_Unf")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("Unf" = "LwQ_Unf")))
prop.table(table(houses$BsmtFinType2))##
## LQ LwQ_Unf NoB Rec
## 0.05275779 0.88386434 0.02740665 0.03597122
ExterCond –> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$ExterCond))##
## Ex Fa Gd Po TA
## 0.004110997 0.022953066 0.102432340 0.001027749 0.869475848
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Ex" = "Ex_Gd")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Gd" = "Ex_Gd")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Po" = "Fa_Po")))
prop.table(table(houses$ExterCond))##
## Ex_Gd Fa_Po TA
## 0.10654334 0.02398082 0.86947585
ExterQual –> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good.prop.table(table(houses$ExterQual))##
## Ex Fa Gd TA
## 0.03665639 0.01199041 0.33538883 0.61596437
houses <- transform(houses, ExterQual=revalue(ExterQual,c("Ex" = "Ex_Gd")))
houses <- transform(houses, ExterQual=revalue(ExterQual,c("Gd" = "Ex_Gd")))
prop.table(table(houses$ExterQual))##
## Ex_Gd Fa TA
## 0.37204522 0.01199041 0.61596437
Foundation –> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$Foundation))##
## BrkTil CBlock PConc Slab Stone Wood
## 0.106543337 0.423090099 0.448098664 0.016786571 0.003768414 0.001712915
houses <- transform(houses, Foundation=revalue(Foundation,c("Slab" = "Other")))
houses <- transform(houses, Foundation=revalue(Foundation,c("Stone" = "Other")))
houses <- transform(houses, Foundation=revalue(Foundation,c("Wood" = "Other")))
prop.table(table(houses$Foundation))##
## BrkTil CBlock PConc Other
## 0.1065433 0.4230901 0.4480987 0.0222679
MasVnrType –> I’ll merge “BrkCmn” and “BrkFace” as “Brk” = Brick.prop.table(table(houses$MasVnrType))##
## BrkCmn BrkFace None Stone
## 0.008564577 0.301130524 0.605001713 0.085303186
houses <- transform(houses, MasVnrType=revalue(MasVnrType,c("BrkCmn" = "Brk")))
houses <- transform(houses, MasVnrType=revalue(MasVnrType,c("BrkFace" = "Brk")))
prop.table(table(houses$MasVnrType))##
## Brk None Stone
## 0.30969510 0.60500171 0.08530319
Heating because almost all are Gas and there are empty level.houses <- houses[,-which(names(houses) == "Heating")]Electrical –> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$Electrical))##
## FuseA FuseF FuseP Mix Other SBrkr
## 0.0644056184 0.0171291538 0.0027406646 0.0003425831 0.0003425831 0.9150393971
houses <- transform(houses, Electrical=revalue(Electrical,c("FuseF" = "Other")))
houses <- transform(houses, Electrical=revalue(Electrical,c("FuseP" = "Other")))
houses <- transform(houses, Electrical=revalue(Electrical,c("Mix" = "Other")))
prop.table(table(houses$Electrical))##
## FuseA Other SBrkr
## 0.06440562 0.02055498 0.91503940
FireplaceQu –> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$FireplaceQu))##
## Ex Fa Gd NoFp Po TA
## 0.01473107 0.02535115 0.25488181 0.48646797 0.01575882 0.20280918
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Ex" = "Ex_Gd")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Gd" = "Ex_Gd")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Fa" = "Fa_Po")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Po" = "Fa_Po")))
prop.table(table(houses$FireplaceQu))##
## Ex_Gd Fa_Po NoFp TA
## 0.26961288 0.04110997 0.48646797 0.20280918
Functional –> I’ll merge “Maj1”, “Maj2” and “Mod” as “Maj” = Major Deductions and also merge “Min1”, “Min2” and “Sev” as “Min” = Minor Deductions.prop.table(table(houses$Functional))##
## Maj1 Maj2 Min1 Min2 Mod Sev
## 0.0065090785 0.0030832477 0.0222679000 0.0239808153 0.0119904077 0.0006851662
## Typ
## 0.9314833847
houses <- transform(houses, Functional=revalue(Functional,c("Maj1" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Maj2" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Mod" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Min1" = "Min")))
houses <- transform(houses, Functional=revalue(Functional,c("Min2" = "Min")))
houses <- transform(houses, Functional=revalue(Functional,c("Sev" = "Min")))
prop.table(table(houses$Functional))##
## Maj Min Typ
## 0.02158273 0.04693388 0.93148338
GarageType –> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$GarageType))##
## 2Types Attchd Basment BuiltIn CarPort Detchd
## 0.007879411 0.590270641 0.012332991 0.063720452 0.005138746 0.266872217
## NoG
## 0.053785543
houses <- transform(houses, GarageType=revalue(GarageType,c("2Types" = "Other")))
houses <- transform(houses, GarageType=revalue(GarageType,c("Basment" = "Other")))
houses <- transform(houses, GarageType=revalue(GarageType,c("CarPort" = "Other")))
prop.table(table(houses$GarageType))##
## Other Attchd BuiltIn Detchd NoG
## 0.02535115 0.59027064 0.06372045 0.26687222 0.05378554
HeatingQC –> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$HeatingQC))##
## Ex Fa Gd Po TA
## 0.511476533 0.031517643 0.162384378 0.001027749 0.293593696
houses <- transform(houses, HeatingQC=revalue(HeatingQC,c("Fa" = "Fa_Po")))
houses <- transform(houses, HeatingQC=revalue(HeatingQC,c("Po" = "Fa_Po")))
prop.table(table(houses$HeatingQC))##
## Ex Fa_Po Gd TA
## 0.51147653 0.03254539 0.16238438 0.29359370
Fence –> I’ll merge “GdPrv” and “GdWo” as “GdPrvWo” = Good privacy or Good wood and also merge “MnPrv” and “MnWw” as “MnPrvWw” = Minimum privacy or Minimum Wood/Wire.prop.table(table(houses$Fence))##
## GdPrv GdWo MnPrv MnWw NoFen
## 0.040424803 0.038369305 0.112709832 0.004110997 0.804385063
houses <- transform(houses, Fence=revalue(Fence,c("GdPrv" = "GdPrvWo")))
houses <- transform(houses, Fence=revalue(Fence,c("GdWo" = "GdPrvWo")))
houses <- transform(houses, Fence=revalue(Fence,c("MnPrv" = "MnPrvWw")))
houses <- transform(houses, Fence=revalue(Fence,c("MnWw" = "MnPrvWw")))
prop.table(table(houses$Fence))##
## GdPrvWo MnPrvWw NoFen
## 0.07879411 0.11682083 0.80438506
GarageCond and GarageQual because they seems similar, i have decide to keep only GarageCond and delete GarageQual due to the most are equal to GarageCond –> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$GarageCond))##
## Ex Fa Gd NoG Po TA
## 0.001027749 0.025351148 0.005138746 0.054470709 0.004796163 0.909215485
prop.table(table(houses$GarageQual))##
## Ex Fa Gd NoG Po TA
## 0.001027749 0.042480301 0.008221994 0.054470709 0.001712915 0.892086331
houses <- houses[,-which(names(houses) == "GarageQual")]
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Ex" = "Ex_Gd")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Gd" = "Ex_Gd")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Po" = "Fa_Po")))
prop.table(table(houses$GarageCond))##
## Ex_Gd Fa_Po NoG TA
## 0.006166495 0.030147311 0.054470709 0.909215485
MiscFeature –> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$MiscFeature))##
## Gar2 None Othr Shed TenC
## 0.0017129154 0.9640287770 0.0013703323 0.0325453923 0.0003425831
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("Gar2" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("Othr" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("Shed" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("TenC" = "Yes")))
prop.table(table(houses$MiscFeature))##
## Yes None
## 0.03597122 0.96402878
PavedDrive –> merge “Y” and “P” as “Y_P” = Paved or partial Paved.prop.table(table(houses$PavedDrive))##
## N P Y
## 0.07399794 0.02124015 0.90476190
houses <- transform(houses, PavedDrive=revalue(PavedDrive,c("Y" = "Y_P")))
houses <- transform(houses, PavedDrive=revalue(PavedDrive,c("P" = "Y_P")))
prop.table(table(houses$PavedDrive))##
## N Y_P
## 0.07399794 0.92600206
PoolQC, more then 99% haven’t Pool so in order to don’t delete the variable because it would be a plus have a pool –> I have decide to change variable in Pool with two level “Yes” = Yes and “No” = No.prop.table(table(houses$PoolQC))##
## Ex Fa Gd NoPo
## 0.0013703323 0.0006851662 0.0013703323 0.9965741692
names(houses)[names(houses) == 'PoolQC'] <- 'Pool'
houses <- transform(houses, Pool=revalue(Pool,c("Ex" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("Fa" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("Gd" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("NoPo" = "No")))
prop.table(table(houses$Pool))##
## Yes No
## 0.003425831 0.996574169
SaleCondition –> I’ll merge in Other all the level that have percentage less than less than 5%.prop.table(table(houses$SaleCondition))##
## Abnorml AdjLand Alloca Family Normal Partial
## 0.065090785 0.004110997 0.008221994 0.015758822 0.822884550 0.083932854
houses <- transform(houses, SaleCondition=revalue(SaleCondition,c("AdjLand" = "Other")))
houses <- transform(houses, SaleCondition=revalue(SaleCondition,c("Alloca" = "Other")))
houses <- transform(houses, SaleCondition=revalue(SaleCondition,c("Family" = "Other")))
prop.table(table(houses$SaleCondition))##
## Abnorml Other Normal Partial
## 0.06509078 0.02809181 0.82288455 0.08393285
SaleType –> I’ll merge “Con”, “ConLD”, “ConLI”, “ConLw” as “Oth” = Other and also merge “WD” and “CWD” as “WD” = Warranty Deed.prop.table(table(houses$SaleType))##
## COD Con ConLD ConLI ConLw CWD
## 0.029804728 0.001712915 0.008907160 0.003083248 0.002740665 0.004110997
## New Oth WD
## 0.081877355 0.002740665 0.865022268
houses <- transform(houses, SaleType=revalue(SaleType,c("Con" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLD" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLI" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLw" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("CWD" = "WD")))
prop.table(table(houses$SaleType))##
## COD Oth WD New
## 0.02980473 0.01918465 0.86913326 0.08187736
continuous_variables <- c(3,4,12:15,19,27,29:31,35:44,46,48,51,53,54,57:62,66:68,71)
str(houses[,continuous_variables])## 'data.frame': 2919 obs. of 36 variables:
## $ LotFrontage : num 65 80 68 60 84 85 75 0 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
## $ BsmtFinSF1 : num 706 978 486 216 655 ...
## $ BsmtFinSF2 : num 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : num 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : num 856 1262 920 756 1145 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : num 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : num 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ GarageYrBlt : num 2003 1976 2001 1998 2000 ...
## $ GarageCars : num 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : num 548 460 608 642 836 480 636 484 468 205 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SalePrice : num 208500 181500 223500 140000 250000 ...
houses %>%
gather(Attributes, value, continuous_variables[1:9]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_boxplot(show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values",
title="Continous Variables - Boxplot") +
scale_fill_discrete() +
my_themehouses %>%
gather(Attributes, value, continuous_variables[10:18]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_boxplot(show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values",
title="Continous Variables - Boxplot") +
scale_fill_discrete() +
my_themehouses %>%
gather(Attributes, value, continuous_variables[19:27]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_boxplot(show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values",
title="Continous Variables - Boxplot") +
scale_fill_discrete() +
my_themehouses %>%
gather(Attributes, value, continuous_variables[28:36]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_boxplot(show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values",
title="Continous Variables - Boxplot") +
scale_fill_discrete() +
my_themeBsmtFinSF2 have to many outlier I’ll remove it and I’ll valuated only the presence of absence of the second basement.houses <- houses[,-which(names(houses) == "BsmtFinSF2")]BsmtHalfBath have almost all 0, it’s difficult to interpret variable I’ll remove it also because I have already BsmtFullBath to consider.houses <- houses[,-which(names(houses) == "BsmtHalfBath")]LowQualFinSF have to many outlier I’ll remove it.houses <- houses[,-which(names(houses) == "LowQualFinSF")]MiscVal have many outlier and it difficult to be interpreted I’ll remove it also because I have already a binomial var to interpret if are present or not Misc.houses <- houses[,-which(names(houses) == "MiscVal")]MiscVal is applied for PollArea.houses <- houses[,-which(names(houses) == "PoolArea")]ClosedPorch with two level (“Yes” or “No”) in order to delete the other not well intrepetable continous variable.houses$ClosedPorch = as.factor(ifelse(houses$EnclosedPorch > 0 | houses$ScreenPorch > 0 | houses$X3SsnPorch > 0, "Yes", "No"))
houses <- houses[,-which(names(houses) == "EnclosedPorch")]
houses <- houses[,-which(names(houses) == "ScreenPorch")]
houses <- houses[,-which(names(houses) == "X3SsnPorch")]
prop.table(table(houses$ClosedPorch))##
## No Yes
## 0.7519699 0.2480301
OpenPorchSF as factorial variable with two level “Yes” or “No”.houses$OpenPorch = as.factor(ifelse(houses$OpenPorchSF > 0, "Yes", "No"))
houses <- houses[,-which(names(houses) == "OpenPorchSF")]
prop.table(table(houses$OpenPorch))##
## No Yes
## 0.4446728 0.5553272
After proceeding with the analysis I’ll relocate SalePrice as last column just to have more order in in data.
houses <- houses %>%
relocate(SalePrice, .after = last_col())I have split Data into train and test set and I have delete the SalePrice from test set (precedently inserted as a 0 column).
train <- houses[1:1460,]
test <- houses[1461:2898,]After going forward rewrite factorial and continuous variable obviously excluding Id.
factorial_variables <- c(2,3,5:11,14,16:18,20:26,28,31:33,42,44,46,47,49,52,53,55:57,59:63)
continuous_variables <- c(4,12,13,15,19,27,29,30,34:41,43,45,48,50,51,54,58,64)Then create train and test set by type of variable.
train_num <- train[,continuous_variables]
test_num <- test[,continuous_variables]
train_fact <- train[,c(1,factorial_variables)]
test_fact <- test[,c(1,factorial_variables)]
test_id <- test[,1]Correlation is a term that is a measure of the strength of a linear relationship between two quantitative variables (e.g., height, weight). This post will define positive and negative correlations, illustrated with examples and explanations of how to measure correlation. Finally, some pitfalls regarding the use of correlation will be discussed.
Positive correlation is a relationship between two variables in which both variables move in the same direction. This is when one variable increases while the other increases and visa versa. For example, positive correlation may be that the more you exercise, the more calories you will burn. Whilst negative correlation is a relationship where one variable increases as the other decreases, and vice versa.
tot_corr <- cor(train_num[,-24])
col<- colorRampPalette(c('darkred', 'white', 'black'))(10)
corrplot(tot_corr, method="pie", type= "upper", diag = F, tl.srt = 40, tl.cex = 0.8, tl.col = "black", col = col)Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed. I want to remove attributes with an absolute correlation of 0.75 or higher.
So, I have removed GrLivArea, GarageCars and X1stFlrSF
highlyCorrelated <- findCorrelation(tot_corr, cutoff=0.75)
str(train_num[,highlyCorrelated])## 'data.frame': 1460 obs. of 3 variables:
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ GarageCars: num 2 2 2 3 3 2 2 2 2 1 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
train_num <- train_num[,-highlyCorrelated]
test_num <- test_num[,-highlyCorrelated]SalePrice Density plot
ggplot(train, aes(x= SalePrice)) +
geom_density(fill = "black", alpha=0.6, show.legend=FALSE) +
labs(x="Values", y="Density", title="SalePrice - Density plot") +
my_themeSalePrice Box plot
ggplot(train, aes(x= SalePrice)) +
geom_boxplot(fill = "black", alpha = 0.6, show.legend = F) +
coord_flip() +
labs(x="Values",title="SalePrice - Box plot") +
my_themeSalePrice Correlation plot
price_cor <- train_num %>%
correlate() %>%
focus(SalePrice)
price_cor %>%
mutate(term = factor(term, levels = term[order(SalePrice)])) %>%
ggplot(aes(x = term, y = SalePrice, fill = SalePrice)) +
geom_bar(stat = "identity", show.legend = F) +
ylab("Correlation with Sale_Price") +
xlab("Variable") +
scale_fill_gradient(low = 'red', high = 'black') +
theme(plot.title = element_text(color = 'darkred', face = "bold.italic", size = 15),
plot.subtitle = element_text(color = 'darkred', size = 8),
plot.background =element_rect(fill = "snow",colour = "darkred",size = 1.5),
panel.grid.major = element_line(colour = "snow", size = 1),
panel.grid.minor = element_line(colour = "snow2"),
legend.title = element_text(colour="black", size=10),
panel.background =element_rect(fill = "snow2"),
legend.background = element_rect(fill="red3",size=0.5, linetype="solid",colour ="red3"),
axis.title = element_text(face = "bold.italic", color = "darkred"),
axis.text.x = element_text(face = "italic",color = 'red3', angle = 75, hjust = 1),
axis.text.y = element_text(face = "italic",color = 'red3'))Before starting to create the predictive models I’m going to rescale the datasets transforming the categorical variable as dummy. A dummy variable is a numeric variable that represents categorical data, such as gender, race, political affiliation, etc. Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small; they can take on only two quantitative values. As a practical matter, regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence. Then I will and merge dummy variable with the continuous into two unique tibble.
train_fact_mtx <- model.matrix(object = Id ~ . , data = train_fact)
train_fact <- data.frame(train_fact_mtx[,-1])
test_fact_mtx <- model.matrix(object = Id ~ . , data = test_fact)
test_fact <- data.frame(test_fact_mtx[,-1])
train <- as_tibble(cbind(train_fact, train_num))
test <- as_tibble(cbind(test_fact, test_num))
test <- test[,-112]Then I have split train dataset into train data and train target
formula <- SalePrice ~ .
train_data <- train[, -112] %>% as.data.frame()
train_target <- train[, 112] %>% pull()Before proceeding to fit predictive models i have to define a train controll in order to evaluate them.
For these purpose i have chosen the repeated cross validation.
Repeated k-fold cross-validation is a procedure of resampling that provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
I decide to resampling with k = 10 and repeats the cross validation 3 times
train_control_RCV <- trainControl(method = "repeatedcv", number = 10, repeats = 3)After the repeated cross validation i will have for each fitted model three value that help me to evaluate them:
Root mean squared error is calculated as: \[RMSE = \sqrt {\frac {1} {n} \sum (y_i - \hat{y}_i)^2}\]
Mean absolute error is calculated as: \[MAE = \sqrt {\frac {1} {n} \sum |y_i - \hat{y}_i|}\]
It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
The coefficient of determination is calculated as: \[R^2 = 1- \frac {SS{res}} {SS{tot}}\]
Where the variability of the data set can be measured with two sums of squares formulas:
The total sum of squares (proportional to the variance of the data): \[SS{tot} = \sum (y_i - \bar{y})^2\] Where: \[\bar{y} = \frac {1} {n} \sum y_i\]
The sum of squares of residuals, also called the residual sum of squares: \[SS{res} = \sum (y_i - \hat{y}_i)^2\]
Multiple Linear Regression (MLR) is a statistical technique for finding existence of an association relationship between a dependent variable and several independent variables.
The functional form is given by:
\[Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + ... + \beta_{p} X_{p} + e\]
Where:
Let’s start to fit the model with Y = SalePrice.
\[SalePrice = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + .. + \beta_{p} X_{p} + e\]
set.seed(123)
lm_RCV <- train(formula, data = train, method = "lm", trControl = train_control_RCV)print(lm_RCV)## Linear Regression
##
## 1460 samples
## 111 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 34963.23 0.8125509 20345
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
After have create the model, let see what are the most important variable:
As showed in the graph the variable that influence most the SalePrice is the OverallQual rates with almost 10% of the total importance, followed by KitchenQualGd with a value of importance just above 7.5%.
lm_importance_value <- varImp(lm_RCV, scale = F)
lm_importance_value <- as.data.frame(lm_importance_value[["importance"]])
lm_importance_value <- cbind(Variable = rownames(lm_importance_value),lm_importance_value)
lm_importance_value <- lm_importance_value %>%
arrange(desc(Overall))
ggplot(lm_importance_value[1:15,], aes(reorder(Variable, Overall), Overall, fill = Overall)) +
geom_bar(stat = "identity", show.legend = F) +
coord_flip()+
scale_fill_gradient(low = "grey75", high = "black") +
labs(title = "Linear Regression - Variables Importance", y = "Importance", x = "") +
my_themeThen try to see how the linear model predict the train data
lm_pred <- predict(lm_RCV, train_data)
tibble(
pred = lm_pred,
actual = train_target
) %>%
ggplot(aes(pred, actual)) +
geom_point( color = "black") +
geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
labs(title = "Linear Regression - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
my_themeNeural networks are a set of algorithms, loosely modeled after the human brain, designed to recognize patterns.
The patterns they recognize are numerical, contained in vectors, to which all data in the real world must be translated.
The stacked neural networks is a networks composed of several layers.
The layers are made of nodes.
A node combines data input with a set of coefficients or weights that either amplifies or dampens that input, thereby assigning significance to inputs for the task that the algorithm is trying to learn.
The output of each layer is the input of the next layer at the same time, starting from the initial input layer receiving your data.
A neural network consists of:
In order to scale all variable inside, we use the maximun and minimum values of each single variable.
train_maxs <- apply(train, 2, max)
train_mins <- apply(train, 2, min)
train_scaled <- as.data.frame(scale(train,
center = train_mins,
scale = train_maxs - train_mins))We are looking for the optimal parameters for the model.
As I can see from the output the final values used for the model were:
size = 3Size is the number of hidden layer that use backpropagation to optimise the weights of the input variables in order to improve the predictive power of the model
decay = 0.1Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.
\[loss = loss + weightdecay parameter * L2 norm of the weights\]
Weight decay is used to prevent overfitting and to keep the weights small and avoid exploding gradient.
Because the L2 norm of the weights are added to the loss, each iteration of your network will try to optimize/minimize the model weights in addition to the loss.
set.seed(123)
nn_RCV <- train(formula, data = train_scaled, method = "nnet", trControl = train_control_RCV)print(nn_RCV)## Neural Network
##
## 1460 samples
## 111 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0e+00 0.15703415 0.8322300 0.13199676
## 1 1e-04 0.14545102 0.4543285 0.12132953
## 1 1e-01 0.04541866 0.8357140 0.02824910
## 3 0e+00 0.14953582 0.8500791 0.12606823
## 3 1e-04 0.12687862 0.4917349 0.10374026
## 3 1e-01 0.04522695 0.8351553 0.02787022
## 5 0e+00 0.15895984 0.7941364 0.13222516
## 5 1e-04 0.09061938 0.6516267 0.06804890
## 5 1e-01 0.04536545 0.8341218 0.02777200
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.1.
After have create the model, let see what are the most important variable:
As showed in the graph the variable that influence most the SalePrice is still the OverallQual rates but this time with less than 5% of the total importance, followed by 2ndflrSF (Second floor square feet) with a value of importance almost of 3%.
nn_importance_value <- varImp(nn_RCV, scale = F)
nn_importance_value <- as.data.frame(nn_importance_value[["importance"]])
nn_importance_value <- cbind(Variable = rownames(nn_importance_value),nn_importance_value)
nn_importance_value <- nn_importance_value %>%
arrange(desc(Overall))
ggplot(nn_importance_value[1:15,], aes(reorder(Variable, Overall), Overall, fill = Overall)) +
geom_bar(stat = "identity", show.legend = F) +
coord_flip()+
scale_fill_gradient(low = "grey75", high = "black") +
labs(title = "Neural Net - Variables Importance", y = "Importance", x = "") +
my_themeThen try to see how the neural network model predict the train data.
nn_pred <- predict(nn_RCV,train_scaled[,-115])
nn_pred_unscaled <- nn_pred * (train_maxs["SalePrice"] - train_mins["SalePrice"]) + train_mins["SalePrice"]
tibble(
pred = nn_pred_unscaled,
actual = train_target
) %>%
ggplot(aes(pred, actual)) +
geom_point( color = "black") +
geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
labs(title = "Neural Net - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
my_themeIn order to compare the MSE and MAE with the others model I have to calculate them with not scaled data.
nn_RMSE <- rmse(train_target, nn_pred_unscaled)
print(paste0("MSE is: ", round(nn_RMSE)))## [1] "MSE is: 29401"
nn_MAE <- mae(train_target, nn_pred_unscaled)
print(paste0("MAE is: ", round(nn_MAE)))## [1] "MAE is: 18276"
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
We are looking for the optimal parameters for the model, so i create a grid that will be applied in this GBM model.
As we can see from the output the final values used for the model were:
n.trees = 150The total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion
interaction.depth = 6The maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc.
shrinkage = 0.01A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction
n.minobsinnode = 15The minimum number of observations in the terminal nodes of the trees.
grid <- expand.grid(interaction.depth = c(3, 6),
n.trees = c(150, 300),
shrinkage = c(0.1),
n.minobsinnode = c(10, 15))
set.seed(123)
gbm_RCV <- train(formula, data = train, distribution = "gaussian", method = "gbm",
trControl = train_control_RCV, tuneGrid = grid)print(gbm_RCV)## Stochastic Gradient Boosting
##
## 1460 samples
## 111 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.minobsinnode n.trees RMSE Rsquared MAE
## 3 10 150 30703.16 0.8520090 18083.20
## 3 10 300 30242.15 0.8559285 17790.39
## 3 15 150 30101.27 0.8571943 18052.16
## 3 15 300 29814.07 0.8596072 17780.00
## 6 10 150 30767.00 0.8510928 17545.10
## 6 10 300 30686.39 0.8512630 17520.70
## 6 15 150 29571.74 0.8621849 17444.02
## 6 15 300 29721.45 0.8609000 17576.79
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 6, shrinkage = 0.1 and n.minobsinnode = 15.
After have create the model, let see what are the most important variable:
Also for this model, as showed in the graph, the variable that influence most the SalePrice is still the OverallQual rates this time its importance value is much bigger than the others with more than 45% of the total importance, followed by TotalBsmtSF (Total square feet of basement area) with a value of importance of about 12%.
gbm_importance_value <- as.data.frame(summary(gbm_RCV))gbm_importance_value <- gbm_importance_value %>%
arrange(desc(rel.inf))
ggplot(gbm_importance_value[1:15,], aes(reorder(var, rel.inf), rel.inf, fill = rel.inf)) +
geom_bar(stat = "identity", show.legend = F) +
coord_flip()+
scale_fill_gradient(low = "grey75", high = "black") +
labs(title = "GBM - Variables Importance", y = "Importance", x = "") +
my_themeThen try to see how the gradient boosting machine model predict the train data.
gbm_pred <- predict(gbm_RCV, train_data)
tibble(
pred = gbm_pred,
actual = train_target
) %>%
ggplot(aes(pred, actual)) +
geom_point( color = "black") +
geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
labs(title = "GBM - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
my_themeAs conclusion I will compare the different prediction model that i have fitted and i will use the best in order to predict the test data SalePrice
In order to compare the models performs I will evaluate them by the R squared, the root mean squared error and the mean absolute error.
As the table shows the gradient boosting perform better in terms of R squared and MAE and also the RMSE is pretty similar to the Neural network.
| R squared | RMSE | MAE | |
|---|---|---|---|
| Multiple linear regression | 0.8125 | 34963 | 20345 |
| Neural network | 0.8351 | 29401 | 18276 |
| Gradient boosting machine | 0,8622 | 29572 | 17444 |
As showed before the best model is the Gradient boosting machine, so I will use it in order to predict the house prices.
price_prediction <- predict(gbm_RCV, test)
analysis_result <- cbind(test_id,price_prediction)
head(analysis_result, 15)## test_id price_prediction
## [1,] 1461 124793.5
## [2,] 1462 163828.6
## [3,] 1463 168015.1
## [4,] 1464 186131.6
## [5,] 1465 186922.5
## [6,] 1466 190698.5
## [7,] 1467 178203.0
## [8,] 1468 161966.9
## [9,] 1469 177181.7
## [10,] 1470 119881.7
## [11,] 1471 200224.3
## [12,] 1472 105117.0
## [13,] 1473 102322.0
## [14,] 1474 152281.4
## [15,] 1475 135077.6