The dataset for my prediction Project consist of various houses attributes including only for the train set the Sales Price
For this project i’m going to estimate the Sale Price of Houses by using different models.
In order to do that, i will use a train set of houses characteristics in order to predict the variable of interest (Sale_Price
) for an other dataset of houses that don’t have already a Price.
Both Dataset are provided by Kaggle.
The train set 1460 observations (rows) while the test set 1459 and both of them have 80 features (columns):
I will merge them in order to have make the same data cleaning and manipulation.
MSSubClass
: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning
: Identifies the general zoning classification of the sale.
A Agriculture
C Commercial
FV Floating Village Residential
I Industrial
RH Residential High Density
RL Residential Low Density
RP Residential Low Density Park
RM Residential Medium Density
LotFrontage
: Linear feet of street connected to property
LotArea
: Lot size in square feet
Street
: Type of road access to property
Grvl Gravel
Pave Paved
Alley
: Type of alley access to property
Grvl Gravel
Pave Paved
NA No alley access
LotShape
: General shape of property
Reg Regular
IR1 Slightly irregular
IR2 Moderately Irregular
IR3 Irregular
LandContour
: Flatness of the property
Lvl Near Flat/Level
Bnk Banked - Quick and significant rise from street grade to building
HLS Hillside - Significant slope from side to side
Low Depression
Utilities
: Type of utilities available
AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
NoSeWa Electricity and Gas Only
ELO Electricity only
LotConfig
: Lot configuration
Inside Inside lot
Corner Corner lot
CulDSac Cul-de-sac
FR2 Frontage on 2 sides of property
FR3 Frontage on 3 sides of property
LandSlope
: Slope of property
Gtl Gentle slope
Mod Moderate Slope
Sev Severe Slope
Neighborhood
: Physical locations within Ames city limits
Blmngtn Bloomington Heights
Blueste Bluestem
BrDale Briardale
BrkSide Brookside
ClearCr Clear Creek
CollgCr College Creek
Crawfor Crawford
Edwards Edwards
Gilbert Gilbert
IDOTRR Iowa DOT and Rail Road
MeadowV Meadow Village
Mitchel Mitchell
Names North Ames
NoRidge Northridge
NPkVill Northpark Villa
NridgHt Northridge Heights
NWAmes Northwest Ames
OldTown Old Town
SWISU South & West of Iowa State University
Sawyer Sawyer
SawyerW Sawyer West
Somerst Somerset
StoneBr Stone Brook
Timber Timberland
Veenker Veenker
Condition1
: Proximity to various conditions
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
Condition2
: Proximity to various conditions (if more than one is present)
Artery Adjacent to arterial street
Feedr Adjacent to feeder street
Norm Normal
RRNn Within 200' of North-South Railroad
RRAn Adjacent to North-South Railroad
PosN Near positive off-site feature--park, greenbelt, etc.
PosA Adjacent to postive off-site feature
RRNe Within 200' of East-West Railroad
RRAe Adjacent to East-West Railroad
BldgType
: Type of dwelling
1Fam Single-family Detached
2FmCon Two-family Conversion; originally built as one-family dwelling
Duplx Duplex
TwnhsE Townhouse End Unit
TwnhsI Townhouse Inside Unit
HouseStyle
: Style of dwelling
1Story One story
1.5Fin One and one-half story: 2nd level finished
1.5Unf One and one-half story: 2nd level unfinished
2Story Two story
2.5Fin Two and one-half story: 2nd level finished
2.5Unf Two and one-half story: 2nd level unfinished
SFoyer Split Foyer
SLvl Split Level
OverallQual
: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
OverallCond
: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
YearBuilt
: Original construction date
YearRemodAdd
: Remodel date (same as construction date if no remodeling or additions)
RoofStyle
: Type of roof
Flat Flat
Gable Gable
Gambrel Gabrel (Barn)
Hip Hip
Mansard Mansard
Shed Shed
RoofMatl
: Roof material
ClyTile Clay or Tile
CompShg Standard (Composite) Shingle
Membran Membrane
Metal Metal
Roll Roll
Tar&Grv Gravel & Tar
WdShake Wood Shakes
WdShngl Wood Shingles
Exterior1st
: Exterior covering on house
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
Exterior2nd
: Exterior covering on house (if more than one material)
AsbShng Asbestos Shingles
AsphShn Asphalt Shingles
BrkComm Brick Common
BrkFace Brick Face
CBlock Cinder Block
CemntBd Cement Board
HdBoard Hard Board
ImStucc Imitation Stucco
MetalSd Metal Siding
Other Other
Plywood Plywood
PreCast PreCast
Stone Stone
Stucco Stucco
VinylSd Vinyl Siding
Wd Sdng Wood Siding
WdShing Wood Shingles
MasVnrType
: Masonry veneer type
BrkCmn Brick Common
BrkFace Brick Face
CBlock Cinder Block
None None
Stone Stone
MasVnrArea
: Masonry veneer area in square feet
ExterQual
: Evaluates the quality of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
ExterCond
: Evaluates the present condition of the material on the exterior
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
Foundation
: Type of foundation
BrkTil Brick & Tile
CBlock Cinder Block
PConc Poured Contrete
Slab Slab
Stone Stone
Wood Wood
BsmtQual
: Evaluates the height of the basement
Ex Excellent (100+ inches)
Gd Good (90-99 inches)
TA Typical (80-89 inches)
Fa Fair (70-79 inches)
Po Poor (<70 inches
NA No Basement
BsmtCond
: Evaluates the general condition of the basement
Ex Excellent
Gd Good
TA Typical - slight dampness allowed
Fa Fair - dampness or some cracking or settling
Po Poor - Severe cracking, settling, or wetness
NA No Basement
BsmtExposure
: Refers to walkout or garden level walls
Gd Good Exposure
Av Average Exposure (split levels or foyers typically score average or above)
Mn Mimimum Exposure
No No Exposure
NA No Basement
BsmtFinType1
: Rating of basement finished area
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF1
: Type 1 finished square feet
BsmtFinType2
: Rating of basement finished area (if multiple types)
GLQ Good Living Quarters
ALQ Average Living Quarters
BLQ Below Average Living Quarters
Rec Average Rec Room
LwQ Low Quality
Unf Unfinshed
NA No Basement
BsmtFinSF2
: Type 2 finished square feet
BsmtUnfSF
: Unfinished square feet of basement area
TotalBsmtSF
: Total square feet of basement area
Heating
: Type of heating
Floor Floor Furnace
GasA Gas forced warm air furnace
GasW Gas hot water or steam heat
Grav Gravity furnace
OthW Hot water or steam heat other than gas
Wall Wall furnace
HeatingQC
: Heating quality and condition
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
Po Poor
CentralAir
: Central air conditioning
N No
Y Yes
Electrical
: Electrical system
SBrkr Standard Circuit Breakers & Romex
FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
Mix Mixed
1stFlrSF
: First Floor square feet
2ndFlrSF
: Second floor square feet
LowQualFinSF
: Low quality finished square feet (all floors)
GrLivArea
: Above grade (ground) living area square feet
BsmtFullBath
: Basement full bathrooms
BsmtHalfBath
: Basement half bathrooms
FullBath
: Full bathrooms above grade
HalfBath
: Half baths above grade
Bedroom
: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen
: Kitchens above grade
KitchenQual
: Kitchen quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
TotRmsAbvGrd
: Total rooms above grade (does not include bathrooms)
Functional
: Home functionality (Assume typical unless deductions are warranted)
Typ Typical Functionality
Min1 Minor Deductions 1
Min2 Minor Deductions 2
Mod Moderate Deductions
Maj1 Major Deductions 1
Maj2 Major Deductions 2
Sev Severely Damaged
Sal Salvage only
Fireplaces
: Number of fireplaces
FireplaceQu
: Fireplace quality
Ex Excellent - Exceptional Masonry Fireplace
Gd Good - Masonry Fireplace in main level
TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
Fa Fair - Prefabricated Fireplace in basement
Po Poor - Ben Franklin Stove
NA No Fireplace
GarageType
: Garage location
2Types More than one type of garage
Attchd Attached to home
Basment Basement Garage
BuiltIn Built-In (Garage part of house - typically has room above garage)
CarPort Car Port
Detchd Detached from home
NA No Garage
GarageYrBlt
: Year garage was built
GarageFinish
: Interior finish of the garage
Fin Finished
RFn Rough Finished
Unf Unfinished
NA No Garage
GarageCars
: Size of garage in car capacity
GarageArea
: Size of garage in square feet
GarageQual
: Garage quality
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
GarageCond
: Garage condition
Ex Excellent
Gd Good
TA Typical/Average
Fa Fair
Po Poor
NA No Garage
PavedDrive
: Paved driveway
Y Paved
P Partial Pavement
N Dirt/Gravel
WoodDeckSF
: Wood deck area in square feet
OpenPorchSF
: Open porch area in square feet
EnclosedPorch
: Enclosed porch area in square feet
3SsnPorch
: Three season porch area in square feet
ScreenPorch
: Screen porch area in square feet
PoolArea
: Pool area in square feet
PoolQC
: Pool quality
Ex Excellent
Gd Good
TA Average/Typical
Fa Fair
NA No Pool
Fence
: Fence quality
GdPrv Good Privacy
MnPrv Minimum Privacy
GdWo Good Wood
MnWw Minimum Wood/Wire
NA No Fence
MiscFeature
: Miscellaneous feature not covered in other categories
Elev Elevator
Gar2 2nd Garage (if not described in garage section)
Othr Other
Shed Shed (over 100 SF)
TenC Tennis Court
NA None
MiscVal
: $Value of miscellaneous feature
MoSold
: Month Sold (MM)
YrSold
: Year Sold (YYYY)
SaleType
: Type of sale
WD Warranty Deed - Conventional
CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
New Home just constructed and sold
COD Court Officer Deed/Estate
Con Contract 15% Down payment regular terms
ConLw Contract Low Down payment and low interest
ConLI Contract Low Interest
ConLD Contract Low Down
Oth Other
SaleCondition
: Condition of sale
Normal Normal Sale
Abnorml Abnormal Sale - trade, foreclosure, short sale
AdjLand Adjoining Land Purchase
Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
Family Sale between family members
Partial Home was not completed when last assessed (associated with New Homes)
$SalePrice <- 0
test<- bind_rows(train, test)
houses %>% glimpse() houses
## Rows: 2,919
## Columns: 81
## $ Id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1~
## $ MSSubClass <int> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60, 20, 20,~
## $ MSZoning <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RM", "R~
## $ LotFrontage <int> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, NA, 91, ~
## $ LotArea <int> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10382, 612~
## $ Street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", ~
## $ Alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ LotShape <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg", "IR1", ~
## $ LandContour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", ~
## $ Utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub", "AllPu~
## $ LotConfig <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Inside", "I~
## $ LandSlope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", ~
## $ Neighborhood <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoRidge", "~
## $ Condition1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm", "Norm",~
## $ Condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", ~
## $ BldgType <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", ~
## $ HouseStyle <chr> "2Story", "1Story", "2Story", "2Story", "2Story", "1.5Fi~
## $ OverallQual <int> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, 6, 4, 5,~
## $ OverallCond <int> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, 7, 5, 5,~
## $ YearBuilt <int> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, 1931, 19~
## $ YearRemodAdd <int> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, 1950, 19~
## $ RoofStyle <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Gable", "G~
## $ RoofMatl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "CompShg", "~
## $ Exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "VinylSd", "~
## $ Exterior2nd <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "VinylSd", "~
## $ MasVnrType <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace", "None",~
## $ MasVnrArea <int> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, 0, 306, ~
## $ ExterQual <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ ExterCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ Foundation <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "Wood", "~
## $ BsmtQual <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", "TA", "T~
## $ BsmtCond <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", "TA", "T~
## $ BsmtExposure <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", "No", "N~
## $ BsmtFinType1 <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ", "ALQ", ~
## $ BsmtFinSF1 <int> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851, 906, 99~
## $ BsmtFinType2 <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "BLQ", ~
## $ BsmtFinSF2 <int> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0~
## $ BsmtUnfSF <int> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140, 134, 17~
## $ TotalBsmtSF <int> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952, 991, 10~
## $ Heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", ~
## $ HeatingQC <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", "Gd", "E~
## $ CentralAir <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ Electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "S~
## $ X1stFlrSF <int> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022, 1077, ~
## $ X2ndFlrSF <int> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, 1142, 0,~
## $ LowQualFinSF <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ GrLivArea <int> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, 1774, 10~
## $ BsmtFullBath <int> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1,~
## $ BsmtHalfBath <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ FullBath <int> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, 1, 2, 1,~
## $ HalfBath <int> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,~
## $ BedroomAbvGr <int> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, 2, 2, 3,~
## $ KitchenAbvGr <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1,~
## $ KitchenQual <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", "TA", "T~
## $ TotRmsAbvGrd <int> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5, 5, 6, 6~
## $ Functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", ~
## $ Fireplaces <int> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, 1, 0, 0,~
## $ FireplaceQu <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA", "TA", ~
## $ GarageType <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd", "Attch~
## $ GarageYrBlt <int> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, 1931, 19~
## $ GarageFinish <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn", "RFn", ~
## $ GarageCars <int> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, 2, 2, 2,~
## $ GarageArea <int> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205, 384, 7~
## $ GarageQual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "Fa", "G~
## $ GarageCond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", "T~
## $ PavedDrive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "~
## $ WoodDeckSF <int> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, 140, 160~
## $ OpenPorchSF <int> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, 33, 213,~
## $ EnclosedPorch <int> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, 176, 0, ~
## $ X3SsnPorch <int> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ ScreenPorch <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0, 0, 0, ~
## $ PoolArea <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ PoolQC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, ~
## $ Fence <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA, NA, NA,~
## $ MiscFeature <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, NA, NA, ~
## $ MiscVal <int> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0, 0, 700,~
## $ MoSold <int> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, 7, 3, 10~
## $ YrSold <int> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, 2008, 20~
## $ SaleType <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", "W~
## $ SaleCondition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal", "Norm~
## $ SalePrice <dbl> 208500, 181500, 223500, 140000, 250000, 143000, 307000, ~
Accurate estimation offers a chance to better identify the value at which to sell a house in order to take as much profit as possible and sell as soon as possible.
Thus, the purposes of this project is to create several model using different Machine Learning technique in order to find the value of Houses.
I examine how different model developed improve significantly the prediction accuracy in term of Mean Squared Error
and R squared
.
Furthermore, i will understand the importance of the Houses characteristics and how they change within different model prediction.
First ensure that there are no duplicate (remove the ID in order to have better interpretation of duplicate).
-1][duplicated(houses)] houses[,
## data frame con 0 colonne e 2919 righe
From the dataset description i saw that the house without basement have “NA” so let replace them with “NoB” in all the level regarding that.
also see that the house without garage have “NA” so let substitute them with “NoG” in all the level regarding that. Then to no delete the houses without garage from the sample i have to replace the “NA” value with the mean of the other in order to not bias the sample.
Furthermore, from dataset description i also see that the house without Alley access, Fireplace, Pool, Fency and Miscellaneous feature have “NA” so replace them with “NoAll”,“NoFen” and “NoFp”.
c(31:34,36)][is.na(houses[,c(31:34,36)])] <- "NoB"
houses[,c(59,61,64,65)][is.na(houses[,c(59,61,64,65)])] <- "NoG"
houses[,<- na.omit(houses[,60])
gar_year 60][is.na(houses[,60])] <- 0
houses[,#or
#houses[,60][is.na(houses[,60])] <- as.integer(mean(gar_year))
7][is.na(houses[,7])] <- "NoAll"
houses[,58][is.na(houses[,58])] <- "NoFp"
houses[,73][is.na(houses[,73])] <- "NoPo"
houses[,74][is.na(houses[,74])] <- "NoFen"
houses[,75][is.na(houses[,75])] <- "None" houses[,
Before continue i control if and where are still present “NA”.
As first control if are present variable that have more than 10% of “NA”.
<- matrix(NA,81,2)
NA_values
for(i in 1:81){
1] <- sum(is.na(houses[,i]))/2919*100
NA_values[i,
}
<- NA
var_with_NA
for(i in 1:81){
if(NA_values[i,1] > 1){
<- i
var_with_NA[i]
}
}
var_with_NA
## [1] NA NA NA 4
<- 4 var_with_NA
I found many “NA” in LotFrontage
variable so let’s replace it with “0”.
4][is.na(houses[,4])] <- 0 houses[,
Now i control how many “NA” are still present in the dataset and I inspect them one by one.
sum(is.na(houses))
## [1] 70
which(is.na(houses), arr.ind=TRUE)
## row col
## [1,] 1916 3
## [2,] 2217 3
## [3,] 2251 3
## [4,] 2905 3
## [5,] 1916 10
## [6,] 1946 10
## [7,] 2152 24
## [8,] 2152 25
## [9,] 235 26
## [10,] 530 26
## [11,] 651 26
## [12,] 937 26
## [13,] 974 26
## [14,] 978 26
## [15,] 1244 26
## [16,] 1279 26
## [17,] 1692 26
## [18,] 1707 26
## [19,] 1883 26
## [20,] 1993 26
## [21,] 2005 26
## [22,] 2042 26
## [23,] 2312 26
## [24,] 2326 26
## [25,] 2341 26
## [26,] 2350 26
## [27,] 2369 26
## [28,] 2593 26
## [29,] 2611 26
## [30,] 2658 26
## [31,] 2687 26
## [32,] 2863 26
## [33,] 235 27
## [34,] 530 27
## [35,] 651 27
## [36,] 937 27
## [37,] 974 27
## [38,] 978 27
## [39,] 1244 27
## [40,] 1279 27
## [41,] 1692 27
## [42,] 1707 27
## [43,] 1883 27
## [44,] 1993 27
## [45,] 2005 27
## [46,] 2042 27
## [47,] 2312 27
## [48,] 2326 27
## [49,] 2341 27
## [50,] 2350 27
## [51,] 2369 27
## [52,] 2593 27
## [53,] 2658 27
## [54,] 2687 27
## [55,] 2863 27
## [56,] 2121 35
## [57,] 2121 37
## [58,] 2121 38
## [59,] 2121 39
## [60,] 1380 43
## [61,] 2121 48
## [62,] 2189 48
## [63,] 2121 49
## [64,] 2189 49
## [65,] 1556 54
## [66,] 2217 56
## [67,] 2474 56
## [68,] 2577 62
## [69,] 2577 63
## [70,] 2490 79
MSZoning
, let’s add “NA” value to the most common categoryprop.table(table(houses$MSZoning))
##
## C (all) FV RH RL RM
## 0.008576329 0.047684391 0.008919383 0.777015437 0.157804460
3][is.na(houses[,3])] <- "RL" houses[,
Utilities
, 99,9% have this variable level at “AllPub” replace NA with that value (later I’ll delete this variable).prop.table(table(houses$Utilities))
##
## AllPub NoSeWa
## 0.999657182 0.000342818
10][is.na(houses[,10])] <- "AllPub" houses[,
Exterior1st
and Exterior2nd
, let’s replace both with the level “Other” already present as Exterior2nd
level (later I’ll group the less frequent Exterior1st
.prop.table(table(houses$Exterior1st))
##
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd
## 0.0150788211 0.0006854010 0.0020562029 0.0298149417 0.0006854010 0.0431802605
## HdBoard ImStucc MetalSd Plywood Stone Stucco
## 0.1514736121 0.0003427005 0.1542152159 0.0757368060 0.0006854010 0.0147361206
## VinylSd Wd Sdng WdShing
## 0.3512679918 0.1408498972 0.0191912269
$Exterior1st <- as.character(houses$Exterior1st)
houses24][is.na(houses[,24])] <- "Other"
houses[,$Exterior1st <- as.factor(houses$Exterior1st)
houses25][is.na(houses[,25])] <- "Other" houses[,
26][is.na(houses[,26])] <- "None"
houses[,27][is.na(houses[,27])] <- 0 houses[,
2121,] houses[
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape
## 2121 2121 20 RM 99 5940 Pave NoAll IR1
## LandContour Utilities LotConfig LandSlope Neighborhood Condition1
## 2121 Lvl AllPub FR3 Gtl BrkSide Feedr
## Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt
## 2121 Norm 1Fam 1Story 4 7 1946
## YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType
## 2121 1950 Gable CompShg MetalSd CBlock None
## MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## 2121 0 TA TA PConc NoB NoB NoB
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 2121 NoB NA NoB NA NA NA
## Heating HeatingQC CentralAir Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## 2121 GasA TA Y FuseA 896 0 0
## GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr
## 2121 896 NA NA 1 0 2
## KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu
## 2121 1 TA 4 Typ 0 NoFp
## GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## 2121 Detchd 1946 Unf 1 280 TA
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch X3SsnPorch
## 2121 TA Y 0 0 0 0
## ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold
## 2121 0 0 NoPo MnPrv None 0 4 2008
## SaleType SaleCondition SalePrice
## 2121 ConLD Abnorml 0
c(35,37:39,48,49)][is.na(houses[,c(35,37:39,48,49)])] <- 0 houses[,
Electrical
, let’s replace it with “Other”, level that i will create later in order to merge less frequent Electrical
level.prop.table(table(houses$Electrical))
##
## FuseA FuseF FuseP Mix SBrkr
## 0.0644276902 0.0171350240 0.0027416038 0.0003427005 0.9153529815
43][is.na(houses[,43])] <- "Other" houses[,
KitchenQual
, let’s replace it with “TA” that means Typical, the most frequent level of this variableprop.table(table(houses$KitchenQual))
##
## Ex Fa Gd TA
## 0.07025360 0.02398903 0.39444825 0.51130912
54][is.na(houses[,54])] <- "TA" houses[,
Functional
, let’s replace it with “Typ” that means Typical Functionality, the most frequent level of this variable.prop.table(table(houses$Functional))
##
## Maj1 Maj2 Min1 Min2 Mod Sev
## 0.0065135413 0.0030853617 0.0222831676 0.0239972575 0.0119986287 0.0006856359
## Typ
## 0.9314364073
56][is.na(houses[,56])] <- "Typ" houses[,
GarageCars
and GarageArea
, let’s replace both with 0 as the other House without Garagec(62,63)][is.na(houses[,c(62,63)])] <- 0 houses[,
SaleType
, let’s replace it with “Oth” that is a level already present.prop.table(table(houses$SaleType))
##
## COD Con ConLD ConLI ConLw CWD
## 0.029814942 0.001713502 0.008910212 0.003084304 0.002741604 0.004112406
## New Oth WD
## 0.081905415 0.002398903 0.865318711
79][is.na(houses[,79])] <- "Oth" houses[,
Finally control if all “NA” have been replaced.
sum(is.na(houses))
## [1] 0
$MSSubClass <- as.factor(houses$MSSubClass)
houses$MSZoning <- as.factor(houses$MSZoning)
houses$Street <- as.factor(houses$Street)
houses$Alley <- as.factor(houses$Alley)
houses$LotShape <- as.factor(houses$LotShape)
houses$LandContour <- as.factor(houses$LandContour)
houses$Utilities <- as.factor(houses$Utilities)
houses$LotConfig <- as.factor(houses$LotConfig)
houses$LandSlope <- as.factor(houses$LandSlope)
houses$Neighborhood <- as.factor(houses$Neighborhood)
houses$Condition1 <- as.factor(houses$Condition1)
houses$Condition2 <- as.factor(houses$Condition2)
houses$BldgType <- as.factor(houses$BldgType)
houses$HouseStyle <- as.factor(houses$HouseStyle)
houses$RoofStyle <- as.factor(houses$RoofStyle)
houses$RoofMatl <- as.factor(houses$RoofMatl)
houses$Exterior1st <- as.factor(houses$Exterior1st)
houses$Exterior2nd <- as.factor(houses$Exterior2nd)
houses$MasVnrType <- as.factor(houses$MasVnrType)
houses$ExterQual <- as.factor(houses$ExterQual)
houses$ExterCond <- as.factor(houses$ExterCond)
houses$Foundation <- as.factor(houses$Foundation)
houses$BsmtQual <- as.factor(houses$BsmtQual)
houses$BsmtCond <- as.factor(houses$BsmtCond)
houses$BsmtExposure <- as.factor(houses$BsmtExposure)
houses$BsmtFinType1 <- as.factor(houses$BsmtFinType1)
houses$BsmtFinType2 <- as.factor(houses$BsmtFinType2)
houses$Heating <- as.factor(houses$Heating)
houses$HeatingQC <- as.factor(houses$HeatingQC)
houses$CentralAir <- as.factor(houses$CentralAir)
houses$Electrical <- as.factor(houses$Electrical)
houses$KitchenQual <- as.factor(houses$KitchenQual)
houses$Functional <- as.factor(houses$Functional)
houses$FireplaceQu <- as.factor(houses$FireplaceQu)
houses$GarageType <- as.factor(houses$GarageType)
houses$GarageFinish <- as.factor(houses$GarageFinish)
houses$GarageQual <- as.factor(houses$GarageQual)
houses$GarageCond <- as.factor(houses$GarageCond)
houses$PavedDrive <- as.factor(houses$PavedDrive)
houses$PoolQC <- as.factor(houses$PoolQC)
houses$Fence <- as.factor(houses$Fence)
houses$MiscFeature <- as.factor(houses$MiscFeature)
houses$SaleType <- as.factor(houses$SaleType)
houses$SaleCondition <- as.factor(houses$SaleCondition) houses
<- c(2,3,6:17,22:26,28:34,36,40:43,54,56,58,59,61,64:66,73:75,79,80)
factorial_variables str(houses[,factorial_variables])
## 'data.frame': 2919 obs. of 44 variables:
## $ MSSubClass : Factor w/ 16 levels "20","30","40",..: 6 1 6 7 6 5 1 6 5 16 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 3 levels "Grvl","NoAll",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 15 14 14 14 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 3 3 3 5 3 3 1 3 5 5 ...
## $ BsmtCond : Factor w/ 5 levels "Fa","Gd","NoB",..: 5 5 5 2 5 5 5 5 5 5 ...
## $ BsmtExposure : Factor w/ 5 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 7 3 ...
## $ BsmtFinType2 : Factor w/ 7 levels "ALQ","BLQ","GLQ",..: 7 7 7 7 7 7 7 2 7 7 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 6 levels "FuseA","FuseF",..: 6 6 6 6 6 6 6 6 2 6 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ FireplaceQu : Factor w/ 6 levels "Ex","Fa","Gd",..: 4 6 6 3 6 4 3 6 6 6 ...
## $ GarageType : Factor w/ 7 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageFinish : Factor w/ 4 levels "Fin","NoG","RFn",..: 3 3 3 4 3 4 3 3 4 3 ...
## $ GarageQual : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 2 3 ...
## $ GarageCond : Factor w/ 6 levels "Ex","Fa","Gd",..: 6 6 6 6 6 6 6 6 6 6 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ PoolQC : Factor w/ 4 levels "Ex","Fa","Gd",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Fence : Factor w/ 5 levels "GdPrv","GdWo",..: 5 5 5 5 5 3 5 5 5 5 ...
## $ MiscFeature : Factor w/ 5 levels "Gar2","None",..: 2 2 2 2 2 4 2 4 2 2 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
%>%
houses gather(Attributes, value, factorial_variables[1:9]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_theme
%>%
houses gather(Attributes, value, factorial_variables[10:18]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_theme
%>%
houses gather(Attributes, value, factorial_variables[19:27]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_theme
%>%
houses gather(Attributes, value, factorial_variables[28:36]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_theme
%>%
houses gather(Attributes, value, factorial_variables[37:44]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_bar(stat="count", show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values", y="Frequency",
title="Categorical Variables - Histograms") +
scale_fill_discrete() +
my_theme
Street
, Utilities
have only one value and Neighborhood
have to many level and and MSSubClass
have many not interpretable level that can create bias so I delete them.<- houses[,-which(names(houses) == "Street")]
houses <- houses[,-which(names(houses) == "Utilities")]
houses <- houses[,-which(names(houses) == "MSSubClass")] houses
LandSlope
–> I’ll merge “Mod” and “Sev” as “ModSev” = Moderate/Severe Slope.prop.table(table(houses$LandSlope))
##
## Gtl Mod Sev
## 0.951695786 0.042822885 0.005481329
<- transform(houses, LandSlope=revalue(LandSlope,c("Mod" = "ModSev")))
houses <- transform(houses, LandSlope=revalue(LandSlope,c("Sev" = "ModSev")))
houses prop.table(table(houses$LandSlope))
##
## Gtl ModSev
## 0.95169579 0.04830421
LotConfig
–> I’ll merge “FR2” and “FR3” as “FR2/3” = Frontage on 2/3 sides of property.prop.table(table(houses$LotConfig))
##
## Corner CulDSac FR2 FR3 Inside
## 0.175059952 0.060294621 0.029119561 0.004796163 0.730729702
<- transform(houses, LotConfig=revalue(LotConfig,c("FR2" = "FR2/3")))
houses <- transform(houses, LotConfig=revalue(LotConfig,c("FR3" = "FR2/3")))
houses prop.table(table(houses$LotConfig))
##
## Corner CulDSac FR2/3 Inside
## 0.17505995 0.06029462 0.03391572 0.73072970
LotShape
–> I’ll merge “IR2” and “IR3” as “IR2” = Moderately or more Irregular.prop.table(table(houses$LotShape))
##
## IR1 IR2 IR3 Reg
## 0.331620418 0.026036314 0.005481329 0.636861939
<- transform(houses, LotShape=revalue(LotShape,c("IR3" = "IR2")))
houses prop.table(table(houses$LotShape))
##
## IR1 IR2 Reg
## 0.33162042 0.03151764 0.63686194
MSZoning
–> I’ll merge “RM” and “RH” as “RMH” = Residential Medium/High Density.prop.table(table(houses$MSZoning))
##
## C (all) FV RH RL RM
## 0.008564577 0.047619048 0.008907160 0.777321000 0.157588215
<- transform(houses, MSZoning=revalue(MSZoning,c("RM" = "RMH")))
houses <- transform(houses, MSZoning=revalue(MSZoning,c("RH" = "RMH")))
houses prop.table(table(houses$MSZoning))
##
## C (all) FV RMH RL
## 0.008564577 0.047619048 0.166495375 0.777321000
condition1
and condition2
because of both have a lot of level difficult to be interpreted and any of that are without observation. I have also removed Roofmatl
because there are many empty category and almost all the observations have the same value. Then i have deleted Exterior2nd
because are very similar to Exterior1st
.<- houses[,-which(names(houses) == "Condition1")]
houses <- houses[,-which(names(houses) == "Condition2")]
houses <- houses[,-which(names(houses) == "RoofMatl")]
houses <- houses[,-which(names(houses) == "Exterior2nd")]
houses <- houses[,-which(names(houses) == "Neighborhood")] houses
BldgType
–> I’ll merge “TwnhsE” with “Twnhs” = Townhouse and also merge “2fmCon” and “Duplex” as “2Fam” = Two-family.prop.table(table(houses$BldgType))
##
## 1Fam 2fmCon Duplex Twnhs TwnhsE
## 0.83076396 0.02124015 0.03734156 0.03288798 0.07776636
<- transform(houses, BldgType=revalue(BldgType,c("TwnhsE" = "Twnhs")))
houses <- transform(houses, BldgType=revalue(BldgType,c("2fmCon" = "2Fam")))
houses <- transform(houses, BldgType=revalue(BldgType,c("Duplex" = "2Fam")))
houses prop.table(table(houses$BldgType))
##
## 1Fam 2Fam Twnhs
## 0.83076396 0.05858171 0.11065433
Exterior1st
–> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$Exterior1st))
##
## AsbShng AsphShn BrkComm BrkFace CBlock CemntBd
## 0.0150736554 0.0006851662 0.0020554985 0.0298047276 0.0006851662 0.0431654676
## HdBoard ImStucc MetalSd Other Plywood Stone
## 0.1514217198 0.0003425831 0.1541623844 0.0003425831 0.0757108599 0.0006851662
## Stucco VinylSd Wd Sdng WdShing
## 0.0147310723 0.3511476533 0.1408016444 0.0191846523
<- transform(houses, Exterior1st=revalue(Exterior1st,c("AsbShng" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("AsphShn" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("BrkComm" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("BrkFace" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("CBlock" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("CemntBd" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("ImStucc" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("Stone" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("Stucco" = "Other")))
houses <- transform(houses, Exterior1st=revalue(Exterior1st,c("WdShing" = "Other")))
houses prop.table(table(houses$Exterior1st))
##
## Other HdBoard MetalSd Plywood VinylSd Wd Sdng
## 0.12675574 0.15142172 0.15416238 0.07571086 0.35114765 0.14080164
I have to inspect HouseStyle
–> I’ll merge “1Story”, “1.5Fin” and “1.5Unf” as “1aH” = One or one and one-half story, merge “2Story”, “2.5Fin” and “2.5Unf” as “2aH” = Two or one and Two-half story and also merge “SFoyer” and “SLvl” as “SFL” = Split Foyer or split Level.
prop.table(table(houses$HouseStyle))
##
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story
## 0.107571086 0.006509078 0.503939705 0.002740665 0.008221994 0.298732443
## SFoyer SLvl
## 0.028434395 0.043850634
<- transform(houses, HouseStyle=revalue(HouseStyle,c("1Story" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("1.5Fin" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("1.5Unf" = "1aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2Story" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2.5Fin" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("2.5Unf" = "2aH")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("SFoyer" = "SFL")))
houses <- transform(houses, HouseStyle=revalue(HouseStyle,c("SLvl" = "SFL")))
houses prop.table(table(houses$HouseStyle))
##
## 1aH 2aH SFL
## 0.61801987 0.30969510 0.07228503
RoofStyle
–> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$RoofStyle))
##
## Flat Gable Gambrel Hip Mansard Shed
## 0.006851662 0.791366906 0.007536828 0.188763275 0.003768414 0.001712915
<- transform(houses, RoofStyle=revalue(RoofStyle,c("Flat" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Gambrel" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Mansard" = "Other")))
houses <- transform(houses, RoofStyle=revalue(RoofStyle,c("Shed" = "Other")))
houses prop.table(table(houses$RoofStyle))
##
## Other Gable Hip
## 0.01986982 0.79136691 0.18876328
BsmtCond
–> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$BsmtCond))
##
## Fa Gd NoB Po TA
## 0.035628640 0.041795135 0.028091812 0.001712915 0.892771497
<- transform(houses, BsmtCond=revalue(BsmtCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, BsmtCond=revalue(BsmtCond,c("Po" = "Fa_Po")))
houses prop.table(table(houses$BsmtCond))
##
## Fa_Po Gd NoB TA
## 0.03734156 0.04179514 0.02809181 0.89277150
BsmtFinType1
–> I’ll merge “ALQ”, “BLQ” and “GLQ” as “LQ” = Living Quarters and also merge “LwQ” and “Unf” as “LwQ_Unf” = Low Quality or Unfinished.prop.table(table(houses$BsmtFinType1))
##
## ALQ BLQ GLQ LwQ NoB Rec Unf
## 0.14696814 0.09215485 0.29085303 0.05275779 0.02706406 0.09866393 0.29153820
<- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("ALQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("BLQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("GLQ" = "LQ")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("LwQ" = "LwQ_Unf")))
houses <- transform(houses, BsmtFinType1=revalue(BsmtFinType1,c("Unf" = "LwQ_Unf")))
houses prop.table(table(houses$BsmtFinType1))
##
## LQ LwQ_Unf NoB Rec
## 0.52997602 0.34429599 0.02706406 0.09866393
BsmtFinType2
–> I’ll merge “ALQ”, “BLQ” and “GLQ” as “LQ” = Living Quarters and also merge “LwQ” and “Unf” as “LwQ_Unf” = Low Quality or Unfinished.prop.table(table(houses$BsmtFinType2))
##
## ALQ BLQ GLQ LwQ NoB Rec Unf
## 0.01781432 0.02329565 0.01164782 0.02980473 0.02740665 0.03597122 0.85405961
<- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("ALQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("BLQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("GLQ" = "LQ")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("LwQ" = "LwQ_Unf")))
houses <- transform(houses, BsmtFinType2=revalue(BsmtFinType2,c("Unf" = "LwQ_Unf")))
houses prop.table(table(houses$BsmtFinType2))
##
## LQ LwQ_Unf NoB Rec
## 0.05275779 0.88386434 0.02740665 0.03597122
ExterCond
–> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$ExterCond))
##
## Ex Fa Gd Po TA
## 0.004110997 0.022953066 0.102432340 0.001027749 0.869475848
<- transform(houses, ExterCond=revalue(ExterCond,c("Ex" = "Ex_Gd")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Gd" = "Ex_Gd")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, ExterCond=revalue(ExterCond,c("Po" = "Fa_Po")))
houses prop.table(table(houses$ExterCond))
##
## Ex_Gd Fa_Po TA
## 0.10654334 0.02398082 0.86947585
ExterQual
–> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good.prop.table(table(houses$ExterQual))
##
## Ex Fa Gd TA
## 0.03665639 0.01199041 0.33538883 0.61596437
<- transform(houses, ExterQual=revalue(ExterQual,c("Ex" = "Ex_Gd")))
houses <- transform(houses, ExterQual=revalue(ExterQual,c("Gd" = "Ex_Gd")))
houses prop.table(table(houses$ExterQual))
##
## Ex_Gd Fa TA
## 0.37204522 0.01199041 0.61596437
Foundation
–> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$Foundation))
##
## BrkTil CBlock PConc Slab Stone Wood
## 0.106543337 0.423090099 0.448098664 0.016786571 0.003768414 0.001712915
<- transform(houses, Foundation=revalue(Foundation,c("Slab" = "Other")))
houses <- transform(houses, Foundation=revalue(Foundation,c("Stone" = "Other")))
houses <- transform(houses, Foundation=revalue(Foundation,c("Wood" = "Other")))
houses prop.table(table(houses$Foundation))
##
## BrkTil CBlock PConc Other
## 0.1065433 0.4230901 0.4480987 0.0222679
MasVnrType
–> I’ll merge “BrkCmn” and “BrkFace” as “Brk” = Brick.prop.table(table(houses$MasVnrType))
##
## BrkCmn BrkFace None Stone
## 0.008564577 0.301130524 0.605001713 0.085303186
<- transform(houses, MasVnrType=revalue(MasVnrType,c("BrkCmn" = "Brk")))
houses <- transform(houses, MasVnrType=revalue(MasVnrType,c("BrkFace" = "Brk")))
houses prop.table(table(houses$MasVnrType))
##
## Brk None Stone
## 0.30969510 0.60500171 0.08530319
Heating
because almost all are Gas and there are empty level.<- houses[,-which(names(houses) == "Heating")] houses
Electrical
–> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$Electrical))
##
## FuseA FuseF FuseP Mix Other SBrkr
## 0.0644056184 0.0171291538 0.0027406646 0.0003425831 0.0003425831 0.9150393971
<- transform(houses, Electrical=revalue(Electrical,c("FuseF" = "Other")))
houses <- transform(houses, Electrical=revalue(Electrical,c("FuseP" = "Other")))
houses <- transform(houses, Electrical=revalue(Electrical,c("Mix" = "Other")))
houses prop.table(table(houses$Electrical))
##
## FuseA Other SBrkr
## 0.06440562 0.02055498 0.91503940
FireplaceQu
–> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$FireplaceQu))
##
## Ex Fa Gd NoFp Po TA
## 0.01473107 0.02535115 0.25488181 0.48646797 0.01575882 0.20280918
<- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Ex" = "Ex_Gd")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Gd" = "Ex_Gd")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Fa" = "Fa_Po")))
houses <- transform(houses, FireplaceQu=revalue(FireplaceQu,c("Po" = "Fa_Po")))
houses prop.table(table(houses$FireplaceQu))
##
## Ex_Gd Fa_Po NoFp TA
## 0.26961288 0.04110997 0.48646797 0.20280918
Functional
–> I’ll merge “Maj1”, “Maj2” and “Mod” as “Maj” = Major Deductions and also merge “Min1”, “Min2” and “Sev” as “Min” = Minor Deductions.prop.table(table(houses$Functional))
##
## Maj1 Maj2 Min1 Min2 Mod Sev
## 0.0065090785 0.0030832477 0.0222679000 0.0239808153 0.0119904077 0.0006851662
## Typ
## 0.9314833847
<- transform(houses, Functional=revalue(Functional,c("Maj1" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Maj2" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Mod" = "Maj")))
houses <- transform(houses, Functional=revalue(Functional,c("Min1" = "Min")))
houses <- transform(houses, Functional=revalue(Functional,c("Min2" = "Min")))
houses <- transform(houses, Functional=revalue(Functional,c("Sev" = "Min")))
houses prop.table(table(houses$Functional))
##
## Maj Min Typ
## 0.02158273 0.04693388 0.93148338
GarageType
–> I’ll merge in “Other” all the level that have percentage less than less than 5%.prop.table(table(houses$GarageType))
##
## 2Types Attchd Basment BuiltIn CarPort Detchd
## 0.007879411 0.590270641 0.012332991 0.063720452 0.005138746 0.266872217
## NoG
## 0.053785543
<- transform(houses, GarageType=revalue(GarageType,c("2Types" = "Other")))
houses <- transform(houses, GarageType=revalue(GarageType,c("Basment" = "Other")))
houses <- transform(houses, GarageType=revalue(GarageType,c("CarPort" = "Other")))
houses prop.table(table(houses$GarageType))
##
## Other Attchd BuiltIn Detchd NoG
## 0.02535115 0.59027064 0.06372045 0.26687222 0.05378554
HeatingQC
–> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$HeatingQC))
##
## Ex Fa Gd Po TA
## 0.511476533 0.031517643 0.162384378 0.001027749 0.293593696
<- transform(houses, HeatingQC=revalue(HeatingQC,c("Fa" = "Fa_Po")))
houses <- transform(houses, HeatingQC=revalue(HeatingQC,c("Po" = "Fa_Po")))
houses prop.table(table(houses$HeatingQC))
##
## Ex Fa_Po Gd TA
## 0.51147653 0.03254539 0.16238438 0.29359370
Fence
–> I’ll merge “GdPrv” and “GdWo” as “GdPrvWo” = Good privacy or Good wood and also merge “MnPrv” and “MnWw” as “MnPrvWw” = Minimum privacy or Minimum Wood/Wire.prop.table(table(houses$Fence))
##
## GdPrv GdWo MnPrv MnWw NoFen
## 0.040424803 0.038369305 0.112709832 0.004110997 0.804385063
<- transform(houses, Fence=revalue(Fence,c("GdPrv" = "GdPrvWo")))
houses <- transform(houses, Fence=revalue(Fence,c("GdWo" = "GdPrvWo")))
houses <- transform(houses, Fence=revalue(Fence,c("MnPrv" = "MnPrvWw")))
houses <- transform(houses, Fence=revalue(Fence,c("MnWw" = "MnPrvWw")))
houses prop.table(table(houses$Fence))
##
## GdPrvWo MnPrvWw NoFen
## 0.07879411 0.11682083 0.80438506
GarageCond
and GarageQual
because they seems similar, i have decide to keep only GarageCond
and delete GarageQual
due to the most are equal to GarageCond
–> I’ll merge “Ex” and “Gd” as “Ex_Gd” = Excellent or Good and also merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$GarageCond))
##
## Ex Fa Gd NoG Po TA
## 0.001027749 0.025351148 0.005138746 0.054470709 0.004796163 0.909215485
prop.table(table(houses$GarageQual))
##
## Ex Fa Gd NoG Po TA
## 0.001027749 0.042480301 0.008221994 0.054470709 0.001712915 0.892086331
<- houses[,-which(names(houses) == "GarageQual")]
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Ex" = "Ex_Gd")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Gd" = "Ex_Gd")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Fa" = "Fa_Po")))
houses <- transform(houses, GarageCond=revalue(GarageCond,c("Po" = "Fa_Po")))
houses prop.table(table(houses$GarageCond))
##
## Ex_Gd Fa_Po NoG TA
## 0.006166495 0.030147311 0.054470709 0.909215485
MiscFeature
–> I’ll merge “Fa” and “Po” as “Fa_Po” = Fair or Poor.prop.table(table(houses$MiscFeature))
##
## Gar2 None Othr Shed TenC
## 0.0017129154 0.9640287770 0.0013703323 0.0325453923 0.0003425831
<- transform(houses, MiscFeature=revalue(MiscFeature,c("Gar2" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("Othr" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("Shed" = "Yes")))
houses <- transform(houses, MiscFeature=revalue(MiscFeature,c("TenC" = "Yes")))
houses prop.table(table(houses$MiscFeature))
##
## Yes None
## 0.03597122 0.96402878
PavedDrive
–> merge “Y” and “P” as “Y_P” = Paved or partial Paved.prop.table(table(houses$PavedDrive))
##
## N P Y
## 0.07399794 0.02124015 0.90476190
<- transform(houses, PavedDrive=revalue(PavedDrive,c("Y" = "Y_P")))
houses <- transform(houses, PavedDrive=revalue(PavedDrive,c("P" = "Y_P")))
houses prop.table(table(houses$PavedDrive))
##
## N Y_P
## 0.07399794 0.92600206
PoolQC
, more then 99% haven’t Pool so in order to don’t delete the variable because it would be a plus have a pool –> I have decide to change variable in Pool
with two level “Yes” = Yes and “No” = No.prop.table(table(houses$PoolQC))
##
## Ex Fa Gd NoPo
## 0.0013703323 0.0006851662 0.0013703323 0.9965741692
names(houses)[names(houses) == 'PoolQC'] <- 'Pool'
<- transform(houses, Pool=revalue(Pool,c("Ex" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("Fa" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("Gd" = "Yes")))
houses <- transform(houses, Pool=revalue(Pool,c("NoPo" = "No")))
houses prop.table(table(houses$Pool))
##
## Yes No
## 0.003425831 0.996574169
SaleCondition
–> I’ll merge in Other
all the level that have percentage less than less than 5%.prop.table(table(houses$SaleCondition))
##
## Abnorml AdjLand Alloca Family Normal Partial
## 0.065090785 0.004110997 0.008221994 0.015758822 0.822884550 0.083932854
<- transform(houses, SaleCondition=revalue(SaleCondition,c("AdjLand" = "Other")))
houses <- transform(houses, SaleCondition=revalue(SaleCondition,c("Alloca" = "Other")))
houses <- transform(houses, SaleCondition=revalue(SaleCondition,c("Family" = "Other")))
houses prop.table(table(houses$SaleCondition))
##
## Abnorml Other Normal Partial
## 0.06509078 0.02809181 0.82288455 0.08393285
SaleType
–> I’ll merge “Con”, “ConLD”, “ConLI”, “ConLw” as “Oth” = Other and also merge “WD” and “CWD” as “WD” = Warranty Deed.prop.table(table(houses$SaleType))
##
## COD Con ConLD ConLI ConLw CWD
## 0.029804728 0.001712915 0.008907160 0.003083248 0.002740665 0.004110997
## New Oth WD
## 0.081877355 0.002740665 0.865022268
<- transform(houses, SaleType=revalue(SaleType,c("Con" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLD" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLI" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("ConLw" = "Oth")))
houses <- transform(houses, SaleType=revalue(SaleType,c("CWD" = "WD")))
houses prop.table(table(houses$SaleType))
##
## COD Oth WD New
## 0.02980473 0.01918465 0.86913326 0.08187736
<- c(3,4,12:15,19,27,29:31,35:44,46,48,51,53,54,57:62,66:68,71)
continuous_variables str(houses[,continuous_variables])
## 'data.frame': 2919 obs. of 36 variables:
## $ LotFrontage : num 65 80 68 60 84 85 75 0 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ MasVnrArea : num 196 0 162 0 350 0 186 240 0 0 ...
## $ BsmtFinSF1 : num 706 978 486 216 655 ...
## $ BsmtFinSF2 : num 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : num 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : num 856 1262 920 756 1145 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : num 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : num 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ GarageYrBlt : num 2003 1976 2001 1998 2000 ...
## $ GarageCars : num 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : num 548 460 608 642 836 480 636 484 468 205 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SalePrice : num 208500 181500 223500 140000 250000 ...
%>%
houses gather(Attributes, value, continuous_variables[1:9]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_boxplot(show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values",
title="Continous Variables - Boxplot") +
scale_fill_discrete() +
my_theme
%>%
houses gather(Attributes, value, continuous_variables[10:18]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_boxplot(show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values",
title="Continous Variables - Boxplot") +
scale_fill_discrete() +
my_theme
%>%
houses gather(Attributes, value, continuous_variables[19:27]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_boxplot(show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values",
title="Continous Variables - Boxplot") +
scale_fill_discrete() +
my_theme
%>%
houses gather(Attributes, value, continuous_variables[28:36]) %>%
ggplot(aes(x=value, fill=Attributes)) +
geom_boxplot(show.legend = F) +
facet_wrap(~Attributes, scales="free_x") +
labs(x="Values",
title="Continous Variables - Boxplot") +
scale_fill_discrete() +
my_theme
BsmtFinSF2
have to many outlier I’ll remove it and I’ll valuated only the presence of absence of the second basement.<- houses[,-which(names(houses) == "BsmtFinSF2")] houses
BsmtHalfBath
have almost all 0, it’s difficult to interpret variable I’ll remove it also because I have already BsmtFullBath
to consider.<- houses[,-which(names(houses) == "BsmtHalfBath")] houses
LowQualFinSF
have to many outlier I’ll remove it.<- houses[,-which(names(houses) == "LowQualFinSF")] houses
MiscVal
have many outlier and it difficult to be interpreted I’ll remove it also because I have already a binomial var to interpret if are present or not Misc
.<- houses[,-which(names(houses) == "MiscVal")] houses
MiscVal
is applied for PollArea
.<- houses[,-which(names(houses) == "PoolArea")] houses
ClosedPorch
with two level (“Yes” or “No”) in order to delete the other not well intrepetable continous variable.$ClosedPorch = as.factor(ifelse(houses$EnclosedPorch > 0 | houses$ScreenPorch > 0 | houses$X3SsnPorch > 0, "Yes", "No"))
houses<- houses[,-which(names(houses) == "EnclosedPorch")]
houses <- houses[,-which(names(houses) == "ScreenPorch")]
houses <- houses[,-which(names(houses) == "X3SsnPorch")]
houses prop.table(table(houses$ClosedPorch))
##
## No Yes
## 0.7519699 0.2480301
OpenPorchSF
as factorial variable with two level “Yes” or “No”.$OpenPorch = as.factor(ifelse(houses$OpenPorchSF > 0, "Yes", "No"))
houses<- houses[,-which(names(houses) == "OpenPorchSF")]
houses prop.table(table(houses$OpenPorch))
##
## No Yes
## 0.4446728 0.5553272
After proceeding with the analysis I’ll relocate SalePrice
as last column just to have more order in in data.
<- houses %>%
houses relocate(SalePrice, .after = last_col())
I have split Data into train and test set and I have delete the SalePrice
from test set (precedently inserted as a 0 column).
<- houses[1:1460,]
train <- houses[1461:2898,] test
After going forward rewrite factorial and continuous variable obviously excluding Id
.
<- c(2,3,5:11,14,16:18,20:26,28,31:33,42,44,46,47,49,52,53,55:57,59:63)
factorial_variables <- c(4,12,13,15,19,27,29,30,34:41,43,45,48,50,51,54,58,64) continuous_variables
Then create train and test set by type of variable.
<- train[,continuous_variables]
train_num <- test[,continuous_variables]
test_num <- train[,c(1,factorial_variables)]
train_fact <- test[,c(1,factorial_variables)]
test_fact <- test[,1] test_id
Correlation is a term that is a measure of the strength of a linear relationship between two quantitative variables (e.g., height, weight). This post will define positive and negative correlations, illustrated with examples and explanations of how to measure correlation. Finally, some pitfalls regarding the use of correlation will be discussed.
Positive correlation is a relationship between two variables in which both variables move in the same direction. This is when one variable increases while the other increases and visa versa. For example, positive correlation may be that the more you exercise, the more calories you will burn. Whilst negative correlation is a relationship where one variable increases as the other decreases, and vice versa.
<- cor(train_num[,-24])
tot_corr <- colorRampPalette(c('darkred', 'white', 'black'))(10)
colcorrplot(tot_corr, method="pie", type= "upper", diag = F, tl.srt = 40, tl.cex = 0.8, tl.col = "black", col = col)
Data can contain attributes that are highly correlated with each other. Many methods perform better if highly correlated attributes are removed. I want to remove attributes with an absolute correlation of 0.75 or higher.
So, I have removed GrLivArea
, GarageCars
and X1stFlrSF
<- findCorrelation(tot_corr, cutoff=0.75)
highlyCorrelated str(train_num[,highlyCorrelated])
## 'data.frame': 1460 obs. of 3 variables:
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ GarageCars: num 2 2 2 3 3 2 2 2 2 1 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
<- train_num[,-highlyCorrelated]
train_num <- test_num[,-highlyCorrelated] test_num
SalePrice
Density plot
ggplot(train, aes(x= SalePrice)) +
geom_density(fill = "black", alpha=0.6, show.legend=FALSE) +
labs(x="Values", y="Density", title="SalePrice - Density plot") +
my_theme
SalePrice
Box plot
ggplot(train, aes(x= SalePrice)) +
geom_boxplot(fill = "black", alpha = 0.6, show.legend = F) +
coord_flip() +
labs(x="Values",title="SalePrice - Box plot") +
my_theme
SalePrice
Correlation plot
<- train_num %>%
price_cor correlate() %>%
focus(SalePrice)
%>%
price_cor mutate(term = factor(term, levels = term[order(SalePrice)])) %>%
ggplot(aes(x = term, y = SalePrice, fill = SalePrice)) +
geom_bar(stat = "identity", show.legend = F) +
ylab("Correlation with Sale_Price") +
xlab("Variable") +
scale_fill_gradient(low = 'red', high = 'black') +
theme(plot.title = element_text(color = 'darkred', face = "bold.italic", size = 15),
plot.subtitle = element_text(color = 'darkred', size = 8),
plot.background =element_rect(fill = "snow",colour = "darkred",size = 1.5),
panel.grid.major = element_line(colour = "snow", size = 1),
panel.grid.minor = element_line(colour = "snow2"),
legend.title = element_text(colour="black", size=10),
panel.background =element_rect(fill = "snow2"),
legend.background = element_rect(fill="red3",size=0.5, linetype="solid",colour ="red3"),
axis.title = element_text(face = "bold.italic", color = "darkred"),
axis.text.x = element_text(face = "italic",color = 'red3', angle = 75, hjust = 1),
axis.text.y = element_text(face = "italic",color = 'red3'))
Before starting to create the predictive models I’m going to rescale the datasets transforming the categorical variable as dummy. A dummy variable is a numeric variable that represents categorical data, such as gender, race, political affiliation, etc. Technically, dummy variables are dichotomous, quantitative variables. Their range of values is small; they can take on only two quantitative values. As a practical matter, regression results are easiest to interpret when dummy variables are limited to two specific values, 1 or 0. Typically, 1 represents the presence of a qualitative attribute, and 0 represents the absence. Then I will and merge dummy variable with the continuous into two unique tibble.
<- model.matrix(object = Id ~ . , data = train_fact)
train_fact_mtx <- data.frame(train_fact_mtx[,-1])
train_fact
<- model.matrix(object = Id ~ . , data = test_fact)
test_fact_mtx <- data.frame(test_fact_mtx[,-1])
test_fact
<- as_tibble(cbind(train_fact, train_num))
train <- as_tibble(cbind(test_fact, test_num))
test <- test[,-112] test
Then I have split train dataset into train data and train target
<- SalePrice ~ .
formula <- train[, -112] %>% as.data.frame()
train_data <- train[, 112] %>% pull() train_target
Before proceeding to fit predictive models i have to define a train controll in order to evaluate them.
For these purpose i have chosen the repeated cross validation.
Repeated k-fold cross-validation is a procedure of resampling that provides a way to improve the estimated performance of a machine learning model. This involves simply repeating the cross-validation procedure multiple times and reporting the mean result across all folds from all runs. This mean result is expected to be a more accurate estimate of the true unknown underlying mean performance of the model on the dataset, as calculated using the standard error.
I decide to resampling with k = 10 and repeats the cross validation 3 times
<- trainControl(method = "repeatedcv", number = 10, repeats = 3) train_control_RCV
After the repeated cross validation i will have for each fitted model three value that help me to evaluate them:
Root mean squared error is calculated as: \[RMSE = \sqrt {\frac {1} {n} \sum (y_i - \hat{y}_i)^2}\]
Mean absolute error is calculated as: \[MAE = \sqrt {\frac {1} {n} \sum |y_i - \hat{y}_i|}\]
It is a statistic used in the context of statistical models whose main purpose is either the prediction of future outcomes or the testing of hypotheses, on the basis of other related information. It provides a measure of how well observed outcomes are replicated by the model, based on the proportion of total variation of outcomes explained by the model.
The coefficient of determination is calculated as: \[R^2 = 1- \frac {SS{res}} {SS{tot}}\]
Where the variability of the data set can be measured with two sums of squares formulas:
The total sum of squares (proportional to the variance of the data): \[SS{tot} = \sum (y_i - \bar{y})^2\] Where: \[\bar{y} = \frac {1} {n} \sum y_i\]
The sum of squares of residuals, also called the residual sum of squares: \[SS{res} = \sum (y_i - \hat{y}_i)^2\]
Multiple Linear Regression (MLR) is a statistical technique for finding existence of an association relationship between a dependent variable and several independent variables.
The functional form is given by:
\[Y = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + ... + \beta_{p} X_{p} + e\]
Where:
Let’s start to fit the model with Y = SalePrice
.
\[SalePrice = \beta_{0} + \beta_{1} X_{1} + \beta_{2} X_{2} + .. + \beta_{p} X_{p} + e\]
set.seed(123)
<- train(formula, data = train, method = "lm", trControl = train_control_RCV) lm_RCV
print(lm_RCV)
## Linear Regression
##
## 1460 samples
## 111 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 34963.23 0.8125509 20345
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
After have create the model, let see what are the most important variable:
As showed in the graph the variable that influence most the SalePrice
is the OverallQual
rates with almost 10% of the total importance, followed by KitchenQualGd
with a value of importance just above 7.5%.
<- varImp(lm_RCV, scale = F)
lm_importance_value <- as.data.frame(lm_importance_value[["importance"]])
lm_importance_value <- cbind(Variable = rownames(lm_importance_value),lm_importance_value)
lm_importance_value <- lm_importance_value %>%
lm_importance_value arrange(desc(Overall))
ggplot(lm_importance_value[1:15,], aes(reorder(Variable, Overall), Overall, fill = Overall)) +
geom_bar(stat = "identity", show.legend = F) +
coord_flip()+
scale_fill_gradient(low = "grey75", high = "black") +
labs(title = "Linear Regression - Variables Importance", y = "Importance", x = "") +
my_theme
Then try to see how the linear model predict the train data
<- predict(lm_RCV, train_data)
lm_pred tibble(
pred = lm_pred,
actual = train_target
%>%
) ggplot(aes(pred, actual)) +
geom_point( color = "black") +
geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
labs(title = "Linear Regression - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
my_theme
Neural networks are a set of algorithms, loosely modeled after the human brain, designed to recognize patterns.
The patterns they recognize are numerical, contained in vectors, to which all data in the real world must be translated.
The stacked neural networks is a networks composed of several layers.
The layers are made of nodes.
A node combines data input with a set of coefficients or weights that either amplifies or dampens that input, thereby assigning significance to inputs for the task that the algorithm is trying to learn.
The output of each layer is the input of the next layer at the same time, starting from the initial input layer receiving your data.
A neural network consists of:
In order to scale all variable inside, we use the maximun and minimum values of each single variable.
<- apply(train, 2, max)
train_maxs <- apply(train, 2, min)
train_mins <- as.data.frame(scale(train,
train_scaled center = train_mins,
scale = train_maxs - train_mins))
We are looking for the optimal parameters for the model.
As I can see from the output the final values used for the model were:
size
= 3Size is the number of hidden layer that use backpropagation to optimise the weights of the input variables in order to improve the predictive power of the model
decay
= 0.1Weight decay is a regularization technique by adding a small penalty, usually the L2 norm of the weights (all the weights of the model), to the loss function.
\[loss = loss + weightdecay parameter * L2 norm of the weights\]
Weight decay is used to prevent overfitting and to keep the weights small and avoid exploding gradient.
Because the L2 norm of the weights are added to the loss, each iteration of your network will try to optimize/minimize the model weights in addition to the loss.
set.seed(123)
<- train(formula, data = train_scaled, method = "nnet", trControl = train_control_RCV) nn_RCV
print(nn_RCV)
## Neural Network
##
## 1460 samples
## 111 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ...
## Resampling results across tuning parameters:
##
## size decay RMSE Rsquared MAE
## 1 0e+00 0.15703415 0.8322300 0.13199676
## 1 1e-04 0.14545102 0.4543285 0.12132953
## 1 1e-01 0.04541866 0.8357140 0.02824910
## 3 0e+00 0.14953582 0.8500791 0.12606823
## 3 1e-04 0.12687862 0.4917349 0.10374026
## 3 1e-01 0.04522695 0.8351553 0.02787022
## 5 0e+00 0.15895984 0.7941364 0.13222516
## 5 1e-04 0.09061938 0.6516267 0.06804890
## 5 1e-01 0.04536545 0.8341218 0.02777200
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were size = 3 and decay = 0.1.
After have create the model, let see what are the most important variable:
As showed in the graph the variable that influence most the SalePrice
is still the OverallQual
rates but this time with less than 5% of the total importance, followed by 2ndflrSF
(Second floor square feet) with a value of importance almost of 3%.
<- varImp(nn_RCV, scale = F)
nn_importance_value <- as.data.frame(nn_importance_value[["importance"]])
nn_importance_value <- cbind(Variable = rownames(nn_importance_value),nn_importance_value)
nn_importance_value <- nn_importance_value %>%
nn_importance_value arrange(desc(Overall))
ggplot(nn_importance_value[1:15,], aes(reorder(Variable, Overall), Overall, fill = Overall)) +
geom_bar(stat = "identity", show.legend = F) +
coord_flip()+
scale_fill_gradient(low = "grey75", high = "black") +
labs(title = "Neural Net - Variables Importance", y = "Importance", x = "") +
my_theme
Then try to see how the neural network model predict the train data.
<- predict(nn_RCV,train_scaled[,-115])
nn_pred <- nn_pred * (train_maxs["SalePrice"] - train_mins["SalePrice"]) + train_mins["SalePrice"]
nn_pred_unscaled
tibble(
pred = nn_pred_unscaled,
actual = train_target
%>%
) ggplot(aes(pred, actual)) +
geom_point( color = "black") +
geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
labs(title = "Neural Net - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
my_theme
In order to compare the MSE and MAE with the others model I have to calculate them with not scaled data.
<- rmse(train_target, nn_pred_unscaled)
nn_RMSE print(paste0("MSE is: ", round(nn_RMSE)))
## [1] "MSE is: 29401"
<- mae(train_target, nn_pred_unscaled)
nn_MAE print(paste0("MAE is: ", round(nn_MAE)))
## [1] "MAE is: 18276"
Gradient boosting is a machine learning technique for regression and classification problems, which produces a prediction model in the form of an ensemble of weak prediction models, typically decision trees.
We are looking for the optimal parameters for the model, so i create a grid
that will be applied in this GBM model.
As we can see from the output the final values used for the model were:
n.trees
= 150The total number of trees to fit. This is equivalent to the number of iterations and the number of basis functions in the additive expansion
interaction.depth
= 6The maximum depth of each tree (i.e., the highest level of variable interactions allowed). A value of 1 implies an additive model, a value of 2 implies a model with up to 2-way interactions, etc.
shrinkage
= 0.01A shrinkage parameter applied to each tree in the expansion. Also known as the learning rate or step-size reduction
n.minobsinnode
= 15The minimum number of observations in the terminal nodes of the trees.
<- expand.grid(interaction.depth = c(3, 6),
grid n.trees = c(150, 300),
shrinkage = c(0.1),
n.minobsinnode = c(10, 15))
set.seed(123)
<- train(formula, data = train, distribution = "gaussian", method = "gbm",
gbm_RCV trControl = train_control_RCV, tuneGrid = grid)
print(gbm_RCV)
## Stochastic Gradient Boosting
##
## 1460 samples
## 111 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 1312, 1313, 1315, 1316, 1314, 1315, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.minobsinnode n.trees RMSE Rsquared MAE
## 3 10 150 30703.16 0.8520090 18083.20
## 3 10 300 30242.15 0.8559285 17790.39
## 3 15 150 30101.27 0.8571943 18052.16
## 3 15 300 29814.07 0.8596072 17780.00
## 6 10 150 30767.00 0.8510928 17545.10
## 6 10 300 30686.39 0.8512630 17520.70
## 6 15 150 29571.74 0.8621849 17444.02
## 6 15 300 29721.45 0.8609000 17576.79
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 6, shrinkage = 0.1 and n.minobsinnode = 15.
After have create the model, let see what are the most important variable:
Also for this model, as showed in the graph, the variable that influence most the SalePrice
is still the OverallQual
rates this time its importance value is much bigger than the others with more than 45% of the total importance, followed by TotalBsmtSF
(Total square feet of basement area) with a value of importance of about 12%.
<- as.data.frame(summary(gbm_RCV)) gbm_importance_value
<- gbm_importance_value %>%
gbm_importance_value arrange(desc(rel.inf))
ggplot(gbm_importance_value[1:15,], aes(reorder(var, rel.inf), rel.inf, fill = rel.inf)) +
geom_bar(stat = "identity", show.legend = F) +
coord_flip()+
scale_fill_gradient(low = "grey75", high = "black") +
labs(title = "GBM - Variables Importance", y = "Importance", x = "") +
my_theme
Then try to see how the gradient boosting machine model predict the train data.
<- predict(gbm_RCV, train_data)
gbm_pred
tibble(
pred = gbm_pred,
actual = train_target
%>%
) ggplot(aes(pred, actual)) +
geom_point( color = "black") +
geom_smooth(method = "lm", colour = "darkred", alpha = 0.1, size = 1.2) +
labs(title = "GBM - Fitted vs Predicted", x = "Predicted", y = "Fitted") +
my_theme
As conclusion I will compare the different prediction model that i have fitted and i will use the best in order to predict the test data SalePrice
In order to compare the models performs I will evaluate them by the R squared, the root mean squared error and the mean absolute error.
As the table shows the gradient boosting perform better in terms of R squared and MAE and also the RMSE is pretty similar to the Neural network.
R squared | RMSE | MAE | |
---|---|---|---|
Multiple linear regression | 0.8125 | 34963 | 20345 |
Neural network | 0.8351 | 29401 | 18276 |
Gradient boosting machine | 0,8622 | 29572 | 17444 |
As showed before the best model is the Gradient boosting machine, so I will use it in order to predict the house prices.
<- predict(gbm_RCV, test)
price_prediction <- cbind(test_id,price_prediction)
analysis_result head(analysis_result, 15)
## test_id price_prediction
## [1,] 1461 124793.5
## [2,] 1462 163828.6
## [3,] 1463 168015.1
## [4,] 1464 186131.6
## [5,] 1465 186922.5
## [6,] 1466 190698.5
## [7,] 1467 178203.0
## [8,] 1468 161966.9
## [9,] 1469 177181.7
## [10,] 1470 119881.7
## [11,] 1471 200224.3
## [12,] 1472 105117.0
## [13,] 1473 102322.0
## [14,] 1474 152281.4
## [15,] 1475 135077.6