The purpose of this project is to take raw data sampling concerning housing prices in Illinois and build a model that will allow us to predict the prices of properties based on comparable characteristics.
Currently, given the economic climate, to say the housing market in flux is a significant understatement. Additionally, now more than ever, there are so many data points collected concerning property sales that is difficult to understand what truly is important and can affect a property’s price. If one looks at Zillow or Realtor.com, every listing has a significant number of features. details, and extras that again, it can be difficult to discern what is important to the average home buyer and what is an outlier.
To begin to address this challenge, we downloaded two sets of data concerning the Illinois housing market. (Disclaimer this model will only be effecting in this particular market as geographic regions have different features that can affect the housing market; for example, ocean front property in California has the high value feature of being ocean front that a home in Illinois will not.) Our training data contained 1,460 rows containing 81 variables. The test data contained more that 2,900 rows of the same with the exception being that sales price was NOT included. The included variables ranged from general information like the type of dwelling, lot shape, and access to a main road to the veryt detailed like basement finish, electrical system, and fireplace quality. The source for hour data can be found here.
For this project, we will be using visualtizations and plots to look for correlations at a surface level followed by linear regression with log transformation, bagging, random forest, and XGBoost utilized to minimize the RMSE and build the most accurate model.
As previously touched on, this data totals approximately 4,400 rows with 80 columns worth of variables, or roughly 350,000 individual data points. As we inspected the data, the first challenge we noticed was that not all columns contained values that could be used in a regression. Additionally, specifically around alleys, there were missing values that had to be cleaned up as well, which is documented further on. The good news with this particular data set is that the data was collected without error and in a uniform fashion meaning the data did not contain formatting or data entry errors that needed to be discovered and cleaned up.
To begin the Data Preparation phase, we must first load the data.
library(readr)
housing_train <- read.csv("C:/Users/raze1/OneDrive/Desktop/UIndy/MSDA 621/Project/Project Presentation/train.csv")
housing_test <- read.csv("C:/Users/raze1/OneDrive/Desktop/UIndy/MSDA 621/Project/Project Presentation/test.csv")
colnames(housing_train)
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "X1stFlrSF"
## [45] "X2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "X3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
head(housing_train)
## Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour
## 1 1 60 RL 65 8450 Pave <NA> Reg Lvl
## 2 2 20 RL 80 9600 Pave <NA> Reg Lvl
## 3 3 60 RL 68 11250 Pave <NA> IR1 Lvl
## 4 4 70 RL 60 9550 Pave <NA> IR1 Lvl
## 5 5 60 RL 84 14260 Pave <NA> IR1 Lvl
## 6 6 50 RL 85 14115 Pave <NA> IR1 Lvl
## Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType
## 1 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 2 AllPub FR2 Gtl Veenker Feedr Norm 1Fam
## 3 AllPub Inside Gtl CollgCr Norm Norm 1Fam
## 4 AllPub Corner Gtl Crawfor Norm Norm 1Fam
## 5 AllPub FR2 Gtl NoRidge Norm Norm 1Fam
## 6 AllPub Inside Gtl Mitchel Norm Norm 1Fam
## HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl
## 1 2Story 7 5 2003 2003 Gable CompShg
## 2 1Story 6 8 1976 1976 Gable CompShg
## 3 2Story 7 5 2001 2002 Gable CompShg
## 4 2Story 7 5 1915 1970 Gable CompShg
## 5 2Story 8 5 2000 2000 Gable CompShg
## 6 1.5Fin 5 5 1993 1995 Gable CompShg
## Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation
## 1 VinylSd VinylSd BrkFace 196 Gd TA PConc
## 2 MetalSd MetalSd None 0 TA TA CBlock
## 3 VinylSd VinylSd BrkFace 162 Gd TA PConc
## 4 Wd Sdng Wd Shng None 0 TA TA BrkTil
## 5 VinylSd VinylSd BrkFace 350 Gd TA PConc
## 6 VinylSd VinylSd None 0 TA TA Wood
## BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2
## 1 Gd TA No GLQ 706 Unf
## 2 Gd TA Gd ALQ 978 Unf
## 3 Gd TA Mn GLQ 486 Unf
## 4 TA Gd No ALQ 216 Unf
## 5 Gd TA Av GLQ 655 Unf
## 6 Gd TA No GLQ 732 Unf
## BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical
## 1 0 150 856 GasA Ex Y SBrkr
## 2 0 284 1262 GasA Ex Y SBrkr
## 3 0 434 920 GasA Ex Y SBrkr
## 4 0 540 756 GasA Gd Y SBrkr
## 5 0 490 1145 GasA Ex Y SBrkr
## 6 0 64 796 GasA Ex Y SBrkr
## X1stFlrSF X2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath
## 1 856 854 0 1710 1 0 2
## 2 1262 0 0 1262 0 1 2
## 3 920 866 0 1786 1 0 2
## 4 961 756 0 1717 1 0 1
## 5 1145 1053 0 2198 1 0 2
## 6 796 566 0 1362 1 0 1
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional
## 1 1 3 1 Gd 8 Typ
## 2 0 3 1 TA 6 Typ
## 3 1 3 1 Gd 6 Typ
## 4 0 3 1 Gd 7 Typ
## 5 1 4 1 Gd 9 Typ
## 6 1 1 1 TA 5 Typ
## Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars
## 1 0 <NA> Attchd 2003 RFn 2
## 2 1 TA Attchd 1976 RFn 2
## 3 1 TA Attchd 2001 RFn 2
## 4 1 Gd Detchd 1998 Unf 3
## 5 1 TA Attchd 2000 RFn 3
## 6 0 <NA> Attchd 1993 Unf 2
## GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF
## 1 548 TA TA Y 0 61
## 2 460 TA TA Y 298 0
## 3 608 TA TA Y 0 42
## 4 642 TA TA Y 0 35
## 5 836 TA TA Y 192 84
## 6 480 TA TA Y 40 30
## EnclosedPorch X3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature
## 1 0 0 0 0 <NA> <NA> <NA>
## 2 0 0 0 0 <NA> <NA> <NA>
## 3 0 0 0 0 <NA> <NA> <NA>
## 4 272 0 0 0 <NA> <NA> <NA>
## 5 0 0 0 0 <NA> <NA> <NA>
## 6 0 320 0 0 <NA> MnPrv Shed
## MiscVal MoSold YrSold SaleType SaleCondition SalePrice
## 1 0 2 2008 WD Normal 208500
## 2 0 5 2007 WD Normal 181500
## 3 0 9 2008 WD Normal 223500
## 4 0 2 2006 WD Abnorml 140000
## 5 0 12 2008 WD Normal 250000
## 6 700 10 2009 WD Normal 143000
cbind(c("train", "test"),
rbind(dim(housing_train), dim(housing_test)))
## [,1] [,2] [,3]
## [1,] "train" "1460" "81"
## [2,] "test" "1459" "80"
The two primary requirements for the Data Preparation phase were:
Below is the list of all of the variables and a brief explanation as to what they mean.
MSSubClass: Identifies the type of dwelling involved in the sale.
20 1-STORY 1946 & NEWER ALL STYLES
30 1-STORY 1945 & OLDER
40 1-STORY W/FINISHED ATTIC ALL AGES
45 1-1/2 STORY - UNFINISHED ALL AGES
50 1-1/2 STORY FINISHED ALL AGES
60 2-STORY 1946 & NEWER
70 2-STORY 1945 & OLDER
75 2-1/2 STORY ALL AGES
80 SPLIT OR MULTI-LEVEL
85 SPLIT FOYER
90 DUPLEX - ALL STYLES AND AGES
120 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
150 1-1/2 STORY PUD - ALL AGES
160 2-STORY PUD - 1946 & NEWER
180 PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
190 2 FAMILY CONVERSION - ALL STYLES AND AGES
MSZoning: Identifies the general zoning classification of the sale.
A Agriculture
1 C Commercial
2 FV Floating Village Residential
I Industrial
3 RH Residential High Density
4 RL Residential Low Density
RP Residential Low Density Park
5 RM Residential Medium Density
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access to property
2 Grvl Gravel
1 Pave Paved
Alley: Type of alley access to property
2 Grvl Gravel
1 Pave Paved
0 NA No alley access
LotShape: General shape of property
4 Reg Regular
1 IR1 Slightly irregular
2 IR2 Moderately Irregular
3 IR3 Irregular
LandContour: Flatness of the property
4 Lvl Near Flat/Level
1 Bnk Banked - Quick and significant rise from street grade to building
2 HLS Hillside - Significant slope from side to side
3 Low Depression
Utilities: Type of utilities available
1 AllPub All public Utilities (E,G,W,& S)
NoSewr Electricity, Gas, and Water (Septic Tank)
2 NoSeWa Electricity and Gas Only
ELO Electricity only
LotConfig: Lot configuration
5 Inside Inside lot
1 Corner Corner lot
2 CulDSac Cul-de-sac
3 FR2 Frontage on 2 sides of property
4 FR3 Frontage on 3 sides of property
LandSlope: Slope of property
1 Gtl Gentle slope
2 Mod Moderate Slope
3 Sev Severe Slope
Neighborhood: Physical locations within Ames city limits
1 Blmngtn Bloomington Heights
2 Blueste Bluestem
3 BrDale Briardale
4 BrkSide Brookside
5 ClearCr Clear Creek
6 CollgCr College Creek
7 Crawfor Crawford
8 Edwards Edwards
9 Gilbert Gilbert
10 IDOTRR Iowa DOT and Rail Road
11 MeadowV Meadow Village
12 Mitchel Mitchell
13 Names North Ames
14 NoRidge Northridge
15 NPkVill Northpark Villa
16 NridgHt Northridge Heights
17 NWAmes Northwest Ames
18 OldTown Old Town
19 SWISU South & West of Iowa State University
20 Sawyer Sawyer
21 SawyerW Sawyer West
22 Somerst Somerset
23 StoneBr Stone Brook
24 Timber Timberland
25 Veenker Veenker
Condition1: Proximity to various conditions
1 Artery Adjacent to arterial street
2 Feedr Adjacent to feeder street
3 Norm Normal
4 RRNn Within 200' of North-South Railroad
5 RRAn Adjacent to North-South Railroad
6 PosN Near positive off-site feature--park, greenbelt, etc.
7 PosA Adjacent to postive off-site feature
8 RRNe Within 200' of East-West Railroad
9 RRAe Adjacent to East-West Railroad
Condition2: Proximity to various conditions (if more than one is present) – Same as above
1 Artery Adjacent to arterial street
2 Feedr Adjacent to feeder street
3 Norm Normal
4 RRNn Within 200' of North-South Railroad
5 RRAn Adjacent to North-South Railroad
6 PosN Near positive off-site feature--park, greenbelt, etc.
7 PosA Adjacent to postive off-site feature
8 RRNe Within 200' of East-West Railroad
9 RRAe Adjacent to East-West Railroad
BldgType: Type of dwelling
1 1Fam Single-family Detached
2 2FmCon Two-family Conversion; originally built as one-family dwelling
3 Duplx Duplex
4 TwnhsE Townhouse End Unit
5 TwnhsI Townhouse Inside Unit
HouseStyle: Style of dwelling
1 1Story One story
2 1.5Fin One and one-half story: 2nd level finished
3 1.5Unf One and one-half story: 2nd level unfinished
4 2Story Two story
5 2.5Fin Two and one-half story: 2nd level finished
6 2.5Unf Two and one-half story: 2nd level unfinished
7 SFoyer Split Foyer
8 SLvl Split Level
OverallQual: Rates the overall material and finish of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
OverallCond: Rates the overall condition of the house
10 Very Excellent
9 Excellent
8 Very Good
7 Good
6 Above Average
5 Average
4 Below Average
3 Fair
2 Poor
1 Very Poor
YearBuilt: Original construction date
YearRemodAdd: Remodel date (same as construction date if no remodeling or additions)
RoofStyle: Type of roof
1 Flat Flat
2 Gable Gable
3 Gambrel Gabrel (Barn)
4 Hip Hip
5 Mansard Mansard
6 Shed Shed
RoofMatl: Roof material
1 ClyTile Clay or Tile
2 CompShg Standard (Composite) Shingle
3 Membran Membrane
4 Metal Metal
5 Roll Roll
6 Tar&Grv Gravel & Tar
7 WdShake Wood Shakes
8 WdShngl Wood Shingles
Exterior1st: Exterior covering on house
1 AsbShng Asbestos Shingles
2 AsphShn Asphalt Shingles
3 BrkComm Brick Common
4 BrkFace Brick Face
5 CBlock Cinder Block
6 CemntBd Cement Board
7 HdBoard Hard Board
8 ImStucc Imitation Stucco
9 MetalSd Metal Siding
10 Other Other
11 Plywood Plywood
12 PreCast PreCast
13 Stone Stone
14 Stucco Stucco
15 VinylSd Vinyl Siding
16 Wd Sdng Wood Siding
17 WdShing Wood Shingles
Exterior2nd: Exterior covering on house (if more than one material)
1 AsbShng Asbestos Shingles
2 AsphShn Asphalt Shingles
3 BrkComm Brick Common
4 BrkFace Brick Face
5 CBlock Cinder Block
6 CemntBd Cement Board
7 HdBoard Hard Board
8 ImStucc Imitation Stucco
9 MetalSd Metal Siding
10 Other Other
11 Plywood Plywood
12 PreCast PreCast
13 Stone Stone
14 Stucco Stucco
15 VinylSd Vinyl Siding
16 Wd Sdng Wood Siding
17 WdShing Wood Shingles
MasVnrType: Masonry veneer type
1 BrkCmn Brick Common
2 BrkFace Brick Face
3 CBlock Cinder Block
0 None None
4 Stone Stone
MasVnrArea: Masonry veneer area in square feet
ExterQual: Evaluates the quality of the material on the exterior
1 Ex Excellent
2 Gd Good
3 TA Average/Typical
4 Fa Fair
5 Po Poor
ExterCond: Evaluates the present condition of the material on the exterior
1 Ex Excellent
2 Gd Good
3 TA Average/Typical
4 Fa Fair
5 Po Poor
Foundation: Type of foundation
1 BrkTil Brick & Tile
2 CBlock Cinder Block
3 PConc Poured Contrete
4 Slab Slab
5 Stone Stone
6 Wood Wood
BsmtQual: Evaluates the height of the basement
1 Ex Excellent (100+ inches)
2 Gd Good (90-99 inches)
3 TA Typical (80-89 inches)
4 Fa Fair (70-79 inches)
5 Po Poor (<70 inches
0 NA No Basement
BsmtCond: Evaluates the general condition of the basement
1 Ex Excellent
2 Gd Good
3 TA Typical - slight dampness allowed
4 Fa Fair - dampness or some cracking or settling
5 Po Poor - Severe cracking, settling, or wetness
0 NA No Basement
BsmtExposure: Refers to walkout or garden level walls
1 Gd Good Exposure
2 Av Average Exposure (split levels or foyers typically score average or above)
3 Mn Mimimum Exposure
4 No No Exposure
0 NA No Basement
BsmtFinType1: Rating of basement finished area
3 GLQ Good Living Quarters
1 ALQ Average Living Quarters
2 BLQ Below Average Living Quarters
5 Rec Average Rec Room
4 LwQ Low Quality
6 Unf Unfinshed
0 NA No Basement
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Rating of basement finished area (if multiple types)
3 GLQ Good Living Quarters
1 ALQ Average Living Quarters
2 BLQ Below Average Living Quarters
5 Rec Average Rec Room
4 LwQ Low Quality
6 Unf Unfinshed
0 No Basement
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
Heating: Type of heating
1 Floor Floor Furnace
2 GasA Gas forced warm air furnace
3 GasW Gas hot water or steam heat
4 Grav Gravity furnace
5 OthW Hot water or steam heat other than gas
6 Wall Wall furnace
HeatingQC: Heating quality and condition
1 Ex Excellent
2 Gd Good
3 TA Average/Typical
4 Fa Fair
5 Po Poor
CentralAir: Central air conditioning
0 N No
1 Y Yes
Electrical: Electrical system
1 SBrkr Standard Circuit Breakers & Romex
2 FuseA Fuse Box over 60 AMP and all Romex wiring (Average)
3 FuseF 60 AMP Fuse Box and mostly Romex wiring (Fair)
4 FuseP 60 AMP Fuse Box and mostly knob & tube wiring (poor)
5 Mix Mixed
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Bedrooms above grade (does NOT include basement bedrooms)
Kitchen: Kitchens above grade
KitchenQual: Kitchen quality
1 Ex Excellent
2 Gd Good
3 TA Typical/Average
4 Fa Fair
5 Po Poor
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality (Assume typical unless deductions are warranted)
8 Typ Typical Functionality
7 Min1 Minor Deductions 1
6 Min2 Minor Deductions 2
5 Mod Moderate Deductions
4 Maj1 Major Deductions 1
3 Maj2 Major Deductions 2
2 Sev Severely Damaged
1 Sal Salvage only
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
5 Ex Excellent - Exceptional Masonry Fireplace
4 Gd Good - Masonry Fireplace in main level
3 TA Average - Prefabricated Fireplace in main living area or Masonry Fireplace in basement
2 Fa Fair - Prefabricated Fireplace in basement
1 Po Poor - Ben Franklin Stove
0 NA No Fireplace
GarageType: Garage location
6 2Types More than one type of garage
5 Attchd Attached to home
4 Basment Basement Garage
3 BuiltIn Built-In (Garage part of house - typically has room above garage)
2 CarPort Car Port
1 Detchd Detached from home
NA No Garage
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
3 Fin Finished
2 RFn Rough Finished
1 Unf Unfinished
0 NA No Garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
5 Ex Excellent
4 Gd Good
3 TA Typical/Average
2 Fa Fair
1 Po Poor
0 NA No Garage
GarageCond: Garage condition
5 Ex Excellent
4 Gd Good
3 TA Typical/Average
2 Fa Fair
1 Po Poor
0 NA No Garage
PavedDrive: Paved driveway
3 Y Paved
2 P Partial Pavement
1 N Dirt/Gravel
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
5 Ex Excellent
4 Gd Good
3 TA Typical/Average
2 Fa Fair
1 Po Poor
0 NA No Garage
Fence: Fence quality
4 GdPrv Good Privacy
3 MnPrv Minimum Privacy
2 GdWo Good Wood
1 MnWw Minimum Wood/Wire
0 NA No Fence
MiscFeature: Miscellaneous feature not covered in other categories
Elev Elevator
4 Gar2 2nd Garage (if not described in garage section)
3 Othr Other
2 Shed Shed (over 100 SF)
1 TenC Tennis Court
0 NA None
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold (MM)
YrSold: Year Sold (YYYY)
SaleType: Type of sale
9 WD Warranty Deed - Conventional
8 CWD Warranty Deed - Cash
VWD Warranty Deed - VA Loan
7 New Home just constructed and sold
6 COD Court Officer Deed/Estate
5 Con Contract 15% Down payment regular terms
4 ConLw Contract Low Down payment and low interest
3 ConLI Contract Low Interest
2 ConLD Contract Low Down
1 Oth Other
SaleCondition: Condition of sale
6 Normal Normal Sale
5 Abnorml Abnormal Sale - trade, foreclosure, short sale
4 AdjLand Adjoining Land Purchase
3 Alloca Allocation - two linked properties with separate deeds, typically condo with a garage unit
2 Family Sale between family members
1 Partial Home was not completed when last assessed (associated with New Homes)
housing_train$MSZoning = as.factor(housing_train$MSZoning)
levels(housing_train$MSZoning)
## [1] "C (all)" "FV" "RH" "RL" "RM"
# MSZoning column of train dataset has following levels: "C (all)", "FV", "RH", "RL", "RM"
housing_test$MSZoning = as.factor(housing_test$MSZoning)
levels(housing_test$MSZoning)
## [1] "C (all)" "FV" "RH" "RL" "RM"
# Change of factors to numeric in train dataset
MSZoning=as.numeric(housing_train$MSZoning,"C "=1, "FV"=2, "RH"=3, "RL"=4, "RM"=5)
housing_train$MSZoning <-MSZoning
# Change of factors to numeric in test dataset
MSZoning=as.numeric(housing_test$MSZoning,"C "=1, "FV"=2, "RH"=3, "RL"=4, "RM"=5)
housing_test$MSZoning <-MSZoning
Street = as.factor(housing_train$Street)
Street = as.numeric(Street, "Pave"= 1,"Grvl"= 2)
housing_train$Street <-Street
# Pave got replaced with 1 and Grvl type of rode got replaced with 2
Street = as.factor(housing_test$Street)
Street = as.numeric(Street, "Pave"= 1,"Grvl"= 2)
housing_test$Street <-Street
# Pave got replaced with 1 and Grvl type of rode got replaced with 2
# Transforming Alley column to numeric in train dataset
Alley<-as.factor(housing_train$Alley)
levels(Alley)
## [1] "Grvl" "Pave"
Alley = as.numeric(Alley, "Pave"= 1,"Grvl"= 2, "NA"=0)
housing_train$Alley <- Alley
# Transforming Alley column to numeric in test dataset
Alley<-as.factor(housing_test$Alley)
levels(Alley)
## [1] "Grvl" "Pave"
Alley = as.numeric(Alley, "Pave"= 1,"Grvl"= 2)
housing_test$Alley <- Alley
# Transforming LotShape column to numeric in train dataset
LotShape <-as.factor(housing_train$LotShape)
levels(LotShape) # 4 levels: "IR1", "IR2", "IR3", "Reg"
## [1] "IR1" "IR2" "IR3" "Reg"
LotShape=as.numeric(LotShape,"IR1"=1, "IR2"=2, "IR3"=3, "Reg"=4)
housing_train$LotShape <- LotShape
# Transforming LotShape column to numeric in test dataset
LotShape <-as.factor(housing_test$LotShape)
levels(LotShape) # 4 levels: "IR1", "IR2", "IR3", "Reg"
## [1] "IR1" "IR2" "IR3" "Reg"
LotShape=as.numeric(LotShape,"IR1"=1, "IR2"=2, "IR3"=3, "Reg"=4)
housing_test$LotShape <- LotShape
# Transforming LandContour column to numeric in train dataset
LandContour <-as.factor(housing_train$LandContour)
levels(LandContour) # 4 levels: "Bnk", "HLS", "Low", "Lvl"
## [1] "Bnk" "HLS" "Low" "Lvl"
LandContour=as.numeric(LandContour,"Bnk"=1, "HLS"=2, "Low"=3, "Lvl"=4)
housing_train$LandContour <- LandContour
# Transforming LandContour column to numeric in test dataset
LandContour <-as.factor(housing_test$LandContour)
levels(LandContour) # 4 levels: "Bnk", "HLS", "Low", "Lvl"
## [1] "Bnk" "HLS" "Low" "Lvl"
LandContour=as.numeric(LandContour,"Bnk"=1, "HLS"=2, "Low"=3, "Lvl"=4)
housing_test$LandContour <- LandContour
# Transforming Utilities column to numeric in train dataset
Utilities <-as.factor(housing_train$Utilities)
levels(Utilities) # 2 levels: "AllPub", "NoSeWa"
## [1] "AllPub" "NoSeWa"
Utilities=as.numeric(Utilities,"AllPub"=1, "NoSeWa"=2, "NA"=0)
housing_train$Utilities <- Utilities
# Transforming Utilities column to numeric in test dataset
Utilities <-as.factor(housing_test$Utilities)
levels(Utilities) # 2 levels: "AllPub", "NoSeWa"
## [1] "AllPub"
Utilities=as.numeric(Utilities,"AllPub"=1, "NA"=0)
housing_test$Utilities <- Utilities
# Transforming LotConfig column to numeric in train dataset
LotConfig <-as.factor(housing_train$LotConfig)
levels(LotConfig)
## [1] "Corner" "CulDSac" "FR2" "FR3" "Inside"
LotConfig=as.numeric(LotConfig,"Corner"=1, "CulDSac"=2, "FR2"=3, "FR3"=4, "Inside"=5, "NA"= 0)
housing_train$LotConfig <- LotConfig
# Transforming LotConfig column to numeric in test dataset
LotConfig <-as.factor(housing_test$LotConfig)
levels(LotConfig)
## [1] "Corner" "CulDSac" "FR2" "FR3" "Inside"
LotConfig=as.numeric(LotConfig,"Corner"=1, "CulDSac"=2, "FR2"=3, "FR3"=4, "Inside"=5, "NA"=0)
housing_test$LotConfig <- LotConfig
# Transforming LandSlope column to numeric in train dataset
LandSlope <-as.factor(housing_train$LandSlope)
levels(LandSlope)
## [1] "Gtl" "Mod" "Sev"
LandSlope=as.numeric(LandSlope,"Gtl"=1, "Mod"=2, "Sev"=3)
housing_train$LandSlope <- LandSlope
# Transforming LandSlope column to numeric in test dataset
LandSlope <-as.factor(housing_test$LandSlope)
levels(LandSlope)
## [1] "Gtl" "Mod" "Sev"
LandSlope=as.numeric(LandSlope,"Blmngtn"=1, "Blueste"=2, "Sev"=3)
housing_test$LandSlope <- LandSlope
# Transforming Neighborhood column to numeric in train dataset
Neighborhood <-as.factor(housing_train$Neighborhood)
levels(Neighborhood)
## [1] "Blmngtn" "Blueste" "BrDale" "BrkSide" "ClearCr" "CollgCr" "Crawfor"
## [8] "Edwards" "Gilbert" "IDOTRR" "MeadowV" "Mitchel" "NAmes" "NoRidge"
## [15] "NPkVill" "NridgHt" "NWAmes" "OldTown" "Sawyer" "SawyerW" "Somerst"
## [22] "StoneBr" "SWISU" "Timber" "Veenker"
Neighborhood=as.numeric(Neighborhood,"Blmngtn"=1, "Blueste"=2, "BrDale"=3, "BrkSide"=4, "ClearCr"=5, "CollgCr"=6, "Crawfor"=7, "Edwards"=8, "Gilbert"=9, "IDOTRR"=10, "MeadowV"=11, "Mitchel"=12, "NAmes"=13, "NoRidge"=14, "NPkVill"=15, "NridgHt"=16, "NWAmes"=17, "OldTown"=18, "SWISU"=19, "Sawyer"=20, "SawyerW"=21, "Somerst"=22, "StoneBr"=23, "Timber"=24, "Veenker"=25)
housing_train$Neighborhood <- Neighborhood
# Transforming Neighborhood column to numeric in test dataset
Neighborhood <-as.factor(housing_test$Neighborhood)
levels(Neighborhood)
## [1] "Blmngtn" "Blueste" "BrDale" "BrkSide" "ClearCr" "CollgCr" "Crawfor"
## [8] "Edwards" "Gilbert" "IDOTRR" "MeadowV" "Mitchel" "NAmes" "NoRidge"
## [15] "NPkVill" "NridgHt" "NWAmes" "OldTown" "Sawyer" "SawyerW" "Somerst"
## [22] "StoneBr" "SWISU" "Timber" "Veenker"
Neighborhood=as.numeric(Neighborhood,"Blmngtn"=1, "Blueste"=2, "BrDale"=3, "BrkSide"=4, "ClearCr"=5, "CollgCr"=6, "Crawfor"=7, "Edwards"=8, "Gilbert"=9, "IDOTRR"=10, "MeadowV"=11, "Mitchel"=12, "NAmes"=13, "NoRidge"=14, "NPkVill"=15, "NridgHt"=16, "NWAmes"=17, "OldTown"=18, "SWISU"=19, "Sawyer"=20, "SawyerW"=21, "Somerst"=22, "StoneBr"=23, "Timber"=24, "Veenker"=25)
housing_test$Neighborhood <- Neighborhood
# Transforming Condition1 column to numeric in train dataset
Condition1 <-as.factor(housing_train$Condition1)
levels(Condition1)
## [1] "Artery" "Feedr" "Norm" "PosA" "PosN" "RRAe" "RRAn" "RRNe"
## [9] "RRNn"
Condition1=as.numeric(Condition1,"Artery"=1, "Feedr"=2, "Norm"=3, "RRNn"=4, "RRAn"=5, "PosN"=6, "PosA"=7, "RRNe"=8, "RRAe"=9)
housing_train$Condition1 <- Condition1
# Transforming Condition1 column to numeric in test dataset
Condition1 <-as.factor(housing_test$Condition1)
levels(Condition1)
## [1] "Artery" "Feedr" "Norm" "PosA" "PosN" "RRAe" "RRAn" "RRNe"
## [9] "RRNn"
Condition1=as.numeric(Condition1,"Artery"=1, "Feedr"=2, "Norm"=3, "RRNn"=4, "RRAn"=5, "PosN"=6, "PosA"=7, "RRNe"=8, "RRAe"=9)
housing_test$Condition1 <- Condition1
# Transforming Condition2 column to numeric in train dataset
Condition2 <-as.factor(housing_train$Condition2)
levels(Condition2)
## [1] "Artery" "Feedr" "Norm" "PosA" "PosN" "RRAe" "RRAn" "RRNn"
Condition2=as.numeric(Condition2,"Artery"=1, "Feedr"=2, "Norm"=3, "RRNn"=4, "RRAn"=5, "PosN"=6, "PosA"=7, "RRNe"=8, "RRAe"=9)
housing_train$Condition2 <- Condition2
# Transforming Condition2 column to numeric in test dataset
Condition2 <-as.factor(housing_test$Condition2)
levels(Condition2) #values
## [1] "Artery" "Feedr" "Norm" "PosA" "PosN"
Condition2=as.numeric(Condition2,"Artery"=1, "Feedr"=2, "Norm"=3, "RRNn"=4, "RRAn"=5, "PosN"=6, "PosA"=7, "RRNe"=8, "RRAe"=9)
housing_test$Condition2 <- Condition2
# Transforming BldgType column to numeric in train dataset
BldgType <-as.factor(housing_train$BldgType)
levels(BldgType)
## [1] "1Fam" "2fmCon" "Duplex" "Twnhs" "TwnhsE"
BldgType=as.numeric(BldgType,"1Fam"=1, "2FmCon"=2, "Duplx"=3, "TwnhsE"=4, "TwnhsI"=5)
housing_train$BldgType <- BldgType
# Transforming BldgType column to numeric in test dataset
BldgType <-as.factor(housing_test$BldgType)
levels(BldgType)
## [1] "1Fam" "2fmCon" "Duplex" "Twnhs" "TwnhsE"
BldgType=as.numeric(BldgType,"1Fam"=1, "2FmCon"=2, "Duplx"=3, "TwnhsE"=4, "TwnhsI"=5)
housing_test$BldgType <- BldgType
# Transforming HouseStyle column to numeric in train dataset
HouseStyle <-as.factor(housing_train$HouseStyle)
levels(HouseStyle)
## [1] "1.5Fin" "1.5Unf" "1Story" "2.5Fin" "2.5Unf" "2Story" "SFoyer" "SLvl"
HouseStyle=as.numeric(HouseStyle,"1Story"=1, "1.5Fin"=2, "1.5Unf"=3, "2Story"=4, "2.5Fin"=5, "2.5Unf"=6, "SFoyer"=7, "SLvl"=8)
housing_train$HouseStyle <- HouseStyle
# Transforming HouseStyle column to numeric in test dataset
HouseStyle <-as.factor(housing_test$HouseStyle)
levels(HouseStyle)
## [1] "1.5Fin" "1.5Unf" "1Story" "2.5Unf" "2Story" "SFoyer" "SLvl"
HouseStyle=as.numeric(HouseStyle,"1Story"=1, "1.5Fin"=2, "1.5Unf"=3, "2Story"=4, "2.5Fin"=5, "2.5Unf"=6, "SFoyer"=7, "SLvl"=8)
housing_test$HouseStyle <- HouseStyle
RoofStyle <-as.factor(housing_train$RoofStyle)
levels(RoofStyle)
## [1] "Flat" "Gable" "Gambrel" "Hip" "Mansard" "Shed"
RoofStyle=as.numeric(RoofStyle,"Flat"=1, "Gable"=2, "Gambrel"=3, "Hip"=4, "Mansard"=5, "Shed"=6)
housing_train$RoofStyle <- RoofStyle
# Transforming RoofStyle column to numeric in test dataset
RoofStyle <-as.factor(housing_test$RoofStyle)
levels(RoofStyle)
## [1] "Flat" "Gable" "Gambrel" "Hip" "Mansard" "Shed"
RoofStyle=as.numeric(RoofStyle,"Flat"=1, "Gable"=2, "Gambrel"=3, "Hip"=4, "Mansard"=5, "Shed"=6)
housing_test$RoofStyle <- RoofStyle
# Transforming RoofMatl column to numeric in train dataset
RoofMatl <-as.factor(housing_train$RoofMatl)
levels(RoofMatl)
## [1] "ClyTile" "CompShg" "Membran" "Metal" "Roll" "Tar&Grv" "WdShake"
## [8] "WdShngl"
RoofMatl=as.numeric(RoofMatl,"ClyTile"=1, "CompShg"=2, "Membran"=3, "Metal"=4, "Roll"=5, "Tar&Grv"=6, "WdShake"=7, "WdShngl"=8)
housing_train$RoofMatl <- RoofMatl
# Transforming RoofMatl column to numeric in test dataset
RoofMatl <-as.factor(housing_test$RoofMatl)
levels(RoofMatl)
## [1] "CompShg" "Tar&Grv" "WdShake" "WdShngl"
RoofMatl=as.numeric(RoofMatl,"ClyTile"=1, "CompShg"=2, "Membran"=3, "Metal"=4, "Roll"=5, "Tar&Grv"=6, "WdShake"=7, "WdShngl"=8)
housing_test$RoofMatl <- RoofMatl
# Transforming Exterior1st column to numeric in train dataset
Exterior1st <-as.factor(housing_train$Exterior1st)
levels(Exterior1st)
## [1] "AsbShng" "AsphShn" "BrkComm" "BrkFace" "CBlock" "CemntBd" "HdBoard"
## [8] "ImStucc" "MetalSd" "Plywood" "Stone" "Stucco" "VinylSd" "Wd Sdng"
## [15] "WdShing"
Exterior1st=as.numeric(Exterior1st,"AsbShng"=1, "AsphShn"=2, "BrkComm"=3, "BrkFace"=4, "CBlock"=5, "CemntBd"=6, "HdBoard"=7, "ImStucc"=8, "MetalSd"=9, "Plywood"=10, "Stone"=11, "Stucco"=12,"VinylSd"=13, "Wd Sdng"=14, "WdShing"=15)
housing_train$Exterior1st <- Exterior1st
# Transforming Exterior1st column to numeric in test dataset
Exterior1st <-as.factor(housing_test$Exterior1st)
levels(Exterior1st)
## [1] "AsbShng" "AsphShn" "BrkComm" "BrkFace" "CBlock" "CemntBd" "HdBoard"
## [8] "MetalSd" "Plywood" "Stucco" "VinylSd" "Wd Sdng" "WdShing"
Exterior1st=as.numeric(Exterior1st,"AsbShng"=1, "AsphShn"=2, "BrkComm"=3, "BrkFace"=4, "CBlock"=5, "CemntBd"=6, "HdBoard"=7, "ImStucc"=8, "MetalSd"=9, "Plywood"=10, "Stone"=11, "Stucco"=12,"VinylSd"=13, "Wd Sdng"=14, "WdShing"=15)
housing_test$Exterior1st <- Exterior1st
# Transforming Exterior2nd column to numeric in train dataset
Exterior2nd <-as.factor(housing_train$Exterior2nd)
levels(Exterior2nd)
## [1] "AsbShng" "AsphShn" "Brk Cmn" "BrkFace" "CBlock" "CmentBd" "HdBoard"
## [8] "ImStucc" "MetalSd" "Other" "Plywood" "Stone" "Stucco" "VinylSd"
## [15] "Wd Sdng" "Wd Shng"
Exterior2nd = as.numeric(Exterior2nd,"AsbShng"=1, "AsphShn"=2, "BrkComm"=3, "BrkFace"=4, "CBlock"=5, "CemntBd"=6, "HdBoard"=7, "ImStucc"=8, "MetalSd"=9, "Plywood"=10, "Stone"=11, "Stucco"=12,"VinylSd"=13, "Wd Sdng"=14, "WdShing"=15)
housing_train$Exterior2nd <- Exterior2nd
# Transforming Exterior2nd column to numeric in test dataset
Exterior2nd <-as.factor(housing_test$Exterior2nd)
levels(Exterior2nd)
## [1] "AsbShng" "AsphShn" "Brk Cmn" "BrkFace" "CBlock" "CmentBd" "HdBoard"
## [8] "ImStucc" "MetalSd" "Plywood" "Stone" "Stucco" "VinylSd" "Wd Sdng"
## [15] "Wd Shng"
Exterior2nd=as.numeric(Exterior2nd,"AsbShng"=1, "AsphShn"=2, "Brk Cmn"=3, "BrkFace"=4, "CBlock"=5, "CemntBd"=6, "HdBoard"=7, "ImStucc"=8, "MetalSd"=9, "Plywood"=10,"PreCast"=11, "Stone"=12, "Stucco"=13,"VinylSd"=14, "Wd Sdng"=15, "Wd Shing"=16)
housing_test$Exterior2nd <- Exterior2nd
# Transforming MasVnrType column to numeric in train dataset
MasVnrType <-as.factor(housing_train$MasVnrType)
levels(MasVnrType)
## [1] "BrkCmn" "BrkFace" "None" "Stone"
MasVnrType=as.numeric(MasVnrType,"BrkCmn"=1, "BrkFace"=2, "None"=0, "Stone"=4)
housing_train$MasVnrType <- MasVnrType
# Transforming MasVnrType column to numeric in test dataset
MasVnrType <-as.factor(housing_test$MasVnrType)
levels(MasVnrType)
## [1] "BrkCmn" "BrkFace" "None" "Stone"
MasVnrType=as.numeric(MasVnrType,"BrkCmn"=1, "BrkFace"=2, "Stone"=3, "NA"=0)
housing_test$MasVnrType <- MasVnrType
# Transforming ExterQual column to numeric in train dataset
ExterQual <-as.factor(housing_train$ExterQual)
levels(ExterQual)
## [1] "Ex" "Fa" "Gd" "TA"
ExterQual=as.numeric(ExterQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 )
housing_train$ExterQual <- ExterQual
# Transforming ExterQual column to numeric in test dataset
ExterQual <-as.factor(housing_test$ExterQual)
levels(ExterQual)
## [1] "Ex" "Fa" "Gd" "TA"
ExterQual=as.numeric(ExterQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 )
housing_test$ExterQual <- ExterQual
# Transforming ExterCond column to numeric in train dataset
ExterCond <-as.factor(housing_train$ExterCond)
levels(ExterCond)
## [1] "Ex" "Fa" "Gd" "Po" "TA"
ExterCond=as.numeric(ExterCond,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 )
housing_train$ExterCond <- ExterCond
# Transforming ExterCond column to numeric in test dataset
ExterCond <-as.factor(housing_test$ExterCond)
levels(ExterCond)
## [1] "Ex" "Fa" "Gd" "Po" "TA"
ExterCond=as.numeric(ExterCond,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 )
housing_test$ExterCond <- ExterCond
# Transforming Foundation column to numeric in train dataset
Foundation <-as.factor(housing_train$Foundation)
levels(Foundation)
## [1] "BrkTil" "CBlock" "PConc" "Slab" "Stone" "Wood"
Foundation=as.numeric(Foundation,"BrkTil"=1, "CBlock"=2, "PConc"=3, "Slab"=4, "Stone"=5, "Wood" = 6 )
housing_train$Foundation <- Foundation
# Transforming Foundation column to numeric in test dataset
Foundation <-as.factor(housing_test$Foundation)
levels(Foundation)
## [1] "BrkTil" "CBlock" "PConc" "Slab" "Stone" "Wood"
Foundation=as.numeric(Foundation,"BrkTil"=1, "CBlock"=2, "PConc"=3, "Slab"=4, "Stone"=5, "Wood" = 6 )
housing_test$Foundation<- Foundation
# Transforming BsmtQual column to numeric in train dataset
BsmtQual <-as.factor(housing_train$BsmtQual)
levels(BsmtQual)
## [1] "Ex" "Fa" "Gd" "TA"
BsmtQual=as.numeric(BsmtQual,"Ex"=1, "Fa"=2, "Gd"=3, "TA"=4, "Fa"=5, "Po" = 6, "NA"=0)
housing_train$BsmtQual <- BsmtQual
# Transforming BsmtQual column to numeric in test dataset
BsmtQual <-as.factor(housing_test$BsmtQual)
levels(BsmtQual)
## [1] "Ex" "Fa" "Gd" "TA"
BsmtQual=as.numeric(BsmtQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5, "NA"=0 )
housing_test$BsmtQual<- BsmtQual
# Transforming BsmtCond column to numeric in train dataset
BsmtCond <-as.factor(housing_train$BsmtCond)
levels(BsmtCond)
## [1] "Fa" "Gd" "Po" "TA"
BsmtCond=as.numeric(BsmtCond,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5, "NA"=0)
housing_train$BsmtCond <- BsmtCond
# Transforming BsmtCond column to numeric in test dataset
BsmtCond <-as.factor(housing_test$BsmtCond)
levels(BsmtCond)
## [1] "Fa" "Gd" "Po" "TA"
BsmtCond=as.numeric(BsmtCond,"Ex"=1, "Fa"=4, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5, "NA"=0 )
housing_test$BsmtCond<- BsmtCond
# Transforming BsmtExposure column to numeric in train dataset
BsmtExposure <-as.factor(housing_train$BsmtExposure)
levels(BsmtExposure)
## [1] "Av" "Gd" "Mn" "No"
BsmtExposure=as.numeric(BsmtExposure,"Av"=1, "Gd"=2, "Mn"=3, "No"=4, "NA"=0)
housing_train$BsmtExposure <- BsmtExposure
# Transforming BsmtExposure column to numeric in test dataset
BsmtExposure <-as.factor(housing_test$BsmtExposure)
levels(BsmtExposure)
## [1] "Av" "Gd" "Mn" "No"
BsmtExposure=as.numeric(BsmtExposure,"Av"=1, "Gd"=2, "Mn"=3, "No"=4, "NA"=0)
housing_test$BsmtExposure<- BsmtExposure
# Transforming BsmtFinType1 column to numeric in train dataset
BsmtFinType1 <-as.factor(housing_train$BsmtFinType1)
levels(BsmtFinType1)
## [1] "ALQ" "BLQ" "GLQ" "LwQ" "Rec" "Unf"
BsmtFinType1=as.numeric(BsmtFinType1,"ALQ"=1, "BLQ"=2, "GLQ"=3, "LwQ"=4, "Rec"=5, "Unf"=6, "NA"=0)
housing_train$BsmtFinType1 <- BsmtFinType1
# Transforming BsmtFinType1 column to numeric in test dataset
BsmtFinType1 <-as.factor(housing_test$BsmtFinType1)
levels(BsmtFinType1)
## [1] "ALQ" "BLQ" "GLQ" "LwQ" "Rec" "Unf"
BsmtFinType1=as.numeric(BsmtFinType1,"ALQ"=1, "BLQ"=2, "GLQ"=3, "LwQ"=4, "Rec"=5, "Unf"=6)
housing_test$BsmtFinType1<- BsmtFinType1
housing_train$BsmtFinType1[is.na(housing_train$BsmtFinType1)] <- 0
sum(is.na(housing_train$BsmtFinType1))
## [1] 0
housing_test$BsmtFinType1[is.na(housing_test$BsmtFinType1)] <- 0
sum(is.na(housing_test$BsmtFinType1))
## [1] 0
# Transforming BsmtFinType2 column to numeric in train dataset
BsmtFinType2 <-as.factor(housing_train$BsmtFinType2)
levels(BsmtFinType2)
## [1] "ALQ" "BLQ" "GLQ" "LwQ" "Rec" "Unf"
BsmtFinType2=as.numeric(BsmtFinType2,"ALQ"=1, "BLQ"=2, "GLQ"=3, "LwQ"=4, "Rec"=5, "Unf"=6, "NA"=0)
housing_train$BsmtFinType2 <- BsmtFinType2
# Transforming BsmtFinType2 column to numeric in test dataset
BsmtFinType2 <-as.factor(housing_test$BsmtFinType2)
levels(BsmtFinType2)
## [1] "ALQ" "BLQ" "GLQ" "LwQ" "Rec" "Unf"
BsmtFinType2=as.numeric(BsmtFinType2,"ALQ"=1, "BLQ"=2, "GLQ"=3, "LwQ"=4, "Rec"=5, "Unf"=6, "NA"=0)
housing_test$BsmtFinType2<- BsmtFinType2
housing_train$BsmtFinType2[is.na(housing_train$BsmtFinType2)] <- 0
sum(is.na(housing_train$BsmtFinType2))
## [1] 0
housing_test$BsmtFinType2[is.na(housing_test$BsmtFinType2)] <- 0
sum(is.na(housing_test$BsmtFinType2))
## [1] 0
# Transforming Heating column to numeric in train dataset
Heating <-as.factor(housing_train$Heating)
levels(Heating)
## [1] "Floor" "GasA" "GasW" "Grav" "OthW" "Wall"
Heating=as.numeric(Heating,"Floor"=1, "GasA"=2, "GasW"=3, "Grav"=4, "OthW"=5, "Wall"=6, "NA"=0)
housing_train$Heating <- Heating
# Transforming Heating column to numeric in test dataset
Heating <-as.factor(housing_test$Heating)
levels(Heating)
## [1] "GasA" "GasW" "Grav" "Wall"
Heating=as.numeric(Heating,"Floor"=1, "GasA"=2, "GasW"=3, "Grav"=4, "OthW"=5, "Wall"=6, "NA"=0)
housing_test$Heating<- Heating
# Transforming HeatingQC column to numeric in train dataset
HeatingQC <-as.factor(housing_train$HeatingQC)
levels(HeatingQC)
## [1] "Ex" "Fa" "Gd" "Po" "TA"
HeatingQC=as.numeric(HeatingQC,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5)
housing_train$HeatingQC <- HeatingQC
# Transforming HHeatingQC column to numeric in test dataset
HeatingQC <-as.factor(housing_test$HeatingQC)
levels(HeatingQC)
## [1] "Ex" "Fa" "Gd" "Po" "TA"
HeatingQC=as.numeric(HeatingQC,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5)
housing_test$HeatingQC<- HeatingQC
# Transforming CentralAir column to numeric in train dataset
CentralAir <-as.factor(housing_train$CentralAir)
levels(CentralAir)
## [1] "N" "Y"
CentralAir=as.numeric(CentralAir,"N"=0, "Y"=1)
housing_train$CentralAir <- CentralAir
# Transforming CentralAir column to numeric in test dataset
CentralAir <-as.factor(housing_test$CentralAir)
levels(CentralAir)
## [1] "N" "Y"
CentralAir=as.numeric(CentralAir,"N"=0, "Y"=1)
housing_test$CentralAir<- CentralAir
# Transforming Electrical column to numeric in train dataset
Electrical <-as.factor(housing_train$Electrical)
levels(Electrical)
## [1] "FuseA" "FuseF" "FuseP" "Mix" "SBrkr"
Electrical=as.numeric(Electrical,"SBrkr"=1, "FuseA"=2, "FuseF"=3, "FuseP"=4, "Mix"=5, "NA"=0)
housing_train$Electrical <- Electrical
# Transforming Electrical column to numeric in test dataset
Electrical <-as.factor(housing_test$Electrical)
levels(Electrical)
## [1] "FuseA" "FuseF" "FuseP" "SBrkr"
Electrical=as.numeric(Electrical,"SBrkr"=1, "FuseA"=2, "FuseF"=3, "FuseP"=4, "Mix"=5 )
housing_test$Electrical<- Electrical
housing_train$Electrical[is.na(housing_train$Electrical)] <- 0
sum(is.na(housing_train$Electrical))
## [1] 0
housing_test$Electrical[is.na(housing_test$Electrical)] <- 0
sum(is.na(housing_test$Electrical))
## [1] 0
# Transforming KitchenQual column to numeric in train dataset
KitchenQual <-as.factor(housing_train$KitchenQual)
levels(KitchenQual)
## [1] "Ex" "Fa" "Gd" "TA"
KitchenQual=as.numeric(KitchenQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5)
housing_train$KitchenQual <- KitchenQual
# Transforming KitchenQual column to numeric in test dataset
KitchenQual <-as.factor(housing_test$KitchenQual)
levels(KitchenQual)
## [1] "Ex" "Fa" "Gd" "TA"
KitchenQual=as.numeric(KitchenQual,"Ex"=1, "Gd"=2, "TA"=3, "Fa"=4, "Po"=5 )
housing_test$KitchenQual<- KitchenQual
#Transforming Functional column to numeric in train dataset
Functional=as.factor(housing_train$Functional)
levels(Functional) #"Maj1" "Maj2" "Min1" "Min2" "Mod" "Sev" "Typ"
## [1] "Maj1" "Maj2" "Min1" "Min2" "Mod" "Sev" "Typ"
Functional=as.numeric(Functional, "Sev"=1, "Maj2"=2,"Maj1"=3, "Mod"=4, "Min2"=5, "Min1"=6, "Typ"=7)
Functional=Functional+1 # "Sal"=1, Sev"=2, "Maj2"=3,"Maj1"=4, "Mod"=5, "Min2"=6, "Min1"=7, "Typ"=8
housing_train$Functional <- Functional
#Transforming Functional column to numeric in test dataset
Functional=as.factor(housing_test$Functional)
levels(Functional) #"Maj1" "Maj2" "Min1" "Min2" "Mod" "Sev" "Typ"
## [1] "Maj1" "Maj2" "Min1" "Min2" "Mod" "Sev" "Typ"
Functional=as.numeric(Functional, "Sev"=1, "Maj2"=2,"Maj1"=3, "Mod"=4, "Min2"=5, "Min1"=6, "Typ"=7, "NA"=0)
Functional=Functional +1 # "Sal"=1, Sev"=2, "Maj2"=3,"Maj1"=4, "Mod"=5, "Min2"=6, "Min1"=7, "Typ"=8
housing_test$Functional <- Functional
#Transforming FireplaceQu column to numeric in train dataset
FireplaceQu=as.factor(housing_train$FireplaceQu)
levels(FireplaceQu) #"Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(FireplaceQu)) # 690 Missing Entries
## [1] 690
FireplaceQu=as.numeric(FireplaceQu, "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5, "NA"=0)
FireplaceQu[is.na(FireplaceQu)]<-0
housing_train$FireplaceQu <- FireplaceQu
#Transforming FireplaceQu column to numeric in test dataset
FireplaceQu=as.factor(housing_test$FireplaceQu)
levels(FireplaceQu) #"Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(FireplaceQu)) # 730 Missing Entries
## [1] 730
FireplaceQu=as.numeric(FireplaceQu, "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5, "NA"=0)
FireplaceQu[is.na(FireplaceQu)]<-0
housing_test$FireplaceQu <- FireplaceQu
#Transforming GarageType column to numeric in train dataset
GarageType=as.factor(housing_train$GarageType)
levels(GarageType) #"2Types" "Attchd" "Basment" "BuiltIn" "CarPort" "Detchd"
## [1] "2Types" "Attchd" "Basment" "BuiltIn" "CarPort" "Detchd"
sum(is.na(GarageType)) # 81 Missing Entries
## [1] 81
GarageType=as.numeric(GarageType, "Detchd"=1, "CarPort"=2,"BuiltIn"=3, "Basment"=4, "Attchd"=5, "2Types"=6)
GarageType[is.na(GarageType)]<-0
housing_train$GarageType <- GarageType
#Transforming GarageType column to numeric in test dataset
GarageType=as.factor(housing_test$GarageType)
levels(GarageType) #"2Types" "Attchd" "Basment" "BuiltIn" "CarPort" "Detchd"
## [1] "2Types" "Attchd" "Basment" "BuiltIn" "CarPort" "Detchd"
sum(is.na(GarageType)) # 76 Missing Entries
## [1] 76
GarageType=as.numeric(GarageType, "Detchd"=1, "CarPort"=2,"BuiltIn"=3, "Basment"=4, "Attchd"=5, "2Types"=6)
GarageType[is.na(GarageType)]<-0
housing_test$GarageType <- GarageType
# Changing missing values of GarageYrBlt
sum(is.na(housing_train$GarageYrBlt)) # 81 missing values
## [1] 81
sum(is.na(housing_test$GarageYrBlt)) #78 missing values
## [1] 78
housing_train$GarageYrBlt[is.na(housing_train$GarageYrBlt)] <- 0
sum(is.na(housing_train$GarageYrBlt))
## [1] 0
housing_test$GarageYrBlt[is.na(housing_test$GarageYrBlt)] <- 0
sum(is.na(housing_test$GarageYrBlt))
## [1] 0
#Transforming GarageFinish column to numeric in train dataset
GarageFinish=as.factor(housing_train$GarageFinish)
levels(GarageFinish) #"Fin" "RFn" "Unf"
## [1] "Fin" "RFn" "Unf"
sum(is.na(GarageFinish)) # 81 Missing Entries
## [1] 81
GarageFinish=as.numeric(GarageFinish, "Unf"=1, "RFn"=2,"Fin"=3)
GarageFinish[is.na(GarageFinish)]<-0
housing_train$GarageFinish <- GarageFinish
#Transforming GarageFinish column to numeric in test dataset
GarageFinish=as.factor(housing_test$GarageFinish)
levels(GarageFinish) #"Fin" "RFn" "Unf"
## [1] "Fin" "RFn" "Unf"
sum(is.na(GarageFinish)) # 78 Missing Entries
## [1] 78
GarageFinish=as.numeric(GarageFinish, "Unf"=1, "RFn"=2,"Fin"=3)
GarageFinish[is.na(GarageFinish)]<-0
housing_test$GarageFinish <- GarageFinish
# Changing missing values of GarageCars
sum(is.na(housing_train$GarageCars)) # no missing values
## [1] 0
sum(is.na(housing_test$GarageCars)) #1 missing value
## [1] 1
housing_test$GarageCars[is.na(housing_test$GarageCars)]<-0
# Changing missing values of GarageArea
sum(is.na(housing_train$GarageArea)) # no missing values
## [1] 0
sum(is.na(housing_test$GarageArea)) #1 missing value
## [1] 1
housing_test$GarageArea[is.na(housing_test$GarageArea)]<-0
#Transforming GarageQual column to numeric in train dataset
GarageQual=as.factor(housing_train$GarageQual)
levels(GarageQual) # "Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(GarageQual)) # 81 Missing Entries
## [1] 81
GarageQual=as.numeric(GarageQual, "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5)
GarageQual[is.na(GarageQual)]<-0
housing_train$GarageQual <- GarageQual
#Transforming GarageQual column to numeric in test dataset
GarageQual=as.factor(housing_test$GarageQual)
levels(GarageQual) # "Fa" "Gd" "Po" "TA"
## [1] "Fa" "Gd" "Po" "TA"
sum(is.na(GarageQual)) # 78 Missing Entries
## [1] 78
GarageQual=as.numeric(GarageQual, "Po"=1, "Fa"=2,"TA"=3, "Gd"=4)
GarageQual[is.na(GarageQual)]<-0
housing_test$GarageQual <- GarageQual
#Transforming GarageCond column to numeric in train dataset
GarageCond=as.factor(housing_train$GarageCond)
levels(GarageCond) # "Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(GarageCond)) # 81 Missing Entries
## [1] 81
GarageCond=as.numeric(GarageCond, "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5)
GarageCond[is.na(GarageCond)]<-0
housing_train$GarageCond <- GarageCond
#Transforming GarageCond column to numeric in test dataset
GarageCond=as.factor(housing_test$GarageCond)
levels(GarageCond) # "Ex" "Fa" "Gd" "Po" "TA"
## [1] "Ex" "Fa" "Gd" "Po" "TA"
sum(is.na(GarageCond)) # 78 Missing Entries
## [1] 78
GarageCond=as.numeric(GarageCond, "Po"=1, "Fa"=2,"TA"=3, "Gd"=4, "Ex"=5)
GarageCond[is.na(GarageCond)]<-0
housing_test$GarageCond <- GarageCond
#Transforming PavedDrive column to numeric in train dataset
PavedDrive=as.factor(housing_train$PavedDrive)
levels(PavedDrive) # "N" "P" "Y"
## [1] "N" "P" "Y"
sum(is.na(PavedDrive)) # 0 Missing Entries
## [1] 0
PavedDrive=as.numeric(PavedDrive, "N"=1, "P"=2,"Y"=3)
housing_train$PavedDrive <- PavedDrive
#Transforming PavedDrive column to numeric in test dataset
PavedDrive=as.factor(housing_test$PavedDrive)
levels(PavedDrive) # "N" "P" "Y"
## [1] "N" "P" "Y"
sum(is.na(PavedDrive)) # 0 Missing Entries
## [1] 0
PavedDrive=as.numeric(PavedDrive, "N"=1, "P"=2,"Y"=3)
housing_test$PavedDrive <- PavedDrive
#Transforming PoolQC column to numeric in train dataset
PoolQC=as.factor(housing_train$PoolQC)
levels(PoolQC) # "N" "P" "Y"
## [1] "Ex" "Fa" "Gd"
sum(is.na(PoolQC)) # 1453 Missing Entries
## [1] 1453
PoolQC=as.numeric(PoolQC, "Fa"=1, "Gd"=2,"Ex"=3)
PoolQC <-ifelse(PoolQC==2|PoolQC==3,PoolQC+1,PoolQC) # No pool=0, Fa=1, TA=2, Gd=3, Ex=4
PoolQC[is.na(PoolQC)]<-0
housing_train$PoolQC <- PoolQC
#Transforming PoolQC column to numeric in test dataset
PoolQC=as.factor(housing_test$PoolQC)
levels(PoolQC) # "Ex" "Gd"
## [1] "Ex" "Gd"
sum(is.na(PoolQC)) # 1456 Missing Entries
## [1] 1456
PoolQC=as.numeric(PoolQC, "Gd"=1, "Ex"=2)
PoolQC=PoolQC+2
PoolQC[is.na(PoolQC)]<-0
housing_test$PoolQC <- PoolQC
#Transforming Fence column to numeric in train dataset
Fence=as.factor(housing_train$Fence)
levels(Fence) # "GdPrv" "GdWo" "MnPrv" "MnWw"
## [1] "GdPrv" "GdWo" "MnPrv" "MnWw"
sum(is.na(Fence)) # 1179 Missing Entries
## [1] 1179
Fence=as.numeric(Fence, "MnWw"=1, "GdWo"=2,"MnPrv"=3, "GdPrv"=4)
Fence[is.na(Fence)]<-0
housing_train$Fence <- Fence
#Transforming Fence column to numeric in test dataset
Fence=as.factor(housing_test$Fence)
levels(Fence) # "GdPrv" "GdWo" "MnPrv" "MnWw"
## [1] "GdPrv" "GdWo" "MnPrv" "MnWw"
sum(is.na(Fence)) # 1169 Missing Entries
## [1] 1169
Fence=as.numeric(Fence, "MnWw"=1, "GdWo"=2,"MnPrv"=3, "GdPrv"=4)
Fence[is.na(Fence)]<-0
housing_test$Fence <- Fence
#Transforming MiscFeature column to numeric in train dataset
MiscFeature=as.factor(housing_train$MiscFeature)
levels(MiscFeature) # "Gar2" "Othr" "Shed" "TenC"
## [1] "Gar2" "Othr" "Shed" "TenC"
sum(is.na(MiscFeature)) # 1406 Missing Entries
## [1] 1406
MiscFeature=as.numeric(MiscFeature, "TenC"=1, "Shed"=2,"Othr"=3, "Gar2"=4)
MiscFeature[is.na(MiscFeature)]<-0
housing_train$MiscFeature <- MiscFeature
#Transforming MiscFeature column to numeric in test dataset
MiscFeature=as.factor(housing_test$MiscFeature)
levels(MiscFeature) # "Gar2" "Othr" "Shed"
## [1] "Gar2" "Othr" "Shed"
sum(is.na(MiscFeature)) # 1408 Missing Entries
## [1] 1408
MiscFeature=as.numeric(MiscFeature, "Shed"=1,"Othr"=2, "Gar2"=3)
MiscFeature=MiscFeature+1
MiscFeature[is.na(MiscFeature)]<-0 #"TenC"=1, "Shed"=2,"Othr"=3, "Gar2"=4
housing_test$MiscFeature <- MiscFeature
#Transforming SaleType column to numeric in train dataset
SaleType=as.factor(housing_train$SaleType)
levels(SaleType) # "COD" "Con" "ConLD" "ConLI" "ConLw" "CWD" "New" "Oth" "WD"
## [1] "COD" "Con" "ConLD" "ConLI" "ConLw" "CWD" "New" "Oth" "WD"
sum(is.na(SaleType)) # 0 Missing Entries
## [1] 0
SaleType=as.numeric(SaleType, "Oth"=1, "ConLD"=2,"ConLI"=3, "ConLw"=4, "Con"=5, "COD"=6, "New"=7, "CWD"=8, "WD"=9)
housing_train$SaleType <- SaleType
#Transforming SaleType column to numeric in test dataset
SaleType=as.factor(housing_test$SaleType)
levels(SaleType) # "COD" "Con" "ConLD" "ConLI" "ConLw" "CWD" "New" "Oth" "WD"
## [1] "COD" "Con" "ConLD" "ConLI" "ConLw" "CWD" "New" "Oth" "WD"
sum(is.na(SaleType)) # 1 Missing Entries
## [1] 1
SaleType=as.numeric(SaleType, "Oth"=1, "ConLD"=2,"ConLI"=3, "ConLw"=4, "Con"=5, "COD"=6, "New"=7, "CWD"=8, "WD"=9)
SaleType[is.na(SaleType)]<-1
housing_test$SaleType <- SaleType
#Transforming SaleCondition column to numeric in train dataset
SaleCondition=as.factor(housing_train$SaleCondition)
levels(SaleCondition) # "Abnorml" "AdjLand" "Alloca" "Family" "Normal" "Partial"
## [1] "Abnorml" "AdjLand" "Alloca" "Family" "Normal" "Partial"
sum(is.na(SaleCondition)) # 0 Missing Entries
## [1] 0
SaleCondition=as.numeric(SaleCondition, "Partial"=1, "Family"=2,"Alloca"=3, "AdjLand"=4, "Abnorml"=5, "Normal"=6)
housing_train$SaleCondition <- SaleCondition
#Transforming SaleCondition column to numeric in test dataset
SaleCondition=as.factor(housing_test$SaleCondition)
levels(SaleCondition) # "Abnorml" "AdjLand" "Alloca" "Family" "Normal" "Partial"
## [1] "Abnorml" "AdjLand" "Alloca" "Family" "Normal" "Partial"
sum(is.na(SaleCondition)) # 0 Missing Entries
## [1] 0
SaleCondition=as.numeric(SaleCondition, "Partial"=1, "Family"=2,"Alloca"=3, "AdjLand"=4, "Abnorml"=5, "Normal"=6)
housing_test$SaleCondition <- SaleCondition
Interestingly enough, the only significant number of nulls came from LotFrontage, indicating the lot was NOT on a street and Alley, indicating whether the property had access to an alley. A handful of nulls resulted from basement details, but nothing significant.
data.frame(num_missing=colSums(is.na(housing_train)))
## num_missing
## Id 0
## MSSubClass 0
## MSZoning 0
## LotFrontage 259
## LotArea 0
## Street 0
## Alley 1369
## LotShape 0
## LandContour 0
## Utilities 0
## LotConfig 0
## LandSlope 0
## Neighborhood 0
## Condition1 0
## Condition2 0
## BldgType 0
## HouseStyle 0
## OverallQual 0
## OverallCond 0
## YearBuilt 0
## YearRemodAdd 0
## RoofStyle 0
## RoofMatl 0
## Exterior1st 0
## Exterior2nd 0
## MasVnrType 8
## MasVnrArea 8
## ExterQual 0
## ExterCond 0
## Foundation 0
## BsmtQual 37
## BsmtCond 37
## BsmtExposure 38
## BsmtFinType1 0
## BsmtFinSF1 0
## BsmtFinType2 0
## BsmtFinSF2 0
## BsmtUnfSF 0
## TotalBsmtSF 0
## Heating 0
## HeatingQC 0
## CentralAir 0
## Electrical 0
## X1stFlrSF 0
## X2ndFlrSF 0
## LowQualFinSF 0
## GrLivArea 0
## BsmtFullBath 0
## BsmtHalfBath 0
## FullBath 0
## HalfBath 0
## BedroomAbvGr 0
## KitchenAbvGr 0
## KitchenQual 0
## TotRmsAbvGrd 0
## Functional 0
## Fireplaces 0
## FireplaceQu 0
## GarageType 0
## GarageYrBlt 0
## GarageFinish 0
## GarageCars 0
## GarageArea 0
## GarageQual 0
## GarageCond 0
## PavedDrive 0
## WoodDeckSF 0
## OpenPorchSF 0
## EnclosedPorch 0
## X3SsnPorch 0
## ScreenPorch 0
## PoolArea 0
## PoolQC 0
## Fence 0
## MiscFeature 0
## MiscVal 0
## MoSold 0
## YrSold 0
## SaleType 0
## SaleCondition 0
## SalePrice 0
data.frame(num_missing=colSums(is.na(housing_test)))
## num_missing
## Id 0
## MSSubClass 0
## MSZoning 4
## LotFrontage 227
## LotArea 0
## Street 0
## Alley 1352
## LotShape 0
## LandContour 0
## Utilities 2
## LotConfig 0
## LandSlope 0
## Neighborhood 0
## Condition1 0
## Condition2 0
## BldgType 0
## HouseStyle 0
## OverallQual 0
## OverallCond 0
## YearBuilt 0
## YearRemodAdd 0
## RoofStyle 0
## RoofMatl 0
## Exterior1st 1
## Exterior2nd 1
## MasVnrType 16
## MasVnrArea 15
## ExterQual 0
## ExterCond 0
## Foundation 0
## BsmtQual 44
## BsmtCond 45
## BsmtExposure 44
## BsmtFinType1 0
## BsmtFinSF1 1
## BsmtFinType2 0
## BsmtFinSF2 1
## BsmtUnfSF 1
## TotalBsmtSF 1
## Heating 0
## HeatingQC 0
## CentralAir 0
## Electrical 0
## X1stFlrSF 0
## X2ndFlrSF 0
## LowQualFinSF 0
## GrLivArea 0
## BsmtFullBath 2
## BsmtHalfBath 2
## FullBath 0
## HalfBath 0
## BedroomAbvGr 0
## KitchenAbvGr 0
## KitchenQual 1
## TotRmsAbvGrd 0
## Functional 2
## Fireplaces 0
## FireplaceQu 0
## GarageType 0
## GarageYrBlt 0
## GarageFinish 0
## GarageCars 0
## GarageArea 0
## GarageQual 0
## GarageCond 0
## PavedDrive 0
## WoodDeckSF 0
## OpenPorchSF 0
## EnclosedPorch 0
## X3SsnPorch 0
## ScreenPorch 0
## PoolArea 0
## PoolQC 0
## Fence 0
## MiscFeature 0
## MiscVal 0
## MoSold 0
## YrSold 0
## SaleType 0
## SaleCondition 0
The following code inputs 0 for missing values.
housing_train$LotFrontage[is.na(housing_train$LotFrontage)] <- 0
sum(is.na(housing_train$LotFrontage))
## [1] 0
housing_test$LotFrontage[is.na(housing_test$LotFrontage)] <- 0
sum(is.na(housing_test$LotFrontage))
## [1] 0
housing_train$BsmtQual[is.na(housing_train$BsmtQual)] <- 0
sum(is.na(housing_train$BsmtQual))
## [1] 0
housing_test$BsmtQual[is.na(housing_test$BsmtQual)] <- 0
sum(is.na(housing_test$BsmtQual))
## [1] 0
housing_train$MasVnrType[is.na(housing_train$MasVnrType)] <- 0
sum(is.na(housing_train$MasVnrType))
## [1] 0
housing_test$MasVnrType[is.na(housing_test$MasVnrType)] <- 0
sum(is.na(housing_test$MasVnrType))
## [1] 0
housing_train$MasVnrArea[is.na(housing_train$MasVnrArea)] <- 0
sum(is.na(housing_train$MasVnrArea))
## [1] 0
housing_test$MasVnrArea[is.na(housing_test$MasVnrArea)] <- 0
sum(is.na(housing_test$MasVnrArea))
## [1] 0
housing_train$BsmtCond[is.na(housing_train$BsmtCond)] <- 0
sum(is.na(housing_train$BsmtCond))
## [1] 0
housing_test$BsmtCond[is.na(housing_test$BsmtCond)] <- 0
sum(is.na(housing_test$BsmtCond))
## [1] 0
housing_train$BsmtExposure[is.na(housing_train$BsmtExposure)] <- 0
sum(is.na(housing_train$BsmtExposure))
## [1] 0
housing_test$BsmtExposure[is.na(housing_test$BsmtExposure)] <- 0
sum(is.na(housing_test$BsmtExposure))
## [1] 0
housing_train$MiscFeature[is.na(housing_train$MiscFeature)] <- 0
sum(is.na(housing_train$MiscFeature))
## [1] 0
housing_test$GarageQual[is.na(housing_test$GarageQual)] <- 0
sum(is.na(housing_test$GarageQual))
## [1] 0
housing_test$MSZoning[is.na(housing_test$MSZoning)] <- 0
sum(is.na(housing_test$MSZoning))
## [1] 0
housing_test$Exterior1st[is.na(housing_test$Exterior1st)] <- 0
sum(is.na(housing_test$Exterior1st))
## [1] 0
housing_test$Exterior2nd[is.na(housing_test$Exterior2nd)] <- 0
sum(is.na(housing_test$Exterior2nd))
## [1] 0
housing_test$BsmtFinSF1[is.na(housing_test$BsmtFinSF1)] <- 0
sum(is.na(housing_test$BsmtFinSF1))
## [1] 0
housing_test$BsmtFinSF2[is.na(housing_test$BsmtFinSF2)] <- 0
sum(is.na(housing_test$BsmtFinSF2 ))
## [1] 0
housing_test$BsmtUnfSF[is.na(housing_test$BsmtUnfSF)] <- 0
sum(is.na(housing_test$BsmtUnfSF))
## [1] 0
housing_test$BsmtFullBath[is.na(housing_test$BsmtFullBath)] <- 0
sum(is.na(housing_test$BsmtFullBath))
## [1] 0
housing_test$BsmtHalfBath[is.na(housing_test$BsmtHalfBath)] <- 0
sum(is.na(housing_test$BsmtHalfBath))
## [1] 0
housing_test$KitchenQual[is.na(housing_test$KitchenQual)] <- 0
sum(is.na(housing_test$KitchenQual))
## [1] 0
housing_test$Functional[is.na(housing_test$Functional)] <- 0
sum(is.na(housing_test$Functional))
## [1] 0
#housing_train$Alley[is.na(housing_train$Alley)] <- 0
#sum(is.na(housing_train$Alley))
#housing_test$Alley[is.na(housing_test$Alley)] <- 0
#sum(is.na(housing_test$Alley))
As the majority of the properties in the data did NOT have an Alley, that variable will be excluded from our model.
trainAlley <- housing_train$Alley
testAlley <- housing_test$Alley
housing_train$Alley <- NULL
housing_test$Alley <- NULL
As previous stated, the data was fairly clean to begin with and the remainder of the missing data was replaced with zeros.
Once we cleaned the data, we needed to visualize the data to get a baseline understanding which variables might work and which ones might not at very basic level. To do this, we loaded the following packages:
#Libraries for the next visualizations
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.2.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.1
library(ggplot2)
correlations <- cor(housing_train[,c(2:15
,80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
Looking at positive correlations the variables that show the
strongest correlation is MSSubClass and BldgType, followed by landSlope
and LotArea. Looking a negative correlations there are greater
correlations on LandContour and LandSlope, BlgdType with LotFrontage and
LotArea. Now looking on SalesPrice variables MSSubClass, MSZoning,
LotShape, LotConfig, BldgType have negative correlation with sales
price. The variables LotFrontage, LotArea, Neighborhood, and condition
have a positive correlation.
correlations <- cor(housing_train[,c(16:26, 80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
Looking at the variables that have greater correlation to sales price. OverallQual, YearBuilt, YearRemondAdd, MasVnrArea, and RoofStyle have a positive correlation. The only variable that has a negative correlation to SalesPrice is OverallCond.
correlations <- cor(housing_train[,c(27:40, 80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
In this set of variables it shows that ExterQual has a negative correlation to the Sales price, as well as BsmtQual, HeatingQC, and BsmtQUal. TotalBsmtSF has a positive correlation followed by BsmtFinSF1, and foundation.
correlations <- cor(housing_train[,c(41:60, 80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
All the variables that are related to living spaces and quality have a positive correlation, the only living space that has a negative correlation is the kitchen.
correlations <- cor(housing_train[,c(61:79, 80)], use="everything")
corrplot::corrplot(correlations, method="circle", type="lower", sig.level = 0.01, insig = "blank")
Finally, garage area is the last thing that has a stronger correlation to SalesPrice. With the exception of the fireplace, the remaining amenities do not appear to have a strong relationship to the SalePrice.
In order to run more advanced scatterplots, we loaded the car package.
pairs(SalePrice~YearBuilt+OverallQual+TotalBsmtSF+GrLivArea,data=housing_train,
main="Simple Scatterplot Matrix")
Looking at the aboe scatterplots, the data seems to be well distributed while also showing how the variables correlate.
#install.packages('carData')
library(car)
## Warning: package 'car' was built under R version 4.2.1
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.1
scatterplot(SalePrice ~ YearBuilt, data=housing_train, xlab="Year Built", ylab="Sale Price", grid=FALSE)
The above chart shows the sale price comparing to the year
it was built, We can see a correlation indicating that the newer the
home, the higher the SalesPrice.
scatterplot(SalePrice ~ YrSold, data=housing_train, xlab="Year Built", ylab="Sale Price", grid=FALSE)
Interestingly, the Year Built vs Sale Price shows how the dip in the 2008 housing market influenced the current sale price of houses; the data shows a small decline from 2007 to 2008, but then shows a slight increase in 2009. SalesPrice seems to stabilize afterwards.
scatterplot(SalePrice ~ LotArea, data=housing_train, xlab="Lot Area", ylab="Sale Price", grid=FALSE)
The chart shows a non-linear relationship between the size
of the lot and the Sales Price indicating that other house factors have
a greater weight on the price of the house than just the lot
size.
scatterplot(SalePrice ~ X1stFlrSF, data=housing_train, xlab="1st Floor Square Foot", ylab="Sale Price", grid=FALSE)
For a final look, data would indicate that 1st floor Square Footage shows a enjoys a strong relationship to SalesPrice, but outliers still exist, indicating there are other important variables.
#Data partition using caret partition function.
#install.packages('lattice')
library(caret)
## Warning: package 'caret' was built under R version 4.2.2
## Loading required package: lattice
#Packages for RMSE
#install.packages('Metrics')
library(Metrics)
## Warning: package 'Metrics' was built under R version 4.2.2
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
##
## precision, recall
#Naming the Sale Price as Outcome
outcome <- housing_train$SalePrice
#Partition the data to be 60% train and 40% test
partition <- createDataPartition(y=outcome, p=.6, list=FALSE)
train <- housing_train[partition,]
test <- housing_train[-partition,]
NOTE: After testing the models several times with different train and test sets, reducing the train set improved the XGBoost but also increased the error for Linear Regression. Train 60% of the data was where the best prediction was displayed for XGBoost; using 50% as train data made our prediction error increase by 2%. The first part of the project will use a different percentage of train data than the XGBoost to make the model better.
Step one: create a linear model to identify variables that share a strong relationship with the Sale Price.
LM_model1 <- lm(SalePrice ~., data=train)
summary(LM_model1)
##
## Call:
## lm(formula = SalePrice ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -135253 -12988 -1143 12562 160803
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.705e+06 1.385e+06 1.232 0.218404
## Id -2.715e+00 2.118e+00 -1.282 0.200240
## MSSubClass -6.641e+01 4.762e+01 -1.395 0.163549
## MSZoning -1.304e+03 1.525e+03 -0.855 0.392666
## LotFrontage 3.017e+01 2.986e+01 1.010 0.312566
## LotArea 5.554e-01 1.042e-01 5.331 1.27e-07 ***
## Street 4.789e+04 1.424e+04 3.364 0.000805 ***
## LotShape -1.190e+03 7.147e+02 -1.665 0.096245 .
## LandContour -2.322e+03 1.407e+03 -1.650 0.099257 .
## Utilities NA NA NA NA
## LotConfig -7.665e+01 5.702e+02 -0.134 0.893090
## LandSlope -4.710e+02 4.150e+03 -0.113 0.909674
## Neighborhood 5.082e+00 1.654e+02 0.031 0.975500
## Condition1 1.056e+02 1.017e+03 0.104 0.917315
## Condition2 -3.593e+03 4.204e+03 -0.855 0.393039
## BldgType -1.371e+03 1.562e+03 -0.877 0.380518
## HouseStyle 4.851e+02 6.885e+02 0.705 0.481307
## OverallQual 8.057e+03 1.185e+03 6.797 2.08e-11 ***
## OverallCond 3.776e+03 1.125e+03 3.357 0.000825 ***
## YearBuilt 2.003e+02 7.536e+01 2.658 0.008027 **
## YearRemodAdd 7.123e+01 6.965e+01 1.023 0.306786
## RoofStyle 1.457e+03 1.163e+03 1.252 0.210794
## RoofMatl -2.531e+03 1.444e+03 -1.752 0.080159 .
## Exterior1st -1.222e+03 5.740e+02 -2.129 0.033523 *
## Exterior2nd 7.309e+02 5.140e+02 1.422 0.155417
## MasVnrType 6.813e+03 1.564e+03 4.357 1.49e-05 ***
## MasVnrArea 2.670e+01 6.028e+00 4.430 1.08e-05 ***
## ExterQual -1.392e+04 2.059e+03 -6.760 2.65e-11 ***
## ExterCond -7.138e+01 1.336e+03 -0.053 0.957408
## Foundation -7.836e+01 1.701e+03 -0.046 0.963261
## BsmtQual -5.945e+03 1.440e+03 -4.128 4.04e-05 ***
## BsmtCond 3.271e+03 1.415e+03 2.312 0.021019 *
## BsmtExposure -2.684e+03 8.953e+02 -2.997 0.002806 **
## BsmtFinType1 3.329e+02 6.475e+02 0.514 0.607255
## BsmtFinSF1 4.702e+01 5.757e+00 8.167 1.23e-15 ***
## BsmtFinType2 -8.020e+02 1.189e+03 -0.674 0.500243
## BsmtFinSF2 3.304e+01 8.848e+00 3.734 0.000202 ***
## BsmtUnfSF 2.528e+01 5.436e+00 4.651 3.86e-06 ***
## TotalBsmtSF NA NA NA NA
## Heating -1.093e+03 3.007e+03 -0.363 0.716429
## HeatingQC -2.456e+02 6.362e+02 -0.386 0.699545
## CentralAir 5.512e+02 4.495e+03 0.123 0.902432
## Electrical -2.493e+02 9.599e+02 -0.260 0.795131
## X1stFlrSF 5.211e+01 6.910e+00 7.541 1.26e-13 ***
## X2ndFlrSF 5.031e+01 5.316e+00 9.463 < 2e-16 ***
## LowQualFinSF -2.744e+01 2.329e+01 -1.178 0.238983
## GrLivArea NA NA NA NA
## BsmtFullBath 1.834e+03 2.573e+03 0.713 0.476306
## BsmtHalfBath 1.399e+03 3.973e+03 0.352 0.724788
## FullBath 6.860e+02 2.846e+03 0.241 0.809573
## HalfBath 3.822e+03 2.724e+03 1.403 0.160882
## BedroomAbvGr -7.348e+03 1.730e+03 -4.248 2.41e-05 ***
## KitchenAbvGr -2.902e+04 5.661e+03 -5.126 3.72e-07 ***
## KitchenQual -5.072e+03 1.537e+03 -3.300 0.001009 **
## TotRmsAbvGrd 4.582e+03 1.255e+03 3.651 0.000278 ***
## Functional 4.990e+03 9.536e+02 5.233 2.13e-07 ***
## Fireplaces 7.823e+03 2.920e+03 2.679 0.007531 **
## FireplaceQu -1.699e+03 8.524e+02 -1.993 0.046607 *
## GarageType 1.883e+03 6.788e+02 2.775 0.005653 **
## GarageYrBlt -1.447e+01 6.365e+00 -2.274 0.023228 *
## GarageFinish -3.868e+02 1.525e+03 -0.254 0.799851
## GarageCars 3.996e+03 2.903e+03 1.377 0.168989
## GarageArea 1.394e+01 9.539e+00 1.461 0.144284
## GarageQual -4.068e+02 1.970e+03 -0.207 0.836445
## GarageCond 2.434e+03 2.372e+03 1.026 0.305259
## PavedDrive 2.473e+03 2.090e+03 1.183 0.236985
## WoodDeckSF 1.546e+01 7.727e+00 2.001 0.045705 *
## OpenPorchSF 2.649e+00 1.557e+01 0.170 0.864900
## EnclosedPorch 7.252e+00 1.693e+01 0.428 0.668419
## X3SsnPorch 3.606e+00 3.281e+01 0.110 0.912509
## ScreenPorch 4.706e+01 1.761e+01 2.673 0.007670 **
## PoolArea 1.907e+03 1.292e+02 14.762 < 2e-16 ***
## PoolQC -4.095e+05 2.263e+04 -18.097 < 2e-16 ***
## Fence 2.516e+02 9.384e+02 0.268 0.788655
## MiscFeature -3.476e+02 1.821e+03 -0.191 0.848668
## MiscVal 3.859e-03 1.535e+00 0.003 0.997994
## MoSold -1.098e+02 3.325e+02 -0.330 0.741330
## YrSold -1.135e+03 6.837e+02 -1.661 0.097146 .
## SaleType -1.277e+03 6.002e+02 -2.127 0.033714 *
## SaleCondition 3.742e+03 8.731e+02 4.286 2.04e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25110 on 801 degrees of freedom
## Multiple R-squared: 0.9058, Adjusted R-squared: 0.8969
## F-statistic: 101.4 on 76 and 801 DF, p-value: < 2.2e-16
#The competition asks to use RMSE for predicting error.
prediction_lm1 <- predict(LM_model1, test, type="response")
## Warning in predict.lm(LM_model1, test, type = "response"): prediction from a
## rank-deficient fit may be misleading
model_output <- cbind(test, prediction_lm1)
model_output$log_prediction <- log(model_output$prediction_lm1)
## Warning in log(model_output$prediction_lm1): NaNs produced
model_output$log_SalePrice <- log(model_output$SalePrice)
#Test with RMSE
rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] NaN
The produced model has an R-Squared of .835, RMSE of .1667 and approximately 25% of the included variables are significant.
LM_model2 <- lm(SalePrice ~LotArea+Street+Neighborhood+Condition1+BldgType+OverallCond+OverallQual+YearBuilt+RoofMatl+MasVnrArea+ExterQual+BsmtFinSF1+BsmtUnfSF+X1stFlrSF+ X2ndFlrSF+BedroomAbvGr+KitchenAbvGr+KitchenQual+TotRmsAbvGrd+Fireplaces+GarageArea+GarageQual, data=train)
summary(LM_model2)
##
## Call:
## lm(formula = SalePrice ~ LotArea + Street + Neighborhood + Condition1 +
## BldgType + OverallCond + OverallQual + YearBuilt + RoofMatl +
## MasVnrArea + ExterQual + BsmtFinSF1 + BsmtUnfSF + X1stFlrSF +
## X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual + TotRmsAbvGrd +
## Fireplaces + GarageArea + GarageQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -471136 -15782 -1509 12844 206921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.556e+05 1.155e+05 -7.409 3.07e-13 ***
## LotArea 6.151e-01 1.171e-01 5.251 1.91e-07 ***
## Street 3.629e+04 1.805e+04 2.011 0.044676 *
## Neighborhood 5.222e+02 2.000e+02 2.611 0.009189 **
## Condition1 1.087e+03 1.269e+03 0.857 0.391611
## BldgType -4.552e+03 1.119e+03 -4.068 5.18e-05 ***
## OverallCond 4.629e+03 1.179e+03 3.927 9.29e-05 ***
## OverallQual 1.337e+04 1.449e+03 9.228 < 2e-16 ***
## YearBuilt 4.318e+02 5.607e+01 7.701 3.74e-14 ***
## RoofMatl 1.722e+03 1.798e+03 0.958 0.338502
## MasVnrArea 2.383e+01 7.054e+00 3.377 0.000765 ***
## ExterQual -1.657e+04 2.551e+03 -6.495 1.41e-10 ***
## BsmtFinSF1 8.057e+00 4.574e+00 1.761 0.078512 .
## BsmtUnfSF -2.191e+00 4.455e+00 -0.492 0.623068
## X1stFlrSF 3.657e+01 6.886e+00 5.312 1.39e-07 ***
## X2ndFlrSF 2.220e+01 5.172e+00 4.293 1.96e-05 ***
## BedroomAbvGr -7.573e+03 2.093e+03 -3.618 0.000315 ***
## KitchenAbvGr -1.980e+04 6.675e+03 -2.966 0.003097 **
## KitchenQual -8.949e+03 1.924e+03 -4.650 3.84e-06 ***
## TotRmsAbvGrd 7.986e+03 1.550e+03 5.153 3.19e-07 ***
## Fireplaces 7.303e+03 2.201e+03 3.318 0.000945 ***
## GarageArea 3.339e+01 8.377e+00 3.986 7.29e-05 ***
## GarageQual -1.140e+03 1.102e+03 -1.035 0.301007
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33440 on 855 degrees of freedom
## Multiple R-squared: 0.8217, Adjusted R-squared: 0.8171
## F-statistic: 179.1 on 22 and 855 DF, p-value: < 2.2e-16
#The competition asks to use RMSE for predicting error.
prediction_lm <- predict(LM_model2, test, type="response")
model_output <- cbind(test, prediction_lm)
model_output$log_prediction <- log(model_output$prediction_lm)
model_output$log_SalePrice <- log(model_output$SalePrice)
#Test with RMSE
rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] 0.16122
The new model, with reduced number of variables has a lowered R-squared of 0.791 but an improved RMSE of .1529.
We run a new set of plots to determine if data transformation is necessary.
plot(LM_model2$fitted.values, LM_model2$residuals, pch = 20, col = "blue")
abline(h = 0)
The model seem to be around the 0 with some scatter prices when the fitted value increases.
#Package for BoxCox
library(MASS)
To further validate, we utilize boxcox.
boxcox(LM_model2)
Based on our boxcox output, we determine that using a log transformation will improve the linear regression model.
model3 <- lm(I(log(SalePrice)) ~LotArea+Street+Neighborhood+Condition1+BldgType+OverallCond+OverallQual+YearBuilt+RoofMatl+MasVnrArea+ExterQual+BsmtFinSF1+BsmtUnfSF+X1stFlrSF+ X2ndFlrSF+BedroomAbvGr+KitchenAbvGr+KitchenQual+TotRmsAbvGrd+Fireplaces+GarageArea+GarageQual, data=train)
summary(model3)
##
## Call:
## lm(formula = I(log(SalePrice)) ~ LotArea + Street + Neighborhood +
## Condition1 + BldgType + OverallCond + OverallQual + YearBuilt +
## RoofMatl + MasVnrArea + ExterQual + BsmtFinSF1 + BsmtUnfSF +
## X1stFlrSF + X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual +
## TotRmsAbvGrd + Fireplaces + GarageArea + GarageQual, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.17438 -0.06670 0.00648 0.08324 0.51722
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.161e+00 5.496e-01 5.752 1.23e-08 ***
## LotArea 2.531e-06 5.574e-07 4.541 6.41e-06 ***
## Street 1.230e-01 8.589e-02 1.432 0.152433
## Neighborhood 2.007e-03 9.518e-04 2.108 0.035299 *
## Condition1 7.016e-03 6.038e-03 1.162 0.245587
## BldgType -1.855e-02 5.325e-03 -3.483 0.000520 ***
## OverallCond 4.902e-02 5.610e-03 8.739 < 2e-16 ***
## OverallQual 8.439e-02 6.894e-03 12.240 < 2e-16 ***
## YearBuilt 3.754e-03 2.669e-04 14.067 < 2e-16 ***
## RoofMatl 1.393e-02 8.558e-03 1.628 0.103828
## MasVnrArea 5.455e-06 3.357e-05 0.162 0.870952
## ExterQual -2.429e-02 1.214e-02 -2.000 0.045797 *
## BsmtFinSF1 4.315e-05 2.177e-05 1.982 0.047763 *
## BsmtUnfSF 1.531e-06 2.120e-05 0.072 0.942456
## X1stFlrSF 2.054e-04 3.277e-05 6.269 5.75e-10 ***
## X2ndFlrSF 1.538e-04 2.461e-05 6.249 6.50e-10 ***
## BedroomAbvGr 3.905e-03 9.962e-03 0.392 0.695188
## KitchenAbvGr -6.512e-02 3.177e-02 -2.050 0.040671 *
## KitchenQual -3.287e-02 9.158e-03 -3.589 0.000351 ***
## TotRmsAbvGrd 1.904e-02 7.376e-03 2.582 0.010001 *
## Fireplaces 5.295e-02 1.048e-02 5.055 5.26e-07 ***
## GarageArea 1.705e-04 3.987e-05 4.278 2.10e-05 ***
## GarageQual 1.180e-02 5.242e-03 2.251 0.024629 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1591 on 855 degrees of freedom
## Multiple R-squared: 0.846, Adjusted R-squared: 0.842
## F-statistic: 213.5 on 22 and 855 DF, p-value: < 2.2e-16
prediction3 <- predict(model3, test, type="response")
model_output <- cbind(test, prediction3)
model_output$log_prediction3 <- log(model_output$prediction3)
model_output$log_SalePrice3 <- log(model_output$SalePrice)
#Test with RMSE
rmse(model_output$log_SalePrice3, model_output$log_prediction3)
## [1] 9.547395
The R-Squared has improved to .8201 (just shy of our original model) but our RMSE is SIGNIFICANTLY off at 9.544 from our previous models. In order to further adjust our model, we will utilize cooks distance to determine if there are any outliers that can influence the model. In turn, this will help us decide if there are variables that should be excluded or not.
mean(hatvalues(model3))
## [1] 0.0261959
qqnorm(LM_model2$residuals, main = "LM_model2")
qqline(LM_model2$residuals)
abline(h = 0, col = "grey")
QQ-plot looking at leverage of data points; overall the model does not need to remove any data points.
For the bagging model no preparation was required sinnce the data was already changed from categorical to numeric. The model will start with 500 bootstrap samples and will be reduced as see fit.
#Package for bagging
#install.packages('ipred')
library(ipred)
## Warning: package 'ipred' was built under R version 4.2.1
house_bag <- bagging(formula = SalePrice ~., data = train, nbagg = 500)
house_bag
##
## Bagging regression trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = SalePrice ~ ., data = train, nbagg = 500)
Out_of_bag Prediction
house_bag_oob <- bagging(formula = SalePrice~., data = train, coob = T, nbagg = 500)
house_bag_oob
##
## Bagging regression trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = SalePrice ~ ., data = train, coob = T,
## nbagg = 500)
##
## Out-of-bag estimate of root mean squared error: 35960.38
The OBB error is high, but smaller than the linear regression with no transformation ( linear regression = , oob = ).
The out of bag show a large error. Looking at the RMSE:
# Predict using the test set
house_bag_pred_1 <- predict(house_bag_oob, test)
model_output <- cbind(test, house_bag_pred_1)
model_output$log_prediction_bag <- log(model_output$house_bag_pred_1)
model_output$log_SalePrice_bag <- log(model_output$SalePrice)
#Test with RMSE
rmse(model_output$log_SalePrice_bag,model_output$log_prediction_bag)
## [1] 0.1924199
The prediction model is showing error.
house_bag2 <- bagging(formula = SalePrice ~LotArea+Street+Neighborhood+Condition1+BldgType+OverallCond+OverallQual+YearBuilt+RoofMatl+MasVnrArea+ExterQual+BsmtFinSF1+BsmtUnfSF+X1stFlrSF+ X2ndFlrSF+BedroomAbvGr+KitchenAbvGr+KitchenQual+TotRmsAbvGrd+Fireplaces+GarageArea+GarageQual, data = train, nbagg = 500)
house_bag2
##
## Bagging regression trees with 500 bootstrap replications
##
## Call: bagging.data.frame(formula = SalePrice ~ LotArea + Street + Neighborhood +
## Condition1 + BldgType + OverallCond + OverallQual + YearBuilt +
## RoofMatl + MasVnrArea + ExterQual + BsmtFinSF1 + BsmtUnfSF +
## X1stFlrSF + X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual +
## TotRmsAbvGrd + Fireplaces + GarageArea + GarageQual, data = train,
## nbagg = 500)
# Predict using the test set
house_bag_pred_2 <- predict(house_bag2, test)
model_output2 <- cbind(test, house_bag_pred_2)
model_output2$log_prediction_bag2 <- log(model_output2$house_bag_pred_2)
model_output2$log_SalePrice_bag2 <- log(model_output2$SalePrice)
#Test with RMSE
rmse(model_output2$log_SalePrice_bag2,model_output2$log_prediction_bag2)
## [1] 0.2042387
Looking at the trees split and error.
ntree <- c(1, 3, 5, seq(20, 500, 20))
MSE_test <- rep(0, length(ntree))
for(i in 1:length(ntree)){
bag1 <- bagging(SalePrice~., data = train, nbagg = ntree[i])
predict <- predict(bag1, newdata = test)
MSE_test[i] <- mean((test$SalePrice - predict)^2)
}
plot(ntree, MSE_test, type = 'l', col = 2, lwd = 2, xaxt = "n")
axis(1, at = ntree, las = 1)
The chart shows the first decline on trees at around 20, but the most significant decline around 200 trees.
house_bag3 <- bagging(formula = SalePrice ~LotArea+Street+Neighborhood+Condition1+BldgType+OverallCond+OverallQual+YearBuilt+RoofMatl+MasVnrArea+ExterQual+BsmtFinSF1+BsmtUnfSF+X1stFlrSF+ X2ndFlrSF+BedroomAbvGr+KitchenAbvGr+KitchenQual+TotRmsAbvGrd+Fireplaces+GarageArea+GarageQual, data = train, nbagg = 250)
house_bag3
##
## Bagging regression trees with 250 bootstrap replications
##
## Call: bagging.data.frame(formula = SalePrice ~ LotArea + Street + Neighborhood +
## Condition1 + BldgType + OverallCond + OverallQual + YearBuilt +
## RoofMatl + MasVnrArea + ExterQual + BsmtFinSF1 + BsmtUnfSF +
## X1stFlrSF + X2ndFlrSF + BedroomAbvGr + KitchenAbvGr + KitchenQual +
## TotRmsAbvGrd + Fireplaces + GarageArea + GarageQual, data = train,
## nbagg = 250)
# Predict using the test set
house_bag_pred_3 <- predict(house_bag3, test)
model_output3 <- cbind(test, house_bag_pred_3)
model_output3$log_prediction_bag3 <- log(model_output3$house_bag_pred_3)
model_output3$log_SalePrice_bag3 <- log(model_output3$SalePrice)
#Test with RMSE
rmse(model_output3$log_SalePrice_bag3,model_output3$log_prediction_bag3)
## [1] 0.2036751
Bagging did not show an improvement to the RMSE over previous models.
#Package for Randpom Florest
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.2.2
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
house_rf <- randomForest(SalePrice~., data = train, importance = TRUE)
house_rf
##
## Call:
## randomForest(formula = SalePrice ~ ., data = train, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 26
##
## Mean of squared residuals: 862144873
## % Var explained: 85.88
Since random Forest has it owns filter for variables, there is no need to select the variables that showed correlations previously.
# Predict using the test set
prediction_rf <- predict(house_rf, test)
model_output_rf <- cbind(test, prediction_rf)
model_output_rf$log_prediction_rf <- log(model_output_rf$prediction_rf)
model_output_rf$log_SalePrice_rf <- log(model_output_rf$SalePrice)
#Test with RMSE
rmse(model_output_rf$log_SalePrice_rf,model_output_rf$log_prediction_rf)
## [1] 0.1440042
The prediction model has a smaller error than bagging showing an RMSE of .0363.
#Package for XGBoost
#install.packages('xgboost')
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.2.2
Splitting the data again:
#Partition the data to be 60% train and 40% test
partition2 <- createDataPartition(y=outcome, p=.9, list=FALSE)
train <- housing_train[partition2,]
test <- housing_train[-partition2,]
The first step it is to transform the data set into Sparse Matrix.
#Assemble and format the data - Using Log for Variable Sale Price
train$log_SalePrice <- log(train$SalePrice)
test$log_SalePrice <- log(test$SalePrice)
#Create matrices from the data frames
trainData<- as.matrix(train, rownames.force=NA)
testData<- as.matrix(test, rownames.force=NA)
#Turn the matrices into sparse matrices
train2 <- as(trainData, "sparseMatrix")
test2 <- as(testData, "sparseMatrix")
#colnames(train2)
#colnames(pred_data)
#Cross Validate the model
vars <- c(1:78) #Choose the variables
trainD <- xgb.DMatrix(data = train2[,vars], label = train2[,"SalePrice"]) #Convert to xgb.DMatrix format for space and efficiency
Creating a cross validation model:
#Cross validate the model
cv.sparse <- xgb.cv(data = trainD,
nrounds = 500,
min_child_weight = 0,
max_depth = 10,
eta = 0.04,
subsample = .7,
colsample_bytree = .7,
booster = "gbtree",
eval_metric = "rmse",
print_every_n = 100,
nfold = 4,
nthread = 2,
objective="reg:linear")
## [18:36:13] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [18:36:13] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [18:36:13] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [18:36:13] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:189050.919230+1173.054544 test-rmse:189103.397150+3690.871760
## [101] train-rmse:9093.985430+481.279412 test-rmse:29463.944696+2769.224325
## [201] train-rmse:2628.380058+372.538611 test-rmse:28734.947692+3085.620299
## [301] train-rmse:1076.411601+197.517498 test-rmse:28773.133768+3113.038632
## [401] train-rmse:444.516212+103.406670 test-rmse:28751.591512+3095.618314
## [500] train-rmse:179.000714+63.906102 test-rmse:28760.266901+3094.409819
#Choose the parameters for the model - tuning the model
param <- list(colsample_bytree = .7, #amount of features for each tree
subsample = .7, #fractions of observation for random samples bt .5 and 1 lower than .5 is very conservative model
booster = "gbtree", #tree Based model for a linear model use 'gblinear'
max_depth = 10, #maximun dept of a tree
eta = 0.04, #makes the model more robust by shrinking the weight of each step
eval_metric = "rmse",
objective="reg:linear")
#Train the model using those parameters
bstSparse <-
xgb.train(params = param,
data = trainD,
nrounds = 500,
watchlist = list(train = trainD),
verbose = TRUE,
print_every_n = 100,
nthread = 2)
## [18:36:32] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:189014.967259
## [101] train-rmse:9072.405113
## [201] train-rmse:2657.627783
## [301] train-rmse:1178.258176
## [401] train-rmse:479.246070
## [500] train-rmse:171.761976
Prediction of the bstSparse Model:
testD <- xgb.DMatrix(data = test2[,vars])
#Column names must match the inputs EXACTLY
prediction <- predict(bstSparse, testD) #Make the prediction based on the half of the training data set aside
#Put testing prediction and test dataset all together
test3 <- as.data.frame(as.matrix(test2))
prediction <- as.data.frame(as.matrix(prediction))
colnames(prediction) <- "prediction"
model_output <- cbind(test3, prediction)
model_output$log_prediction <- log(model_output$prediction)
model_output$log_SalePrice <- log(model_output$SalePrice)
#Test with RMSE
rmse(model_output$log_SalePrice,model_output$log_prediction)
## [1] 0.1544653
The RMSE error is 13.88% for the first model, after running many different model with different values the best RMSE was 1.11% error (but it varies between 1% and 2%).
#Changing the parameters
param2 <- list(colsample_bytree = .6,
subsample = .8,
booster = "gbtree",
max_depth = 12,
eta = 0.05,
eval_metric = "rmse",
objective="reg:linear")
Make a second model
#Train the model using those parameters
bstSparse2 <-
xgb.train(params = param2,
data = trainD,
nrounds = 500,
watchlist = list(train = trainD),
verbose = TRUE,
print_every_n = 100,
nthread = 2)
## [18:36:38] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:187225.998476
## [101] train-rmse:4641.332695
## [201] train-rmse:802.036158
## [301] train-rmse:189.122680
## [401] train-rmse:42.507610
## [500] train-rmse:9.741979
#Column names must match the inputs EXACTLY
prediction_2 <- predict(bstSparse2, testD) #Make the prediction based on the half of the training data set aside
#Put testing prediction and test dataset all together
test3 <- as.data.frame(as.matrix(test2))
prediction2 <- as.data.frame(as.matrix(prediction_2))
colnames(prediction2) <- "prediction"
output <- cbind(test3, prediction2)
output$log_prediction_2 <- log(output$prediction)
output$log_SalePrice2 <- log(output$SalePrice)
#Test with RMSE
rmse(output$log_SalePrice2,output$log_prediction_2)
## [1] 0.1564827
The RMSE error is .1067 what is slightly higher than the previous model.
Preparing the test data set
# Get the supplied test data ready #
predict <- as.data.frame(housing_test) #Get the dataset formatted as a frame for later combining
#Create matrices from the data frames
predData<- as.matrix(predict, rownames.force=NA)
#Turn the matrices into sparse matrices
predicting <- as(predData, "sparseMatrix")
#colnames(train[,c(2:79)])
vars <- c("Id", "MSSubClass", "MSZoning", "LotFrontage", "LotArea", "Street",
"LotShape", "LandContour", "Utilities", "LotConfig", "LandSlope", "Neighborhood",
"Condition1", "Condition2", "BldgType", "HouseStyle", "OverallQual", "OverallCond",
"YearBuilt", "YearRemodAdd", "RoofStyle", "RoofMatl", "Exterior1st", "Exterior2nd",
"MasVnrType", "MasVnrArea", "ExterQual", "ExterCond", "Foundation", "BsmtQual",
"BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinSF1", "BsmtFinType2", "BsmtFinSF2",
"BsmtUnfSF", "TotalBsmtSF", "Heating", "HeatingQC", "CentralAir", "Electrical",
"X1stFlrSF", "X2ndFlrSF", "LowQualFinSF", "GrLivArea", "BsmtFullBath", "BsmtHalfBath",
"FullBath", "HalfBath", "BedroomAbvGr", "KitchenAbvGr", "KitchenQual", "TotRmsAbvGrd",
"Functional", "Fireplaces", "FireplaceQu", "GarageType", "GarageYrBlt", "GarageFinish",
"GarageCars", "GarageArea", "GarageQual", "GarageCond", "PavedDrive", "WoodDeckSF","OpenPorchSF", "EnclosedPorch", "X3SsnPorch", "ScreenPorch", "PoolArea", "PoolQC",
"Fence", "MiscFeature", "MiscVal", "MoSold", "YrSold", "SaleType",
"SaleCondition")
colnames(predicting[,vars])
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "LotShape" "LandContour"
## [9] "Utilities" "LotConfig" "LandSlope" "Neighborhood"
## [13] "Condition1" "Condition2" "BldgType" "HouseStyle"
## [17] "OverallQual" "OverallCond" "YearBuilt" "YearRemodAdd"
## [21] "RoofStyle" "RoofMatl" "Exterior1st" "Exterior2nd"
## [25] "MasVnrType" "MasVnrArea" "ExterQual" "ExterCond"
## [29] "Foundation" "BsmtQual" "BsmtCond" "BsmtExposure"
## [33] "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2" "BsmtFinSF2"
## [37] "BsmtUnfSF" "TotalBsmtSF" "Heating" "HeatingQC"
## [41] "CentralAir" "Electrical" "X1stFlrSF" "X2ndFlrSF"
## [45] "LowQualFinSF" "GrLivArea" "BsmtFullBath" "BsmtHalfBath"
## [49] "FullBath" "HalfBath" "BedroomAbvGr" "KitchenAbvGr"
## [53] "KitchenQual" "TotRmsAbvGrd" "Functional" "Fireplaces"
## [57] "FireplaceQu" "GarageType" "GarageYrBlt" "GarageFinish"
## [61] "GarageCars" "GarageArea" "GarageQual" "GarageCond"
## [65] "PavedDrive" "WoodDeckSF" "OpenPorchSF" "EnclosedPorch"
## [69] "X3SsnPorch" "ScreenPorch" "PoolArea" "PoolQC"
## [73] "Fence" "MiscFeature" "MiscVal" "MoSold"
## [77] "YrSold" "SaleType" "SaleCondition"
rm(bstSparse)
#Create matrices from the data frames
retrainData<- as.matrix(train, rownames.force=NA)
#Turn the matrices into sparse matrices
retrain <- as(retrainData, "sparseMatrix")
param3 <- list(colsample_bytree = .7,
subsample = .7,
booster = "gbtree",
max_depth = 10,
eta = 0.04,
eval_metric = "rmse",
objective="reg:linear")
retrainD <- xgb.DMatrix(data = retrain[,vars], label = retrain[,"SalePrice"])
#retrain the model using those parameters
bstSparse3 <-
xgb.train(params = param3,
data = retrainD,
nrounds = 500,
watchlist = list(train = trainD),
verbose = TRUE,
print_every_n = 100,
nthread = 2)
## [18:36:45] WARNING: amalgamation/../src/objective/regression_obj.cu:203: reg:linear is now deprecated in favor of reg:squarederror.
## [1] train-rmse:189013.622178
## [101] train-rmse:11690.407035
## [201] train-rmse:6175.194693
## [301] train-rmse:5460.061183
## [401] train-rmse:5236.044375
## [500] train-rmse:5176.225863
#Column names must match the inputs EXACTLY
prediction <- predict(bstSparse3, predicting[,vars])
prediction <- as.data.frame(as.matrix(prediction)) #Get the dataset formatted as a frame for later combining
colnames(prediction) <- "prediction"
model_output <- cbind(predict, prediction) #Combine the prediction output with the rest of the set
results <- data.frame(Id = model_output$Id, SalePrice = model_output$prediction)
length(model_output$prediction)
## [1] 1459
###Data Analysis - Results
write.csv(results, file = "Prediction.csv", row.names = F)
head(results$SalePrice)
## [1] 125088.7 158861.6 175157.6 189112.8 190358.8 174472.2
The file has a sales price prediction for the house_testing set.
summary <- read.csv("C:/Users/raze1/OneDrive/Desktop/UIndy/MSDA 621/Project/Project Presentation/Prediction.csv")
head(summary)
## Id SalePrice
## 1 1461 125088.7
## 2 1462 158861.6
## 3 1463 175157.6
## 4 1464 189112.8
## 5 1465 190358.8
## 6 1466 174472.2
As stated in the beginning of this analysis, the housing market is in incredible flux and homes have numerous data points on which they are categorized. Trying to understand which points truly matter and which points do not is incredibly complex but can be incredibly helpful when predicting the sale price of a home.
In this analysis, we cleaned and categorized data so it could be used to build a linear model. Then we refined the model to minimize the RMSE and found XGBoost to be the best approach. And ultimately built a predictive model for that specific housing market.
Moving forward, we can use this process in different markets independently, compare statistically signficant variables, and ultimately expand this model beyond Illinois as a national means of predicting home prices.