Part 1 and Part 2 were used to clean and prepare data. As a result, we now have data that is all numeric, and that has no missing values.
We can now build some models to predict the sale price of each house described in the Kaggle test dataset.
library(tidyverse) # data manipulation
library(vtreat) # variable preparation
library(h2o) # modeling framework
library(kableExtra) # customize table outputLoad and first inspection of the data.
## Observations: 1,102
## Variables: 148
## $ ms_sub_class_catN <dbl> 0.30299208, 0.30299208, -0.24959678…
## $ ms_zoning_catN <dbl> 0.06550683, 0.06550683, 0.06550683,…
## $ lot_frontage <dbl> 68.00000, 84.00000, 85.00000, 75.00…
## $ lot_area <dbl> 11250, 14260, 14115, 10084, 10382, …
## $ alley_catN <dbl> 0.01187456, 0.01187456, 0.01187456,…
## $ lot_shape_catN <dbl> 0.14188506, 0.14188506, 0.14188506,…
## $ land_contour_catN <dbl> -0.007902741, -0.007902741, -0.0079…
## $ lot_config_catN <dbl> -0.02452583, 0.00000000, -0.0245258…
## $ neighborhood_catN <dbl> 0.13697891, 0.65789741, -0.09717128…
## $ condition1_catN <dbl> 0.02096058, 0.02096058, 0.02096058,…
## $ bldg_type_catN <dbl> 0.02699036, 0.02699036, 0.02699036,…
## $ house_style_catN <dbl> 0.14962377, 0.14962377, -0.25994606…
## $ overall_qual <dbl> 7, 8, 5, 8, 7, 7, 5, 5, 9, 7, 6, 4,…
## $ year_built <dbl> 2001, 2000, 1993, 2004, 1973, 1931,…
## $ year_remod_add <dbl> 2002, 2000, 1995, 2005, 1973, 1950,…
## $ roof_style_catN <dbl> -0.04491734, -0.04491734, -0.044917…
## $ exterior1st_catN <dbl> 0.15687324, 0.15687324, 0.15687324,…
## $ exterior2nd_catN <dbl> 0.1587478, 0.1587478, 0.1587478, 0.…
## $ mas_vnr_type_catN <dbl> 0.1464440, 0.1464440, -0.1204298, 0…
## $ mas_vnr_area <dbl> 162, 350, 0, 186, 240, 0, 0, 0, 286…
## $ exter_qual <dbl> 4, 4, 3, 4, 3, 3, 3, 3, 5, 4, 3, 3,…
## $ foundation_catN <dbl> 0.2352084, 0.2352084, 0.2393833, 0.…
## $ bsmt_qual <dbl> 5, 5, 5, 6, 5, 4, 4, 4, 6, 5, 4, 1,…
## $ bsmt_cond <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1,…
## $ bsmt_exposure_catN <dbl> 0.10091461, 0.10104746, -0.07571532…
## $ bsmt_fin_type1 <dbl> 7, 7, 7, 7, 6, 2, 7, 4, 7, 2, 5, 1,…
## $ bsmt_fin_sf1 <dbl> 486, 655, 732, 1369, 859, 0, 851, 9…
## $ bsmt_unf_sf <dbl> 434, 490, 64, 317, 216, 952, 140, 1…
## $ total_bsmt_sf <dbl> 920, 1145, 796, 1686, 1107, 952, 99…
## $ heating_catN <dbl> 0.009119800, 0.009119800, 0.0091198…
## $ heating_qc <dbl> 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 3, 3,…
## $ electrical <dbl> 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4,…
## $ x1st_flr_sf <dbl> 920, 1145, 796, 1694, 1107, 1022, 1…
## $ x2nd_flr_sf <dbl> 866, 1053, 566, 0, 983, 752, 0, 0, …
## $ gr_liv_area <dbl> 1786, 2198, 1362, 1694, 2090, 1774,…
## $ bsmt_full_bath <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,…
## $ full_bath <dbl> 2, 2, 1, 2, 2, 2, 1, 1, 3, 2, 1, 2,…
## $ half_bath <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ bedroom_abv_gr <dbl> 3, 4, 1, 3, 3, 2, 2, 3, 4, 3, 2, 2,…
## $ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2,…
## $ kitchen_qual <dbl> 4, 4, 3, 4, 3, 3, 3, 3, 5, 4, 3, 3,…
## $ tot_rms_abv_grd <dbl> 6, 9, 5, 7, 7, 8, 5, 5, 11, 7, 5, 6…
## $ functional <dbl> 8, 8, 8, 8, 8, 7, 8, 8, 8, 8, 8, 8,…
## $ fireplaces <dbl> 1, 1, 0, 1, 2, 2, 2, 0, 2, 1, 1, 0,…
## $ fireplace_qu <dbl> 4, 4, 1, 5, 4, 4, 4, 1, 5, 5, 3, 1,…
## $ garage_type_catN <dbl> 0.1404059, 0.1404059, 0.1404059, 0.…
## $ garage_finish_catN <dbl> 0.1666591, 0.1666591, -0.1968145, 0…
## $ garage_cars <dbl> 2, 3, 2, 2, 2, 2, 1, 1, 3, 3, 1, 2,…
## $ garage_area <dbl> 608, 836, 480, 636, 484, 468, 205, …
## $ garage_qual <dbl> 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4,…
## $ garage_cond <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ paved_drive_catN <dbl> 0.03148809, 0.03148809, 0.03148809,…
## $ wood_deck_sf <dbl> 0, 192, 40, 255, 235, 90, 0, 0, 147…
## $ open_porch_sf <dbl> 42, 84, 30, 57, 204, 0, 4, 0, 21, 3…
## $ enclosed_porch <dbl> 0, 0, 0, 0, 228, 205, 0, 0, 0, 0, 1…
## $ screen_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pool_qc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fence <dbl> 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 3, 1,…
## $ misc_feature_catN <dbl> 0.005977216, 0.005977216, -0.181971…
## $ sale_type_catN <dbl> -0.02942813, -0.02942813, -0.029428…
## $ sale_condition_catN <dbl> -0.01765492, -0.01765492, -0.017654…
## $ has_garage <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_built <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_remod <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_rare <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_120 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_160 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_50 <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_60 <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ ms_sub_class_lev_x_90 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ ms_zoning_lev_x_FV <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_RL <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ ms_zoning_lev_x_RM <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_Grvl <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_None <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lot_shape_lev_x_IR1 <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0,…
## $ lot_shape_lev_x_IR2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_shape_lev_x_Reg <dbl> 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ land_contour_lev_x_Bnk <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_HLS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_Low <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_CulDSac <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_Inside <dbl> 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,…
## $ land_slope_lev_x_Gtl <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ neighborhood_lev_x_BrkSide <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_CollgCr <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ neighborhood_lev_x_Crawfor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Edwards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NAmes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ neighborhood_lev_x_NoRidge <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NridgHt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ neighborhood_lev_x_OldTown <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Sawyer <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ neighborhood_lev_x_Somerst <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Timber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Artery <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Feedr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Norm <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ bldg_type_lev_x_1Fam <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
## $ bldg_type_lev_x_Duplex <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ bldg_type_lev_x_Twnhs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1_5Fin <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1Story <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,…
## $ house_style_lev_x_2Story <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ house_style_lev_x_SFoyer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_style_lev_x_Gable <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1,…
## $ roof_style_lev_x_Hip <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,…
## $ roof_matl_lev_x_CompShg <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ exterior1st_lev_x_CemntBd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior1st_lev_x_MetalSd <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ exterior1st_lev_x_VinylSd <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ exterior1st_lev_x_Wd_Sdng <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_MetalSd <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ exterior2nd_lev_x_VinylSd <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ exterior2nd_lev_x_Wd_Sdng <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_BrkFace <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ mas_vnr_type_lev_x_None <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ mas_vnr_type_lev_x_Stone <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,…
## $ foundation_lev_x_BrkTil <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
## $ foundation_lev_x_CBlock <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,…
## $ foundation_lev_x_PConc <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Av <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Gd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_No <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,…
## $ bsmt_exposure_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ heating_lev_x_GasA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ central_air_lev_x_N <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ central_air_lev_x_Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_type_lev_x_Attchd <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
## $ garage_type_lev_x_BuiltIn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ garage_type_lev_x_Detchd <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ garage_type_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Fin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ garage_finish_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_RFn <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
## $ garage_finish_lev_x_Unf <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,…
## $ paved_drive_lev_x_N <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ paved_drive_lev_x_Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_None <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,…
## $ misc_feature_lev_x_Shed <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
## $ sale_type_lev_x_COD <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_New <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ sale_type_lev_x_WD <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
## $ sale_condition_lev_x_Abnorml <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sale_condition_lev_x_Normal <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,…
## $ sale_condition_lev_x_Partial <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ log_sale_price <dbl> 12.31717, 12.42922, 11.87060, 12.63…
We will use h2o framework to build multiple models. More specifically, the automl() auto-machine learning process is very handful to create models and automatically tune hyperparameters of each algorithm.
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 13 minutes 36 seconds
## H2O cluster timezone: Europe/Paris
## H2O data parsing timezone: UTC
## H2O cluster version: 3.24.0.5
## H2O cluster version age: 18 days
## H2O cluster name: H2O_started_from_R_alex_jif955
## H2O cluster total nodes: 1
## H2O cluster total memory: 1.64 GB
## H2O cluster total cores: 2
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4
## R Version: R version 3.6.0 (2019-04-26)
The data has to be ‘h2o-formatted’ in order to be used.
h2o.describe(train_h2o) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>%
scroll_box(width = "800px")| Label | Type | Missing | Zeros | PosInf | NegInf | Min | Max | Mean | Sigma | Cardinality |
|---|---|---|---|---|---|---|---|---|---|---|
| ms_sub_class_catN | real | 0 | 349 | 0 | 0 | -0.6086855 | 3.433566e-01 | -0.0010771 | 2.259486e-01 | NA |
| ms_zoning_catN | real | 0 | 0 | 0 | 0 | -0.6535452 | 2.269903e-01 | 0.0007190 | 1.608428e-01 | NA |
| lot_frontage | real | 0 | 0 | 0 | 0 | 21.0000000 | 3.130000e+02 | 69.9879470 | 2.150109e+01 | NA |
| lot_area | int | 0 | 0 | 0 | 0 | 1300.0000000 | 2.152450e+05 | 10504.1079855 | 1.010558e+04 | NA |
| alley_catN | real | 0 | 35 | 0 | 0 | -0.4320284 | 1.696460e-02 | 0.0003160 | 7.342400e-02 | NA |
| lot_shape_catN | real | 0 | 1 | 0 | 0 | -0.0983687 | 8.033851e-01 | 0.0037089 | 1.344989e-01 | NA |
| land_contour_catN | real | 0 | 669 | 0 | 0 | -0.2459688 | 3.048004e-01 | 0.0029096 | 7.323370e-02 | NA |
| lot_config_catN | real | 0 | 441 | 0 | 0 | -0.0374232 | 6.290317e-01 | 0.0038008 | 6.307520e-02 | NA |
| neighborhood_catN | real | 0 | 51 | 0 | 0 | -0.6095687 | 6.809439e-01 | 0.0008942 | 2.872994e-01 | NA |
| condition1_catN | real | 0 | 28 | 0 | 0 | -0.3162691 | 1.296756e-01 | -0.0000929 | 7.735200e-02 | NA |
# Run AutoML for 40 base models (limited to 1 hour max runtime by default)
# Excluse Deep Learning
# The metric used on Kaggle is Root Mean Squared Logarithmic Error
# (we already log-transformed the response, so the metric is RMSE)
house_automl <- h2o.automl(x = x, y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
max_models = 40, max_runtime_secs = 60,
exclude_algos = c("DeepLearning"),
sort_metric = "RMSE",
seed = 42)# Extract the AutoML Leaderboard
house_lb <- house_automl@leaderboard
# View the 10 best models
house_lb %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>%
scroll_box(width = "800px")| model_id | mean_residual_deviance | rmse | mse | mae | rmsle |
|---|---|---|---|---|---|
| XGBoost_1_AutoML_20190707_193526 | 0.0174348 | 0.1320411 | 0.0174348 | 0.0911977 | 0.0102442 |
| XGBoost_2_AutoML_20190707_193526 | 0.0174418 | 0.1320675 | 0.0174418 | 0.0906730 | 0.0102372 |
| StackedEnsemble_BestOfFamily_AutoML_20190707_193526 | 0.0178124 | 0.1334630 | 0.0178124 | 0.0851239 | 0.0102857 |
| StackedEnsemble_AllModels_AutoML_20190707_193526 | 0.0178208 | 0.1334944 | 0.0178208 | 0.0855003 | 0.0102978 |
| XGBoost_3_AutoML_20190707_193526 | 0.0182921 | 0.1352484 | 0.0182921 | 0.0933149 | 0.0104718 |
| GBM_2_AutoML_20190707_193526 | 0.0195830 | 0.1399394 | 0.0195830 | 0.0936482 | 0.0108621 |
| GBM_1_AutoML_20190707_193526 | 0.0196602 | 0.1402148 | 0.0196602 | 0.0938517 | 0.0108908 |
| GLM_grid_1_AutoML_20190707_193526_model_1 | 0.0220872 | 0.1486175 | 0.0220872 | 0.0905069 | 0.0112702 |
| DRF_1_AutoML_20190707_193526 | 0.0239059 | 0.1546152 | 0.0239059 | 0.1049230 | 0.0120129 |
Variable importance.
We can also inspect the parameters of the best model.
## $model_id
## [1] "XGBoost_1_AutoML_20190707_193526"
##
## $training_frame
## [1] "automl_training_train_treated_sid_9702_1"
##
## $validation_frame
## [1] "valid_treated_sid_9702_3"
##
## $nfolds
## [1] 5
##
## $keep_cross_validation_models
## [1] FALSE
##
## $keep_cross_validation_predictions
## [1] TRUE
##
## $fold_assignment
## [1] "Modulo"
##
## $stopping_metric
## [1] "RMSE"
##
## $stopping_tolerance
## [1] 0.03012376
##
## $seed
## [1] 42
##
## $ntrees
## [1] 139
##
## $max_depth
## [1] 5
##
## $min_rows
## [1] 3
##
## $learn_rate
## [1] 0.05
##
## $sample_rate
## [1] 0.8
##
## $col_sample_rate
## [1] 0.8
##
## $col_sample_rate_per_tree
## [1] 0.8
##
## $score_tree_interval
## [1] 5
##
## $x
## [1] "ms_sub_class_catN" "ms_zoning_catN"
## [3] "lot_frontage" "lot_area"
## [5] "alley_catN" "lot_shape_catN"
## [7] "land_contour_catN" "lot_config_catN"
## [9] "neighborhood_catN" "condition1_catN"
## [11] "bldg_type_catN" "house_style_catN"
## [13] "overall_qual" "year_built"
## [15] "year_remod_add" "roof_style_catN"
## [17] "exterior1st_catN" "exterior2nd_catN"
## [19] "mas_vnr_type_catN" "mas_vnr_area"
## [21] "exter_qual" "foundation_catN"
## [23] "bsmt_qual" "bsmt_cond"
## [25] "bsmt_exposure_catN" "bsmt_fin_type1"
## [27] "bsmt_fin_sf1" "bsmt_unf_sf"
## [29] "total_bsmt_sf" "heating_catN"
## [31] "heating_qc" "electrical"
## [33] "x1st_flr_sf" "x2nd_flr_sf"
## [35] "gr_liv_area" "bsmt_full_bath"
## [37] "full_bath" "half_bath"
## [39] "bedroom_abv_gr" "kitchen_abv_gr"
## [41] "kitchen_qual" "tot_rms_abv_grd"
## [43] "functional" "fireplaces"
## [45] "fireplace_qu" "garage_type_catN"
## [47] "garage_finish_catN" "garage_cars"
## [49] "garage_area" "garage_qual"
## [51] "garage_cond" "paved_drive_catN"
## [53] "wood_deck_sf" "open_porch_sf"
## [55] "enclosed_porch" "screen_porch"
## [57] "pool_qc" "fence"
## [59] "misc_feature_catN" "sale_type_catN"
## [61] "sale_condition_catN" "has_garage"
## [63] "garage_yr_same_built" "garage_yr_same_remod"
## [65] "ms_sub_class_lev_rare" "ms_sub_class_lev_x_120"
## [67] "ms_sub_class_lev_x_160" "ms_sub_class_lev_x_30"
## [69] "ms_sub_class_lev_x_50" "ms_sub_class_lev_x_60"
## [71] "ms_sub_class_lev_x_90" "ms_zoning_lev_x_FV"
## [73] "ms_zoning_lev_x_RL" "ms_zoning_lev_x_RM"
## [75] "alley_lev_x_Grvl" "alley_lev_x_None"
## [77] "lot_shape_lev_x_IR1" "lot_shape_lev_x_IR2"
## [79] "lot_shape_lev_x_Reg" "land_contour_lev_x_Bnk"
## [81] "land_contour_lev_x_HLS" "land_contour_lev_x_Low"
## [83] "lot_config_lev_x_CulDSac" "lot_config_lev_x_Inside"
## [85] "land_slope_lev_x_Gtl" "neighborhood_lev_x_BrkSide"
## [87] "neighborhood_lev_x_CollgCr" "neighborhood_lev_x_Crawfor"
## [89] "neighborhood_lev_x_Edwards" "neighborhood_lev_x_NAmes"
## [91] "neighborhood_lev_x_NoRidge" "neighborhood_lev_x_NridgHt"
## [93] "neighborhood_lev_x_OldTown" "neighborhood_lev_x_Sawyer"
## [95] "neighborhood_lev_x_Somerst" "neighborhood_lev_x_Timber"
## [97] "condition1_lev_x_Artery" "condition1_lev_x_Feedr"
## [99] "condition1_lev_x_Norm" "bldg_type_lev_x_1Fam"
## [101] "bldg_type_lev_x_Duplex" "bldg_type_lev_x_Twnhs"
## [103] "house_style_lev_x_1_5Fin" "house_style_lev_x_1Story"
## [105] "house_style_lev_x_2Story" "house_style_lev_x_SFoyer"
## [107] "roof_style_lev_x_Gable" "roof_style_lev_x_Hip"
## [109] "roof_matl_lev_x_CompShg" "exterior1st_lev_x_CemntBd"
## [111] "exterior1st_lev_x_MetalSd" "exterior1st_lev_x_VinylSd"
## [113] "exterior1st_lev_x_Wd_Sdng" "exterior2nd_lev_x_MetalSd"
## [115] "exterior2nd_lev_x_VinylSd" "exterior2nd_lev_x_Wd_Sdng"
## [117] "mas_vnr_type_lev_x_BrkFace" "mas_vnr_type_lev_x_None"
## [119] "mas_vnr_type_lev_x_Stone" "foundation_lev_x_BrkTil"
## [121] "foundation_lev_x_CBlock" "foundation_lev_x_PConc"
## [123] "bsmt_exposure_lev_x_Av" "bsmt_exposure_lev_x_Gd"
## [125] "bsmt_exposure_lev_x_No" "bsmt_exposure_lev_x_None"
## [127] "heating_lev_x_GasA" "central_air_lev_x_N"
## [129] "central_air_lev_x_Y" "garage_type_lev_x_Attchd"
## [131] "garage_type_lev_x_BuiltIn" "garage_type_lev_x_Detchd"
## [133] "garage_type_lev_x_None" "garage_finish_lev_x_Fin"
## [135] "garage_finish_lev_x_None" "garage_finish_lev_x_RFn"
## [137] "garage_finish_lev_x_Unf" "paved_drive_lev_x_N"
## [139] "paved_drive_lev_x_Y" "misc_feature_lev_x_None"
## [141] "misc_feature_lev_x_Shed" "sale_type_lev_x_COD"
## [143] "sale_type_lev_x_New" "sale_type_lev_x_WD"
## [145] "sale_condition_lev_x_Abnorml" "sale_condition_lev_x_Normal"
## [147] "sale_condition_lev_x_Partial"
##
## $y
## [1] "log_sale_price"
test dataThe test dataset will be prepared using the same steps and treatment plan created in Part 2.
# the 'test' data comes from the pre-prepared data from Part 1
test <- readRDS("01-full_train_test.rds") %>%
filter(df_id == "test") %>%
select(-df_id, -sale_price)## Observations: 1,459
## Variables: 80
## $ id <dbl> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, …
## $ ms_sub_class <fct> 20, 20, 60, 60, 120, 60, 20, 60, 20, 20, 120, 16…
## $ ms_zoning <fct> RH, RL, RL, RL, RL, RL, RL, RL, RL, RL, RH, RM, …
## $ lot_frontage <dbl> 80, 81, 74, 78, 43, 75, NA, 63, 85, 70, 26, 21, …
## $ lot_area <dbl> 11622, 14267, 13830, 9978, 5005, 10000, 7980, 84…
## $ street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, …
## $ alley <fct> None, None, None, None, None, None, None, None, …
## $ lot_shape <fct> Reg, IR1, IR1, IR1, IR1, IR1, IR1, IR1, Reg, Reg…
## $ land_contour <fct> Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, Lvl, Lvl, Lvl…
## $ utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
## $ lot_config <fct> Inside, Corner, Inside, Inside, Inside, Corner, …
## $ land_slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl…
## $ neighborhood <fct> NAmes, NAmes, Gilbert, Gilbert, StoneBr, Gilbert…
## $ condition1 <fct> Feedr, Norm, Norm, Norm, Norm, Norm, Norm, Norm,…
## $ condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, …
## $ bldg_type <fct> 1Fam, 1Fam, 1Fam, 1Fam, TwnhsE, 1Fam, 1Fam, 1Fam…
## $ house_style <fct> 1Story, 1Story, 2Story, 2Story, 1Story, 2Story, …
## $ overall_qual <ord> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6, 5, 6, 7, 9, …
## $ overall_cond <ord> 6, 6, 5, 6, 5, 5, 7, 5, 5, 5, 5, 5, 5, 6, 6, 5, …
## $ year_built <dbl> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, …
## $ year_remod_add <dbl> 1961, 1958, 1998, 1998, 1992, 1994, 2007, 1998, …
## $ roof_style <fct> Gable, Hip, Gable, Gable, Gable, Gable, Gable, G…
## $ roof_matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com…
## $ exterior1st <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdB…
## $ exterior2nd <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdB…
## $ mas_vnr_type <fct> None, BrkFace, None, BrkFace, None, None, None, …
## $ mas_vnr_area <dbl> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0, 504, 492, 0,…
## $ exter_qual <ord> TA, TA, TA, TA, Gd, TA, TA, TA, TA, TA, Gd, TA, …
## $ exter_cond <ord> TA, TA, TA, TA, TA, TA, Gd, TA, TA, TA, TA, TA, …
## $ foundation <fct> CBlock, CBlock, PConc, PConc, PConc, PConc, PCon…
## $ bsmt_qual <ord> TA, TA, Gd, TA, Gd, Gd, Gd, Gd, Gd, TA, Gd, TA, …
## $ bsmt_cond <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ bsmt_exposure <fct> No, No, No, No, No, No, No, No, Gd, No, No, No, …
## $ bsmt_fin_type1 <ord> Rec, ALQ, GLQ, GLQ, ALQ, Unf, ALQ, Unf, GLQ, ALQ…
## $ bsmt_fin_sf1 <dbl> 468, 923, 791, 602, 263, 0, 935, 0, 637, 804, 10…
## $ bsmt_fin_type2 <ord> LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Rec…
## $ bsmt_fin_sf2 <dbl> 144, 0, 0, 0, 0, 0, 0, 0, 0, 78, 0, 0, 0, 0, 0, …
## $ bsmt_unf_sf <dbl> 270, 406, 137, 324, 1017, 763, 233, 789, 663, 0,…
## $ total_bsmt_sf <dbl> 882, 1329, 928, 926, 1280, 763, 1168, 789, 1300,…
## $ heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, …
## $ heating_qc <ord> TA, TA, Gd, Ex, Ex, Gd, Ex, Gd, Gd, TA, Ex, TA, …
## $ central_air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ electrical <ord> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,…
## $ x1st_flr_sf <dbl> 896, 1329, 928, 926, 1280, 763, 1187, 789, 1341,…
## $ x2nd_flr_sf <dbl> 0, 0, 701, 678, 0, 892, 0, 676, 0, 0, 0, 504, 56…
## $ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ gr_liv_area <dbl> 896, 1329, 1629, 1604, 1280, 1655, 1187, 1465, 1…
## $ bsmt_full_bath <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, …
## $ bsmt_half_bath <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, …
## $ half_bath <dbl> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, …
## $ bedroom_abv_gr <dbl> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, …
## $ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ kitchen_qual <ord> TA, Gd, TA, Gd, Gd, TA, TA, TA, Gd, TA, Gd, TA, …
## $ tot_rms_abv_grd <dbl> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5, 6, 6, 4, 10,…
## $ functional <ord> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ…
## $ fireplaces <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, …
## $ fireplace_qu <ord> None, None, TA, Gd, None, TA, None, Gd, Po, None…
## $ garage_type <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, …
## $ garage_yr_blt <fct> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, …
## $ garage_finish <fct> Unf, Unf, Fin, Fin, RFn, Fin, Fin, Fin, Unf, Fin…
## $ garage_cars <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 3, …
## $ garage_area <dbl> 730, 312, 482, 470, 506, 440, 420, 393, 506, 525…
## $ garage_qual <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ garage_cond <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ paved_drive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ wood_deck_sf <dbl> 140, 393, 212, 360, 0, 157, 483, 0, 192, 240, 20…
## $ open_porch_sf <dbl> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0, 68, 0, 0, 0…
## $ enclosed_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ x3ssn_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ screen_porch <dbl> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pool_area <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc <ord> None, None, None, None, None, None, None, None, …
## $ fence <ord> MnPrv, None, MnPrv, None, None, None, GdPrv, Non…
## $ misc_feature <fct> None, Gar2, None, None, None, None, Shed, None, …
## $ misc_val <dbl> 0, 12500, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, …
## $ mo_sold <dbl> 6, 6, 3, 6, 1, 4, 3, 5, 2, 4, 6, 2, 3, 6, 6, 1, …
## $ yr_sold <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
## $ sale_type <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, COD,…
## $ sale_condition <fct> Normal, Normal, Normal, Normal, Normal, Normal, …
As a reminder, we applied the following steps to prepare the train_treated data, and will apply the same for the testing data :
mo_sold variablegarage_yr_blt variable# Prepare 'test' data
test_treated <- test %>%
cyclical_transform(column_name = "mo_sold") %>%
garage_year_transform() %>%
mutate_if(is.ordered, as.numeric) %>%
prepare(treatmentplan = vtreat_plan$treatments,
pruneSig = vtreat_prune_sig)## Observations: 1,459
## Variables: 147
## $ ms_sub_class_catN <dbl> 0.02487946, 0.02487946, 0.31707790,…
## $ ms_zoning_catN <dbl> -0.53309450, 0.06141452, 0.06141452…
## $ lot_frontage <dbl> 80.0000, 81.0000, 74.0000, 78.0000,…
## $ lot_area <dbl> 11622, 14267, 13830, 9978, 5005, 10…
## $ alley_catN <dbl> 0.01468384, 0.01468384, 0.01468384,…
## $ lot_shape_catN <dbl> -0.09031732, 0.15279191, 0.15279191…
## $ land_contour_catN <dbl> -0.005418798, -0.005418798, -0.0054…
## $ lot_config_catN <dbl> -0.02389266, 0.00000000, -0.0238926…
## $ neighborhood_catN <dbl> -0.15777407, -0.15777407, 0.1267109…
## $ condition1_catN <dbl> -0.2327314, 0.0209811, 0.0209811, 0…
## $ bldg_type_catN <dbl> 0.02452499, 0.02452499, 0.02452499,…
## $ house_style_catN <dbl> -0.03114034, -0.03114034, 0.1641937…
## $ overall_qual <dbl> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6,…
## $ year_built <dbl> 1961, 1958, 1997, 1998, 1992, 1993,…
## $ year_remod_add <dbl> 1961, 1958, 1998, 1998, 1992, 1994,…
## $ roof_style_catN <dbl> -0.04345073, 0.18122306, -0.0434507…
## $ exterior1st_catN <dbl> 0.17180733, -0.17412791, 0.17180733…
## $ exterior2nd_catN <dbl> 0.17412141, -0.17021159, 0.17412141…
## $ mas_vnr_type_catN <dbl> -0.1297748, 0.1514415, -0.1297748, …
## $ mas_vnr_area <dbl> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0,…
## $ exter_qual <dbl> 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3,…
## $ foundation_catN <dbl> -0.1589739, -0.1589739, 0.2294760, …
## $ bsmt_qual <dbl> 4, 4, 5, 4, 5, 5, 5, 5, 5, 4, 5, 4,…
## $ bsmt_cond <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ bsmt_exposure_catN <dbl> -0.08084167, -0.08084167, -0.080841…
## $ bsmt_fin_type1 <dbl> 4, 6, 7, 7, 6, 2, 6, 2, 7, 6, 7, 4,…
## $ bsmt_fin_sf1 <dbl> 468, 923, 791, 602, 263, 0, 935, 0,…
## $ bsmt_unf_sf <dbl> 270, 406, 137, 324, 1017, 763, 233,…
## $ total_bsmt_sf <dbl> 882, 1329, 928, 926, 1280, 763, 116…
## $ heating_catN <dbl> 0.008521347, 0.008521347, 0.0085213…
## $ heating_qc <dbl> 3, 3, 4, 5, 5, 4, 5, 4, 4, 3, 5, 3,…
## $ electrical <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ x1st_flr_sf <dbl> 896, 1329, 928, 926, 1280, 763, 118…
## $ x2nd_flr_sf <dbl> 0, 0, 701, 678, 0, 892, 0, 676, 0, …
## $ gr_liv_area <dbl> 896, 1329, 1629, 1604, 1280, 1655, …
## $ bsmt_full_bath <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,…
## $ full_bath <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1,…
## $ half_bath <dbl> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1,…
## $ bedroom_abv_gr <dbl> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2,…
## $ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ kitchen_qual <dbl> 3, 4, 3, 4, 4, 3, 3, 3, 4, 3, 4, 3,…
## $ tot_rms_abv_grd <dbl> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5,…
## $ functional <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,…
## $ fireplaces <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,…
## $ fireplace_qu <dbl> 1, 1, 4, 5, 1, 4, 1, 5, 2, 1, 3, 1,…
## $ garage_type_catN <dbl> 0.1336612, 0.1336612, 0.1336612, 0.…
## $ garage_finish_catN <dbl> -0.2014492, -0.2014492, 0.2872421, …
## $ garage_cars <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
## $ garage_area <dbl> 730, 312, 482, 470, 506, 440, 420, …
## $ garage_qual <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ garage_cond <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ paved_drive_catN <dbl> 0.03529355, 0.03529355, 0.03529355,…
## $ wood_deck_sf <dbl> 140, 393, 212, 360, 0, 157, 483, 0,…
## $ open_porch_sf <dbl> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0…
## $ enclosed_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ screen_porch <dbl> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0…
## $ pool_qc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fence <dbl> 4, 1, 4, 1, 1, 1, 5, 1, 1, 4, 1, 1,…
## $ misc_feature_catN <dbl> 0.006052505, 0.000000000, 0.0060525…
## $ sale_type_catN <dbl> -0.02981943, -0.02981943, -0.029819…
## $ sale_condition_catN <dbl> -0.02056441, -0.02056441, -0.020564…
## $ has_garage <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_built <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_remod <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_rare <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_120 <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ ms_sub_class_lev_x_160 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ ms_sub_class_lev_x_30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_50 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_60 <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_90 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_FV <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_RL <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,…
## $ ms_zoning_lev_x_RM <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ alley_lev_x_Grvl <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_None <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lot_shape_lev_x_IR1 <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,…
## $ lot_shape_lev_x_IR2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_shape_lev_x_Reg <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,…
## $ land_contour_lev_x_Bnk <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_HLS <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_Low <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_CulDSac <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_Inside <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1,…
## $ land_slope_lev_x_Gtl <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ neighborhood_lev_x_BrkSide <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_CollgCr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Crawfor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Edwards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NAmes <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,…
## $ neighborhood_lev_x_NoRidge <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NridgHt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_OldTown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Sawyer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Somerst <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Timber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Artery <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Feedr <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Norm <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ bldg_type_lev_x_1Fam <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ bldg_type_lev_x_Duplex <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bldg_type_lev_x_Twnhs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ house_style_lev_x_1_5Fin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1Story <dbl> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0,…
## $ house_style_lev_x_2Story <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,…
## $ house_style_lev_x_SFoyer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_style_lev_x_Gable <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ roof_style_lev_x_Hip <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_matl_lev_x_CompShg <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ exterior1st_lev_x_CemntBd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior1st_lev_x_MetalSd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exterior1st_lev_x_VinylSd <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ exterior1st_lev_x_Wd_Sdng <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_MetalSd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exterior2nd_lev_x_VinylSd <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_Wd_Sdng <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_BrkFace <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ mas_vnr_type_lev_x_None <dbl> 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ mas_vnr_type_lev_x_Stone <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ foundation_lev_x_BrkTil <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ foundation_lev_x_CBlock <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ foundation_lev_x_PConc <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,…
## $ bsmt_exposure_lev_x_Av <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_Gd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ bsmt_exposure_lev_x_No <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,…
## $ bsmt_exposure_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heating_lev_x_GasA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ central_air_lev_x_N <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ central_air_lev_x_Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_type_lev_x_Attchd <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ garage_type_lev_x_BuiltIn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_type_lev_x_Detchd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ garage_type_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Fin <dbl> 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,…
## $ garage_finish_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_RFn <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Unf <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,…
## $ paved_drive_lev_x_N <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ paved_drive_lev_x_Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_None <dbl> 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_Shed <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_COD <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ sale_type_lev_x_New <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_WD <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ sale_condition_lev_x_Abnorml <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_condition_lev_x_Normal <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sale_condition_lev_x_Partial <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
First, import the test_treated data into h2o framework.
Predict on the test_treated set.
## predict
## 1 11.60951
## 2 11.97460
## 3 12.16442
## 4 12.15858
## 5 12.21806
## 6 12.03056
Recall we have log-transformed the sale_price. For Kaggle submission, we need the sale_price and not log_sale_price.
## exp(predict)
## 1 110140.1
## 2 158673.0
## 3 191840.8
## 4 190724.4
## 5 202412.0
## 6 167804.8
We can now create the file that will be submitted to Kaggle.
## # A tibble: 1,459 x 2
## Id SalePrice
## <dbl> <dbl>
## 1 1461 110140.
## 2 1462 158673.
## 3 1463 191841.
## 4 1464 190724.
## 5 1465 202412.
## 6 1466 167805.
## 7 1467 172645.
## 8 1468 160331.
## 9 1469 186073.
## 10 1470 127539.
## # … with 1,449 more rows
On Kaggle, the RMSLE score is 0.13304, which is similar to the error we found when building models with auto_ml, meaning that our final model did not overfit on new unknown data.
To get a better score, more feature engineering should be helpful. But usually, feature engineering is field-oriented. In a real case scenario, we should get in touch with professionals to get a sense on useful new variables we could create.
Another approach would be to consider the best model obtained with automl as a baseline model, and try few tweaks on the hyperparameters.