Part 1 and Part 2 were used to clean and prepare data. As a result, we now have data that is all numeric, and that has no missing values.
We can now build some models to predict the sale price of each house described in the Kaggle test dataset.

1 Load data and libraries

library(tidyverse)   # data manipulation
library(vtreat)      # variable preparation
library(h2o)         # modeling framework
library(kableExtra)  # customize table output

Load and first inspection of the data.

# Import the R objects created in Part 2
load("02-house_objects.RData")
# Structure of the train dataset
glimpse(train_treated)
## Observations: 1,102
## Variables: 148
## $ ms_sub_class_catN            <dbl> 0.30299208, 0.30299208, -0.24959678…
## $ ms_zoning_catN               <dbl> 0.06550683, 0.06550683, 0.06550683,…
## $ lot_frontage                 <dbl> 68.00000, 84.00000, 85.00000, 75.00…
## $ lot_area                     <dbl> 11250, 14260, 14115, 10084, 10382, …
## $ alley_catN                   <dbl> 0.01187456, 0.01187456, 0.01187456,…
## $ lot_shape_catN               <dbl> 0.14188506, 0.14188506, 0.14188506,…
## $ land_contour_catN            <dbl> -0.007902741, -0.007902741, -0.0079…
## $ lot_config_catN              <dbl> -0.02452583, 0.00000000, -0.0245258…
## $ neighborhood_catN            <dbl> 0.13697891, 0.65789741, -0.09717128…
## $ condition1_catN              <dbl> 0.02096058, 0.02096058, 0.02096058,…
## $ bldg_type_catN               <dbl> 0.02699036, 0.02699036, 0.02699036,…
## $ house_style_catN             <dbl> 0.14962377, 0.14962377, -0.25994606…
## $ overall_qual                 <dbl> 7, 8, 5, 8, 7, 7, 5, 5, 9, 7, 6, 4,…
## $ year_built                   <dbl> 2001, 2000, 1993, 2004, 1973, 1931,…
## $ year_remod_add               <dbl> 2002, 2000, 1995, 2005, 1973, 1950,…
## $ roof_style_catN              <dbl> -0.04491734, -0.04491734, -0.044917…
## $ exterior1st_catN             <dbl> 0.15687324, 0.15687324, 0.15687324,…
## $ exterior2nd_catN             <dbl> 0.1587478, 0.1587478, 0.1587478, 0.…
## $ mas_vnr_type_catN            <dbl> 0.1464440, 0.1464440, -0.1204298, 0…
## $ mas_vnr_area                 <dbl> 162, 350, 0, 186, 240, 0, 0, 0, 286…
## $ exter_qual                   <dbl> 4, 4, 3, 4, 3, 3, 3, 3, 5, 4, 3, 3,…
## $ foundation_catN              <dbl> 0.2352084, 0.2352084, 0.2393833, 0.…
## $ bsmt_qual                    <dbl> 5, 5, 5, 6, 5, 4, 4, 4, 6, 5, 4, 1,…
## $ bsmt_cond                    <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1,…
## $ bsmt_exposure_catN           <dbl> 0.10091461, 0.10104746, -0.07571532…
## $ bsmt_fin_type1               <dbl> 7, 7, 7, 7, 6, 2, 7, 4, 7, 2, 5, 1,…
## $ bsmt_fin_sf1                 <dbl> 486, 655, 732, 1369, 859, 0, 851, 9…
## $ bsmt_unf_sf                  <dbl> 434, 490, 64, 317, 216, 952, 140, 1…
## $ total_bsmt_sf                <dbl> 920, 1145, 796, 1686, 1107, 952, 99…
## $ heating_catN                 <dbl> 0.009119800, 0.009119800, 0.0091198…
## $ heating_qc                   <dbl> 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 3, 3,…
## $ electrical                   <dbl> 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4,…
## $ x1st_flr_sf                  <dbl> 920, 1145, 796, 1694, 1107, 1022, 1…
## $ x2nd_flr_sf                  <dbl> 866, 1053, 566, 0, 983, 752, 0, 0, …
## $ gr_liv_area                  <dbl> 1786, 2198, 1362, 1694, 2090, 1774,…
## $ bsmt_full_bath               <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,…
## $ full_bath                    <dbl> 2, 2, 1, 2, 2, 2, 1, 1, 3, 2, 1, 2,…
## $ half_bath                    <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ bedroom_abv_gr               <dbl> 3, 4, 1, 3, 3, 2, 2, 3, 4, 3, 2, 2,…
## $ kitchen_abv_gr               <dbl> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2,…
## $ kitchen_qual                 <dbl> 4, 4, 3, 4, 3, 3, 3, 3, 5, 4, 3, 3,…
## $ tot_rms_abv_grd              <dbl> 6, 9, 5, 7, 7, 8, 5, 5, 11, 7, 5, 6…
## $ functional                   <dbl> 8, 8, 8, 8, 8, 7, 8, 8, 8, 8, 8, 8,…
## $ fireplaces                   <dbl> 1, 1, 0, 1, 2, 2, 2, 0, 2, 1, 1, 0,…
## $ fireplace_qu                 <dbl> 4, 4, 1, 5, 4, 4, 4, 1, 5, 5, 3, 1,…
## $ garage_type_catN             <dbl> 0.1404059, 0.1404059, 0.1404059, 0.…
## $ garage_finish_catN           <dbl> 0.1666591, 0.1666591, -0.1968145, 0…
## $ garage_cars                  <dbl> 2, 3, 2, 2, 2, 2, 1, 1, 3, 3, 1, 2,…
## $ garage_area                  <dbl> 608, 836, 480, 636, 484, 468, 205, …
## $ garage_qual                  <dbl> 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4,…
## $ garage_cond                  <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ paved_drive_catN             <dbl> 0.03148809, 0.03148809, 0.03148809,…
## $ wood_deck_sf                 <dbl> 0, 192, 40, 255, 235, 90, 0, 0, 147…
## $ open_porch_sf                <dbl> 42, 84, 30, 57, 204, 0, 4, 0, 21, 3…
## $ enclosed_porch               <dbl> 0, 0, 0, 0, 228, 205, 0, 0, 0, 0, 1…
## $ screen_porch                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pool_qc                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fence                        <dbl> 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 3, 1,…
## $ misc_feature_catN            <dbl> 0.005977216, 0.005977216, -0.181971…
## $ sale_type_catN               <dbl> -0.02942813, -0.02942813, -0.029428…
## $ sale_condition_catN          <dbl> -0.01765492, -0.01765492, -0.017654…
## $ has_garage                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_built         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_remod         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_rare        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_120       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_160       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_30        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_50        <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_60        <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ ms_sub_class_lev_x_90        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ ms_zoning_lev_x_FV           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_RL           <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ ms_zoning_lev_x_RM           <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_Grvl             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lot_shape_lev_x_IR1          <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0,…
## $ lot_shape_lev_x_IR2          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_shape_lev_x_Reg          <dbl> 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ land_contour_lev_x_Bnk       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_HLS       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_Low       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_CulDSac     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_Inside      <dbl> 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,…
## $ land_slope_lev_x_Gtl         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ neighborhood_lev_x_BrkSide   <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_CollgCr   <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ neighborhood_lev_x_Crawfor   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Edwards   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NAmes     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ neighborhood_lev_x_NoRidge   <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NridgHt   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ neighborhood_lev_x_OldTown   <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Sawyer    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ neighborhood_lev_x_Somerst   <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Timber    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Artery      <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Feedr       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Norm        <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ bldg_type_lev_x_1Fam         <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
## $ bldg_type_lev_x_Duplex       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ bldg_type_lev_x_Twnhs        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1_5Fin     <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1Story     <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,…
## $ house_style_lev_x_2Story     <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ house_style_lev_x_SFoyer     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_style_lev_x_Gable       <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1,…
## $ roof_style_lev_x_Hip         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,…
## $ roof_matl_lev_x_CompShg      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ exterior1st_lev_x_CemntBd    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior1st_lev_x_MetalSd    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ exterior1st_lev_x_VinylSd    <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ exterior1st_lev_x_Wd_Sdng    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_MetalSd    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ exterior2nd_lev_x_VinylSd    <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ exterior2nd_lev_x_Wd_Sdng    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_BrkFace   <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ mas_vnr_type_lev_x_None      <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ mas_vnr_type_lev_x_Stone     <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,…
## $ foundation_lev_x_BrkTil      <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
## $ foundation_lev_x_CBlock      <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,…
## $ foundation_lev_x_PConc       <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Av       <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Gd       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_No       <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,…
## $ bsmt_exposure_lev_x_None     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ heating_lev_x_GasA           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ central_air_lev_x_N          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ central_air_lev_x_Y          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_type_lev_x_Attchd     <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
## $ garage_type_lev_x_BuiltIn    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ garage_type_lev_x_Detchd     <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ garage_type_lev_x_None       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Fin      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ garage_finish_lev_x_None     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_RFn      <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
## $ garage_finish_lev_x_Unf      <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,…
## $ paved_drive_lev_x_N          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ paved_drive_lev_x_Y          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_None      <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,…
## $ misc_feature_lev_x_Shed      <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
## $ sale_type_lev_x_COD          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_New          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ sale_type_lev_x_WD           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
## $ sale_condition_lev_x_Abnorml <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sale_condition_lev_x_Normal  <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,…
## $ sale_condition_lev_x_Partial <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ log_sale_price               <dbl> 12.31717, 12.42922, 11.87060, 12.63…

2 Build models

We will use h2o framework to build multiple models. More specifically, the automl() auto-machine learning process is very handful to create models and automatically tune hyperparameters of each algorithm.

h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         13 minutes 36 seconds 
##     H2O cluster timezone:       Europe/Paris 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.24.0.5 
##     H2O cluster version age:    18 days  
##     H2O cluster name:           H2O_started_from_R_alex_jif955 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   1.64 GB 
##     H2O cluster total cores:    2 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, XGBoost, Algos, AutoML, Core V3, Core V4 
##     R Version:                  R version 3.6.0 (2019-04-26)
h2o.no_progress()

The data has to be ‘h2o-formatted’ in order to be used.

train_h2o <- as.h2o(x = train_treated)
valid_h2o <- as.h2o(x = valid_treated)
h2o.describe(train_h2o) %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  scroll_box(width = "800px")
Label Type Missing Zeros PosInf NegInf Min Max Mean Sigma Cardinality
ms_sub_class_catN real 0 349 0 0 -0.6086855 3.433566e-01 -0.0010771 2.259486e-01 NA
ms_zoning_catN real 0 0 0 0 -0.6535452 2.269903e-01 0.0007190 1.608428e-01 NA
lot_frontage real 0 0 0 0 21.0000000 3.130000e+02 69.9879470 2.150109e+01 NA
lot_area int 0 0 0 0 1300.0000000 2.152450e+05 10504.1079855 1.010558e+04 NA
alley_catN real 0 35 0 0 -0.4320284 1.696460e-02 0.0003160 7.342400e-02 NA
lot_shape_catN real 0 1 0 0 -0.0983687 8.033851e-01 0.0037089 1.344989e-01 NA
land_contour_catN real 0 669 0 0 -0.2459688 3.048004e-01 0.0029096 7.323370e-02 NA
lot_config_catN real 0 441 0 0 -0.0374232 6.290317e-01 0.0038008 6.307520e-02 NA
neighborhood_catN real 0 51 0 0 -0.6095687 6.809439e-01 0.0008942 2.872994e-01 NA
condition1_catN real 0 28 0 0 -0.3162691 1.296756e-01 -0.0000929 7.735200e-02 NA
# Identify predictors and response
y <- "log_sale_price"
x <- setdiff(names(train_h2o), y)
# Run AutoML for 40 base models (limited to 1 hour max runtime by default)
# Excluse Deep Learning
# The metric used on Kaggle is Root Mean Squared Logarithmic Error
# (we already log-transformed the response, so the metric is RMSE)
house_automl <- h2o.automl(x = x, y = y,
                           training_frame = train_h2o,
                           validation_frame = valid_h2o,
                           max_models = 40, max_runtime_secs = 60,
                           exclude_algos = c("DeepLearning"), 
                           sort_metric = "RMSE",
                           seed = 42)
# Extract the AutoML Leaderboard
house_lb <- house_automl@leaderboard

# View the 10 best models
house_lb %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  scroll_box(width = "800px")
model_id mean_residual_deviance rmse mse mae rmsle
XGBoost_1_AutoML_20190707_193526 0.0174348 0.1320411 0.0174348 0.0911977 0.0102442
XGBoost_2_AutoML_20190707_193526 0.0174418 0.1320675 0.0174418 0.0906730 0.0102372
StackedEnsemble_BestOfFamily_AutoML_20190707_193526 0.0178124 0.1334630 0.0178124 0.0851239 0.0102857
StackedEnsemble_AllModels_AutoML_20190707_193526 0.0178208 0.1334944 0.0178208 0.0855003 0.0102978
XGBoost_3_AutoML_20190707_193526 0.0182921 0.1352484 0.0182921 0.0933149 0.0104718
GBM_2_AutoML_20190707_193526 0.0195830 0.1399394 0.0195830 0.0936482 0.0108621
GBM_1_AutoML_20190707_193526 0.0196602 0.1402148 0.0196602 0.0938517 0.0108908
GLM_grid_1_AutoML_20190707_193526_model_1 0.0220872 0.1486175 0.0220872 0.0905069 0.0112702
DRF_1_AutoML_20190707_193526 0.0239059 0.1546152 0.0239059 0.1049230 0.0120129

Variable importance.

house_best <- h2o.getModel(model_id = house_automl@leader@model_id)

h2o.varimp_plot(house_best)

We can also inspect the parameters of the best model.

house_best@parameters
## $model_id
## [1] "XGBoost_1_AutoML_20190707_193526"
## 
## $training_frame
## [1] "automl_training_train_treated_sid_9702_1"
## 
## $validation_frame
## [1] "valid_treated_sid_9702_3"
## 
## $nfolds
## [1] 5
## 
## $keep_cross_validation_models
## [1] FALSE
## 
## $keep_cross_validation_predictions
## [1] TRUE
## 
## $fold_assignment
## [1] "Modulo"
## 
## $stopping_metric
## [1] "RMSE"
## 
## $stopping_tolerance
## [1] 0.03012376
## 
## $seed
## [1] 42
## 
## $ntrees
## [1] 139
## 
## $max_depth
## [1] 5
## 
## $min_rows
## [1] 3
## 
## $learn_rate
## [1] 0.05
## 
## $sample_rate
## [1] 0.8
## 
## $col_sample_rate
## [1] 0.8
## 
## $col_sample_rate_per_tree
## [1] 0.8
## 
## $score_tree_interval
## [1] 5
## 
## $x
##   [1] "ms_sub_class_catN"            "ms_zoning_catN"              
##   [3] "lot_frontage"                 "lot_area"                    
##   [5] "alley_catN"                   "lot_shape_catN"              
##   [7] "land_contour_catN"            "lot_config_catN"             
##   [9] "neighborhood_catN"            "condition1_catN"             
##  [11] "bldg_type_catN"               "house_style_catN"            
##  [13] "overall_qual"                 "year_built"                  
##  [15] "year_remod_add"               "roof_style_catN"             
##  [17] "exterior1st_catN"             "exterior2nd_catN"            
##  [19] "mas_vnr_type_catN"            "mas_vnr_area"                
##  [21] "exter_qual"                   "foundation_catN"             
##  [23] "bsmt_qual"                    "bsmt_cond"                   
##  [25] "bsmt_exposure_catN"           "bsmt_fin_type1"              
##  [27] "bsmt_fin_sf1"                 "bsmt_unf_sf"                 
##  [29] "total_bsmt_sf"                "heating_catN"                
##  [31] "heating_qc"                   "electrical"                  
##  [33] "x1st_flr_sf"                  "x2nd_flr_sf"                 
##  [35] "gr_liv_area"                  "bsmt_full_bath"              
##  [37] "full_bath"                    "half_bath"                   
##  [39] "bedroom_abv_gr"               "kitchen_abv_gr"              
##  [41] "kitchen_qual"                 "tot_rms_abv_grd"             
##  [43] "functional"                   "fireplaces"                  
##  [45] "fireplace_qu"                 "garage_type_catN"            
##  [47] "garage_finish_catN"           "garage_cars"                 
##  [49] "garage_area"                  "garage_qual"                 
##  [51] "garage_cond"                  "paved_drive_catN"            
##  [53] "wood_deck_sf"                 "open_porch_sf"               
##  [55] "enclosed_porch"               "screen_porch"                
##  [57] "pool_qc"                      "fence"                       
##  [59] "misc_feature_catN"            "sale_type_catN"              
##  [61] "sale_condition_catN"          "has_garage"                  
##  [63] "garage_yr_same_built"         "garage_yr_same_remod"        
##  [65] "ms_sub_class_lev_rare"        "ms_sub_class_lev_x_120"      
##  [67] "ms_sub_class_lev_x_160"       "ms_sub_class_lev_x_30"       
##  [69] "ms_sub_class_lev_x_50"        "ms_sub_class_lev_x_60"       
##  [71] "ms_sub_class_lev_x_90"        "ms_zoning_lev_x_FV"          
##  [73] "ms_zoning_lev_x_RL"           "ms_zoning_lev_x_RM"          
##  [75] "alley_lev_x_Grvl"             "alley_lev_x_None"            
##  [77] "lot_shape_lev_x_IR1"          "lot_shape_lev_x_IR2"         
##  [79] "lot_shape_lev_x_Reg"          "land_contour_lev_x_Bnk"      
##  [81] "land_contour_lev_x_HLS"       "land_contour_lev_x_Low"      
##  [83] "lot_config_lev_x_CulDSac"     "lot_config_lev_x_Inside"     
##  [85] "land_slope_lev_x_Gtl"         "neighborhood_lev_x_BrkSide"  
##  [87] "neighborhood_lev_x_CollgCr"   "neighborhood_lev_x_Crawfor"  
##  [89] "neighborhood_lev_x_Edwards"   "neighborhood_lev_x_NAmes"    
##  [91] "neighborhood_lev_x_NoRidge"   "neighborhood_lev_x_NridgHt"  
##  [93] "neighborhood_lev_x_OldTown"   "neighborhood_lev_x_Sawyer"   
##  [95] "neighborhood_lev_x_Somerst"   "neighborhood_lev_x_Timber"   
##  [97] "condition1_lev_x_Artery"      "condition1_lev_x_Feedr"      
##  [99] "condition1_lev_x_Norm"        "bldg_type_lev_x_1Fam"        
## [101] "bldg_type_lev_x_Duplex"       "bldg_type_lev_x_Twnhs"       
## [103] "house_style_lev_x_1_5Fin"     "house_style_lev_x_1Story"    
## [105] "house_style_lev_x_2Story"     "house_style_lev_x_SFoyer"    
## [107] "roof_style_lev_x_Gable"       "roof_style_lev_x_Hip"        
## [109] "roof_matl_lev_x_CompShg"      "exterior1st_lev_x_CemntBd"   
## [111] "exterior1st_lev_x_MetalSd"    "exterior1st_lev_x_VinylSd"   
## [113] "exterior1st_lev_x_Wd_Sdng"    "exterior2nd_lev_x_MetalSd"   
## [115] "exterior2nd_lev_x_VinylSd"    "exterior2nd_lev_x_Wd_Sdng"   
## [117] "mas_vnr_type_lev_x_BrkFace"   "mas_vnr_type_lev_x_None"     
## [119] "mas_vnr_type_lev_x_Stone"     "foundation_lev_x_BrkTil"     
## [121] "foundation_lev_x_CBlock"      "foundation_lev_x_PConc"      
## [123] "bsmt_exposure_lev_x_Av"       "bsmt_exposure_lev_x_Gd"      
## [125] "bsmt_exposure_lev_x_No"       "bsmt_exposure_lev_x_None"    
## [127] "heating_lev_x_GasA"           "central_air_lev_x_N"         
## [129] "central_air_lev_x_Y"          "garage_type_lev_x_Attchd"    
## [131] "garage_type_lev_x_BuiltIn"    "garage_type_lev_x_Detchd"    
## [133] "garage_type_lev_x_None"       "garage_finish_lev_x_Fin"     
## [135] "garage_finish_lev_x_None"     "garage_finish_lev_x_RFn"     
## [137] "garage_finish_lev_x_Unf"      "paved_drive_lev_x_N"         
## [139] "paved_drive_lev_x_Y"          "misc_feature_lev_x_None"     
## [141] "misc_feature_lev_x_Shed"      "sale_type_lev_x_COD"         
## [143] "sale_type_lev_x_New"          "sale_type_lev_x_WD"          
## [145] "sale_condition_lev_x_Abnorml" "sale_condition_lev_x_Normal" 
## [147] "sale_condition_lev_x_Partial"
## 
## $y
## [1] "log_sale_price"

3 Prepare test data

The test dataset will be prepared using the same steps and treatment plan created in Part 2.

# the 'test' data comes from the pre-prepared data from Part 1
test <- readRDS("01-full_train_test.rds") %>% 
  filter(df_id == "test") %>% 
  select(-df_id, -sale_price)
glimpse(test)
## Observations: 1,459
## Variables: 80
## $ id              <dbl> 1461, 1462, 1463, 1464, 1465, 1466, 1467, 1468, …
## $ ms_sub_class    <fct> 20, 20, 60, 60, 120, 60, 20, 60, 20, 20, 120, 16…
## $ ms_zoning       <fct> RH, RL, RL, RL, RL, RL, RL, RL, RL, RL, RH, RM, …
## $ lot_frontage    <dbl> 80, 81, 74, 78, 43, 75, NA, 63, 85, 70, 26, 21, …
## $ lot_area        <dbl> 11622, 14267, 13830, 9978, 5005, 10000, 7980, 84…
## $ street          <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, …
## $ alley           <fct> None, None, None, None, None, None, None, None, …
## $ lot_shape       <fct> Reg, IR1, IR1, IR1, IR1, IR1, IR1, IR1, Reg, Reg…
## $ land_contour    <fct> Lvl, Lvl, Lvl, Lvl, HLS, Lvl, Lvl, Lvl, Lvl, Lvl…
## $ utilities       <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
## $ lot_config      <fct> Inside, Corner, Inside, Inside, Inside, Corner, …
## $ land_slope      <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl…
## $ neighborhood    <fct> NAmes, NAmes, Gilbert, Gilbert, StoneBr, Gilbert…
## $ condition1      <fct> Feedr, Norm, Norm, Norm, Norm, Norm, Norm, Norm,…
## $ condition2      <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, …
## $ bldg_type       <fct> 1Fam, 1Fam, 1Fam, 1Fam, TwnhsE, 1Fam, 1Fam, 1Fam…
## $ house_style     <fct> 1Story, 1Story, 2Story, 2Story, 1Story, 2Story, …
## $ overall_qual    <ord> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6, 5, 6, 7, 9, …
## $ overall_cond    <ord> 6, 6, 5, 6, 5, 5, 7, 5, 5, 5, 5, 5, 5, 6, 6, 5, …
## $ year_built      <dbl> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, …
## $ year_remod_add  <dbl> 1961, 1958, 1998, 1998, 1992, 1994, 2007, 1998, …
## $ roof_style      <fct> Gable, Hip, Gable, Gable, Gable, Gable, Gable, G…
## $ roof_matl       <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com…
## $ exterior1st     <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdB…
## $ exterior2nd     <fct> VinylSd, Wd Sdng, VinylSd, VinylSd, HdBoard, HdB…
## $ mas_vnr_type    <fct> None, BrkFace, None, BrkFace, None, None, None, …
## $ mas_vnr_area    <dbl> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0, 504, 492, 0,…
## $ exter_qual      <ord> TA, TA, TA, TA, Gd, TA, TA, TA, TA, TA, Gd, TA, …
## $ exter_cond      <ord> TA, TA, TA, TA, TA, TA, Gd, TA, TA, TA, TA, TA, …
## $ foundation      <fct> CBlock, CBlock, PConc, PConc, PConc, PConc, PCon…
## $ bsmt_qual       <ord> TA, TA, Gd, TA, Gd, Gd, Gd, Gd, Gd, TA, Gd, TA, …
## $ bsmt_cond       <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ bsmt_exposure   <fct> No, No, No, No, No, No, No, No, Gd, No, No, No, …
## $ bsmt_fin_type1  <ord> Rec, ALQ, GLQ, GLQ, ALQ, Unf, ALQ, Unf, GLQ, ALQ…
## $ bsmt_fin_sf1    <dbl> 468, 923, 791, 602, 263, 0, 935, 0, 637, 804, 10…
## $ bsmt_fin_type2  <ord> LwQ, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Unf, Rec…
## $ bsmt_fin_sf2    <dbl> 144, 0, 0, 0, 0, 0, 0, 0, 0, 78, 0, 0, 0, 0, 0, …
## $ bsmt_unf_sf     <dbl> 270, 406, 137, 324, 1017, 763, 233, 789, 663, 0,…
## $ total_bsmt_sf   <dbl> 882, 1329, 928, 926, 1280, 763, 1168, 789, 1300,…
## $ heating         <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, …
## $ heating_qc      <ord> TA, TA, Gd, Ex, Ex, Gd, Ex, Gd, Gd, TA, Ex, TA, …
## $ central_air     <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ electrical      <ord> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,…
## $ x1st_flr_sf     <dbl> 896, 1329, 928, 926, 1280, 763, 1187, 789, 1341,…
## $ x2nd_flr_sf     <dbl> 0, 0, 701, 678, 0, 892, 0, 676, 0, 0, 0, 504, 56…
## $ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ gr_liv_area     <dbl> 896, 1329, 1629, 1604, 1280, 1655, 1187, 1465, 1…
## $ bsmt_full_bath  <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, …
## $ bsmt_half_bath  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath       <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 1, 2, 1, 2, …
## $ half_bath       <dbl> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, …
## $ bedroom_abv_gr  <dbl> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2, 3, 3, 2, 3, …
## $ kitchen_abv_gr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ kitchen_qual    <ord> TA, Gd, TA, Gd, Gd, TA, TA, TA, Gd, TA, Gd, TA, …
## $ tot_rms_abv_grd <dbl> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5, 6, 6, 4, 10,…
## $ functional      <ord> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ…
## $ fireplaces      <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, …
## $ fireplace_qu    <ord> None, None, TA, Gd, None, TA, None, Gd, Po, None…
## $ garage_type     <fct> Attchd, Attchd, Attchd, Attchd, Attchd, Attchd, …
## $ garage_yr_blt   <fct> 1961, 1958, 1997, 1998, 1992, 1993, 1992, 1998, …
## $ garage_finish   <fct> Unf, Unf, Fin, Fin, RFn, Fin, Fin, Fin, Unf, Fin…
## $ garage_cars     <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1, 3, …
## $ garage_area     <dbl> 730, 312, 482, 470, 506, 440, 420, 393, 506, 525…
## $ garage_qual     <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ garage_cond     <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ paved_drive     <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ wood_deck_sf    <dbl> 140, 393, 212, 360, 0, 157, 483, 0, 192, 240, 20…
## $ open_porch_sf   <dbl> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0, 68, 0, 0, 0…
## $ enclosed_porch  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ x3ssn_porch     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ screen_porch    <dbl> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pool_area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc         <ord> None, None, None, None, None, None, None, None, …
## $ fence           <ord> MnPrv, None, MnPrv, None, None, None, GdPrv, Non…
## $ misc_feature    <fct> None, Gar2, None, None, None, None, Shed, None, …
## $ misc_val        <dbl> 0, 12500, 0, 0, 0, 0, 500, 0, 0, 0, 0, 0, 0, 0, …
## $ mo_sold         <dbl> 6, 6, 3, 6, 1, 4, 3, 5, 2, 4, 6, 2, 3, 6, 6, 1, …
## $ yr_sold         <dbl> 2010, 2010, 2010, 2010, 2010, 2010, 2010, 2010, …
## $ sale_type       <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, COD,…
## $ sale_condition  <fct> Normal, Normal, Normal, Normal, Normal, Normal, …

As a reminder, we applied the following steps to prepare the train_treated data, and will apply the same for the testing data :

  • cyclical-transform the mo_sold variable
  • binary-transform the garage_yr_blt variable
  • numeric-transform the ordinal variables
  • apply treatment plan to get a numeric dataset without missing value
# Prepare 'test' data
test_treated <- test %>% 
  cyclical_transform(column_name = "mo_sold") %>% 
  garage_year_transform() %>% 
  mutate_if(is.ordered, as.numeric) %>% 
  prepare(treatmentplan = vtreat_plan$treatments,
          pruneSig = vtreat_prune_sig)
glimpse(test_treated)
## Observations: 1,459
## Variables: 147
## $ ms_sub_class_catN            <dbl> 0.02487946, 0.02487946, 0.31707790,…
## $ ms_zoning_catN               <dbl> -0.53309450, 0.06141452, 0.06141452…
## $ lot_frontage                 <dbl> 80.0000, 81.0000, 74.0000, 78.0000,…
## $ lot_area                     <dbl> 11622, 14267, 13830, 9978, 5005, 10…
## $ alley_catN                   <dbl> 0.01468384, 0.01468384, 0.01468384,…
## $ lot_shape_catN               <dbl> -0.09031732, 0.15279191, 0.15279191…
## $ land_contour_catN            <dbl> -0.005418798, -0.005418798, -0.0054…
## $ lot_config_catN              <dbl> -0.02389266, 0.00000000, -0.0238926…
## $ neighborhood_catN            <dbl> -0.15777407, -0.15777407, 0.1267109…
## $ condition1_catN              <dbl> -0.2327314, 0.0209811, 0.0209811, 0…
## $ bldg_type_catN               <dbl> 0.02452499, 0.02452499, 0.02452499,…
## $ house_style_catN             <dbl> -0.03114034, -0.03114034, 0.1641937…
## $ overall_qual                 <dbl> 5, 6, 5, 6, 8, 6, 6, 6, 7, 4, 7, 6,…
## $ year_built                   <dbl> 1961, 1958, 1997, 1998, 1992, 1993,…
## $ year_remod_add               <dbl> 1961, 1958, 1998, 1998, 1992, 1994,…
## $ roof_style_catN              <dbl> -0.04345073, 0.18122306, -0.0434507…
## $ exterior1st_catN             <dbl> 0.17180733, -0.17412791, 0.17180733…
## $ exterior2nd_catN             <dbl> 0.17412141, -0.17021159, 0.17412141…
## $ mas_vnr_type_catN            <dbl> -0.1297748, 0.1514415, -0.1297748, …
## $ mas_vnr_area                 <dbl> 0, 108, 0, 20, 0, 0, 0, 0, 0, 0, 0,…
## $ exter_qual                   <dbl> 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 4, 3,…
## $ foundation_catN              <dbl> -0.1589739, -0.1589739, 0.2294760, …
## $ bsmt_qual                    <dbl> 4, 4, 5, 4, 5, 5, 5, 5, 5, 4, 5, 4,…
## $ bsmt_cond                    <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ bsmt_exposure_catN           <dbl> -0.08084167, -0.08084167, -0.080841…
## $ bsmt_fin_type1               <dbl> 4, 6, 7, 7, 6, 2, 6, 2, 7, 6, 7, 4,…
## $ bsmt_fin_sf1                 <dbl> 468, 923, 791, 602, 263, 0, 935, 0,…
## $ bsmt_unf_sf                  <dbl> 270, 406, 137, 324, 1017, 763, 233,…
## $ total_bsmt_sf                <dbl> 882, 1329, 928, 926, 1280, 763, 116…
## $ heating_catN                 <dbl> 0.008521347, 0.008521347, 0.0085213…
## $ heating_qc                   <dbl> 3, 3, 4, 5, 5, 4, 5, 4, 4, 3, 5, 3,…
## $ electrical                   <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ x1st_flr_sf                  <dbl> 896, 1329, 928, 926, 1280, 763, 118…
## $ x2nd_flr_sf                  <dbl> 0, 0, 701, 678, 0, 892, 0, 676, 0, …
## $ gr_liv_area                  <dbl> 896, 1329, 1629, 1604, 1280, 1655, …
## $ bsmt_full_bath               <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,…
## $ full_bath                    <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 2, 1,…
## $ half_bath                    <dbl> 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1,…
## $ bedroom_abv_gr               <dbl> 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 2,…
## $ kitchen_abv_gr               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ kitchen_qual                 <dbl> 3, 4, 3, 4, 4, 3, 3, 3, 4, 3, 4, 3,…
## $ tot_rms_abv_grd              <dbl> 5, 6, 6, 7, 5, 7, 6, 7, 5, 4, 5, 5,…
## $ functional                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,…
## $ fireplaces                   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0,…
## $ fireplace_qu                 <dbl> 1, 1, 4, 5, 1, 4, 1, 5, 2, 1, 3, 1,…
## $ garage_type_catN             <dbl> 0.1336612, 0.1336612, 0.1336612, 0.…
## $ garage_finish_catN           <dbl> -0.2014492, -0.2014492, 0.2872421, …
## $ garage_cars                  <dbl> 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1,…
## $ garage_area                  <dbl> 730, 312, 482, 470, 506, 440, 420, …
## $ garage_qual                  <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ garage_cond                  <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ paved_drive_catN             <dbl> 0.03529355, 0.03529355, 0.03529355,…
## $ wood_deck_sf                 <dbl> 140, 393, 212, 360, 0, 157, 483, 0,…
## $ open_porch_sf                <dbl> 0, 36, 34, 36, 82, 84, 21, 75, 0, 0…
## $ enclosed_porch               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ screen_porch                 <dbl> 120, 0, 0, 0, 144, 0, 0, 0, 0, 0, 0…
## $ pool_qc                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fence                        <dbl> 4, 1, 4, 1, 1, 1, 5, 1, 1, 4, 1, 1,…
## $ misc_feature_catN            <dbl> 0.006052505, 0.000000000, 0.0060525…
## $ sale_type_catN               <dbl> -0.02981943, -0.02981943, -0.029819…
## $ sale_condition_catN          <dbl> -0.02056441, -0.02056441, -0.020564…
## $ has_garage                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_built         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_remod         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_rare        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_120       <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ ms_sub_class_lev_x_160       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ ms_sub_class_lev_x_30        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_50        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_60        <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_90        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_FV           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_RL           <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,…
## $ ms_zoning_lev_x_RM           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ alley_lev_x_Grvl             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lot_shape_lev_x_IR1          <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0,…
## $ lot_shape_lev_x_IR2          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_shape_lev_x_Reg          <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,…
## $ land_contour_lev_x_Bnk       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_HLS       <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_Low       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_CulDSac     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_Inside      <dbl> 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1,…
## $ land_slope_lev_x_Gtl         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ neighborhood_lev_x_BrkSide   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_CollgCr   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Crawfor   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Edwards   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NAmes     <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0,…
## $ neighborhood_lev_x_NoRidge   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NridgHt   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_OldTown   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Sawyer    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Somerst   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Timber    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Artery      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Feedr       <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Norm        <dbl> 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ bldg_type_lev_x_1Fam         <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ bldg_type_lev_x_Duplex       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bldg_type_lev_x_Twnhs        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ house_style_lev_x_1_5Fin     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1Story     <dbl> 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0,…
## $ house_style_lev_x_2Story     <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1,…
## $ house_style_lev_x_SFoyer     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_style_lev_x_Gable       <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ roof_style_lev_x_Hip         <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_matl_lev_x_CompShg      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ exterior1st_lev_x_CemntBd    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior1st_lev_x_MetalSd    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exterior1st_lev_x_VinylSd    <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ exterior1st_lev_x_Wd_Sdng    <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_MetalSd    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exterior2nd_lev_x_VinylSd    <dbl> 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_Wd_Sdng    <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_BrkFace   <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ mas_vnr_type_lev_x_None      <dbl> 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ mas_vnr_type_lev_x_Stone     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ foundation_lev_x_BrkTil      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ foundation_lev_x_CBlock      <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ foundation_lev_x_PConc       <dbl> 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0,…
## $ bsmt_exposure_lev_x_Av       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_Gd       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ bsmt_exposure_lev_x_No       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,…
## $ bsmt_exposure_lev_x_None     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heating_lev_x_GasA           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ central_air_lev_x_N          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ central_air_lev_x_Y          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_type_lev_x_Attchd     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ garage_type_lev_x_BuiltIn    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_type_lev_x_Detchd     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ garage_type_lev_x_None       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Fin      <dbl> 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0,…
## $ garage_finish_lev_x_None     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_RFn      <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Unf      <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,…
## $ paved_drive_lev_x_N          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ paved_drive_lev_x_Y          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_None      <dbl> 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_Shed      <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_COD          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ sale_type_lev_x_New          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_WD           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,…
## $ sale_condition_lev_x_Abnorml <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_condition_lev_x_Normal  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ sale_condition_lev_x_Partial <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…

4 Create predictions

First, import the test_treated data into h2o framework.

test_h2o <- as.h2o(x = test_treated)

Predict on the test_treated set.

house_pred <- h2o.predict(house_automl@leader, newdata = test_h2o)
head(house_pred)
##    predict
## 1 11.60951
## 2 11.97460
## 3 12.16442
## 4 12.15858
## 5 12.21806
## 6 12.03056

Recall we have log-transformed the sale_price. For Kaggle submission, we need the sale_price and not log_sale_price.

head(exp(house_pred))
##   exp(predict)
## 1     110140.1
## 2     158673.0
## 3     191840.8
## 4     190724.4
## 5     202412.0
## 6     167804.8

We can now create the file that will be submitted to Kaggle.

# Create final csv file
final <- tibble(Id = test$id,
                SalePrice = as.vector(exp(house_pred)))

final
## # A tibble: 1,459 x 2
##       Id SalePrice
##    <dbl>     <dbl>
##  1  1461   110140.
##  2  1462   158673.
##  3  1463   191841.
##  4  1464   190724.
##  5  1465   202412.
##  6  1466   167805.
##  7  1467   172645.
##  8  1468   160331.
##  9  1469   186073.
## 10  1470   127539.
## # … with 1,449 more rows
write_csv(x = final, path = "h2o_final.csv")

 

On Kaggle, the RMSLE score is 0.13304, which is similar to the error we found when building models with auto_ml, meaning that our final model did not overfit on new unknown data.

To get a better score, more feature engineering should be helpful. But usually, feature engineering is field-oriented. In a real case scenario, we should get in touch with professionals to get a sense on useful new variables we could create.
Another approach would be to consider the best model obtained with automl as a baseline model, and try few tweaks on the hyperparameters.

# close h2o framework
h2o.shutdown(prompt = FALSE)