Once the basic cleaning has been performed (see Part 1), we can go through the next phase.
In this notebook, we will concentrate on 2 tasks :

  • feature engineering
  • prepare variables for machine learning (via variable treatment using vtreat package)

1 Load data and libraries

library(tidyverse)   # data manipulation
library(vtreat)      # variable preparation for ML
library(kableExtra)  # customize table output

Load and first inspection of the data.

# Import the 'full' dataset created in Part 1
full <- readRDS("01-full_train_test.rds")

# Split the data to retrieve the original train dataset
train <- full %>% 
  filter(df_id == "train") %>% 
  select(-df_id)

rm(full)
# Structure of the train dataset
glimpse(train)
## Observations: 1,459
## Variables: 81
## $ id              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ ms_sub_class    <fct> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,…
## $ ms_zoning       <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, …
## $ lot_frontage    <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, …
## $ lot_area        <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10…
## $ street          <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, …
## $ alley           <fct> None, None, None, None, None, None, None, None, …
## $ lot_shape       <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg…
## $ land_contour    <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl…
## $ utilities       <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
## $ lot_config      <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside…
## $ land_slope      <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl…
## $ neighborhood    <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit…
## $ condition1      <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN,…
## $ condition2      <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, …
## $ bldg_type       <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, …
## $ house_style     <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, …
## $ overall_qual    <ord> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, …
## $ overall_cond    <ord> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, …
## $ year_built      <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, …
## $ year_remod_add  <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, …
## $ roof_style      <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable,…
## $ roof_matl       <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com…
## $ exterior1st     <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vin…
## $ exterior2nd     <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vin…
## $ mas_vnr_type    <fct> BrkFace, None, BrkFace, None, BrkFace, None, Sto…
## $ mas_vnr_area    <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, …
## $ exter_qual      <ord> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, …
## $ exter_cond      <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ foundation      <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc…
## $ bsmt_qual       <ord> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, …
## $ bsmt_cond       <ord> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ bsmt_exposure   <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, …
## $ bsmt_fin_type1  <ord> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ…
## $ bsmt_fin_sf1    <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,…
## $ bsmt_fin_type2  <ord> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf…
## $ bsmt_fin_sf2    <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_unf_sf     <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,…
## $ total_bsmt_sf   <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,…
## $ heating         <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, …
## $ heating_qc      <ord> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, …
## $ central_air     <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ electrical      <ord> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,…
## $ x1st_flr_sf     <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022…
## $ x2nd_flr_sf     <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, …
## $ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ gr_liv_area     <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, …
## $ bsmt_full_bath  <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, …
## $ bsmt_half_bath  <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath       <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, …
## $ half_bath       <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ bedroom_abv_gr  <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, …
## $ kitchen_abv_gr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, …
## $ kitchen_qual    <ord> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, …
## $ tot_rms_abv_grd <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,…
## $ functional      <ord> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Ty…
## $ fireplaces      <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, …
## $ fireplace_qu    <ord> None, TA, TA, Gd, TA, None, Gd, TA, TA, TA, None…
## $ garage_type     <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, …
## $ garage_yr_blt   <fct> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, …
## $ garage_finish   <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn…
## $ garage_cars     <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, …
## $ garage_area     <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205…
## $ garage_qual     <ord> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, …
## $ garage_cond     <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ paved_drive     <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ wood_deck_sf    <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, …
## $ open_porch_sf   <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, …
## $ enclosed_porch  <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, …
## $ x3ssn_porch     <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ screen_porch    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0…
## $ pool_area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc         <ord> None, None, None, None, None, None, None, None, …
## $ fence           <ord> None, None, None, None, None, MnPrv, None, None,…
## $ misc_feature    <fct> None, None, None, None, None, Shed, None, Shed, …
## $ misc_val        <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,…
## $ mo_sold         <dbl> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, …
## $ yr_sold         <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, …
## $ sale_type       <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New,…
## $ sale_condition  <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,…
## $ sale_price      <dbl> 208500, 181500, 223500, 140000, 250000, 143000, …

2 Feature engineering

2.1 sale_price

As previously said in Part 1, the Kaggle competition evaluates the performance on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

# log-transform the sale_price
train$log_sale_price <- log(train$sale_price)

Plot of the 2 variations of sale_price.

gridExtra::grid.arrange(
  ggplot(train, aes(x = sale_price)) +
    geom_density(),
  ggplot(train, aes(x = log_sale_price)) +
    geom_density()
)

# remove the 'sale_price' column
train$sale_price <- NULL

2.2 mo_sold

The mo_sold variable (‘Month Sold’) can be recoded as a “cyclical” variable (to keep the time variability).
Indeed, if we use 1-12 encoding, we’re telling the model that months 4 and 5 are very similar, while months 1 and 12 are very dissimilar. In fact, months 1 and 12 are just as similar as months 4 and 5.
The variable can be transformed in cyclical values, using cos/sin transformations (http://blog.davidkaleko.com/feature-engineering-cyclical-features.html).

train %>% 
  left_join(tibble(mo_sold = 1:12,
                   mo_sold_abbr = month.abb), by = "mo_sold") %>% 
  mutate(mo_sold_sin_trans = sin(2*pi * mo_sold / 12),
         mo_sold_cos_trans = cos(2*pi * mo_sold / 12)) %>% 
  distinct(mo_sold, mo_sold_sin_trans, mo_sold_cos_trans, mo_sold_abbr) %>% 
  ggplot(aes(x = mo_sold_cos_trans, y = mo_sold_sin_trans)) +
    geom_point() +
    geom_label(aes(label = paste(mo_sold_abbr, "\n cos : ", round(mo_sold_cos_trans, 4), "\n sin : ", round(mo_sold_sin_trans, 4))), size = 3) +
  scale_x_continuous(limits = c(-1.3, 1.3)) +
  scale_y_continuous(limits = c(-1.3, 1.3))

The following function will be next applied to the train and test datasets.

# function to transform into cyclical variables
cyclical_transform <- function(df, column_name) {
  # create new cyclical variables
  df[, paste0(column_name, "_sin_trans")] <- sin(2*pi * df[, column_name] / 12)
  df[, paste0(column_name, "_cos_trans")] <- cos(2*pi * df[, column_name] / 12)
  
  # remove original variable
  df[, column_name] <- NULL
  
  return(df)
}

# train$mo_sold_sin_trans <- sin(2*pi * train$mo_sold / 12)
# train$mo_sold_cos_trans <- cos(2*pi * train$mo_sold / 12)
# 
# train %>% 
#   select(MoSold, mo_sold_sin_trans, mo_sold_cos_trans)
# cyclical transformation of the 'mo_sold' variable
train <- cyclical_transform(df = train, column_name = "mo_sold")

2.3 garage_yr_blt

In Part 1, we have temporarily transformed the garage_yr_blt variable into a categorical variable, so that we can include the fact that there is no garage.

train %>% 
  select(garage_yr_blt)
## # A tibble: 1,459 x 1
##    garage_yr_blt
##    <fct>        
##  1 2003         
##  2 1976         
##  3 2001         
##  4 1998         
##  5 2000         
##  6 1993         
##  7 2004         
##  8 1973         
##  9 1931         
## 10 1939         
## # … with 1,449 more rows

The garage_yr_blt has 2 distinct values “categories” :

  • if a garage exists, its built year
  • if there is no garage, the value None

There are 2 other variables related to year : year_built and year_remod_add.

So, the idea for recoding the garage_yr_blt is the following :

  • add a new binary column has_garage
  • add 2 new binary columns garage_yr_same_built and garage_yr_same_built if the garage was built the same year of the house or remodeling

This intends to reproduce the garage_yr_blt as close as possible.

# Observe the result with new columns
train %>% 
  mutate(has_garage = as.numeric(garage_type != "None")) %>% 
  select(year_built, year_remod_add, has_garage, garage_yr_blt) %>% 
  mutate(garage_yr_same_built = garage_yr_blt == year_built,
         garage_yr_same_remod = year_built != year_remod_add & garage_yr_blt == year_remod_add) %>% 
  head(10) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  scroll_box(width = "800px")
year_built year_remod_add has_garage garage_yr_blt garage_yr_same_built garage_yr_same_remod
2003 2003 1 2003 TRUE FALSE
1976 1976 1 1976 TRUE FALSE
2001 2002 1 2001 TRUE FALSE
1915 1970 1 1998 FALSE FALSE
2000 2000 1 2000 TRUE FALSE
1993 1995 1 1993 TRUE FALSE
2004 2005 1 2004 TRUE FALSE
1973 1973 1 1973 TRUE FALSE
1931 1950 1 1931 TRUE FALSE
1939 1950 1 1939 TRUE FALSE

Let’s create a function (so that we can apply it on the test dataset).

garage_year_transform <- function(df) {
  df <- df %>% 
    mutate(has_garage = as.numeric(garage_type != "None"),
           garage_yr_same_built = as.numeric(garage_yr_blt == year_built),
           garage_yr_same_remod = as.numeric(year_built != year_remod_add & garage_yr_blt == year_remod_add))%>% 
    select(-garage_yr_blt)
  
  return(df)
}

Apply to the train dataset.

train <- garage_year_transform(train)

2.4 Ordinal variables

We have created some ordinal variables in Part 1 ; in those variables, the order of the levels have importance (level “Excellent” is generally better than a level “Poor” or “Fair”).

# which variables are ordinal ?
train %>% 
  select_if(is.ordered) %>% 
  colnames()
##  [1] "overall_qual"   "overall_cond"   "exter_qual"     "exter_cond"    
##  [5] "bsmt_qual"      "bsmt_cond"      "bsmt_fin_type1" "bsmt_fin_type2"
##  [9] "heating_qc"     "electrical"     "kitchen_qual"   "functional"    
## [13] "fireplace_qu"   "garage_qual"    "garage_cond"    "pool_qc"       
## [17] "fence"

These levels can now be transformed into numeric : the lowest value being the least important level.
For example, let’s look at the variable electrical.

# distribution of levels : note the 'NA' value
table(train$electrical, useNA = "ifany")
## 
## FuseP FuseF FuseA SBrkr  <NA> 
##     3    27    94  1334     1
# distribution of numerical levels : the 'NA' is still present
table(as.numeric(train$electrical), useNA = "ifany")
## 
##    1    2    3    4 <NA> 
##    3   27   94 1334    1

The vtreat package shall provide tools to deal with remaining NA values.

train <- train %>% 
  mutate_if(is.ordered, as.numeric)
glimpse(train)
## Observations: 1,459
## Variables: 84
## $ id                   <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ ms_sub_class         <fct> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20…
## $ ms_zoning            <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL,…
## $ lot_frontage         <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70,…
## $ lot_area             <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 1008…
## $ street               <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, P…
## $ alley                <fct> None, None, None, None, None, None, None, N…
## $ lot_shape            <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg…
## $ land_contour         <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl…
## $ utilities            <fct> AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ lot_config           <fct> Inside, FR2, Inside, Corner, FR2, Inside, I…
## $ land_slope           <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl…
## $ neighborhood         <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge…
## $ condition1           <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, …
## $ condition2           <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, N…
## $ bldg_type            <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1…
## $ house_style          <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5…
## $ overall_qual         <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6…
## $ overall_cond         <dbl> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5…
## $ year_built           <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1…
## $ year_remod_add       <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1…
## $ roof_style           <fct> Gable, Gable, Gable, Gable, Gable, Gable, G…
## $ roof_matl            <fct> CompShg, CompShg, CompShg, CompShg, CompShg…
## $ exterior1st          <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd…
## $ exterior2nd          <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd…
## $ mas_vnr_type         <fct> BrkFace, None, BrkFace, None, BrkFace, None…
## $ mas_vnr_area         <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, …
## $ exter_qual           <dbl> 4, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 5, 3, 4, 3…
## $ exter_cond           <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ foundation           <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, …
## $ bsmt_qual            <dbl> 5, 5, 5, 4, 5, 5, 6, 5, 4, 4, 4, 6, 4, 5, 4…
## $ bsmt_cond            <dbl> 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ bsmt_exposure        <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No,…
## $ bsmt_fin_type1       <dbl> 7, 6, 7, 6, 7, 7, 7, 6, 2, 7, 4, 7, 6, 2, 5…
## $ bsmt_fin_sf1         <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0,…
## $ bsmt_fin_type2       <dbl> 2, 2, 2, 2, 2, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2…
## $ bsmt_fin_sf2         <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, …
## $ bsmt_unf_sf          <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952,…
## $ total_bsmt_sf        <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107,…
## $ heating              <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, G…
## $ heating_qc           <dbl> 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 5, 5, 3, 5, 3…
## $ central_air          <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ electrical           <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4…
## $ x1st_flr_sf          <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107,…
## $ x2nd_flr_sf          <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0…
## $ low_qual_fin_sf      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gr_liv_area          <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2…
## $ bsmt_full_bath       <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1…
## $ bsmt_half_bath       <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ full_bath            <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1…
## $ half_bath            <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1…
## $ bedroom_abv_gr       <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2…
## $ kitchen_abv_gr       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1…
## $ kitchen_qual         <dbl> 4, 3, 4, 4, 4, 3, 4, 3, 3, 3, 3, 5, 3, 4, 3…
## $ tot_rms_abv_grd      <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, …
## $ functional           <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 7, 8, 8, 8, 8, 8, 8…
## $ fireplaces           <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1…
## $ fireplace_qu         <dbl> 1, 4, 4, 5, 4, 1, 5, 4, 4, 4, 1, 5, 1, 5, 3…
## $ garage_type          <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Att…
## $ garage_finish        <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf…
## $ garage_cars          <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1…
## $ garage_area          <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468…
## $ garage_qual          <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4…
## $ garage_cond          <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ paved_drive          <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ wood_deck_sf         <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, …
## $ open_porch_sf        <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21…
## $ enclosed_porch       <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0…
## $ x3ssn_porch          <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ screen_porch         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0,…
## $ pool_area            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pool_qc              <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ fence                <dbl> 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 3…
## $ misc_feature         <fct> None, None, None, None, None, Shed, None, S…
## $ misc_val             <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, …
## $ yr_sold              <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2…
## $ sale_type            <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD,…
## $ sale_condition       <fct> Normal, Normal, Normal, Abnorml, Normal, No…
## $ log_sale_price       <dbl> 12.24769, 12.10901, 12.31717, 11.84940, 12.…
## $ mo_sold_sin_trans    <dbl> 8.660254e-01, 5.000000e-01, -1.000000e+00, …
## $ mo_sold_cos_trans    <dbl> 5.000000e-01, -8.660254e-01, -1.836970e-16,…
## $ has_garage           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ garage_yr_same_built <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ garage_yr_same_remod <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…

3 Variable preparation

The package vtreat is very handful and allows to create a clean data frame that :

  • only has numeric columns (other than the outcome for classification problems)
  • has no Infinite/NA/NaN in the effective variable columns

The use pattern is :

  • design a treatment plan (we will use a cross_validation treatment plan)
  • use the returned structure with prepare() to apply the plan to data frames (train and test)

We will create 2 distinct datasets : train_treated and valid_treated :

  • train_treated will be used to train different models in next Part 3
  • valid_treated will be used to eveluate models’ performance in next Part 3

The following picture helps understanding the different datasets used in each step : datasets_picture

# Split train into 2 temporary datasets, ratio 0.75/0.25
set.seed(42)
sample_group <- sample(x = c("v_train", "v_valid"), size = nrow(train), replace = TRUE, prob = c(0.75, 0.25))

vtreat_train <- train[sample_group == "v_train", ]
vtreat_valid <- train[sample_group == "v_valid", ]

3.1 treatment plan n°1

The first treatment plan will be basic, with no option (no significance or pruning defined).

# Define response and input variables
response <- "log_sale_price"
input_vars <- setdiff(colnames(vtreat_train), response)

Create a treatment plan, based on cross-validation.

# Basic treatment plan
vtreat_plan <- mkCrossFrameNExperiment(dframe = vtreat_train,
                                       varlist = input_vars,
                                       outcomename = response)
## [1] "vtreat 1.4.2 start initial treatment design Sun Jul  7 19:33:22 2019"
## [1] " start cross frame work Sun Jul  7 19:33:28 2019"
## [1] " vtreat::mkCrossFrameNExperiment done Sun Jul  7 19:33:32 2019"

The treatments attribute has the complete treatment plan, and scoreFrame allows to see new created variables (with their different types and significance level).
The new values are stored in the argument crossFrame.

# complete treatment plan
treatments <- vtreat_plan$treatments

Let’s extract new created variables related to alley.

treatments$scoreFrame %>% 
  filter(origName == "alley")
##            varName varMoves          rsq          sig needsSplit
## 1       alley_catP     TRUE 0.0187716570 5.004325e-06       TRUE
## 2       alley_catN     TRUE 0.0277306535 2.688002e-08       TRUE
## 3       alley_catD     TRUE 0.0222482994 6.566783e-07       TRUE
## 4 alley_lev_x_Grvl     TRUE 0.0300314294 7.033336e-09      FALSE
## 5 alley_lev_x_None     TRUE 0.0188307617 4.834244e-06      FALSE
## 6 alley_lev_x_Pave     TRUE 0.0001759761 6.600172e-01      FALSE
##   extraModelDegrees origName code
## 1                 2    alley catP
## 2                 2    alley catN
## 3                 2    alley catD
## 4                 0    alley  lev
## 5                 0    alley  lev
## 6                 0    alley  lev

We can see that 6 new variables have been created.
The ‘lev’ variables are one-hot-encoded variables.

We can also extract the new “projected” values.

# Extract new values dataframe related to 'alley'
vtreat_plan$crossFrame %>% 
  select(starts_with("alley")) %>% 
  head(10)
##    alley_catP alley_catN alley_catD alley_lev_x_Grvl alley_lev_x_None
## 1   0.9306122 0.01301487  0.4048582                0                1
## 2   0.9306122 0.01301487  0.4048582                0                1
## 3   0.9319728 0.01619430  0.3984798                0                1
## 4   0.9359673 0.01484579  0.3966729                0                1
## 5   0.9359673 0.01484579  0.3966729                0                1
## 6   0.9319728 0.01619430  0.3984798                0                1
## 7   0.9359673 0.01484579  0.3966729                0                1
## 8   0.9319728 0.01619430  0.3984798                0                1
## 9   0.9319728 0.01619430  0.3984798                0                1
## 10  0.9319728 0.01619430  0.3984798                0                1
##    alley_lev_x_Pave
## 1                 0
## 2                 0
## 3                 0
## 4                 0
## 5                 0
## 6                 0
## 7                 0
## 8                 0
## 9                 0
## 10                0

The original categorical variable alley has been trasnformed into 6 all numeric variables.
Also note that we no longer have missing values.

sum(is.na(train))
## [1] 262
sum(is.na(vtreat_plan$crossFrame))
## [1] 0

3.2 treatment plan n°2

The other great interest of using vtreat is that it has a “variable selection” process.
During the treatment creation, it can among other things handle rare levels of categorical variables, prune variables based on statistically significance (“y-aware” processing).

# rare counts : if a level has less count than 1% of the total observations, 
# it will be considered as rare level
# (rare levels will be grouped inside a shared rare level)
vtreat_rare_count <- 0.01 * nrow(vtreat_train)

# rare significance level : choose a significance level to prune levels at.
# see http://www.win-vector.com/blog/2015/10/using-differential-privacy-to-reuse-training-data/
vtreat_rare_sig <- 0.3  # ???

# prune variables using a significance level.
# see https://winvector.github.io/vtreat/articles/vtreatSignificance.html
vtreat_prune_sig <- 1 / ncol(vtreat_train)

The sig (significance) value will be used to “prune” the variables (only keep variables with a sig value lower than vtreat_prune_sig).

Also, this vtreat page recommends :

We strongly suggest using the standard variables coded as ‘lev’, ‘clean’, and ‘isBad’; and the “y aware” variables coded as ‘catN’ and ‘catB’. The non sub-model variables (‘catP’ and ‘catD’) can be useful (possibly as interactions or guards on the corresponding ‘catN’ and ‘catB’ variables) but also encode distributional facts about the data that may or may not be appropriate depending on your problem domain.

 

This can be done when creating the treatment plan, using the argument codeRestriction.

# New treatment plan with code restriction
vtreat_plan <- mkCrossFrameNExperiment(dframe = vtreat_train,
                                       varlist = input_vars,
                                       outcomename = response,
                                       rareCount = vtreat_rare_count,
                                       rareSig = vtreat_rare_sig,
                                       codeRestriction = c("lev", "clean", "isBad", "catN"))
## [1] "vtreat 1.4.2 start initial treatment design Sun Jul  7 19:33:32 2019"
## [1] " start cross frame work Sun Jul  7 19:33:39 2019"
## [1] " vtreat::mkCrossFrameNExperiment done Sun Jul  7 19:33:44 2019"
# complete treatment plan
treatments <- vtreat_plan$treatments

Let’s extract new created variables related to alley.

treatments$scoreFrame %>% 
  filter(origName == "alley")
##            varName varMoves        rsq          sig needsSplit
## 1       alley_catN     TRUE 0.03013298 6.629171e-09       TRUE
## 2 alley_lev_x_Grvl     TRUE 0.03003143 7.033336e-09      FALSE
## 3 alley_lev_x_None     TRUE 0.01883076 4.834244e-06      FALSE
##   extraModelDegrees origName code
## 1                 2    alley catN
## 2                 0    alley  lev
## 3                 0    alley  lev

The ‘catP’ and ‘catD’ have not been retained as expected.
Also note that one level of the alley variable has not been retained : the significance level was indeed above our thresold of 0.3.

We can now filter the variables bases on significance.

# How many variables have been created by the treatment plan ?
nrow(treatments$scoreFrame)
## [1] 178
# Select only variables with a significance value lower than our previous defined value 'vtreat_prune_sig'
newvars <- treatments$scoreFrame$varName[treatments$scoreFrame$sig <= vtreat_prune_sig]

length(newvars)
## [1] 147

We now have 147 new variables returned by the treatment plan.

The new values are stored in the crossFrame attribute.

# New treated training dataset
train_treated <- vtreat_plan$crossFrame[, c(newvars, "log_sale_price")]
glimpse(train_treated)
## Observations: 1,102
## Variables: 148
## $ ms_sub_class_catN            <dbl> 0.30299208, 0.30299208, -0.24959678…
## $ ms_zoning_catN               <dbl> 0.06550683, 0.06550683, 0.06550683,…
## $ lot_frontage                 <dbl> 68.00000, 84.00000, 85.00000, 75.00…
## $ lot_area                     <dbl> 11250, 14260, 14115, 10084, 10382, …
## $ alley_catN                   <dbl> 0.01187456, 0.01187456, 0.01187456,…
## $ lot_shape_catN               <dbl> 0.14188506, 0.14188506, 0.14188506,…
## $ land_contour_catN            <dbl> -0.007902741, -0.007902741, -0.0079…
## $ lot_config_catN              <dbl> -0.02452583, 0.00000000, -0.0245258…
## $ neighborhood_catN            <dbl> 0.13697891, 0.65789741, -0.09717128…
## $ condition1_catN              <dbl> 0.02096058, 0.02096058, 0.02096058,…
## $ bldg_type_catN               <dbl> 0.02699036, 0.02699036, 0.02699036,…
## $ house_style_catN             <dbl> 0.14962377, 0.14962377, -0.25994606…
## $ overall_qual                 <dbl> 7, 8, 5, 8, 7, 7, 5, 5, 9, 7, 6, 4,…
## $ year_built                   <dbl> 2001, 2000, 1993, 2004, 1973, 1931,…
## $ year_remod_add               <dbl> 2002, 2000, 1995, 2005, 1973, 1950,…
## $ roof_style_catN              <dbl> -0.04491734, -0.04491734, -0.044917…
## $ exterior1st_catN             <dbl> 0.15687324, 0.15687324, 0.15687324,…
## $ exterior2nd_catN             <dbl> 0.1587478, 0.1587478, 0.1587478, 0.…
## $ mas_vnr_type_catN            <dbl> 0.1464440, 0.1464440, -0.1204298, 0…
## $ mas_vnr_area                 <dbl> 162, 350, 0, 186, 240, 0, 0, 0, 286…
## $ exter_qual                   <dbl> 4, 4, 3, 4, 3, 3, 3, 3, 5, 4, 3, 3,…
## $ foundation_catN              <dbl> 0.2352084, 0.2352084, 0.2393833, 0.…
## $ bsmt_qual                    <dbl> 5, 5, 5, 6, 5, 4, 4, 4, 6, 5, 4, 1,…
## $ bsmt_cond                    <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1,…
## $ bsmt_exposure_catN           <dbl> 0.10091461, 0.10104746, -0.07571532…
## $ bsmt_fin_type1               <dbl> 7, 7, 7, 7, 6, 2, 7, 4, 7, 2, 5, 1,…
## $ bsmt_fin_sf1                 <dbl> 486, 655, 732, 1369, 859, 0, 851, 9…
## $ bsmt_unf_sf                  <dbl> 434, 490, 64, 317, 216, 952, 140, 1…
## $ total_bsmt_sf                <dbl> 920, 1145, 796, 1686, 1107, 952, 99…
## $ heating_catN                 <dbl> 0.009119800, 0.009119800, 0.0091198…
## $ heating_qc                   <dbl> 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 3, 3,…
## $ electrical                   <dbl> 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4,…
## $ x1st_flr_sf                  <dbl> 920, 1145, 796, 1694, 1107, 1022, 1…
## $ x2nd_flr_sf                  <dbl> 866, 1053, 566, 0, 983, 752, 0, 0, …
## $ gr_liv_area                  <dbl> 1786, 2198, 1362, 1694, 2090, 1774,…
## $ bsmt_full_bath               <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,…
## $ full_bath                    <dbl> 2, 2, 1, 2, 2, 2, 1, 1, 3, 2, 1, 2,…
## $ half_bath                    <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ bedroom_abv_gr               <dbl> 3, 4, 1, 3, 3, 2, 2, 3, 4, 3, 2, 2,…
## $ kitchen_abv_gr               <dbl> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2,…
## $ kitchen_qual                 <dbl> 4, 4, 3, 4, 3, 3, 3, 3, 5, 4, 3, 3,…
## $ tot_rms_abv_grd              <dbl> 6, 9, 5, 7, 7, 8, 5, 5, 11, 7, 5, 6…
## $ functional                   <dbl> 8, 8, 8, 8, 8, 7, 8, 8, 8, 8, 8, 8,…
## $ fireplaces                   <dbl> 1, 1, 0, 1, 2, 2, 2, 0, 2, 1, 1, 0,…
## $ fireplace_qu                 <dbl> 4, 4, 1, 5, 4, 4, 4, 1, 5, 5, 3, 1,…
## $ garage_type_catN             <dbl> 0.1404059, 0.1404059, 0.1404059, 0.…
## $ garage_finish_catN           <dbl> 0.1666591, 0.1666591, -0.1968145, 0…
## $ garage_cars                  <dbl> 2, 3, 2, 2, 2, 2, 1, 1, 3, 3, 1, 2,…
## $ garage_area                  <dbl> 608, 836, 480, 636, 484, 468, 205, …
## $ garage_qual                  <dbl> 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4,…
## $ garage_cond                  <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ paved_drive_catN             <dbl> 0.03148809, 0.03148809, 0.03148809,…
## $ wood_deck_sf                 <dbl> 0, 192, 40, 255, 235, 90, 0, 0, 147…
## $ open_porch_sf                <dbl> 42, 84, 30, 57, 204, 0, 4, 0, 21, 3…
## $ enclosed_porch               <dbl> 0, 0, 0, 0, 228, 205, 0, 0, 0, 0, 1…
## $ screen_porch                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pool_qc                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fence                        <dbl> 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 3, 1,…
## $ misc_feature_catN            <dbl> 0.005977216, 0.005977216, -0.181971…
## $ sale_type_catN               <dbl> -0.02942813, -0.02942813, -0.029428…
## $ sale_condition_catN          <dbl> -0.01765492, -0.01765492, -0.017654…
## $ has_garage                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_built         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_remod         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_rare        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_120       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_160       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_30        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_50        <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_60        <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ ms_sub_class_lev_x_90        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ ms_zoning_lev_x_FV           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_RL           <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ ms_zoning_lev_x_RM           <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_Grvl             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lot_shape_lev_x_IR1          <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0,…
## $ lot_shape_lev_x_IR2          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_shape_lev_x_Reg          <dbl> 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ land_contour_lev_x_Bnk       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_HLS       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_Low       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_CulDSac     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_Inside      <dbl> 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,…
## $ land_slope_lev_x_Gtl         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ neighborhood_lev_x_BrkSide   <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_CollgCr   <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ neighborhood_lev_x_Crawfor   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Edwards   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NAmes     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ neighborhood_lev_x_NoRidge   <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NridgHt   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ neighborhood_lev_x_OldTown   <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Sawyer    <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ neighborhood_lev_x_Somerst   <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Timber    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Artery      <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Feedr       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Norm        <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ bldg_type_lev_x_1Fam         <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
## $ bldg_type_lev_x_Duplex       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ bldg_type_lev_x_Twnhs        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1_5Fin     <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1Story     <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,…
## $ house_style_lev_x_2Story     <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ house_style_lev_x_SFoyer     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_style_lev_x_Gable       <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1,…
## $ roof_style_lev_x_Hip         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,…
## $ roof_matl_lev_x_CompShg      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ exterior1st_lev_x_CemntBd    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior1st_lev_x_MetalSd    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ exterior1st_lev_x_VinylSd    <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ exterior1st_lev_x_Wd_Sdng    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_MetalSd    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ exterior2nd_lev_x_VinylSd    <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ exterior2nd_lev_x_Wd_Sdng    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_BrkFace   <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ mas_vnr_type_lev_x_None      <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ mas_vnr_type_lev_x_Stone     <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,…
## $ foundation_lev_x_BrkTil      <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
## $ foundation_lev_x_CBlock      <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,…
## $ foundation_lev_x_PConc       <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Av       <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Gd       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_No       <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,…
## $ bsmt_exposure_lev_x_None     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ heating_lev_x_GasA           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ central_air_lev_x_N          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ central_air_lev_x_Y          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_type_lev_x_Attchd     <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
## $ garage_type_lev_x_BuiltIn    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ garage_type_lev_x_Detchd     <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ garage_type_lev_x_None       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Fin      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ garage_finish_lev_x_None     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_RFn      <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
## $ garage_finish_lev_x_Unf      <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,…
## $ paved_drive_lev_x_N          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ paved_drive_lev_x_Y          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_None      <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,…
## $ misc_feature_lev_x_Shed      <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
## $ sale_type_lev_x_COD          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_New          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ sale_type_lev_x_WD           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
## $ sale_condition_lev_x_Abnorml <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sale_condition_lev_x_Normal  <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,…
## $ sale_condition_lev_x_Partial <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ log_sale_price               <dbl> 12.31717, 12.42922, 11.87060, 12.63…

We can see that the id variable, which we strongly suspect not to have any relation with the sale price, has been deleted.

 

We can now prepare the validation dataset using the same conditions as the training set ; the pruneSig argument is used to directly prune the variables.

# apply treatment plan to validation set
valid_treated <- prepare(treatmentplan = treatments,
                         dframe = vtreat_valid,
                         pruneSig = vtreat_prune_sig)
glimpse(valid_treated)
## Observations: 357
## Variables: 148
## $ ms_sub_class_catN            <dbl> 0.31707790, 0.02487946, -0.06736683…
## $ ms_zoning_catN               <dbl> 0.06141452, 0.06141452, 0.06141452,…
## $ lot_frontage                 <dbl> 65.0000, 80.0000, 60.0000, 69.9967,…
## $ lot_area                     <dbl> 8450, 9600, 9550, 12968, 6120, 1124…
## $ alley_catN                   <dbl> 0.01468384, 0.01468384, 0.01468384,…
## $ lot_shape_catN               <dbl> -0.09031732, -0.09031732, 0.1527919…
## $ land_contour_catN            <dbl> -0.005418798, -0.005418798, -0.0054…
## $ lot_config_catN              <dbl> -0.02389266, 0.00000000, 0.00000000…
## $ neighborhood_catN            <dbl> 0.11086832, -0.09532982, 0.23254295…
## $ condition1_catN              <dbl> 0.0209811, -0.2327314, 0.0209811, 0…
## $ bldg_type_catN               <dbl> 0.02452499, 0.02452499, 0.02452499,…
## $ house_style_catN             <dbl> 0.16419373, -0.03114034, 0.16419373…
## $ overall_qual                 <dbl> 7, 6, 7, 5, 7, 6, 8, 8, 5, 8, 4, 5,…
## $ year_built                   <dbl> 2003, 1976, 1915, 1962, 1929, 1970,…
## $ year_remod_add               <dbl> 2003, 1976, 1970, 1962, 2001, 1970,…
## $ roof_style_catN              <dbl> -0.04345073, -0.04345073, -0.043450…
## $ exterior1st_catN             <dbl> 0.17180733, -0.16217432, -0.1741279…
## $ exterior2nd_catN             <dbl> 0.17412141, -0.15994012, -0.1003386…
## $ mas_vnr_type_catN            <dbl> 0.1514415, -0.1297748, -0.1297748, …
## $ mas_vnr_area                 <dbl> 196, 0, 0, 0, 0, 180, 380, 281, 0, …
## $ exter_qual                   <dbl> 4, 3, 3, 3, 3, 3, 4, 4, 3, 4, 3, 3,…
## $ foundation_catN              <dbl> 0.2294760, -0.1589739, -0.3096192, …
## $ bsmt_qual                    <dbl> 5, 5, 4, 4, 4, 4, 6, 5, 5, 6, 4, 4,…
## $ bsmt_cond                    <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ bsmt_exposure_catN           <dbl> -0.08084167, 0.35620572, -0.0808416…
## $ bsmt_fin_type1               <dbl> 7, 6, 6, 6, 2, 6, 2, 2, 7, 7, 2, 2,…
## $ bsmt_fin_sf1                 <dbl> 706, 978, 216, 737, 0, 578, 0, 0, 8…
## $ bsmt_unf_sf                  <dbl> 150, 284, 540, 175, 832, 426, 1158,…
## $ total_bsmt_sf                <dbl> 856, 1262, 756, 912, 832, 1004, 115…
## $ heating_catN                 <dbl> 0.008521347, 0.008521347, 0.0085213…
## $ heating_qc                   <dbl> 5, 5, 4, 3, 5, 5, 5, 5, 3, 5, 2, 4,…
## $ electrical                   <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4,…
## $ x1st_flr_sf                  <dbl> 856, 1262, 961, 912, 854, 1004, 115…
## $ x2nd_flr_sf                  <dbl> 854, 0, 756, 0, 0, 0, 1218, 0, 0, 0…
## $ gr_liv_area                  <dbl> 1710, 1262, 1717, 912, 854, 1004, 2…
## $ bsmt_full_bath               <dbl> 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,…
## $ full_bath                    <dbl> 2, 2, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1,…
## $ half_bath                    <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,…
## $ bedroom_abv_gr               <dbl> 3, 3, 3, 2, 2, 2, 4, 3, 3, 3, 1, 3,…
## $ kitchen_abv_gr               <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ kitchen_qual                 <dbl> 4, 3, 4, 3, 3, 3, 4, 4, 3, 4, 2, 4,…
## $ tot_rms_abv_grd              <dbl> 8, 6, 7, 4, 5, 5, 9, 7, 6, 7, 4, 6,…
## $ functional                   <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,…
## $ fireplaces                   <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ fireplace_qu                 <dbl> 1, 4, 5, 1, 1, 4, 5, 5, 4, 5, 1, 1,…
## $ garage_type_catN             <dbl> 0.1336612, 0.1336612, -0.2537943, -…
## $ garage_finish_catN           <dbl> 0.1513195, 0.1513195, -0.2014492, -…
## $ garage_cars                  <dbl> 2, 2, 3, 1, 2, 2, 3, 2, 2, 3, 1, 1,…
## $ garage_area                  <dbl> 548, 460, 642, 352, 576, 480, 853, …
## $ garage_qual                  <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4,…
## $ garage_cond                  <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ paved_drive_catN             <dbl> 0.03529355, 0.03529355, 0.03529355,…
## $ wood_deck_sf                 <dbl> 0, 298, 0, 140, 48, 0, 240, 171, 10…
## $ open_porch_sf                <dbl> 61, 0, 35, 0, 112, 0, 154, 159, 110…
## $ enclosed_porch               <dbl> 0, 0, 272, 0, 0, 0, 0, 0, 0, 0, 87,…
## $ screen_porch                 <dbl> 0, 0, 0, 176, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc                      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fence                        <dbl> 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 4,…
## $ misc_feature_catN            <dbl> 0.006052505, 0.006052505, 0.0060525…
## $ sale_type_catN               <dbl> -0.02981943, -0.02981943, -0.029819…
## $ sale_condition_catN          <dbl> -0.02056441, -0.02056441, -0.193405…
## $ has_garage                   <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_built         <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ garage_yr_same_remod         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ ms_sub_class_lev_rare        <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_120       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ ms_sub_class_lev_x_160       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_30        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ ms_sub_class_lev_x_50        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_60        <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_90        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_FV           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_RL           <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,…
## $ ms_zoning_lev_x_RM           <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,…
## $ alley_lev_x_Grvl             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_None             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lot_shape_lev_x_IR1          <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,…
## $ lot_shape_lev_x_IR2          <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_shape_lev_x_Reg          <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0,…
## $ land_contour_lev_x_Bnk       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_HLS       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_Low       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_CulDSac     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ lot_config_lev_x_Inside      <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ land_slope_lev_x_Gtl         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ neighborhood_lev_x_BrkSide   <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ neighborhood_lev_x_CollgCr   <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Crawfor   <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Edwards   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NAmes     <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NoRidge   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NridgHt   <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ neighborhood_lev_x_OldTown   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Sawyer    <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ neighborhood_lev_x_Somerst   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Timber    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Artery      <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Feedr       <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ condition1_lev_x_Norm        <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ bldg_type_lev_x_1Fam         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,…
## $ bldg_type_lev_x_Duplex       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bldg_type_lev_x_Twnhs        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1_5Fin     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1Story     <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,…
## $ house_style_lev_x_2Story     <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_SFoyer     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_style_lev_x_Gable       <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,…
## $ roof_style_lev_x_Hip         <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ roof_matl_lev_x_CompShg      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ exterior1st_lev_x_CemntBd    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ exterior1st_lev_x_MetalSd    <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exterior1st_lev_x_VinylSd    <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
## $ exterior1st_lev_x_Wd_Sdng    <dbl> 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_MetalSd    <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exterior2nd_lev_x_VinylSd    <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
## $ exterior2nd_lev_x_Wd_Sdng    <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_BrkFace   <dbl> 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_None      <dbl> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ mas_vnr_type_lev_x_Stone     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ foundation_lev_x_BrkTil      <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ foundation_lev_x_CBlock      <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ foundation_lev_x_PConc       <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Av       <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_Gd       <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_No       <dbl> 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ bsmt_exposure_lev_x_None     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heating_lev_x_GasA           <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ central_air_lev_x_N          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ central_air_lev_x_Y          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ garage_type_lev_x_Attchd     <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1,…
## $ garage_type_lev_x_BuiltIn    <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ garage_type_lev_x_Detchd     <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ garage_type_lev_x_None       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Fin      <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_None     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_RFn      <dbl> 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
## $ garage_finish_lev_x_Unf      <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ paved_drive_lev_x_N          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ paved_drive_lev_x_Y          <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_None      <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_Shed      <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_COD          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_New          <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_WD           <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ sale_condition_lev_x_Abnorml <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_condition_lev_x_Normal  <dbl> 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ sale_condition_lev_x_Partial <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ log_sale_price               <dbl> 12.24769, 12.10901, 11.84940, 11.87…

We have 2 similar datasets, with only numeric variables and no missing values.

# similar datasets
identical(colnames(train_treated), colnames(valid_treated))
## [1] TRUE
# all numeric variables
ncol(train_treated) == train_treated %>% select_if(is.numeric) %>% ncol(.)
## [1] TRUE
ncol(valid_treated) == valid_treated %>% select_if(is.numeric) %>% ncol(.)
## [1] TRUE
# no missing values
sum(is.na(train_treated))
## [1] 0
sum(is.na(valid_treated))
## [1] 0

 

 

Let’s recap the different steps performed to prepare the data :

  • log-transform the sale_price variable
  • cyclical-transform the mo_sold variable
  • binary-transform the garage_yr_blt variable
  • numeric-transform the ordinal variables
  • apply treatment plan to get a numeric dataset without missing value

Only the first step will not be performed on the test dataset (since it is the variable of interest and so it doesn’t appear in the test data).

 

 

To close this Part 2, let’s save all necessary elements.

# Export the different objects
save(train_treated, valid_treated,               # datasets
     vtreat_plan, vtreat_prune_sig,              # treatment plan objects
     cyclical_transform, garage_year_transform,  # functions
     file = "02-house_objects.RData")