Once the basic cleaning has been performed (see Part 1), we can go through the next phase.
In this notebook, we will concentrate on 2 tasks :
vtreat package)library(tidyverse) # data manipulation
library(vtreat) # variable preparation for ML
library(kableExtra) # customize table outputLoad and first inspection of the data.
# Import the 'full' dataset created in Part 1
full <- readRDS("01-full_train_test.rds")
# Split the data to retrieve the original train dataset
train <- full %>%
filter(df_id == "train") %>%
select(-df_id)
rm(full)## Observations: 1,459
## Variables: 81
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ ms_sub_class <fct> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,…
## $ ms_zoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL, RL, …
## $ lot_frontage <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, …
## $ lot_area <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10…
## $ street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, Pave, …
## $ alley <fct> None, None, None, None, None, None, None, None, …
## $ lot_shape <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg, Reg…
## $ land_contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl…
## $ utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, AllPub, …
## $ lot_config <fct> Inside, FR2, Inside, Corner, FR2, Inside, Inside…
## $ land_slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl…
## $ neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge, Mit…
## $ condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, PosN,…
## $ condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, Norm, …
## $ bldg_type <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, …
## $ house_style <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5Fin, …
## $ overall_qual <ord> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, …
## $ overall_cond <ord> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, …
## $ year_built <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, …
## $ year_remod_add <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, …
## $ roof_style <fct> Gable, Gable, Gable, Gable, Gable, Gable, Gable,…
## $ roof_matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg, Com…
## $ exterior1st <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd, Vin…
## $ exterior2nd <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd, Vin…
## $ mas_vnr_type <fct> BrkFace, None, BrkFace, None, BrkFace, None, Sto…
## $ mas_vnr_area <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, …
## $ exter_qual <ord> Gd, TA, Gd, TA, Gd, TA, Gd, TA, TA, TA, TA, Ex, …
## $ exter_cond <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ foundation <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, PConc…
## $ bsmt_qual <ord> Gd, Gd, Gd, TA, Gd, Gd, Ex, Gd, TA, TA, TA, Ex, …
## $ bsmt_cond <ord> TA, TA, TA, Gd, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ bsmt_exposure <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No, No, …
## $ bsmt_fin_type1 <ord> GLQ, ALQ, GLQ, ALQ, GLQ, GLQ, GLQ, ALQ, Unf, GLQ…
## $ bsmt_fin_sf1 <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,…
## $ bsmt_fin_type2 <ord> Unf, Unf, Unf, Unf, Unf, Unf, Unf, BLQ, Unf, Unf…
## $ bsmt_fin_sf2 <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_unf_sf <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,…
## $ total_bsmt_sf <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,…
## $ heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, GasA, …
## $ heating_qc <ord> Ex, Ex, Ex, Gd, Ex, Ex, Ex, Ex, Gd, Ex, Ex, Ex, …
## $ central_air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ electrical <ord> SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr, SBrkr,…
## $ x1st_flr_sf <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022…
## $ x2nd_flr_sf <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, …
## $ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ gr_liv_area <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, …
## $ bsmt_full_bath <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, …
## $ bsmt_half_bath <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, …
## $ half_bath <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ bedroom_abv_gr <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, …
## $ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, …
## $ kitchen_qual <ord> Gd, TA, Gd, Gd, Gd, TA, Gd, TA, TA, TA, TA, Ex, …
## $ tot_rms_abv_grd <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,…
## $ functional <ord> Typ, Typ, Typ, Typ, Typ, Typ, Typ, Typ, Min1, Ty…
## $ fireplaces <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, …
## $ fireplace_qu <ord> None, TA, TA, Gd, TA, None, Gd, TA, TA, TA, None…
## $ garage_type <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Attchd, …
## $ garage_yr_blt <fct> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, …
## $ garage_finish <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf, RFn…
## $ garage_cars <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, …
## $ garage_area <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205…
## $ garage_qual <ord> TA, TA, TA, TA, TA, TA, TA, TA, Fa, Gd, TA, TA, …
## $ garage_cond <ord> TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, TA, …
## $ paved_drive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ wood_deck_sf <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, …
## $ open_porch_sf <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, …
## $ enclosed_porch <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, …
## $ x3ssn_porch <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ screen_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0…
## $ pool_area <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc <ord> None, None, None, None, None, None, None, None, …
## $ fence <ord> None, None, None, None, None, MnPrv, None, None,…
## $ misc_feature <fct> None, None, None, None, None, Shed, None, Shed, …
## $ misc_val <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,…
## $ mo_sold <dbl> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, …
## $ yr_sold <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, …
## $ sale_type <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, New,…
## $ sale_condition <fct> Normal, Normal, Normal, Abnorml, Normal, Normal,…
## $ sale_price <dbl> 208500, 181500, 223500, 140000, 250000, 143000, …
sale_priceAs previously said in Part 1, the Kaggle competition evaluates the performance on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.
Plot of the 2 variations of sale_price.
gridExtra::grid.arrange(
ggplot(train, aes(x = sale_price)) +
geom_density(),
ggplot(train, aes(x = log_sale_price)) +
geom_density()
)mo_soldThe mo_sold variable (‘Month Sold’) can be recoded as a “cyclical” variable (to keep the time variability).
Indeed, if we use 1-12 encoding, we’re telling the model that months 4 and 5 are very similar, while months 1 and 12 are very dissimilar. In fact, months 1 and 12 are just as similar as months 4 and 5.
The variable can be transformed in cyclical values, using cos/sin transformations (http://blog.davidkaleko.com/feature-engineering-cyclical-features.html).
train %>%
left_join(tibble(mo_sold = 1:12,
mo_sold_abbr = month.abb), by = "mo_sold") %>%
mutate(mo_sold_sin_trans = sin(2*pi * mo_sold / 12),
mo_sold_cos_trans = cos(2*pi * mo_sold / 12)) %>%
distinct(mo_sold, mo_sold_sin_trans, mo_sold_cos_trans, mo_sold_abbr) %>%
ggplot(aes(x = mo_sold_cos_trans, y = mo_sold_sin_trans)) +
geom_point() +
geom_label(aes(label = paste(mo_sold_abbr, "\n cos : ", round(mo_sold_cos_trans, 4), "\n sin : ", round(mo_sold_sin_trans, 4))), size = 3) +
scale_x_continuous(limits = c(-1.3, 1.3)) +
scale_y_continuous(limits = c(-1.3, 1.3))The following function will be next applied to the train and test datasets.
# function to transform into cyclical variables
cyclical_transform <- function(df, column_name) {
# create new cyclical variables
df[, paste0(column_name, "_sin_trans")] <- sin(2*pi * df[, column_name] / 12)
df[, paste0(column_name, "_cos_trans")] <- cos(2*pi * df[, column_name] / 12)
# remove original variable
df[, column_name] <- NULL
return(df)
}
# train$mo_sold_sin_trans <- sin(2*pi * train$mo_sold / 12)
# train$mo_sold_cos_trans <- cos(2*pi * train$mo_sold / 12)
#
# train %>%
# select(MoSold, mo_sold_sin_trans, mo_sold_cos_trans)garage_yr_bltIn Part 1, we have temporarily transformed the garage_yr_blt variable into a categorical variable, so that we can include the fact that there is no garage.
## # A tibble: 1,459 x 1
## garage_yr_blt
## <fct>
## 1 2003
## 2 1976
## 3 2001
## 4 1998
## 5 2000
## 6 1993
## 7 2004
## 8 1973
## 9 1931
## 10 1939
## # … with 1,449 more rows
The garage_yr_blt has 2 distinct values “categories” :
NoneThere are 2 other variables related to year : year_built and year_remod_add.
So, the idea for recoding the garage_yr_blt is the following :
has_garagegarage_yr_same_built and garage_yr_same_built if the garage was built the same year of the house or remodelingThis intends to reproduce the garage_yr_blt as close as possible.
# Observe the result with new columns
train %>%
mutate(has_garage = as.numeric(garage_type != "None")) %>%
select(year_built, year_remod_add, has_garage, garage_yr_blt) %>%
mutate(garage_yr_same_built = garage_yr_blt == year_built,
garage_yr_same_remod = year_built != year_remod_add & garage_yr_blt == year_remod_add) %>%
head(10) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>%
scroll_box(width = "800px")| year_built | year_remod_add | has_garage | garage_yr_blt | garage_yr_same_built | garage_yr_same_remod |
|---|---|---|---|---|---|
| 2003 | 2003 | 1 | 2003 | TRUE | FALSE |
| 1976 | 1976 | 1 | 1976 | TRUE | FALSE |
| 2001 | 2002 | 1 | 2001 | TRUE | FALSE |
| 1915 | 1970 | 1 | 1998 | FALSE | FALSE |
| 2000 | 2000 | 1 | 2000 | TRUE | FALSE |
| 1993 | 1995 | 1 | 1993 | TRUE | FALSE |
| 2004 | 2005 | 1 | 2004 | TRUE | FALSE |
| 1973 | 1973 | 1 | 1973 | TRUE | FALSE |
| 1931 | 1950 | 1 | 1931 | TRUE | FALSE |
| 1939 | 1950 | 1 | 1939 | TRUE | FALSE |
Let’s create a function (so that we can apply it on the test dataset).
garage_year_transform <- function(df) {
df <- df %>%
mutate(has_garage = as.numeric(garage_type != "None"),
garage_yr_same_built = as.numeric(garage_yr_blt == year_built),
garage_yr_same_remod = as.numeric(year_built != year_remod_add & garage_yr_blt == year_remod_add))%>%
select(-garage_yr_blt)
return(df)
}Apply to the train dataset.
We have created some ordinal variables in Part 1 ; in those variables, the order of the levels have importance (level “Excellent” is generally better than a level “Poor” or “Fair”).
## [1] "overall_qual" "overall_cond" "exter_qual" "exter_cond"
## [5] "bsmt_qual" "bsmt_cond" "bsmt_fin_type1" "bsmt_fin_type2"
## [9] "heating_qc" "electrical" "kitchen_qual" "functional"
## [13] "fireplace_qu" "garage_qual" "garage_cond" "pool_qc"
## [17] "fence"
These levels can now be transformed into numeric : the lowest value being the least important level.
For example, let’s look at the variable electrical.
##
## FuseP FuseF FuseA SBrkr <NA>
## 3 27 94 1334 1
# distribution of numerical levels : the 'NA' is still present
table(as.numeric(train$electrical), useNA = "ifany")##
## 1 2 3 4 <NA>
## 3 27 94 1334 1
The vtreat package shall provide tools to deal with remaining NA values.
## Observations: 1,459
## Variables: 84
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ ms_sub_class <fct> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20…
## $ ms_zoning <fct> RL, RL, RL, RL, RL, RL, RL, RL, RM, RL, RL,…
## $ lot_frontage <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70,…
## $ lot_area <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 1008…
## $ street <fct> Pave, Pave, Pave, Pave, Pave, Pave, Pave, P…
## $ alley <fct> None, None, None, None, None, None, None, N…
## $ lot_shape <fct> Reg, Reg, IR1, IR1, IR1, IR1, Reg, IR1, Reg…
## $ land_contour <fct> Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl, Lvl…
## $ utilities <fct> AllPub, AllPub, AllPub, AllPub, AllPub, All…
## $ lot_config <fct> Inside, FR2, Inside, Corner, FR2, Inside, I…
## $ land_slope <fct> Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl, Gtl…
## $ neighborhood <fct> CollgCr, Veenker, CollgCr, Crawfor, NoRidge…
## $ condition1 <fct> Norm, Feedr, Norm, Norm, Norm, Norm, Norm, …
## $ condition2 <fct> Norm, Norm, Norm, Norm, Norm, Norm, Norm, N…
## $ bldg_type <fct> 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1Fam, 1…
## $ house_style <fct> 2Story, 1Story, 2Story, 2Story, 2Story, 1.5…
## $ overall_qual <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6…
## $ overall_cond <dbl> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5…
## $ year_built <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1…
## $ year_remod_add <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1…
## $ roof_style <fct> Gable, Gable, Gable, Gable, Gable, Gable, G…
## $ roof_matl <fct> CompShg, CompShg, CompShg, CompShg, CompShg…
## $ exterior1st <fct> VinylSd, MetalSd, VinylSd, Wd Sdng, VinylSd…
## $ exterior2nd <fct> VinylSd, MetalSd, VinylSd, Wd Shng, VinylSd…
## $ mas_vnr_type <fct> BrkFace, None, BrkFace, None, BrkFace, None…
## $ mas_vnr_area <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, …
## $ exter_qual <dbl> 4, 3, 4, 3, 4, 3, 4, 3, 3, 3, 3, 5, 3, 4, 3…
## $ exter_cond <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3…
## $ foundation <fct> PConc, CBlock, PConc, BrkTil, PConc, Wood, …
## $ bsmt_qual <dbl> 5, 5, 5, 4, 5, 5, 6, 5, 4, 4, 4, 6, 4, 5, 4…
## $ bsmt_cond <dbl> 4, 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ bsmt_exposure <fct> No, Gd, Mn, No, Av, No, Av, Mn, No, No, No,…
## $ bsmt_fin_type1 <dbl> 7, 6, 7, 6, 7, 7, 7, 6, 2, 7, 4, 7, 6, 2, 5…
## $ bsmt_fin_sf1 <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0,…
## $ bsmt_fin_type2 <dbl> 2, 2, 2, 2, 2, 2, 2, 5, 2, 2, 2, 2, 2, 2, 2…
## $ bsmt_fin_sf2 <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, …
## $ bsmt_unf_sf <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952,…
## $ total_bsmt_sf <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107,…
## $ heating <fct> GasA, GasA, GasA, GasA, GasA, GasA, GasA, G…
## $ heating_qc <dbl> 5, 5, 5, 4, 5, 5, 5, 5, 4, 5, 5, 5, 3, 5, 3…
## $ central_air <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ electrical <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4…
## $ x1st_flr_sf <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107,…
## $ x2nd_flr_sf <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0…
## $ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ gr_liv_area <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2…
## $ bsmt_full_bath <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1…
## $ bsmt_half_bath <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ full_bath <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1…
## $ half_bath <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1…
## $ bedroom_abv_gr <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2…
## $ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1…
## $ kitchen_qual <dbl> 4, 3, 4, 4, 4, 3, 4, 3, 3, 3, 3, 5, 3, 4, 3…
## $ tot_rms_abv_grd <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, …
## $ functional <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 7, 8, 8, 8, 8, 8, 8…
## $ fireplaces <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1…
## $ fireplace_qu <dbl> 1, 4, 4, 5, 4, 1, 5, 4, 4, 4, 1, 5, 1, 5, 3…
## $ garage_type <fct> Attchd, Attchd, Attchd, Detchd, Attchd, Att…
## $ garage_finish <fct> RFn, RFn, RFn, Unf, RFn, Unf, RFn, RFn, Unf…
## $ garage_cars <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1…
## $ garage_area <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468…
## $ garage_qual <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4…
## $ garage_cond <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4…
## $ paved_drive <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y…
## $ wood_deck_sf <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, …
## $ open_porch_sf <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21…
## $ enclosed_porch <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0…
## $ x3ssn_porch <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ screen_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0,…
## $ pool_area <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pool_qc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ fence <dbl> 1, 1, 1, 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 1, 3…
## $ misc_feature <fct> None, None, None, None, None, Shed, None, S…
## $ misc_val <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, …
## $ yr_sold <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2…
## $ sale_type <fct> WD, WD, WD, WD, WD, WD, WD, WD, WD, WD, WD,…
## $ sale_condition <fct> Normal, Normal, Normal, Abnorml, Normal, No…
## $ log_sale_price <dbl> 12.24769, 12.10901, 12.31717, 11.84940, 12.…
## $ mo_sold_sin_trans <dbl> 8.660254e-01, 5.000000e-01, -1.000000e+00, …
## $ mo_sold_cos_trans <dbl> 5.000000e-01, -8.660254e-01, -1.836970e-16,…
## $ has_garage <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ garage_yr_same_built <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ garage_yr_same_remod <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
The package vtreat is very handful and allows to create a clean data frame that :
Infinite/NA/NaN in the effective variable columnsThe use pattern is :
train and test)We will create 2 distinct datasets : train_treated and valid_treated :
train_treated will be used to train different models in next Part 3valid_treated will be used to eveluate models’ performance in next Part 3The following picture helps understanding the different datasets used in each step :
# Split train into 2 temporary datasets, ratio 0.75/0.25
set.seed(42)
sample_group <- sample(x = c("v_train", "v_valid"), size = nrow(train), replace = TRUE, prob = c(0.75, 0.25))
vtreat_train <- train[sample_group == "v_train", ]
vtreat_valid <- train[sample_group == "v_valid", ]The first treatment plan will be basic, with no option (no significance or pruning defined).
# Define response and input variables
response <- "log_sale_price"
input_vars <- setdiff(colnames(vtreat_train), response)Create a treatment plan, based on cross-validation.
# Basic treatment plan
vtreat_plan <- mkCrossFrameNExperiment(dframe = vtreat_train,
varlist = input_vars,
outcomename = response)## [1] "vtreat 1.4.2 start initial treatment design Sun Jul 7 19:33:22 2019"
## [1] " start cross frame work Sun Jul 7 19:33:28 2019"
## [1] " vtreat::mkCrossFrameNExperiment done Sun Jul 7 19:33:32 2019"
The treatments attribute has the complete treatment plan, and scoreFrame allows to see new created variables (with their different types and significance level).
The new values are stored in the argument crossFrame.
Let’s extract new created variables related to alley.
## varName varMoves rsq sig needsSplit
## 1 alley_catP TRUE 0.0187716570 5.004325e-06 TRUE
## 2 alley_catN TRUE 0.0277306535 2.688002e-08 TRUE
## 3 alley_catD TRUE 0.0222482994 6.566783e-07 TRUE
## 4 alley_lev_x_Grvl TRUE 0.0300314294 7.033336e-09 FALSE
## 5 alley_lev_x_None TRUE 0.0188307617 4.834244e-06 FALSE
## 6 alley_lev_x_Pave TRUE 0.0001759761 6.600172e-01 FALSE
## extraModelDegrees origName code
## 1 2 alley catP
## 2 2 alley catN
## 3 2 alley catD
## 4 0 alley lev
## 5 0 alley lev
## 6 0 alley lev
We can see that 6 new variables have been created.
The ‘lev’ variables are one-hot-encoded variables.
We can also extract the new “projected” values.
# Extract new values dataframe related to 'alley'
vtreat_plan$crossFrame %>%
select(starts_with("alley")) %>%
head(10)## alley_catP alley_catN alley_catD alley_lev_x_Grvl alley_lev_x_None
## 1 0.9306122 0.01301487 0.4048582 0 1
## 2 0.9306122 0.01301487 0.4048582 0 1
## 3 0.9319728 0.01619430 0.3984798 0 1
## 4 0.9359673 0.01484579 0.3966729 0 1
## 5 0.9359673 0.01484579 0.3966729 0 1
## 6 0.9319728 0.01619430 0.3984798 0 1
## 7 0.9359673 0.01484579 0.3966729 0 1
## 8 0.9319728 0.01619430 0.3984798 0 1
## 9 0.9319728 0.01619430 0.3984798 0 1
## 10 0.9319728 0.01619430 0.3984798 0 1
## alley_lev_x_Pave
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 0
The original categorical variable alley has been trasnformed into 6 all numeric variables.
Also note that we no longer have missing values.
## [1] 262
## [1] 0
The other great interest of using vtreat is that it has a “variable selection” process.
During the treatment creation, it can among other things handle rare levels of categorical variables, prune variables based on statistically significance (“y-aware” processing).
# rare counts : if a level has less count than 1% of the total observations,
# it will be considered as rare level
# (rare levels will be grouped inside a shared rare level)
vtreat_rare_count <- 0.01 * nrow(vtreat_train)
# rare significance level : choose a significance level to prune levels at.
# see http://www.win-vector.com/blog/2015/10/using-differential-privacy-to-reuse-training-data/
vtreat_rare_sig <- 0.3 # ???
# prune variables using a significance level.
# see https://winvector.github.io/vtreat/articles/vtreatSignificance.html
vtreat_prune_sig <- 1 / ncol(vtreat_train)The sig (significance) value will be used to “prune” the variables (only keep variables with a sig value lower than vtreat_prune_sig).
Also, this vtreat page recommends :
We strongly suggest using the standard variables coded as ‘lev’, ‘clean’, and ‘isBad’; and the “y aware” variables coded as ‘catN’ and ‘catB’. The non sub-model variables (‘catP’ and ‘catD’) can be useful (possibly as interactions or guards on the corresponding ‘catN’ and ‘catB’ variables) but also encode distributional facts about the data that may or may not be appropriate depending on your problem domain.
This can be done when creating the treatment plan, using the argument codeRestriction.
# New treatment plan with code restriction
vtreat_plan <- mkCrossFrameNExperiment(dframe = vtreat_train,
varlist = input_vars,
outcomename = response,
rareCount = vtreat_rare_count,
rareSig = vtreat_rare_sig,
codeRestriction = c("lev", "clean", "isBad", "catN"))## [1] "vtreat 1.4.2 start initial treatment design Sun Jul 7 19:33:32 2019"
## [1] " start cross frame work Sun Jul 7 19:33:39 2019"
## [1] " vtreat::mkCrossFrameNExperiment done Sun Jul 7 19:33:44 2019"
Let’s extract new created variables related to alley.
## varName varMoves rsq sig needsSplit
## 1 alley_catN TRUE 0.03013298 6.629171e-09 TRUE
## 2 alley_lev_x_Grvl TRUE 0.03003143 7.033336e-09 FALSE
## 3 alley_lev_x_None TRUE 0.01883076 4.834244e-06 FALSE
## extraModelDegrees origName code
## 1 2 alley catN
## 2 0 alley lev
## 3 0 alley lev
The ‘catP’ and ‘catD’ have not been retained as expected.
Also note that one level of the alley variable has not been retained : the significance level was indeed above our thresold of 0.3.
We can now filter the variables bases on significance.
## [1] 178
# Select only variables with a significance value lower than our previous defined value 'vtreat_prune_sig'
newvars <- treatments$scoreFrame$varName[treatments$scoreFrame$sig <= vtreat_prune_sig]
length(newvars)## [1] 147
We now have 147 new variables returned by the treatment plan.
The new values are stored in the crossFrame attribute.
# New treated training dataset
train_treated <- vtreat_plan$crossFrame[, c(newvars, "log_sale_price")]## Observations: 1,102
## Variables: 148
## $ ms_sub_class_catN <dbl> 0.30299208, 0.30299208, -0.24959678…
## $ ms_zoning_catN <dbl> 0.06550683, 0.06550683, 0.06550683,…
## $ lot_frontage <dbl> 68.00000, 84.00000, 85.00000, 75.00…
## $ lot_area <dbl> 11250, 14260, 14115, 10084, 10382, …
## $ alley_catN <dbl> 0.01187456, 0.01187456, 0.01187456,…
## $ lot_shape_catN <dbl> 0.14188506, 0.14188506, 0.14188506,…
## $ land_contour_catN <dbl> -0.007902741, -0.007902741, -0.0079…
## $ lot_config_catN <dbl> -0.02452583, 0.00000000, -0.0245258…
## $ neighborhood_catN <dbl> 0.13697891, 0.65789741, -0.09717128…
## $ condition1_catN <dbl> 0.02096058, 0.02096058, 0.02096058,…
## $ bldg_type_catN <dbl> 0.02699036, 0.02699036, 0.02699036,…
## $ house_style_catN <dbl> 0.14962377, 0.14962377, -0.25994606…
## $ overall_qual <dbl> 7, 8, 5, 8, 7, 7, 5, 5, 9, 7, 6, 4,…
## $ year_built <dbl> 2001, 2000, 1993, 2004, 1973, 1931,…
## $ year_remod_add <dbl> 2002, 2000, 1995, 2005, 1973, 1950,…
## $ roof_style_catN <dbl> -0.04491734, -0.04491734, -0.044917…
## $ exterior1st_catN <dbl> 0.15687324, 0.15687324, 0.15687324,…
## $ exterior2nd_catN <dbl> 0.1587478, 0.1587478, 0.1587478, 0.…
## $ mas_vnr_type_catN <dbl> 0.1464440, 0.1464440, -0.1204298, 0…
## $ mas_vnr_area <dbl> 162, 350, 0, 186, 240, 0, 0, 0, 286…
## $ exter_qual <dbl> 4, 4, 3, 4, 3, 3, 3, 3, 5, 4, 3, 3,…
## $ foundation_catN <dbl> 0.2352084, 0.2352084, 0.2393833, 0.…
## $ bsmt_qual <dbl> 5, 5, 5, 6, 5, 4, 4, 4, 6, 5, 4, 1,…
## $ bsmt_cond <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1,…
## $ bsmt_exposure_catN <dbl> 0.10091461, 0.10104746, -0.07571532…
## $ bsmt_fin_type1 <dbl> 7, 7, 7, 7, 6, 2, 7, 4, 7, 2, 5, 1,…
## $ bsmt_fin_sf1 <dbl> 486, 655, 732, 1369, 859, 0, 851, 9…
## $ bsmt_unf_sf <dbl> 434, 490, 64, 317, 216, 952, 140, 1…
## $ total_bsmt_sf <dbl> 920, 1145, 796, 1686, 1107, 952, 99…
## $ heating_catN <dbl> 0.009119800, 0.009119800, 0.0091198…
## $ heating_qc <dbl> 5, 5, 5, 5, 5, 4, 5, 5, 5, 5, 3, 3,…
## $ electrical <dbl> 4, 4, 4, 4, 4, 2, 4, 4, 4, 4, 4, 4,…
## $ x1st_flr_sf <dbl> 920, 1145, 796, 1694, 1107, 1022, 1…
## $ x2nd_flr_sf <dbl> 866, 1053, 566, 0, 983, 752, 0, 0, …
## $ gr_liv_area <dbl> 1786, 2198, 1362, 1694, 2090, 1774,…
## $ bsmt_full_bath <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0,…
## $ full_bath <dbl> 2, 2, 1, 2, 2, 2, 1, 1, 3, 2, 1, 2,…
## $ half_bath <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ bedroom_abv_gr <dbl> 3, 4, 1, 3, 3, 2, 2, 3, 4, 3, 2, 2,…
## $ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2,…
## $ kitchen_qual <dbl> 4, 4, 3, 4, 3, 3, 3, 3, 5, 4, 3, 3,…
## $ tot_rms_abv_grd <dbl> 6, 9, 5, 7, 7, 8, 5, 5, 11, 7, 5, 6…
## $ functional <dbl> 8, 8, 8, 8, 8, 7, 8, 8, 8, 8, 8, 8,…
## $ fireplaces <dbl> 1, 1, 0, 1, 2, 2, 2, 0, 2, 1, 1, 0,…
## $ fireplace_qu <dbl> 4, 4, 1, 5, 4, 4, 4, 1, 5, 5, 3, 1,…
## $ garage_type_catN <dbl> 0.1404059, 0.1404059, 0.1404059, 0.…
## $ garage_finish_catN <dbl> 0.1666591, 0.1666591, -0.1968145, 0…
## $ garage_cars <dbl> 2, 3, 2, 2, 2, 2, 1, 1, 3, 3, 1, 2,…
## $ garage_area <dbl> 608, 836, 480, 636, 484, 468, 205, …
## $ garage_qual <dbl> 4, 4, 4, 4, 4, 3, 5, 4, 4, 4, 4, 4,…
## $ garage_cond <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ paved_drive_catN <dbl> 0.03148809, 0.03148809, 0.03148809,…
## $ wood_deck_sf <dbl> 0, 192, 40, 255, 235, 90, 0, 0, 147…
## $ open_porch_sf <dbl> 42, 84, 30, 57, 204, 0, 4, 0, 21, 3…
## $ enclosed_porch <dbl> 0, 0, 0, 0, 228, 205, 0, 0, 0, 0, 1…
## $ screen_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ pool_qc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fence <dbl> 1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 3, 1,…
## $ misc_feature_catN <dbl> 0.005977216, 0.005977216, -0.181971…
## $ sale_type_catN <dbl> -0.02942813, -0.02942813, -0.029428…
## $ sale_condition_catN <dbl> -0.01765492, -0.01765492, -0.017654…
## $ has_garage <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_built <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_remod <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_rare <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_120 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_160 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_50 <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_60 <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ ms_sub_class_lev_x_90 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ ms_zoning_lev_x_FV <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_RL <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ ms_zoning_lev_x_RM <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_Grvl <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_None <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lot_shape_lev_x_IR1 <dbl> 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0,…
## $ lot_shape_lev_x_IR2 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_shape_lev_x_Reg <dbl> 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ land_contour_lev_x_Bnk <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_HLS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_Low <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_CulDSac <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_Inside <dbl> 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 0, 1,…
## $ land_slope_lev_x_Gtl <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ neighborhood_lev_x_BrkSide <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_CollgCr <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ neighborhood_lev_x_Crawfor <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Edwards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NAmes <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ neighborhood_lev_x_NoRidge <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NridgHt <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ neighborhood_lev_x_OldTown <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Sawyer <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1,…
## $ neighborhood_lev_x_Somerst <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Timber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Artery <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Feedr <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Norm <dbl> 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ bldg_type_lev_x_1Fam <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0,…
## $ bldg_type_lev_x_Duplex <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ bldg_type_lev_x_Twnhs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1_5Fin <dbl> 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1Story <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1,…
## $ house_style_lev_x_2Story <dbl> 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ house_style_lev_x_SFoyer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_style_lev_x_Gable <dbl> 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1,…
## $ roof_style_lev_x_Hip <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,…
## $ roof_matl_lev_x_CompShg <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ exterior1st_lev_x_CemntBd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior1st_lev_x_MetalSd <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ exterior1st_lev_x_VinylSd <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ exterior1st_lev_x_Wd_Sdng <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_MetalSd <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,…
## $ exterior2nd_lev_x_VinylSd <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ exterior2nd_lev_x_Wd_Sdng <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_BrkFace <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ mas_vnr_type_lev_x_None <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1,…
## $ mas_vnr_type_lev_x_Stone <dbl> 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0,…
## $ foundation_lev_x_BrkTil <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,…
## $ foundation_lev_x_CBlock <dbl> 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0,…
## $ foundation_lev_x_PConc <dbl> 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Av <dbl> 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Gd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_No <dbl> 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,…
## $ bsmt_exposure_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ heating_lev_x_GasA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ central_air_lev_x_N <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ central_air_lev_x_Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_type_lev_x_Attchd <dbl> 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
## $ garage_type_lev_x_BuiltIn <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ garage_type_lev_x_Detchd <dbl> 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ garage_type_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Fin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ garage_finish_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_RFn <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0,…
## $ garage_finish_lev_x_Unf <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,…
## $ paved_drive_lev_x_N <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ paved_drive_lev_x_Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_None <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0,…
## $ misc_feature_lev_x_Shed <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1,…
## $ sale_type_lev_x_COD <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_New <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ sale_type_lev_x_WD <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,…
## $ sale_condition_lev_x_Abnorml <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sale_condition_lev_x_Normal <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1,…
## $ sale_condition_lev_x_Partial <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ log_sale_price <dbl> 12.31717, 12.42922, 11.87060, 12.63…
We can see that the id variable, which we strongly suspect not to have any relation with the sale price, has been deleted.
We can now prepare the validation dataset using the same conditions as the training set ; the pruneSig argument is used to directly prune the variables.
# apply treatment plan to validation set
valid_treated <- prepare(treatmentplan = treatments,
dframe = vtreat_valid,
pruneSig = vtreat_prune_sig)## Observations: 357
## Variables: 148
## $ ms_sub_class_catN <dbl> 0.31707790, 0.02487946, -0.06736683…
## $ ms_zoning_catN <dbl> 0.06141452, 0.06141452, 0.06141452,…
## $ lot_frontage <dbl> 65.0000, 80.0000, 60.0000, 69.9967,…
## $ lot_area <dbl> 8450, 9600, 9550, 12968, 6120, 1124…
## $ alley_catN <dbl> 0.01468384, 0.01468384, 0.01468384,…
## $ lot_shape_catN <dbl> -0.09031732, -0.09031732, 0.1527919…
## $ land_contour_catN <dbl> -0.005418798, -0.005418798, -0.0054…
## $ lot_config_catN <dbl> -0.02389266, 0.00000000, 0.00000000…
## $ neighborhood_catN <dbl> 0.11086832, -0.09532982, 0.23254295…
## $ condition1_catN <dbl> 0.0209811, -0.2327314, 0.0209811, 0…
## $ bldg_type_catN <dbl> 0.02452499, 0.02452499, 0.02452499,…
## $ house_style_catN <dbl> 0.16419373, -0.03114034, 0.16419373…
## $ overall_qual <dbl> 7, 6, 7, 5, 7, 6, 8, 8, 5, 8, 4, 5,…
## $ year_built <dbl> 2003, 1976, 1915, 1962, 1929, 1970,…
## $ year_remod_add <dbl> 2003, 1976, 1970, 1962, 2001, 1970,…
## $ roof_style_catN <dbl> -0.04345073, -0.04345073, -0.043450…
## $ exterior1st_catN <dbl> 0.17180733, -0.16217432, -0.1741279…
## $ exterior2nd_catN <dbl> 0.17412141, -0.15994012, -0.1003386…
## $ mas_vnr_type_catN <dbl> 0.1514415, -0.1297748, -0.1297748, …
## $ mas_vnr_area <dbl> 196, 0, 0, 0, 0, 180, 380, 281, 0, …
## $ exter_qual <dbl> 4, 3, 3, 3, 3, 3, 4, 4, 3, 4, 3, 3,…
## $ foundation_catN <dbl> 0.2294760, -0.1589739, -0.3096192, …
## $ bsmt_qual <dbl> 5, 5, 4, 4, 4, 4, 6, 5, 5, 6, 4, 4,…
## $ bsmt_cond <dbl> 4, 4, 5, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ bsmt_exposure_catN <dbl> -0.08084167, 0.35620572, -0.0808416…
## $ bsmt_fin_type1 <dbl> 7, 6, 6, 6, 2, 6, 2, 2, 7, 7, 2, 2,…
## $ bsmt_fin_sf1 <dbl> 706, 978, 216, 737, 0, 578, 0, 0, 8…
## $ bsmt_unf_sf <dbl> 150, 284, 540, 175, 832, 426, 1158,…
## $ total_bsmt_sf <dbl> 856, 1262, 756, 912, 832, 1004, 115…
## $ heating_catN <dbl> 0.008521347, 0.008521347, 0.0085213…
## $ heating_qc <dbl> 5, 5, 4, 3, 5, 5, 5, 5, 3, 5, 2, 4,…
## $ electrical <dbl> 4, 4, 4, 4, 3, 4, 4, 4, 4, 4, 4, 4,…
## $ x1st_flr_sf <dbl> 856, 1262, 961, 912, 854, 1004, 115…
## $ x2nd_flr_sf <dbl> 854, 0, 756, 0, 0, 0, 1218, 0, 0, 0…
## $ gr_liv_area <dbl> 1710, 1262, 1717, 912, 854, 1004, 2…
## $ bsmt_full_bath <dbl> 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0,…
## $ full_bath <dbl> 2, 2, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1,…
## $ half_bath <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1,…
## $ bedroom_abv_gr <dbl> 3, 3, 3, 2, 2, 2, 4, 3, 3, 3, 1, 3,…
## $ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ kitchen_qual <dbl> 4, 3, 4, 3, 3, 3, 4, 4, 3, 4, 2, 4,…
## $ tot_rms_abv_grd <dbl> 8, 6, 7, 4, 5, 5, 9, 7, 6, 7, 4, 6,…
## $ functional <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8,…
## $ fireplaces <dbl> 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0,…
## $ fireplace_qu <dbl> 1, 4, 5, 1, 1, 4, 5, 5, 4, 5, 1, 1,…
## $ garage_type_catN <dbl> 0.1336612, 0.1336612, -0.2537943, -…
## $ garage_finish_catN <dbl> 0.1513195, 0.1513195, -0.2014492, -…
## $ garage_cars <dbl> 2, 2, 3, 1, 2, 2, 3, 2, 2, 3, 1, 1,…
## $ garage_area <dbl> 548, 460, 642, 352, 576, 480, 853, …
## $ garage_qual <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 3, 4,…
## $ garage_cond <dbl> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ paved_drive_catN <dbl> 0.03529355, 0.03529355, 0.03529355,…
## $ wood_deck_sf <dbl> 0, 298, 0, 140, 48, 0, 240, 171, 10…
## $ open_porch_sf <dbl> 61, 0, 35, 0, 112, 0, 154, 159, 110…
## $ enclosed_porch <dbl> 0, 0, 272, 0, 0, 0, 0, 0, 0, 0, 87,…
## $ screen_porch <dbl> 0, 0, 0, 176, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ fence <dbl> 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, 1, 4,…
## $ misc_feature_catN <dbl> 0.006052505, 0.006052505, 0.0060525…
## $ sale_type_catN <dbl> -0.02981943, -0.02981943, -0.029819…
## $ sale_condition_catN <dbl> -0.02056441, -0.02056441, -0.193405…
## $ has_garage <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ garage_yr_same_built <dbl> 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1,…
## $ garage_yr_same_remod <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ ms_sub_class_lev_rare <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_120 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ ms_sub_class_lev_x_160 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_30 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ ms_sub_class_lev_x_50 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_60 <dbl> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ ms_sub_class_lev_x_90 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_FV <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ms_zoning_lev_x_RL <dbl> 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1,…
## $ ms_zoning_lev_x_RM <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0,…
## $ alley_lev_x_Grvl <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ alley_lev_x_None <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ lot_shape_lev_x_IR1 <dbl> 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1,…
## $ lot_shape_lev_x_IR2 <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_shape_lev_x_Reg <dbl> 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0,…
## $ land_contour_lev_x_Bnk <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_HLS <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ land_contour_lev_x_Low <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lot_config_lev_x_CulDSac <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ lot_config_lev_x_Inside <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ land_slope_lev_x_Gtl <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ neighborhood_lev_x_BrkSide <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ neighborhood_lev_x_CollgCr <dbl> 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Crawfor <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Edwards <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NAmes <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NoRidge <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_NridgHt <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ neighborhood_lev_x_OldTown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Sawyer <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1,…
## $ neighborhood_lev_x_Somerst <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ neighborhood_lev_x_Timber <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Artery <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ condition1_lev_x_Feedr <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ condition1_lev_x_Norm <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ bldg_type_lev_x_1Fam <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,…
## $ bldg_type_lev_x_Duplex <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bldg_type_lev_x_Twnhs <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1_5Fin <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_1Story <dbl> 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1,…
## $ house_style_lev_x_2Story <dbl> 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ house_style_lev_x_SFoyer <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ roof_style_lev_x_Gable <dbl> 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,…
## $ roof_style_lev_x_Hip <dbl> 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ roof_matl_lev_x_CompShg <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ exterior1st_lev_x_CemntBd <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ exterior1st_lev_x_MetalSd <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exterior1st_lev_x_VinylSd <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
## $ exterior1st_lev_x_Wd_Sdng <dbl> 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
## $ exterior2nd_lev_x_MetalSd <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ exterior2nd_lev_x_VinylSd <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
## $ exterior2nd_lev_x_Wd_Sdng <dbl> 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_BrkFace <dbl> 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0,…
## $ mas_vnr_type_lev_x_None <dbl> 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ mas_vnr_type_lev_x_Stone <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ foundation_lev_x_BrkTil <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ foundation_lev_x_CBlock <dbl> 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ foundation_lev_x_PConc <dbl> 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,…
## $ bsmt_exposure_lev_x_Av <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_Gd <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_exposure_lev_x_No <dbl> 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ bsmt_exposure_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ heating_lev_x_GasA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ central_air_lev_x_N <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ central_air_lev_x_Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1,…
## $ garage_type_lev_x_Attchd <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1,…
## $ garage_type_lev_x_BuiltIn <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ garage_type_lev_x_Detchd <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0,…
## $ garage_type_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_Fin <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_None <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ garage_finish_lev_x_RFn <dbl> 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0,…
## $ garage_finish_lev_x_Unf <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ paved_drive_lev_x_N <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ paved_drive_lev_x_Y <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_None <dbl> 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,…
## $ misc_feature_lev_x_Shed <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_COD <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_New <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ sale_type_lev_x_WD <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ sale_condition_lev_x_Abnorml <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ sale_condition_lev_x_Normal <dbl> 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,…
## $ sale_condition_lev_x_Partial <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ log_sale_price <dbl> 12.24769, 12.10901, 11.84940, 11.87…
We have 2 similar datasets, with only numeric variables and no missing values.
## [1] TRUE
## [1] TRUE
## [1] TRUE
## [1] 0
## [1] 0
Let’s recap the different steps performed to prepare the data :
sale_price variablemo_sold variablegarage_yr_blt variableOnly the first step will not be performed on the test dataset (since it is the variable of interest and so it doesn’t appear in the test data).
To close this Part 2, let’s save all necessary elements.