The “House Price Prediction” competition from Kaggle challenges us to predict the final sale price of houses in Ames, Iowa.
Based on a set of 79 features (built year, location and neighborhood, type of heating, …), a train dataset is available with the sale price of 1460 houses.
The goal is to predict the final sale price of the 1459 houses described in the test dataset.

Note that the evaluation is performed on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.

1 Load data and libraries

library(tidyverse)   # data manipulation
library(skimr)       # summary statistics 
library(kableExtra)  # customize table output
library(formattable) # customize table output

Load and first inspection of the data.

# Import data from 'train.csv' and 'test.csv'
train <- read_csv("train.csv")
test <- read_csv("test.csv")

# Merge the 2 datasets and clean the column names
full <- bind_rows(train = train, test = test, .id = "df_id") %>% 
  janitor::clean_names()

# delete original data
rm(train, test)

# Structure of the full dataset.
glimpse(full)

## Observations: 2,919
## Variables: 82
## $ df_id           <chr> "train", "train", "train", "train", "train", "tr…
## $ id              <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ ms_sub_class    <dbl> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,…
## $ ms_zoning       <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", …
## $ lot_frontage    <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, …
## $ lot_area        <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10…
## $ street          <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ alley           <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lot_shape       <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg",…
## $ land_contour    <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",…
## $ utilities       <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub"…
## $ lot_config      <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Ins…
## $ land_slope      <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl",…
## $ neighborhood    <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoR…
## $ condition1      <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm",…
## $ condition2      <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
## $ bldg_type       <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", …
## $ house_style     <chr> "2Story", "1Story", "2Story", "2Story", "2Story"…
## $ overall_qual    <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, …
## $ overall_cond    <dbl> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, …
## $ year_built      <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, …
## $ year_remod_add  <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, …
## $ roof_style      <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Ga…
## $ roof_matl       <chr> "CompShg", "CompShg", "CompShg", "CompShg", "Com…
## $ exterior1st     <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "Vin…
## $ exterior2nd     <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "Vin…
## $ mas_vnr_type    <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace",…
## $ mas_vnr_area    <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, …
## $ exter_qual      <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", …
## $ exter_cond      <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ foundation      <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "…
## $ bsmt_qual       <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", …
## $ bsmt_cond       <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", …
## $ bsmt_exposure   <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", …
## $ bsmt_fin_type1  <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ",…
## $ bsmt_fin_sf1    <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,…
## $ bsmt_fin_type2  <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf",…
## $ bsmt_fin_sf2    <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_unf_sf     <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,…
## $ total_bsmt_sf   <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,…
## $ heating         <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ heating_qc      <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", …
## $ central_air     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ electrical      <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SB…
## $ x1st_flr_sf     <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022…
## $ x2nd_flr_sf     <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, …
## $ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ gr_liv_area     <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, …
## $ bsmt_full_bath  <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, …
## $ bsmt_half_bath  <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath       <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, …
## $ half_bath       <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ bedroom_abv_gr  <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, …
## $ kitchen_abv_gr  <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, …
## $ kitchen_qual    <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", …
## $ tot_rms_abv_grd <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,…
## $ functional      <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ",…
## $ fireplaces      <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, …
## $ fireplace_qu    <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA"…
## $ garage_type     <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd"…
## $ garage_yr_blt   <dbl> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, …
## $ garage_finish   <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn",…
## $ garage_cars     <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, …
## $ garage_area     <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205…
## $ garage_qual     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ garage_cond     <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ paved_drive     <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ wood_deck_sf    <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, …
## $ open_porch_sf   <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, …
## $ enclosed_porch  <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, …
## $ x3ssn_porch     <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ screen_porch    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0…
## $ pool_area       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc         <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ fence           <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA,…
## $ misc_feature    <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, …
## $ misc_val        <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,…
## $ mo_sold         <dbl> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, …
## $ yr_sold         <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, …
## $ sale_type       <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", …
## $ sale_condition  <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal…
## $ sale_price      <dbl> 208500, 181500, 223500, 140000, 250000, 143000, …

2 Cleaning `NA` values

The dataset contains NA values that are not actually missing values but, according to the description of the data, are “Not Applicable” values (or None values) : some NA’s should not be treated as missing values.
However, we can’t replace all NA by a new value None (for example), because some are really missing values !!
So for each variable, we will have to look if we can add this new value None or leave it as a real NA.
This can be done by looking at other variables that can be related, and look at the incoherence we found.

Note : the next steps are not imputation, but just a look at the data to detect wrong NAs. Imputation will be discussed in another notebook.

2.1 `basement` related variables

basement_full <- full %>% 
  select(df_id, id, contains("bsmt"))

Let’s look at our “missing values” in this subset.
First, check the numeric variables.

# define numeric and character column names
num_names <- c("df_id", basement_full %>% select_if(is.numeric) %>% names)
char_names <- c("id", basement_full %>% select_if(is.character) %>% names())

# filter observations in which there is at least 1 missing value in a numeric column
basement_full %>% 
  filter_at(.vars = num_names, any_vars(is.na(.))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  scroll_box(width = "800px")

df_id	id	bsmt_qual	bsmt_cond	bsmt_exposure	bsmt_fin_type1	bsmt_fin_sf1	bsmt_fin_type2	bsmt_fin_sf2	bsmt_unf_sf	total_bsmt_sf	bsmt_full_bath	bsmt_half_bath
test	2121	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
test	2189	NA	NA	NA	NA	0	NA	0	0	0	NA	NA

We can clearly see that there is a typo : we should have 0 instead of NA because these variables are numeric and not character (and we clearly have observations with no basement).

Now character variables.

# filter observations in which there is at least 1 missing value in a character column
basement_full %>% 
  filter_at(.vars = char_names, any_vars(is.na(.))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  scroll_box(width = "800px", height = "400px")

df_id	id	bsmt_qual	bsmt_cond	bsmt_exposure	bsmt_fin_type1	bsmt_fin_sf1	bsmt_fin_type2	bsmt_fin_sf2	bsmt_unf_sf	total_bsmt_sf	bsmt_full_bath	bsmt_half_bath
train	18	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	40	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	91	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	103	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	157	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	183	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	260	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	333	Gd	TA	No	GLQ	1124	NA	479	1603	3206	1	0
train	343	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	363	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	372	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	393	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	521	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	533	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	534	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	554	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	647	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	706	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	737	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	750	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	779	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	869	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	895	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	898	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	949	Gd	TA	NA	Unf	0	Unf	0	936	936	0	0
train	985	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1001	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1012	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1036	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1046	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1049	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1050	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1091	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1180	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1217	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1219	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1233	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1322	NA	NA	NA	NA	0	NA	0	0	0	0	0
train	1413	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1488	Gd	TA	NA	Unf	0	Unf	0	1595	1595	0	0
test	1586	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1594	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1730	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1779	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1815	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1848	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1849	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1857	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1858	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1859	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1861	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	1916	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2041	Gd	NA	Mn	GLQ	1044	Rec	382	0	1426	1	0
test	2051	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2067	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2069	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2121	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA
test	2123	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2186	TA	NA	No	BLQ	1033	Unf	0	94	1127	0	1
test	2189	NA	NA	NA	NA	0	NA	0	0	0	NA	NA
test	2190	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2191	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2194	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2217	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2218	NA	Fa	No	Unf	0	Unf	0	173	173	0	0
test	2219	NA	TA	No	Unf	0	Unf	0	356	356	0	0
test	2225	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2349	Gd	TA	NA	Unf	0	Unf	0	725	725	0	0
test	2388	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2436	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2453	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2454	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2491	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2499	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2525	TA	NA	Av	ALQ	755	Unf	0	240	995	0	0
test	2548	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2553	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2565	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2579	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2600	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2703	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2764	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2767	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2804	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2805	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2825	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2892	NA	NA	NA	NA	0	NA	0	0	0	0	0
test	2905	NA	NA	NA	NA	0	NA	0	0	0	0	0

The majority of observations have a NA value for all columns, which should correspond to No Basement. So, they will be recoded as None.

Let’s proceed to our changes for these variables.

# Replace 'NA' in numeric variables with '0'
basement_full <- basement_full %>% 
  mutate_at(.vars = num_names[-(1:2)], .funs = replace_na, 0)

# Create a new dataframe, with only character variables :
# Replace 'NA' in character variables : only if all values are 'NA' on the same observation => 'None'
# Other 'NA' will remain 'NA' for now
basement_char <- basement_full %>% 
  select(char_names) %>% 
  filter_at(.vars = char_names[-(1:2)], all_vars(is.na(.))) %>%   # filter rows which have all character colomns to 'NA'
  map_df(.x = ., .f = ~replace_na(data = .x, replace = "None")) %>%   # in these rows only, replace 'NA' with 'None'
  mutate(id = as.numeric(id))

# Replace the corresponding rows in 'basement_full'
basement_full[basement_char$id, char_names] <- basement_char

# Let's check our results.
basement_full %>% 
  filter_all(any_vars(is.na(.))) %>% 
  mutate_all(function(x) {
    cell_spec(x, 
              background = ifelse(is.na(x), "orange", "default"),
              color = ifelse(is.na(x), "white", "default"))
  }) %>% 
  kable(format = "html", escape = F) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  scroll_box(width = "800px", height = "400px")

df_id	id	bsmt_qual	bsmt_cond	bsmt_exposure	bsmt_fin_type1	bsmt_fin_sf1	bsmt_fin_type2	bsmt_fin_sf2	bsmt_unf_sf	total_bsmt_sf	bsmt_full_bath	bsmt_half_bath
train	333	Gd	TA	No	GLQ	1124	NA	479	1603	3206	1	0
train	949	Gd	TA	NA	Unf	0	Unf	0	936	936	0	0
test	1488	Gd	TA	NA	Unf	0	Unf	0	1595	1595	0	0
test	2041	Gd	NA	Mn	GLQ	1044	Rec	382	0	1426	1	0
test	2186	TA	NA	No	BLQ	1033	Unf	0	94	1127	0	1
test	2218	NA	Fa	No	Unf	0	Unf	0	173	173	0	0
test	2219	NA	TA	No	Unf	0	Unf	0	356	356	0	0
test	2349	Gd	TA	NA	Unf	0	Unf	0	725	725	0	0
test	2525	TA	NA	Av	ALQ	755	Unf	0	240	995	0	0

In these remaining 9 observations, the remaining NAs seem to be legit missing values.

Let’s re-put our new variables into the original dataset.

full[, colnames(basement_full)] <- basement_full

# Clean
rm(basement_full, basement_char, char_names, num_names)

2.2 `pool_qc`

We can look at the variable pool_area to detect incongruities.

pool_full <- full %>% 
  select(df_id, id, pool_area, pool_qc)

Which are the observations in which we have a NA value for pool_qc and a value different from 0 for pool_area ?

pool_full %>% 
  filter(is.na(pool_qc), pool_area != 0) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	pool_area	pool_qc
test	2421	368	NA
test	2504	444	NA
test	2600	561	NA

Three observations seem to be legit missing values ; all other observations with NA for the variable pool_qc also have a 0 value for pool_area, meaning that there is probably no pool.

# Change the missing values into 'None' for character variables
pool_full$pool_qc[is.na(pool_full$pool_qc) & pool_full$pool_area == 0] <- "None"

# Double check our results
pool_full %>% 
  filter_all(any_vars(is.na(.))) %>% 
  mutate(pool_qc = cell_spec(pool_qc, "html", 
                             color = ifelse(is.na(pool_qc), "white", "default"),
                             background = ifelse(is.na(pool_qc), "orange", "default"))) %>% 
  kable(escape = FALSE) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	pool_area	pool_qc
test	2421	368	NA
test	2504	444	NA
test	2600	561	NA

Let’s re-put our new variables into the original dataset.

full[, colnames(pool_full)] <- pool_full

# Clean
rm(pool_full)

2.3 `misc_feature`

We can look at the variable misc_val to detect incongruities.

misc_full <- full %>% 
  select(df_id, id, contains("misc"))

Which are the observations in which we have a NA value for misc_feature and a value different from 0 for misc_val ?

misc_full %>% 
  filter(is.na(misc_feature), misc_val != 0) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	misc_feature	misc_val
test	2550	NA	17000

Only one observation (n°2550) is an actual missing value ; all others have a misc_val of 0, meaning the probable absence of miscellaneous feature.

# Change the missing values into 'None' for character variables.
misc_full$misc_feature[is.na(misc_full$misc_feature) & misc_full$misc_val == 0] <- "None"

# Double check our results.
misc_full %>% 
  filter_all(any_vars(is.na(.))) %>% 
  mutate(misc_feature = cell_spec(misc_feature, "html", 
                                  color = ifelse(is.na(misc_feature), "white", "default"),
                                  background = ifelse(is.na(misc_feature), "orange", "default"))) %>% 
  kable(escape = FALSE) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	misc_feature	misc_val
test	2550	NA	17000

# Copy our new values into the original dataset.
full[, colnames(misc_full)] <- misc_full

# Clean : remove 'misc_full'.
rm(misc_full)

2.4 `garage` related variables

garage_full <- full %>% 
  select(1:2, contains("garage"))

# missing value for all garage variables
garage_full %>% 
  select(-id, -df_id) %>% 
  sapply(function(x) sum(is.na(x)))

##   garage_type garage_yr_blt garage_finish   garage_cars   garage_area 
##           157           159           159             1             1 
##   garage_qual   garage_cond 
##           159           159

Seems to have a lot of missing values that could be related.
Let’s check with garage_type first.

garage_full %>%
  filter(is.na(garage_type)) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  scroll_box(width = "800px", height = "400px")

df_id	id	garage_type	garage_yr_blt	garage_finish	garage_qual	garage_cond
train	40	NA	NA	NA	NA	NA
train	49	NA	NA	NA	NA	NA
train	79	NA	NA	NA	NA	NA
train	89	NA	NA	NA	NA	NA
train	90	NA	NA	NA	NA	NA
train	100	NA	NA	NA	NA	NA
train	109	NA	NA	NA	NA	NA
train	126	NA	NA	NA	NA	NA
train	128	NA	NA	NA	NA	NA
train	141	NA	NA	NA	NA	NA
train	149	NA	NA	NA	NA	NA
train	156	NA	NA	NA	NA	NA
train	164	NA	NA	NA	NA	NA
train	166	NA	NA	NA	NA	NA
train	199	NA	NA	NA	NA	NA
train	211	NA	NA	NA	NA	NA
train	242	NA	NA	NA	NA	NA
train	251	NA	NA	NA	NA	NA
train	288	NA	NA	NA	NA	NA
train	292	NA	NA	NA	NA	NA
train	308	NA	NA	NA	NA	NA
train	376	NA	NA	NA	NA	NA
train	387	NA	NA	NA	NA	NA
train	394	NA	NA	NA	NA	NA
train	432	NA	NA	NA	NA	NA
train	435	NA	NA	NA	NA	NA
train	442	NA	NA	NA	NA	NA
train	465	NA	NA	NA	NA	NA
train	496	NA	NA	NA	NA	NA
train	521	NA	NA	NA	NA	NA
train	529	NA	NA	NA	NA	NA
train	534	NA	NA	NA	NA	NA
train	536	NA	NA	NA	NA	NA
train	563	NA	NA	NA	NA	NA
train	583	NA	NA	NA	NA	NA
train	614	NA	NA	NA	NA	NA
train	615	NA	NA	NA	NA	NA
train	621	NA	NA	NA	NA	NA
train	636	NA	NA	NA	NA	NA
train	637	NA	NA	NA	NA	NA
train	639	NA	NA	NA	NA	NA
train	650	NA	NA	NA	NA	NA
train	706	NA	NA	NA	NA	NA
train	711	NA	NA	NA	NA	NA
train	739	NA	NA	NA	NA	NA
train	751	NA	NA	NA	NA	NA
train	785	NA	NA	NA	NA	NA
train	827	NA	NA	NA	NA	NA
train	844	NA	NA	NA	NA	NA
train	922	NA	NA	NA	NA	NA
train	943	NA	NA	NA	NA	NA
train	955	NA	NA	NA	NA	NA
train	961	NA	NA	NA	NA	NA
train	969	NA	NA	NA	NA	NA
train	971	NA	NA	NA	NA	NA
train	977	NA	NA	NA	NA	NA
train	1010	NA	NA	NA	NA	NA
train	1012	NA	NA	NA	NA	NA
train	1031	NA	NA	NA	NA	NA
train	1039	NA	NA	NA	NA	NA
train	1097	NA	NA	NA	NA	NA
train	1124	NA	NA	NA	NA	NA
train	1132	NA	NA	NA	NA	NA
train	1138	NA	NA	NA	NA	NA
train	1144	NA	NA	NA	NA	NA
train	1174	NA	NA	NA	NA	NA
train	1180	NA	NA	NA	NA	NA
train	1219	NA	NA	NA	NA	NA
train	1220	NA	NA	NA	NA	NA
train	1235	NA	NA	NA	NA	NA
train	1258	NA	NA	NA	NA	NA
train	1284	NA	NA	NA	NA	NA
train	1324	NA	NA	NA	NA	NA
train	1326	NA	NA	NA	NA	NA
train	1327	NA	NA	NA	NA	NA
train	1338	NA	NA	NA	NA	NA
train	1350	NA	NA	NA	NA	NA
train	1408	NA	NA	NA	NA	NA
train	1450	NA	NA	NA	NA	NA
train	1451	NA	NA	NA	NA	NA
train	1454	NA	NA	NA	NA	NA
test	1514	NA	NA	NA	NA	NA
test	1532	NA	NA	NA	NA	NA
test	1540	NA	NA	NA	NA	NA
test	1553	NA	NA	NA	NA	NA
test	1557	NA	NA	NA	NA	NA
test	1559	NA	NA	NA	NA	NA
test	1561	NA	NA	NA	NA	NA
test	1591	NA	NA	NA	NA	NA
test	1594	NA	NA	NA	NA	NA
test	1595	NA	NA	NA	NA	NA
test	1615	NA	NA	NA	NA	NA
test	1616	NA	NA	NA	NA	NA
test	1718	NA	NA	NA	NA	NA
test	1722	NA	NA	NA	NA	NA
test	1788	NA	NA	NA	NA	NA
test	1809	NA	NA	NA	NA	NA
test	1811	NA	NA	NA	NA	NA
test	1812	NA	NA	NA	NA	NA
test	1820	NA	NA	NA	NA	NA
test	1823	NA	NA	NA	NA	NA
test	1832	NA	NA	NA	NA	NA
test	1835	NA	NA	NA	NA	NA
test	1837	NA	NA	NA	NA	NA
test	1840	NA	NA	NA	NA	NA
test	1848	NA	NA	NA	NA	NA
test	1894	NA	NA	NA	NA	NA
test	2011	NA	NA	NA	NA	NA
test	2082	NA	NA	NA	NA	NA
test	2091	NA	NA	NA	NA	NA
test	2094	NA	NA	NA	NA	NA
test	2097	NA	NA	NA	NA	NA
test	2100	NA	NA	NA	NA	NA
test	2105	NA	NA	NA	NA	NA
test	2136	NA	NA	NA	NA	NA
test	2152	NA	NA	NA	NA	NA
test	2154	NA	NA	NA	NA	NA
test	2190	NA	NA	NA	NA	NA
test	2191	NA	NA	NA	NA	NA
test	2192	NA	NA	NA	NA	NA
test	2193	NA	NA	NA	NA	NA
test	2194	NA	NA	NA	NA	NA
test	2213	NA	NA	NA	NA	NA
test	2239	NA	NA	NA	NA	NA
test	2247	NA	NA	NA	NA	NA
test	2354	NA	NA	NA	NA	NA
test	2355	NA	NA	NA	NA	NA
test	2399	NA	NA	NA	NA	NA
test	2400	NA	NA	NA	NA	NA
test	2423	NA	NA	NA	NA	NA
test	2427	NA	NA	NA	NA	NA
test	2553	NA	NA	NA	NA	NA
test	2554	NA	NA	NA	NA	NA
test	2558	NA	NA	NA	NA	NA
test	2576	NA	NA	NA	NA	NA
test	2580	NA	NA	NA	NA	NA
test	2604	NA	NA	NA	NA	NA
test	2610	NA	NA	NA	NA	NA
test	2692	NA	NA	NA	NA	NA
test	2694	NA	NA	NA	NA	NA
test	2709	NA	NA	NA	NA	NA
test	2768	NA	NA	NA	NA	NA
test	2772	NA	NA	NA	NA	NA
test	2790	NA	NA	NA	NA	NA
test	2792	NA	NA	NA	NA	NA
test	2800	NA	NA	NA	NA	NA
test	2860	NA	NA	NA	NA	NA
test	2863	NA	NA	NA	NA	NA
test	2871	NA	NA	NA	NA	NA
test	2889	NA	NA	NA	NA	NA
test	2892	NA	NA	NA	NA	NA
test	2893	NA	NA	NA	NA	NA
test	2894	NA	NA	NA	NA	NA
test	2910	NA	NA	NA	NA	NA
test	2914	NA	NA	NA	NA	NA
test	2915	NA	NA	NA	NA	NA
test	2918	NA	NA	NA	NA	NA

For each observation having a NA value for garage_type, all other variables are also NA or 0 ; which indicates indeed that there is no garage.
The changes will be the same as previously : None for character variables and 0 for numeric variables.
The one exception is garage_yr_blt, where the best course of action is less clear. If the house has no garage, how can we say when it was built?
For now, we can try to solve this by transforming the garage_yr_blt as a character variable, and add a None value for corresponding NA.

# transform the 'garage_yr_blt' as a character variable
garage_full$garage_yr_blt <- as.character(garage_full$garage_yr_blt)

Let’s proceed to our changes for these variables.

# define numeric and character column names
num_names <- garage_full %>% select_if(is.numeric) %>% names()
char_names <- garage_full %>% select_if(is.character) %>% names()

# filter observations which have 'NA' in 'garage_type', and replace
# 'NA's with 0 for numerical and 'None' for character variables
garage_na_type <- garage_full %>% 
  filter(is.na(garage_type)) %>% 
  mutate_at(.vars = num_names, replace_na, 0) %>% 
  mutate_at(.vars = char_names, replace_na, "None")

# Replace the corresponding rows in 'garage_full'
garage_full[garage_na_type$id, ] <- garage_na_type

# Let's check our results.
garage_full %>% 
  filter_all(any_vars(is.na(.))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left") %>% 
  scroll_box(width = "800px")

df_id	id	garage_type	garage_yr_blt	garage_finish	garage_cars	garage_area	garage_qual	garage_cond
test	2127	Detchd	NA	NA	1	360	NA	NA
test	2577	Detchd	NA	NA	NA	NA	NA	NA

For observation n°2127, these seem to be legit missing values.

For observation n°2577, we have a strange situation, since only 1 variable is entered.
We might assume that this is a mistake ; we will then change the garage_type into None for this observation, and then all variables will be None or 0 for this observation.

garage_full[garage_full$id == 2577, char_names[-1]] <- "None"
garage_full[garage_full$id == 2577, num_names[-1]] <- 0

# Copy our new values into the original dataset.
full[, colnames(garage_full)] <- garage_full

# Clean
rm(garage_full, garage_na_type, char_names, num_names)

2.5 `fireplace_qu`

We can look at the variable misc_val to detect incongruities.

fire_full <- full %>% 
  select(df_id, id, contains("fire"))

Which are the observations in which we have a NA value for fireplace_qu and a value different from 0 for fireplaces ?

fire_full %>% 
  filter(is.na(fireplace_qu), fireplaces != 0) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	fireplaces	fireplace_qu

Every NA is actually a None value.

# Change the missing values into 'None' for character variables
fire_full$fireplace_qu[is.na(fire_full$fireplace_qu)] <- "None"

# Double check our results.
fire_full %>% 
  filter_all(any_vars(is.na(.))) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	fireplaces	fireplace_qu

# Copy our new values into the original dataset.
full[, colnames(fire_full)] <- fire_full

# Clean : remove 'fire_full'.
rm(fire_full)

2.6 `mas_vnr_type`

We can look at the variable mas_vnr_area to detect incongruities.

mas_full <- full %>% 
  select(df_id, id, contains("mas"))

Which are the observations in which we have a NA value for mas_vnr_type and a value different from 0 for mas_vnr_area ?

mas_full %>% 
  filter(is.na(mas_vnr_type), mas_vnr_area != 0) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	mas_vnr_type	mas_vnr_area
test	2611	NA	198

Only one observation (n°2611) is an actual missing value for mas_vnr_type.

# Change the missing values into 'None' for character variables and '0' for numeric variables.
mas_full$mas_vnr_type[is.na(mas_full$mas_vnr_type) & is.na(mas_full$mas_vnr_area)] <- "None"
mas_full$mas_vnr_area[is.na(mas_full$mas_vnr_area)] <- 0

# Double check our results.
mas_full %>% 
  filter_all(any_vars(is.na(.))) %>% 
  mutate(mas_vnr_type = cell_spec(mas_vnr_type, "html",
                                  color = ifelse(is.na(mas_vnr_type), "white", "default"),
                                  background = ifelse(is.na(mas_vnr_type), "orange", "default"))) %>% 
  kable(escape = FALSE) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	mas_vnr_type	mas_vnr_area
test	2611	NA	198

The remaining missing value seems to be legit.

# Copy our new values into the original dataset.
full[, colnames(mas_full)] <- mas_full

# Clean : remove 'mas_full'.
rm(mas_full)

2.7 `alley`, `fence`

These 2 variable don’t seem to be related to any other variable in the dataset (as per description file).
We will then change all NA into None without looking at other variables.
Note that we thus assume that there is no missing value for these variables!

full$alley[is.na(full$alley)] <- "None"
full$fence[is.na(full$fence)] <- "None"

2.8 other variables

Only one variable remains with a lot of NA values : lot_frontage.
This variable is numeric, and is probably not be equal to 0.
So we might have to impute these missing values.

full %>% 
  sapply(function(x) sum(is.na(x))) %>% 
  enframe(name = "variable", value = "missing_nb") %>% 
  arrange(-missing_nb) %>% 
  filter(missing_nb != 0, variable != "sale_price") %>% 
  mutate(missing_nb = color_bar("lightgreen")(missing_nb)) %>% 
  kable(escape = FALSE) %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

variable	missing_nb
lot_frontage	486
ms_zoning	4
bsmt_cond	3
bsmt_exposure	3
pool_qc	3
utilities	2
bsmt_qual	2
functional	2
exterior1st	1
exterior2nd	1
mas_vnr_type	1
bsmt_fin_type2	1
electrical	1
kitchen_qual	1
garage_yr_blt	1
garage_finish	1
garage_qual	1
garage_cond	1
misc_feature	1
sale_type	1

3 Other useful transformations

Now we have cleaned the data, let’s have a look at remaining variables to spot incongruities.

3.1 `ms_sub_class`, `overall_qual`, `overall_cond`

These variables are numeric variables to be treated as character/categorical variables (as per description file).

full$ms_sub_class <- as.character(full$ms_sub_class)
full$overall_qual <- as.character(full$overall_qual)
full$overall_cond <- as.character(full$overall_cond)

3.2 factor variables

All character variables can now be transformed into factors.

full <- full %>% 
  mutate_if(is.character, as.factor)

3.3 `garage_yr_blt`

There is an observation with a year equal to 2207, clearly a typo.

table(full$garage_yr_blt, useNA = "always")

## 
## 1895 1896 1900 1906 1908 1910 1914 1915 1916 1917 1918 1919 1920 1921 1922 
##    1    1    6    1    1   10    2    7    6    2    3    1   33    5    8 
## 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 
##    6    8   15   15    5    7    2   27    4    4    1    4    8    7    6 
## 1938 1939 1940 1941 1942 1943 1945 1946 1947 1948 1949 1950 1951 1952 1953 
##   11   21   25   14    6    1   10    9    5   19   14   51   17   16   23 
## 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 
##   37   24   41   34   42   36   37   31   35   34   35   34   39   36   48 
## 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 
##   32   32   24   27   29   35   28   50   66   41   35   32   15    9   11 
## 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 
##   19   18   12   18   20   19   26   17   27   49   39   35   40   44   58 
## 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2207 None <NA> 
##   54   55   41   53   92   99  142  115  115   61   29    5    1  158    1

full %>% 
  filter(garage_yr_blt == "2207") %>% 
  select(df_id, id, year_built, year_remod_add, garage_yr_blt) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	year_built	year_remod_add	garage_yr_blt
test	2593	2006	2007	2207

It seems this value should be replaced with the year 2007.

full$garage_yr_blt[full$garage_yr_blt == "2207"] <- "2007"
full$garage_yr_blt <- fct_drop(full$garage_yr_blt)

3.4 `year_built`, `year_remod_add`

One observation has a remodeling year before the built year, which is clearly a typo.

full %>% 
  select(df_id, id, year_built, year_remod_add, garage_yr_blt) %>% 
  filter(year_built > year_remod_add) %>% 
  kable() %>% 
  kable_styling(bootstrap_options = "striped", full_width = F, position = "left")

df_id	id	year_built	year_remod_add	garage_yr_blt
test	1877	2002	2001	2002

# change the 'year_remod_add' value
full$year_remod_add[full$id == 1877] <- 2002

3.5 Ordinal variables

Some features have specific labels that can be considered as ordinal. Indeed, an “Excellent” quality is generally better than a “Poor” or “Fair” quality.
These variables can thus be recoded as ordinal variables, for which label ordering matters.

# transform following features as ordinal variables
full$overall_qual <- ordered(x = full$overall_qual, levels = as.character(1:10))
full$overall_cond <- ordered(x = full$overall_cond, levels = as.character(1:10))

full$exter_qual <- ordered(x = full$exter_qual, levels = c("Po", "Fa", "TA", "Gd", "Ex"))
full$exter_cond <- ordered(x = full$exter_cond, levels = c("Po", "Fa", "TA", "Gd", "Ex"))

full$bsmt_qual <- ordered(x = full$bsmt_qual, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))
full$bsmt_cond <- ordered(x = full$bsmt_cond, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))
full$bsmt_fin_type1 <- ordered(x = full$bsmt_fin_type1, levels = c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"))
full$bsmt_fin_type2 <- ordered(x = full$bsmt_fin_type2, levels = c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"))

full$heating_qc <- ordered(x = full$heating_qc, levels = c("Po", "Fa", "TA", "Gd", "Ex"))

full$kitchen_qual <- ordered(x = full$kitchen_qual, levels = c("Po", "Fa", "TA", "Gd", "Ex"))

full$functional <- ordered(x = full$functional, levels = c("Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"))

full$fireplace_qu <- ordered(x = full$fireplace_qu, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))

full$garage_qual <- ordered(x = full$garage_qual, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))
full$garage_cond <- ordered(x = full$garage_cond, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))

full$pool_qc <- ordered(x = full$pool_qc, levels = c("None", "Fa", "TA", "Gd", "Ex"))
full$fence <- ordered(x = full$fence, levels = c("None", "MnWw", "GdWo", "MnPrv", "GdPrv"))

Special case : electrical. The distribution table shows there is only 1 observation for the Mix label.

# distribution table for 'electrical' feature
table(full$electrical)

## 
## FuseA FuseF FuseP   Mix SBrkr 
##   188    50     8     1  2671

The descrition file is not very clear concerning the “rank” of the Mix label.
This observation being in the train set, the label (and the observation) will be deleted to define a clear ordinal variable as defined in description file.

# which dataset contains the 'Mix' value
full %>% 
  filter(electrical == "Mix") %>% 
  select(df_id, id, electrical)

## # A tibble: 1 x 3
##   df_id    id electrical
##   <fct> <dbl> <fct>     
## 1 train   399 Mix

# remove the observation n°399
# note concerning filtering NA values in 'electrical' : https://github.com/tidyverse/dplyr/issues/3196
full <- full %>% 
  filter(is.na(electrical) | electrical != "Mix")

# relevel the 'electrical' factor
full$electrical <- ordered(x = full$electrical, levels = c("FuseP", "FuseF", "FuseA", "SBrkr"))

3.6 Other variables

Some other variables have inconsistencies, but it will be harder (and longer) to find where the errors come from.
As an example, we can look at the following feature : ms_sub_class, and the specific labels corresponding to the construction year :

20 : 1-STORY 1946 & NEWER ALL STYLES
30 : 1-STORY 1945 & OLDER
60 : 2-STORY 1946 & NEWER
70 : 2-STORY 1945 & OLDER
120 : 1-STORY PUD (Planned Unit Development) - 1946 & NEWER
160 : 2-STORY PUD - 1946 & NEWER

We can add a new feature to check if the construction year is before or after 1946, and compare to the above labels.

full %>% 
  mutate(built_after_1946 = year_built >= 1946) %>% 
  filter(ms_sub_class %in% c("20", "30", "60", "70", "120", "160")) %>% 
  select(ms_sub_class, built_after_1946) %>% 
  table()

##             built_after_1946
## ms_sub_class FALSE TRUE
##          120     0  182
##          150     0    0
##          160     0  128
##          180     0    0
##          190     0    0
##          20      2 1077
##          30    136    2
##          40      0    0
##          45      0    0
##          50      0    0
##          60      3  572
##          70    127    1
##          75      0    0
##          80      0    0
##          85      0    0
##          90      0    0

The previous table shows inconsistencies. For example, the label 30 is supposed to represent houses built before 1946 ; but 2 observations seem to have a year_built value greater than 1946.

This kind of incongruities will be very time-consuming to detect, so we will consider the data as is for the rest of the analysis.

To close this Part 1, let’s save the full dataset.

# Export the 'full' dataset into a R object
saveRDS(object = full, file = "01-full_train_test.rds")

House Price Predictions - Part 1 : Data Cleaning

1 Load data and libraries

2 Cleaning NA values

2.1 basement related variables

2.2 pool_qc

2.3 misc_feature

2.4 garage related variables

2.5 fireplace_qu

2.6 mas_vnr_type

2.7 alley, fence