The “House Price Prediction” competition from Kaggle challenges us to predict the final sale price of houses in Ames, Iowa.
Based on a set of 79 features (built year, location and neighborhood, type of heating, …), a train dataset is available with the sale price of 1460 houses.
The goal is to predict the final sale price of the 1459 houses described in the test dataset.
Note that the evaluation is performed on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price.
library(tidyverse) # data manipulation
library(skimr) # summary statistics
library(kableExtra) # customize table output
library(formattable) # customize table outputLoad and first inspection of the data.
# Import data from 'train.csv' and 'test.csv'
train <- read_csv("train.csv")
test <- read_csv("test.csv")# Merge the 2 datasets and clean the column names
full <- bind_rows(train = train, test = test, .id = "df_id") %>%
janitor::clean_names()
# delete original data
rm(train, test)## Observations: 2,919
## Variables: 82
## $ df_id <chr> "train", "train", "train", "train", "train", "tr…
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 1…
## $ ms_sub_class <dbl> 60, 20, 60, 70, 60, 50, 20, 60, 50, 190, 20, 60,…
## $ ms_zoning <chr> "RL", "RL", "RL", "RL", "RL", "RL", "RL", "RL", …
## $ lot_frontage <dbl> 65, 80, 68, 60, 84, 85, 75, NA, 51, 50, 70, 85, …
## $ lot_area <dbl> 8450, 9600, 11250, 9550, 14260, 14115, 10084, 10…
## $ street <chr> "Pave", "Pave", "Pave", "Pave", "Pave", "Pave", …
## $ alley <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ lot_shape <chr> "Reg", "Reg", "IR1", "IR1", "IR1", "IR1", "Reg",…
## $ land_contour <chr> "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl", "Lvl",…
## $ utilities <chr> "AllPub", "AllPub", "AllPub", "AllPub", "AllPub"…
## $ lot_config <chr> "Inside", "FR2", "Inside", "Corner", "FR2", "Ins…
## $ land_slope <chr> "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl", "Gtl",…
## $ neighborhood <chr> "CollgCr", "Veenker", "CollgCr", "Crawfor", "NoR…
## $ condition1 <chr> "Norm", "Feedr", "Norm", "Norm", "Norm", "Norm",…
## $ condition2 <chr> "Norm", "Norm", "Norm", "Norm", "Norm", "Norm", …
## $ bldg_type <chr> "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", "1Fam", …
## $ house_style <chr> "2Story", "1Story", "2Story", "2Story", "2Story"…
## $ overall_qual <dbl> 7, 6, 7, 7, 8, 5, 8, 7, 7, 5, 5, 9, 5, 7, 6, 7, …
## $ overall_cond <dbl> 5, 8, 5, 5, 5, 5, 5, 6, 5, 6, 5, 5, 6, 5, 5, 8, …
## $ year_built <dbl> 2003, 1976, 2001, 1915, 2000, 1993, 2004, 1973, …
## $ year_remod_add <dbl> 2003, 1976, 2002, 1970, 2000, 1995, 2005, 1973, …
## $ roof_style <chr> "Gable", "Gable", "Gable", "Gable", "Gable", "Ga…
## $ roof_matl <chr> "CompShg", "CompShg", "CompShg", "CompShg", "Com…
## $ exterior1st <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Sdng", "Vin…
## $ exterior2nd <chr> "VinylSd", "MetalSd", "VinylSd", "Wd Shng", "Vin…
## $ mas_vnr_type <chr> "BrkFace", "None", "BrkFace", "None", "BrkFace",…
## $ mas_vnr_area <dbl> 196, 0, 162, 0, 350, 0, 186, 240, 0, 0, 0, 286, …
## $ exter_qual <chr> "Gd", "TA", "Gd", "TA", "Gd", "TA", "Gd", "TA", …
## $ exter_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ foundation <chr> "PConc", "CBlock", "PConc", "BrkTil", "PConc", "…
## $ bsmt_qual <chr> "Gd", "Gd", "Gd", "TA", "Gd", "Gd", "Ex", "Gd", …
## $ bsmt_cond <chr> "TA", "TA", "TA", "Gd", "TA", "TA", "TA", "TA", …
## $ bsmt_exposure <chr> "No", "Gd", "Mn", "No", "Av", "No", "Av", "Mn", …
## $ bsmt_fin_type1 <chr> "GLQ", "ALQ", "GLQ", "ALQ", "GLQ", "GLQ", "GLQ",…
## $ bsmt_fin_sf1 <dbl> 706, 978, 486, 216, 655, 732, 1369, 859, 0, 851,…
## $ bsmt_fin_type2 <chr> "Unf", "Unf", "Unf", "Unf", "Unf", "Unf", "Unf",…
## $ bsmt_fin_sf2 <dbl> 0, 0, 0, 0, 0, 0, 0, 32, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ bsmt_unf_sf <dbl> 150, 284, 434, 540, 490, 64, 317, 216, 952, 140,…
## $ total_bsmt_sf <dbl> 856, 1262, 920, 756, 1145, 796, 1686, 1107, 952,…
## $ heating <chr> "GasA", "GasA", "GasA", "GasA", "GasA", "GasA", …
## $ heating_qc <chr> "Ex", "Ex", "Ex", "Gd", "Ex", "Ex", "Ex", "Ex", …
## $ central_air <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ electrical <chr> "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SBrkr", "SB…
## $ x1st_flr_sf <dbl> 856, 1262, 920, 961, 1145, 796, 1694, 1107, 1022…
## $ x2nd_flr_sf <dbl> 854, 0, 866, 756, 1053, 566, 0, 983, 752, 0, 0, …
## $ low_qual_fin_sf <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ gr_liv_area <dbl> 1710, 1262, 1786, 1717, 2198, 1362, 1694, 2090, …
## $ bsmt_full_bath <dbl> 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, …
## $ bsmt_half_bath <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ full_bath <dbl> 2, 2, 2, 1, 2, 1, 2, 2, 2, 1, 1, 3, 1, 2, 1, 1, …
## $ half_bath <dbl> 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, …
## $ bedroom_abv_gr <dbl> 3, 3, 3, 3, 4, 1, 3, 3, 2, 2, 3, 4, 2, 3, 2, 2, …
## $ kitchen_abv_gr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 1, …
## $ kitchen_qual <chr> "Gd", "TA", "Gd", "Gd", "Gd", "TA", "Gd", "TA", …
## $ tot_rms_abv_grd <dbl> 8, 6, 6, 7, 9, 5, 7, 7, 8, 5, 5, 11, 4, 7, 5, 5,…
## $ functional <chr> "Typ", "Typ", "Typ", "Typ", "Typ", "Typ", "Typ",…
## $ fireplaces <dbl> 0, 1, 1, 1, 1, 0, 1, 2, 2, 2, 0, 2, 0, 1, 1, 0, …
## $ fireplace_qu <chr> NA, "TA", "TA", "Gd", "TA", NA, "Gd", "TA", "TA"…
## $ garage_type <chr> "Attchd", "Attchd", "Attchd", "Detchd", "Attchd"…
## $ garage_yr_blt <dbl> 2003, 1976, 2001, 1998, 2000, 1993, 2004, 1973, …
## $ garage_finish <chr> "RFn", "RFn", "RFn", "Unf", "RFn", "Unf", "RFn",…
## $ garage_cars <dbl> 2, 2, 2, 3, 3, 2, 2, 2, 2, 1, 1, 3, 1, 3, 1, 2, …
## $ garage_area <dbl> 548, 460, 608, 642, 836, 480, 636, 484, 468, 205…
## $ garage_qual <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ garage_cond <chr> "TA", "TA", "TA", "TA", "TA", "TA", "TA", "TA", …
## $ paved_drive <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y"…
## $ wood_deck_sf <dbl> 0, 298, 0, 0, 192, 40, 255, 235, 90, 0, 0, 147, …
## $ open_porch_sf <dbl> 61, 0, 42, 35, 84, 30, 57, 204, 0, 4, 0, 21, 0, …
## $ enclosed_porch <dbl> 0, 0, 0, 272, 0, 0, 0, 228, 205, 0, 0, 0, 0, 0, …
## $ x3ssn_porch <dbl> 0, 0, 0, 0, 0, 320, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ screen_porch <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 176, 0, 0, 0…
## $ pool_area <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ pool_qc <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ fence <chr> NA, NA, NA, NA, NA, "MnPrv", NA, NA, NA, NA, NA,…
## $ misc_feature <chr> NA, NA, NA, NA, NA, "Shed", NA, "Shed", NA, NA, …
## $ misc_val <dbl> 0, 0, 0, 0, 0, 700, 0, 350, 0, 0, 0, 0, 0, 0, 0,…
## $ mo_sold <dbl> 2, 5, 9, 2, 12, 10, 8, 11, 4, 1, 2, 7, 9, 8, 5, …
## $ yr_sold <dbl> 2008, 2007, 2008, 2006, 2008, 2009, 2007, 2009, …
## $ sale_type <chr> "WD", "WD", "WD", "WD", "WD", "WD", "WD", "WD", …
## $ sale_condition <chr> "Normal", "Normal", "Normal", "Abnorml", "Normal…
## $ sale_price <dbl> 208500, 181500, 223500, 140000, 250000, 143000, …
NA valuesThe dataset contains NA values that are not actually missing values but, according to the description of the data, are “Not Applicable” values (or None values) : some NA’s should not be treated as missing values.
However, we can’t replace all NA by a new value None (for example), because some are really missing values !!
So for each variable, we will have to look if we can add this new value None or leave it as a real NA.
This can be done by looking at other variables that can be related, and look at the incoherence we found.
Note : the next steps are not imputation, but just a look at the data to detect wrong NAs. Imputation will be discussed in another notebook.
pool_qcWe can look at the variable pool_area to detect incongruities.
Which are the observations in which we have a NA value for pool_qc and a value different from 0 for pool_area ?
pool_full %>%
filter(is.na(pool_qc), pool_area != 0) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | pool_area | pool_qc |
|---|---|---|---|
| test | 2421 | 368 | NA |
| test | 2504 | 444 | NA |
| test | 2600 | 561 | NA |
Three observations seem to be legit missing values ; all other observations with NA for the variable pool_qc also have a 0 value for pool_area, meaning that there is probably no pool.
# Change the missing values into 'None' for character variables
pool_full$pool_qc[is.na(pool_full$pool_qc) & pool_full$pool_area == 0] <- "None"# Double check our results
pool_full %>%
filter_all(any_vars(is.na(.))) %>%
mutate(pool_qc = cell_spec(pool_qc, "html",
color = ifelse(is.na(pool_qc), "white", "default"),
background = ifelse(is.na(pool_qc), "orange", "default"))) %>%
kable(escape = FALSE) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | pool_area | pool_qc |
|---|---|---|---|
| test | 2421 | 368 | NA |
| test | 2504 | 444 | NA |
| test | 2600 | 561 | NA |
Let’s re-put our new variables into the original dataset.
misc_featureWe can look at the variable misc_val to detect incongruities.
Which are the observations in which we have a NA value for misc_feature and a value different from 0 for misc_val ?
misc_full %>%
filter(is.na(misc_feature), misc_val != 0) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | misc_feature | misc_val |
|---|---|---|---|
| test | 2550 | NA | 17000 |
Only one observation (n°2550) is an actual missing value ; all others have a misc_val of 0, meaning the probable absence of miscellaneous feature.
# Change the missing values into 'None' for character variables.
misc_full$misc_feature[is.na(misc_full$misc_feature) & misc_full$misc_val == 0] <- "None"# Double check our results.
misc_full %>%
filter_all(any_vars(is.na(.))) %>%
mutate(misc_feature = cell_spec(misc_feature, "html",
color = ifelse(is.na(misc_feature), "white", "default"),
background = ifelse(is.na(misc_feature), "orange", "default"))) %>%
kable(escape = FALSE) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | misc_feature | misc_val |
|---|---|---|---|
| test | 2550 | NA | 17000 |
fireplace_quWe can look at the variable misc_val to detect incongruities.
Which are the observations in which we have a NA value for fireplace_qu and a value different from 0 for fireplaces ?
fire_full %>%
filter(is.na(fireplace_qu), fireplaces != 0) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | fireplaces | fireplace_qu |
|---|---|---|---|
Every NA is actually a None value.
# Change the missing values into 'None' for character variables
fire_full$fireplace_qu[is.na(fire_full$fireplace_qu)] <- "None"# Double check our results.
fire_full %>%
filter_all(any_vars(is.na(.))) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | fireplaces | fireplace_qu |
|---|---|---|---|
mas_vnr_typeWe can look at the variable mas_vnr_area to detect incongruities.
Which are the observations in which we have a NA value for mas_vnr_type and a value different from 0 for mas_vnr_area ?
mas_full %>%
filter(is.na(mas_vnr_type), mas_vnr_area != 0) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | mas_vnr_type | mas_vnr_area |
|---|---|---|---|
| test | 2611 | NA | 198 |
Only one observation (n°2611) is an actual missing value for mas_vnr_type.
# Change the missing values into 'None' for character variables and '0' for numeric variables.
mas_full$mas_vnr_type[is.na(mas_full$mas_vnr_type) & is.na(mas_full$mas_vnr_area)] <- "None"
mas_full$mas_vnr_area[is.na(mas_full$mas_vnr_area)] <- 0# Double check our results.
mas_full %>%
filter_all(any_vars(is.na(.))) %>%
mutate(mas_vnr_type = cell_spec(mas_vnr_type, "html",
color = ifelse(is.na(mas_vnr_type), "white", "default"),
background = ifelse(is.na(mas_vnr_type), "orange", "default"))) %>%
kable(escape = FALSE) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | mas_vnr_type | mas_vnr_area |
|---|---|---|---|
| test | 2611 | NA | 198 |
The remaining missing value seems to be legit.
alley, fenceThese 2 variable don’t seem to be related to any other variable in the dataset (as per description file).
We will then change all NA into None without looking at other variables.
Note that we thus assume that there is no missing value for these variables!
Only one variable remains with a lot of NA values : lot_frontage.
This variable is numeric, and is probably not be equal to 0.
So we might have to impute these missing values.
full %>%
sapply(function(x) sum(is.na(x))) %>%
enframe(name = "variable", value = "missing_nb") %>%
arrange(-missing_nb) %>%
filter(missing_nb != 0, variable != "sale_price") %>%
mutate(missing_nb = color_bar("lightgreen")(missing_nb)) %>%
kable(escape = FALSE) %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| variable | missing_nb |
|---|---|
| lot_frontage | 486 |
| ms_zoning | 4 |
| bsmt_cond | 3 |
| bsmt_exposure | 3 |
| pool_qc | 3 |
| utilities | 2 |
| bsmt_qual | 2 |
| functional | 2 |
| exterior1st | 1 |
| exterior2nd | 1 |
| mas_vnr_type | 1 |
| bsmt_fin_type2 | 1 |
| electrical | 1 |
| kitchen_qual | 1 |
| garage_yr_blt | 1 |
| garage_finish | 1 |
| garage_qual | 1 |
| garage_cond | 1 |
| misc_feature | 1 |
| sale_type | 1 |
Now we have cleaned the data, let’s have a look at remaining variables to spot incongruities.
ms_sub_class, overall_qual, overall_condThese variables are numeric variables to be treated as character/categorical variables (as per description file).
All character variables can now be transformed into factors.
garage_yr_bltThere is an observation with a year equal to 2207, clearly a typo.
##
## 1895 1896 1900 1906 1908 1910 1914 1915 1916 1917 1918 1919 1920 1921 1922
## 1 1 6 1 1 10 2 7 6 2 3 1 33 5 8
## 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937
## 6 8 15 15 5 7 2 27 4 4 1 4 8 7 6
## 1938 1939 1940 1941 1942 1943 1945 1946 1947 1948 1949 1950 1951 1952 1953
## 11 21 25 14 6 1 10 9 5 19 14 51 17 16 23
## 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968
## 37 24 41 34 42 36 37 31 35 34 35 34 39 36 48
## 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983
## 32 32 24 27 29 35 28 50 66 41 35 32 15 9 11
## 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998
## 19 18 12 18 20 19 26 17 27 49 39 35 40 44 58
## 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2207 None <NA>
## 54 55 41 53 92 99 142 115 115 61 29 5 1 158 1
full %>%
filter(garage_yr_blt == "2207") %>%
select(df_id, id, year_built, year_remod_add, garage_yr_blt) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | year_built | year_remod_add | garage_yr_blt |
|---|---|---|---|---|
| test | 2593 | 2006 | 2007 | 2207 |
It seems this value should be replaced with the year 2007.
year_built, year_remod_addOne observation has a remodeling year before the built year, which is clearly a typo.
full %>%
select(df_id, id, year_built, year_remod_add, garage_yr_blt) %>%
filter(year_built > year_remod_add) %>%
kable() %>%
kable_styling(bootstrap_options = "striped", full_width = F, position = "left")| df_id | id | year_built | year_remod_add | garage_yr_blt |
|---|---|---|---|---|
| test | 1877 | 2002 | 2001 | 2002 |
Some features have specific labels that can be considered as ordinal. Indeed, an “Excellent” quality is generally better than a “Poor” or “Fair” quality.
These variables can thus be recoded as ordinal variables, for which label ordering matters.
# transform following features as ordinal variables
full$overall_qual <- ordered(x = full$overall_qual, levels = as.character(1:10))
full$overall_cond <- ordered(x = full$overall_cond, levels = as.character(1:10))
full$exter_qual <- ordered(x = full$exter_qual, levels = c("Po", "Fa", "TA", "Gd", "Ex"))
full$exter_cond <- ordered(x = full$exter_cond, levels = c("Po", "Fa", "TA", "Gd", "Ex"))
full$bsmt_qual <- ordered(x = full$bsmt_qual, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))
full$bsmt_cond <- ordered(x = full$bsmt_cond, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))
full$bsmt_fin_type1 <- ordered(x = full$bsmt_fin_type1, levels = c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"))
full$bsmt_fin_type2 <- ordered(x = full$bsmt_fin_type2, levels = c("None", "Unf", "LwQ", "Rec", "BLQ", "ALQ", "GLQ"))
full$heating_qc <- ordered(x = full$heating_qc, levels = c("Po", "Fa", "TA", "Gd", "Ex"))
full$kitchen_qual <- ordered(x = full$kitchen_qual, levels = c("Po", "Fa", "TA", "Gd", "Ex"))
full$functional <- ordered(x = full$functional, levels = c("Sal", "Sev", "Maj2", "Maj1", "Mod", "Min2", "Min1", "Typ"))
full$fireplace_qu <- ordered(x = full$fireplace_qu, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))
full$garage_qual <- ordered(x = full$garage_qual, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))
full$garage_cond <- ordered(x = full$garage_cond, levels = c("None", "Po", "Fa", "TA", "Gd", "Ex"))
full$pool_qc <- ordered(x = full$pool_qc, levels = c("None", "Fa", "TA", "Gd", "Ex"))
full$fence <- ordered(x = full$fence, levels = c("None", "MnWw", "GdWo", "MnPrv", "GdPrv"))Special case : electrical. The distribution table shows there is only 1 observation for the Mix label.
##
## FuseA FuseF FuseP Mix SBrkr
## 188 50 8 1 2671
The descrition file is not very clear concerning the “rank” of the Mix label.
This observation being in the train set, the label (and the observation) will be deleted to define a clear ordinal variable as defined in description file.
# which dataset contains the 'Mix' value
full %>%
filter(electrical == "Mix") %>%
select(df_id, id, electrical)## # A tibble: 1 x 3
## df_id id electrical
## <fct> <dbl> <fct>
## 1 train 399 Mix
# remove the observation n°399
# note concerning filtering NA values in 'electrical' : https://github.com/tidyverse/dplyr/issues/3196
full <- full %>%
filter(is.na(electrical) | electrical != "Mix")
# relevel the 'electrical' factor
full$electrical <- ordered(x = full$electrical, levels = c("FuseP", "FuseF", "FuseA", "SBrkr"))Some other variables have inconsistencies, but it will be harder (and longer) to find where the errors come from.
As an example, we can look at the following feature : ms_sub_class, and the specific labels corresponding to the construction year :
20 : 1-STORY 1946 & NEWER ALL STYLES30 : 1-STORY 1945 & OLDER60 : 2-STORY 1946 & NEWER70 : 2-STORY 1945 & OLDER120 : 1-STORY PUD (Planned Unit Development) - 1946 & NEWER160 : 2-STORY PUD - 1946 & NEWERWe can add a new feature to check if the construction year is before or after 1946, and compare to the above labels.
full %>%
mutate(built_after_1946 = year_built >= 1946) %>%
filter(ms_sub_class %in% c("20", "30", "60", "70", "120", "160")) %>%
select(ms_sub_class, built_after_1946) %>%
table()## built_after_1946
## ms_sub_class FALSE TRUE
## 120 0 182
## 150 0 0
## 160 0 128
## 180 0 0
## 190 0 0
## 20 2 1077
## 30 136 2
## 40 0 0
## 45 0 0
## 50 0 0
## 60 3 572
## 70 127 1
## 75 0 0
## 80 0 0
## 85 0 0
## 90 0 0
The previous table shows inconsistencies. For example, the label 30 is supposed to represent houses built before 1946 ; but 2 observations seem to have a year_built value greater than 1946.
This kind of incongruities will be very time-consuming to detect, so we will consider the data as is for the rest of the analysis.
To close this Part 1, let’s save the full dataset.