This file demonstrates a typical process of using R package “cleandata” to prepare data for machine learning.
A collection of functions that work with data frame to inspect, impute, and encode data. The functions for imputation and encoding can produce log files to help you keep track of data manipulation process.
Available on CRAN: https://cran.r-project.org/package=cleandata
Source Code on GitHub: https://github.com/sherrisherry/cleandata
Issues and Suggestions: https://github.com/sherrisherry/cleandata/issues
With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa for predicting housing prices, this dataset is a typical example of what a business analyst encounters everyday.
According to the description of this dataset, the “NA”s in some columns aren’t missing value. To prevent R from comfusing them with true missing values, read in the data files without converting any value to the NA in R.
The train set should have only one more column SalePrice than the test set.
# import 'cleandata' package.
library('cleandata')
# read in the training and test datasets without converting 'NA's to missing values.
train <- read.csv('data/train.csv', na.strings = "", strip.white = TRUE)
test <- read.csv('data/test.csv', na.strings = "", strip.white = TRUE)
# summarize the training set and test set
cat(paste('train: ', nrow(train), 'obs. ', ncol(train), 'cols\ncolumn names:\n', toString(colnames(train)),
'\n\ntest: ', nrow(test), 'obs. ', ncol(test), 'cols\ncolumn names:\n', toString(colnames(test)), '\n'))
## train: 1460 obs. 81 cols
## column names:
## Id, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC, CentralAir, Electrical, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd, Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition, SalePrice
##
## test: 1459 obs. 80 cols
## column names:
## Id, MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC, CentralAir, Electrical, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd, Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition
To ensure consistency in the following imputation and encoding process across the train set and the test set, I appended the test set to the train set. The SalePrice values of the rows of the test set was set to NA to distinguish them from the rows of the train set. The resulting data frame was called df.
# filling the target columns of the test set with NA then combining test and training sets
test$SalePrice <- NA
df <- rbind(train, test)
rm(train, test)
Function
inspect_na
inspect_na() counts the number of NAs in each column and sort them in descending order. In the following operation, inspect_na() returned the top 5 columns with missing values. If you want to see the number of missing values in every column, leave parameter top as default. As supposed, only SalePrice contained missing values, which equaled to the number of rows in the test set.
inspect_na(df, top = 5)
## SalePrice Id MSSubClass MSZoning LotFrontage
## 1459 0 0 0 0
The NAs in the columns listed in NAisNoA were what was refered to as ‘none’-but-not-‘NA’ values. In these columns, NA had only one possible value - “not applicable”. I replaced these NAs with NoA to prevent imputing them later.
# in the 'NAisNoA' columns, NA means this attribute doesn't apply to them, not missing.
NAisNoA <- c('Alley', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2',
'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond',
'PoolQC', 'Fence', 'MiscFeature')
for(i in NAisNoA){levels(df[, i])[levels(df[, i]) == "NA"] <- "NoA"}
At this stage, I reconstructed the data frame df to inspect the true missing values.
We can see that only LotFrontage had about 20% missing values. The other columns had few to no missing value.
# write the dataset into a csv file then read this file back to df to reconstruct df
write.csv(df, file = 'data/data.csv', row.names = FALSE)
df <- read.csv('data/data.csv', na.strings = "NA", strip.white = TRUE)
# see which predictors have most NAs
inspect_na(df[, -ncol(df)], top = 25)
## LotFrontage GarageYrBlt MasVnrType MasVnrArea MSZoning
## 486 159 24 23 4
## Utilities BsmtFullBath BsmtHalfBath Functional Exterior1st
## 2 2 2 2 1
## Exterior2nd BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF
## 1 1 1 1 1
## Electrical KitchenQual GarageCars GarageArea SaleType
## 1 1 1 1 1
## Id MSSubClass LotArea Street Alley
## 0 0 0 0 0
Function
inspect_map
inspect_map() classifies the columns of a data frame. Before I further explain this function, I’d like to introduce ‘scheme’. In package “cleandata”, a scheme refers to a set of all the possible values of an enumerator. The factor objects in R are enumerators.
Function inspect_map returns a list of factor_cols (list), factor_levels (list), num_cols (vector), char_cols (vector), ordered_cols (vector), and other_cols (vector).
common for more information about scheme.In the following codes, I specified that 2 factorial columns share the same scheme if their levels had more than 2 same values by setting the common parameter to 2. By default, the common parameter is 0, which means every level of 2 factorial columns should be the same for them to share the same scheme.
# create a map for imputation and encoding
data_map <- inspect_map(df[, -ncol(df)], common = 2)
## Id integer factors: 0 nums: 1 chars: 0 ordered: 0 others: 0
## MSSubClass integer factors: 0 nums: 2 chars: 0 ordered: 0 others: 0
## MSZoning factor factors: 1 nums: 2 chars: 0 ordered: 0 others: 0
## LotFrontage integer factors: 1 nums: 3 chars: 0 ordered: 0 others: 0
## LotArea integer factors: 1 nums: 4 chars: 0 ordered: 0 others: 0
## Street factor factors: 2 nums: 4 chars: 0 ordered: 0 others: 0
## Alley factor factors: 3 nums: 4 chars: 0 ordered: 0 others: 0
## LotShape factor factors: 4 nums: 4 chars: 0 ordered: 0 others: 0
## LandContour factor factors: 5 nums: 4 chars: 0 ordered: 0 others: 0
## Utilities factor factors: 6 nums: 4 chars: 0 ordered: 0 others: 0
## LotConfig factor factors: 7 nums: 4 chars: 0 ordered: 0 others: 0
## LandSlope factor factors: 8 nums: 4 chars: 0 ordered: 0 others: 0
## Neighborhood factor factors: 9 nums: 4 chars: 0 ordered: 0 others: 0
## Condition1 factor factors: 10 nums: 4 chars: 0 ordered: 0 others: 0
## Condition2 factor factors: 11 nums: 4 chars: 0 ordered: 0 others: 0
## BldgType factor factors: 12 nums: 4 chars: 0 ordered: 0 others: 0
## HouseStyle factor factors: 13 nums: 4 chars: 0 ordered: 0 others: 0
## OverallQual integer factors: 13 nums: 5 chars: 0 ordered: 0 others: 0
## OverallCond integer factors: 13 nums: 6 chars: 0 ordered: 0 others: 0
## YearBuilt integer factors: 13 nums: 7 chars: 0 ordered: 0 others: 0
## YearRemodAdd integer factors: 13 nums: 8 chars: 0 ordered: 0 others: 0
## RoofStyle factor factors: 14 nums: 8 chars: 0 ordered: 0 others: 0
## RoofMatl factor factors: 15 nums: 8 chars: 0 ordered: 0 others: 0
## Exterior1st factor factors: 16 nums: 8 chars: 0 ordered: 0 others: 0
## Exterior2nd factor factors: 17 nums: 8 chars: 0 ordered: 0 others: 0
## MasVnrType factor factors: 18 nums: 8 chars: 0 ordered: 0 others: 0
## MasVnrArea integer factors: 18 nums: 9 chars: 0 ordered: 0 others: 0
## ExterQual factor factors: 19 nums: 9 chars: 0 ordered: 0 others: 0
## ExterCond factor factors: 20 nums: 9 chars: 0 ordered: 0 others: 0
## Foundation factor factors: 21 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtQual factor factors: 22 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtCond factor factors: 23 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtExposure factor factors: 24 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtFinType1 factor factors: 25 nums: 9 chars: 0 ordered: 0 others: 0
## BsmtFinSF1 integer factors: 25 nums: 10 chars: 0 ordered: 0 others: 0
## BsmtFinType2 factor factors: 26 nums: 10 chars: 0 ordered: 0 others: 0
## BsmtFinSF2 integer factors: 26 nums: 11 chars: 0 ordered: 0 others: 0
## BsmtUnfSF integer factors: 26 nums: 12 chars: 0 ordered: 0 others: 0
## TotalBsmtSF integer factors: 26 nums: 13 chars: 0 ordered: 0 others: 0
## Heating factor factors: 27 nums: 13 chars: 0 ordered: 0 others: 0
## HeatingQC factor factors: 28 nums: 13 chars: 0 ordered: 0 others: 0
## CentralAir factor factors: 29 nums: 13 chars: 0 ordered: 0 others: 0
## Electrical factor factors: 30 nums: 13 chars: 0 ordered: 0 others: 0
## X1stFlrSF integer factors: 30 nums: 14 chars: 0 ordered: 0 others: 0
## X2ndFlrSF integer factors: 30 nums: 15 chars: 0 ordered: 0 others: 0
## LowQualFinSF integer factors: 30 nums: 16 chars: 0 ordered: 0 others: 0
## GrLivArea integer factors: 30 nums: 17 chars: 0 ordered: 0 others: 0
## BsmtFullBath integer factors: 30 nums: 18 chars: 0 ordered: 0 others: 0
## BsmtHalfBath integer factors: 30 nums: 19 chars: 0 ordered: 0 others: 0
## FullBath integer factors: 30 nums: 20 chars: 0 ordered: 0 others: 0
## HalfBath integer factors: 30 nums: 21 chars: 0 ordered: 0 others: 0
## BedroomAbvGr integer factors: 30 nums: 22 chars: 0 ordered: 0 others: 0
## KitchenAbvGr integer factors: 30 nums: 23 chars: 0 ordered: 0 others: 0
## KitchenQual factor factors: 31 nums: 23 chars: 0 ordered: 0 others: 0
## TotRmsAbvGrd integer factors: 31 nums: 24 chars: 0 ordered: 0 others: 0
## Functional factor factors: 32 nums: 24 chars: 0 ordered: 0 others: 0
## Fireplaces integer factors: 32 nums: 25 chars: 0 ordered: 0 others: 0
## FireplaceQu factor factors: 33 nums: 25 chars: 0 ordered: 0 others: 0
## GarageType factor factors: 34 nums: 25 chars: 0 ordered: 0 others: 0
## GarageYrBlt integer factors: 34 nums: 26 chars: 0 ordered: 0 others: 0
## GarageFinish factor factors: 35 nums: 26 chars: 0 ordered: 0 others: 0
## GarageCars integer factors: 35 nums: 27 chars: 0 ordered: 0 others: 0
## GarageArea integer factors: 35 nums: 28 chars: 0 ordered: 0 others: 0
## GarageQual factor factors: 36 nums: 28 chars: 0 ordered: 0 others: 0
## GarageCond factor factors: 37 nums: 28 chars: 0 ordered: 0 others: 0
## PavedDrive factor factors: 38 nums: 28 chars: 0 ordered: 0 others: 0
## WoodDeckSF integer factors: 38 nums: 29 chars: 0 ordered: 0 others: 0
## OpenPorchSF integer factors: 38 nums: 30 chars: 0 ordered: 0 others: 0
## EnclosedPorch integer factors: 38 nums: 31 chars: 0 ordered: 0 others: 0
## X3SsnPorch integer factors: 38 nums: 32 chars: 0 ordered: 0 others: 0
## ScreenPorch integer factors: 38 nums: 33 chars: 0 ordered: 0 others: 0
## PoolArea integer factors: 38 nums: 34 chars: 0 ordered: 0 others: 0
## PoolQC factor factors: 39 nums: 34 chars: 0 ordered: 0 others: 0
## Fence factor factors: 40 nums: 34 chars: 0 ordered: 0 others: 0
## MiscFeature factor factors: 41 nums: 34 chars: 0 ordered: 0 others: 0
## MiscVal integer factors: 41 nums: 35 chars: 0 ordered: 0 others: 0
## MoSold integer factors: 41 nums: 36 chars: 0 ordered: 0 others: 0
## YrSold integer factors: 41 nums: 37 chars: 0 ordered: 0 others: 0
## SaleType factor factors: 42 nums: 37 chars: 0 ordered: 0 others: 0
## SaleCondition factor factors: 43 nums: 37 chars: 0 ordered: 0 others: 0
summary(data_map)
## Length Class Mode
## factor_cols 31 -none- list
## factor_levels 31 -none- list
## num_cols 37 -none- character
## char_cols 0 -none- NULL
## ordered_cols 0 -none- NULL
## other_cols 0 -none- NULL
This dataset only had factorial and numeric columns. I unpacked data_map before heading to imputation and encoding.
factor_cols <- data_map$factor_cols
factor_levels <- data_map$factor_levels
num_cols <- data_map$num_cols
rm(data_map)
The functions for imputation and encoding keep track of your process by producing log files. This feature is by default disabled. To enable log files, I passed a list of arguments to the log parameter in every imputation or encoding function.
# create a list of arguments for producing a log file
log_arg <- list(file = 'log.txt', append = TRUE, split = FALSE)
log_arg stroed the arguments I passed to parameter log. The log file was named “log.txt”, new information was appended to the file, and the contents to the log file weren’t printed to the standard output.
To prevent leakage, I instructed the imputation functions to use only rows of the train set to calculate the imputation values by passing an index to parameter idx.
Function
impute_mode,impute_median,impute_mean
impute_mode() works with both numerical, string, and factorial columns. It impute NAs by the modes of their corresponding columns.
impute_median() and impute_mean() only work with numerical columns. They impute NAs by medians and means respectively.
# impute NAs in factorial columns by the mode of corresponding columns
lst <- unlist(factor_cols)
df <- impute_mode(df, cols = lst, idx = !is.na(df$SalePrice), log = log_arg)
# impute NAs in numerical columns by the median of corresponding columns
lst <- num_cols
df <- impute_median(df, cols = lst, idx = !is.na(df$SalePrice), log = log_arg)
# check the result
inspect_na(df[, -ncol(df)], top = 5)
## Id MSSubClass MSZoning LotFrontage LotArea
## 0 0 0 0 0
Every encoding function prints summary of the columns before and after encoding by default. The output of every funcion is by default factorial. If you want numerical output, set parameter out.int to TRUE after making sure no missing value in the input.
In this demo, I kept the encoded columns factorial because I intended to save the dataset into a csv file, which doesn’t distinguish between factorial and numerical columns.
In business datasets, we can often find ratings, which are ordinal and use similar schemes. Based on my experience, if many columns share the same scheme, they are likely to be ratings.
summary(factor_cols)
## Length Class Mode
## MSZoning 1 -none- character
## Street 1 -none- character
## Alley 1 -none- character
## LotShape 1 -none- character
## LandContour 1 -none- character
## Utilities 1 -none- character
## LotConfig 1 -none- character
## LandSlope 1 -none- character
## Neighborhood 1 -none- character
## Condition1 2 -none- character
## BldgType 1 -none- character
## HouseStyle 1 -none- character
## RoofStyle 1 -none- character
## RoofMatl 1 -none- character
## Exterior1st 2 -none- character
## MasVnrType 1 -none- character
## ExterQual 10 -none- character
## Foundation 1 -none- character
## BsmtExposure 1 -none- character
## BsmtFinType1 2 -none- character
## Heating 1 -none- character
## CentralAir 1 -none- character
## Electrical 1 -none- character
## Functional 1 -none- character
## GarageType 1 -none- character
## GarageFinish 1 -none- character
## PavedDrive 1 -none- character
## Fence 1 -none- character
## MiscFeature 1 -none- character
## SaleType 1 -none- character
## SaleCondition 1 -none- character
In our dataset ExterQual and other 9 columns share the same scheme. After I checked their scheme and the description file, I was sure that they were ordinal.
factor_levels$ExterQual
## [1] "Ex" "Fa" "Gd" "NoA" "Po" "TA"
“Po”: poor, “Fa”: fair, “TA”: typical/average, “Gd”: good, “Ex”: excellent
Function
encode_ordinal
encode_ordinal() encodes ordinal data into sequential integers by a given order. The argument passed to none is always encoded to 0. The 1st member of the vector passed to order is encoded to 1.
# encoding ordinal columns
i <- 'ExterQual'; lst <- c('Po', 'Fa', 'TA', 'Gd', 'Ex')
df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]]], order = lst, none = 'NoA', log = log_arg)
## ExterQual ExterCond BsmtQual BsmtCond HeatingQC KitchenQual
## Ex: 107 Ex: 12 Ex : 258 Fa : 104 Ex:1493 Ex: 205
## Fa: 35 Fa: 67 Fa : 88 Gd : 122 Fa: 92 Fa: 70
## Gd: 979 Gd: 299 Gd :1209 NoA: 82 Gd: 474 Gd:1151
## TA:1798 Po: 3 NoA: 81 Po : 5 Po: 3 TA:1493
## TA:2538 TA :1283 TA :2606 TA: 857
##
## FireplaceQu GarageQual GarageCond PoolQC
## Ex : 43 Ex : 3 Ex : 3 Ex : 4
## Fa : 74 Fa : 124 Fa : 74 Fa : 2
## Gd : 744 Gd : 24 Gd : 15 Gd : 4
## NoA:1420 NoA: 159 NoA: 159 NoA:2909
## Po : 46 Po : 5 Po : 14
## TA : 592 TA :2604 TA :2654
## coded 10 cols 5 levels
## ExterQual ExterCond BsmtQual BsmtCond HeatingQC KitchenQual FireplaceQu
## 5: 107 5: 12 5: 258 2: 104 5:1493 5: 205 5: 43
## 2: 35 2: 67 2: 88 4: 122 2: 92 2: 70 2: 74
## 4: 979 4: 299 4:1209 0: 82 4: 474 4:1151 4: 744
## 3:1798 1: 3 0: 81 1: 5 1: 3 3:1493 0:1420
## 3:2538 3:1283 3:2606 3: 857 1: 46
## 3: 592
## GarageQual GarageCond PoolQC
## 5: 3 5: 3 5: 4
## 2: 124 2: 74 2: 2
## 4: 24 4: 15 4: 4
## 0: 159 0: 159 0:2909
## 1: 5 1: 14
## 3:2604 3:2654
# removing encoded columns from the map
factor_levels[[i]] <- NULL
factor_cols[[i]] <- NULL
The Utilities column was binary according the dataset.
factor_levels$Utilities
## [1] "AllPub" "NoSeWa"
levels(df$Utilities)
## [1] "AllPub" "NoSeWa"
However, the description file indicates that it has 4 possible values: ‘ELO’, ‘NoSeWa’, ‘NoSewr’, ‘AllPub’. Therefore, I encoded it as having 4 levels.
# in dataset only "AllPub" "NoSeWa", with 2 NAs
i <- 'Utilities'; lst <- c('ELO', 'NoSeWa', 'NoSewr', 'AllPub')
df[, factor_cols[[i]]] <- encode_ordinal(df[, factor_cols[[i]], drop=FALSE], order = lst, log = log_arg)
## Utilities
## AllPub:2918
## NoSeWa: 1
## coded 1 cols 4 levels
## Utilities
## 4:2918
## 2: 1
factor_levels[[i]]<-NULL
factor_cols[[i]]<-NULL
# find all the 2-level columns
lst <- lapply(factor_levels, length)
lst <- as.data.frame(lst)
colnames(lst[, lst == 2])
## [1] "Street" "CentralAir"
Function
encode_binary
encode_binary() encodes binary data into 0 and 1, regardless of order.
# encode all the 2-level columns
i <- c(factor_cols$Street, factor_cols$CentralAir)
df[, i] <- encode_binary(df[, i, drop=FALSE], log = log_arg)
## Street CentralAir
## Grvl: 12 N: 196
## Pave:2907 Y:2723
## coded 1 cols
## coded 1 cols
## Street CentralAir
## 0: 12 0: 196
## 1:2907 1:2723
Let’s check the log file at the end.
cat(paste(readLines('log.txt'), collapse = '\n'))
## Columns Imputed by Mode:
## MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, FireplaceQu, GarageQual, GarageCond, PoolQC, Foundation, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, CentralAir, Electrical, Functional, GarageType, GarageFinish, PavedDrive, Fence, MiscFeature, SaleType, SaleCondition
##
## Columns Imputed by Median:
## Id, MSSubClass, LotFrontage, LotArea, OverallQual, OverallCond, YearBuilt, YearRemodAdd, MasVnrArea, BsmtFinSF1, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, X1stFlrSF, X2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, TotRmsAbvGrd, Fireplaces, GarageYrBlt, GarageCars, GarageArea, WoodDeckSF, OpenPorchSF, EnclosedPorch, X3SsnPorch, ScreenPorch, PoolArea, MiscVal, MoSold, YrSold
##
## Columns:
## ExterQual, ExterCond, BsmtQual, BsmtCond, HeatingQC, KitchenQual, FireplaceQu, GarageQual, GarageCond, PoolQC
## Scheme:
## NoA Po Fa TA Gd Ex
## 0 1 2 3 4 5
##
## Columns:
## Utilities
## Scheme:
## ELO NoSeWa NoSewr AllPub
## 0 1 2 3 4
##
## Columns:
## Street
## Scheme:
## Grvl Pave
## 0 1
##
## Columns:
## CentralAir
## Scheme:
## N Y
## 0 1
=== end ===