Data: Preprocessing

Preliminary

We will use the DescTools and caret packages for preprocessing our data. If you do not already have the caret package installed, you will first install the package using the install.packages() function. Both packages have very useful and comprehensive vignettes which can be accessed through the help documentation in RStudio.

install.packages("caret")

Once installed, we load the packages for use in the R session using the library() function.

library(DescTools)
library(caret)

In the lesson that follows we will use the “csdata.RData” file. We use the load() function to import the objects in the RData file from our working directory into our workspace for use in the current session.

load("csdata.RData")

We can view the names of the objects that we have imported using the ls() function.

ls()

## [1] "cs"   "facs" "nums" "ords"

Preprocessing

Aggregation

We may want to aggregate either rows or columns to reduce the dimensionality and variability in the data. For instance, we may want to create an AvgPrice variable, which contains the average of the Price and CompPrice variables, and use this variable instead of the two individual variables as input to our analysis. We can use the rowMeans() function to achieve this.

cs$AvgPrice <- rowMeans(cs[ ,c("Price", "CompPrice")])
head(cs[ , c("Price", "CompPrice", "AvgPrice")])

##   Price CompPrice AvgPrice
## 1   120       138    129.0
## 2    83       111     97.0
## 3    80       113     96.5
## 4    97       117    107.0
## 5   128       141    134.5
## 6    72       124     98.0

Sampling

A common technique to reduce the observation (row) dimensionality of a dataset is sampling. Simple random sampling can be performed with (replace = TRUE) or without replacement (replace = FALSE, default) using the sample() function. To create a reproducible sample, we first use the set.seed() function to initialize a random seed.

First, we create a random sample without replacement (default) containing 90% of the observations in the cs dataframe. The x argument designates the number of items to choose from and the size argument indicates the number of items to randomly select. We subset the row dimension of the cs dataframe to include the row index numbers selected by the sample() function.

set.seed(831)
cs_sample <- cs[sample(x = nrow(cs), 
                       size  = nrow(cs) * 0.9), ]

Next, we can create a random sample of 75% of the observations in the cs dataframe with replacement (rows can be included more than once in the sample) by specifying replace = TRUE.

cs_sample2 <- cs[sample(x = nrow(cs),
                        size = nrow(cs) * 0.75,
                        replace = TRUE), ]

In some cases, we may want to preserve the distribution of a categorical (factor) variable in our sample. We can use the createDataPartition() function in the caret package. The function outputs an array of the row index numbers to include in a sample which preserves the distribution of the categorical variable specified in the y argument. The p argument is used to indicate the percentage of the total rows to include in the sample and list = FALSE allows us to use the output from the function to create a subset of the original dataframe by indexing.

samp <- createDataPartition(y = cs$Sales_Lev,
                            p = 0.7,
                            list = FALSE)
cs_sample3 <- cs[samp, ]

Discretization

We use the cut() function to perform discretization, breaking a numerical variable, in this case the Age variable into 5 categories, thereby converting it to a categorical variable. The x argument indicates the variable to discretize, the breaks argument identifies the number of class levels to create, the labels argument provides the names of the class levels and ordered_result = TRUE indicates that the factor variable created should be an ordered factor.

cs$Age_disc <- cut(x = cs$Age, 
                breaks = 5, 
                labels = c(1, 2, 3, 4, 5),
                ordered_result = TRUE)
head(cs[ , c("Age", "Age_disc")])

##   Age Age_disc
## 1  42        2
## 2  65        4
## 3  59        4
## 4  55        3
## 5  38        2
## 6  78        5

Binarization

For binary variables (categorical variables with 2 class levels) we can use the class2ind() function from the caret package and use drop2nd = TRUE to create a single dummy variable. Below demonstrates binarization of the Urban variable.

cs$Urban_bin <- class2ind(cs$Urban, drop2nd = TRUE)
head(cs[ ,c("Urban", "Urban_bin")])

##   Urban Urban_bin
## 1   Yes         0
## 2   Yes         0
## 3   Yes         0
## 4   Yes         0
## 5   Yes         0
## 6    No         1

To convert all categorical variables with 2 class levels to binary variables at the same time, we can use the lapply() function.

cs[,c("Urban_bin", "US_bin", "Sales_Lev")] <- lapply(cs[ ,c(facs, "Sales_Lev")], FUN = class2ind, drop2nd = TRUE)

For categorical variables with more than 2 class levels, we can use the dummyVars() function from the caret package. For ordinal factor variables, we need to convert them to unordered factor variables first before binarization.

cs$ShelveLoc <- factor(cs$ShelveLoc, ordered = FALSE)

We use the dummyVars() and predict functions to create the ‘dummy’ variables. The formula argument contains the variable or variables that you would like to binarize following the format (~ x1 + x2 + .. + xn) to convert n variables. The data argument specifies the dataframe containing the original variable(s).

cats <- dummyVars(formula =  ~ ShelveLoc,
                  data = cs)

To create the matrix of dummy variables, the predict() function is used. The object argument contains the object created using the dummyVars() function and the newdata argument contains the original dataframe.

cats_dums <- predict(object = cats, 
                     newdata = cs)

To create a new dataframe that combines the original and binarized variables, we can use the cbind() function, which combines columns.

cs_dum <- cbind(cs, cats_dums)

In the new dataframe, we will want to remove the original factor variables.

cs_dum <- cs_dum[ ,-c(6, 9:11, 13)]

For ordinal variables, another approach is to convert the ordered factor levels to integer values. We can use the as.numeric() function.

cs$ShelveLoc_int <- as.numeric(cs$ShelveLoc)
head(cs[ , c("ShelveLoc", "ShelveLoc_int")])

##   ShelveLoc ShelveLoc_int
## 1       Bad             1
## 2      Good             3
## 3    Medium             2
## 4    Medium             2
## 5       Bad             1
## 6       Bad             1

Variable Transformation

We can apply basic transformations such as exp(), log(), and sqrt().

adv_log <- log(cs$Advertising)

We can use the preProcess() function from the caret package (and the predict() function) to apply transformations to our data. Available transformations (method) include: BoxCox, YeoJohnson, expoTrans, center, scale, range, pca, ica. Note: the transformations will only be applied to numeric variables.

Standardization

Setting method = c("center", "scale") in the preProcess() function performs standardization (subtracting the means and dividing by the standard deviation) on the numeric variables in the dataframe identified in the x argument. Using the predict() function creates the transformed variables, which we save as a new dataframe.

cen_sc <- preProcess(x = cs,
                     method = c("center", "scale"))
cs_sc <- predict(object = cen_sc,
                     newdata = cs)

Min-Max Normalization

Setting method = "range" in the preProcess() function performs min-max normalization (restricting the range of values to be between 0 and 1) on the numeric variables in the dataframe identified in the x argument. Using the predict() function creates the transformed variables, which we save as a new dataframe.

cen_ran <- preProcess(x = cs,
                      method = "range")
cs_ran <- predict(object = cen_ran,
                      newdata = cs)