In almost all statistical and machine learning analyses, it is necessary to perform some data transformations (i.e., data transformation, scaling, centering, standardisation and normalisation) on the raw (but tidy and clean!) data before it can be used for modeling.
Data transformation is often a requisite to further proceed with statistical analysis. Below are the situations where we might need transformations:
We may need to change the scale of a variable or standardise the values of a variable for better understanding.
We may need to transform complex non-linear relationships into linear relationships. Transformation helps us to convert a non-linear relation into linear one.
In statistical inference, symmetric (normal) distribution is preferred over skewed distribution. Also, some statistical analysis techniques (i.e., parametric tests, linear regression, etc) requires normal distribution of variables and homogeneity of variances. So, whenever we have a skewed distribution and/or heterogeneous of variances, we can use transformations which can reduce skewness and/or heterogeneity of variances.
This compresses high values and spreads low values by expressing the values as orders of magnitude.
# log_salary <- log10(salary$salary)
#OR
# ln_salary <- log(salary$salary)
This has the advantage that it can be applied to zero values.
# sqrt_salary <- sqrt(salary$salary)
This has avery strong transformation with a drastic effect on the distribution shape. It will compress large values to smaller values.
# recip_salary <- 1/salary$salary
It spreads out the high values relative to the smaller values.
This only works to reduce left skewness when the distribution is only moderately left skewed.
# sq_salary <- salary$salary^2
Use this to turn any skewed distribution into a normal distribution.
# install.packages("forecast")
# library(forecast)
# boxcox_salary<- BoxCox(salary$salary,lambda = "auto")
To reduce right skewness in the distribution, taking roots or logarithms or reciprocals work well.
To reduce left skewness, taking squares or cubes or higher powers work well.
Some statistical analysis methods are sensitive to the scale of the variables and there can be instances found in data sets where values for one variable could range between 1-10 and values for other variable could range from 1-10000000.
Especially for the distance based methods in machine learning, this could in turn impact the prediction accuracy. For such cases, we may need to normalise or scale the values under different variables such that they fall under common range.
Centering (a.k.a. mean-centering) involves the subtraction of the variable average from the data.
If we have more than one variable to center, we can calculate the average value of each variable and then subtract it from the data.
This implies that each column will be transformed in such a way that the resulting variable will have a zero mean.
# center_x <-scale(df, center = TRUE, scale = FALSE)
In the output, the new centered values for each column are given along with the column (variables’) averages.
Scaling involves the division of the values to its standard deviation.
# scale_x1 <- scale(df, center = FALSE, scale = TRUE)
If we want to scale by the standard deviations without centering, we can use the following:
# scale_x2 <- scale(df, center = FALSE, scale = apply(df, 2, sd, na.rm = TRUE))
Usually they are not used individually.
Instead, they are used together and it is called as the z-score standardisation.
This is the combination of certering and scaling.
# z_x <- scale(df, center = TRUE, scale = TRUE)
An alternative approach to z-score standardisation is the Min-Max normalisation technique.
# minmaxnormalise <- function(x){(x- min(x)) /(max(x)-min(x))}
# lapply(df, minmaxnormalise)
Sometimes we may need to discretise numeric values as analysis methods require discrete values as a input or output variables (i.e., most versions of Naive Bayes and CHAID analysis).
Binning or discretisation methods transform numerical variables into categorical counterparts.
Binning is also useful to deal with possible outliers. It controls or mitigates the impact of outliers over the model by placing them to the first or last category.
In equal-width binning, the variable is divided into n intervals of equal size.
# install.packages("infotheo")
# library(infotheo)
# ew_binned<-discretize(versicolor, disc = "equalwidth")
In equal-depth binning method, the variable is divided into n intervals, each containing approximately the same number of observations (frequencies).
# install.packages("infotheo")
# library(infotheo)
# ed_binned<-discretize(versicolor, disc = "equalfreq")
For large data sets, a common problem called the “curse of dimensionality” occurs as these data sets have huge number of variables (a.k.a. features/dimensions).
Mainly, there are two ways of dimension reduction; feature selection and feature extraction.
In feature filtering, redundant features are filtered out and the ones that are most useful or most relevant for the problem are selected.
This method includes ranking features according to an importance criteria and selecting those which are above a defined threshold. This technique is also called feature ranking.
In this technique features are ranked according to a statistical criteria (i.e., chi-square test, correlation test, entropy based tests, random forest, etc.) and either selected to be kept or removed from the data set.
Feature extraction reduces the data in a high dimensional space to a lower dimension space, i.e. a space with lesser number of dimensions.
This method is an unsupervised algorithm that creates linear combinations of the original features.
Note that the advantage of this technique is fast and simple to implement, and works well in practice.
However the new principal components are not interpretable, because they are linear combinations of original features.