Data Transformations

In almost all statistical and machine learning analyses, it is necessary to perform some data transformations (i.e., data transformation, scaling, centering, standardisation and normalisation) on the raw (but tidy and clean!) data before it can be used for modeling.

Data Transformation

Data transformation is often a requisite to further proceed with statistical analysis. Below are the situations where we might need transformations:

We may need to change the scale of a variable or standardise the values of a variable for better understanding.
We may need to transform complex non-linear relationships into linear relationships. Transformation helps us to convert a non-linear relation into linear one.
In statistical inference, symmetric (normal) distribution is preferred over skewed distribution. Also, some statistical analysis techniques (i.e., parametric tests, linear regression, etc) requires normal distribution of variables and homogeneity of variances. So, whenever we have a skewed distribution and/or heterogeneous of variances, we can use transformations which can reduce skewness and/or heterogeneity of variances.

Reduce Right Skewness

Log Transformation

This compresses high values and spreads low values by expressing the values as orders of magnitude.

It can not be applied to zero or negative values directly.
In order to apply the log transformation to a zero or negative value, we can add a non-negative constant to all observations and then take the logarithm.

# log_salary <- log10(salary$salary)

#OR

# ln_salary <- log(salary$salary)

Square Root Transformation

This has the advantage that it can be applied to zero values.

# sqrt_salary <- sqrt(salary$salary)

Reciprocal Transformation

This has avery strong transformation with a drastic effect on the distribution shape. It will compress large values to smaller values.

# recip_salary <- 1/salary$salary

Reduce Left Skewness

Square Transformation

It spreads out the high values relative to the smaller values.

This only works to reduce left skewness when the distribution is only moderately left skewed.

# sq_salary <- salary$salary^2

Fix Either Skew

Box-Cox Transformation

Use this to turn any skewed distribution into a normal distribution.

# install.packages("forecast")
# library(forecast)

# boxcox_salary<- BoxCox(salary$salary,lambda = "auto")

Right vs Left Skew

To reduce right skewness in the distribution, taking roots or logarithms or reciprocals work well.
To reduce left skewness, taking squares or cubes or higher powers work well.

Data Normalisation

Some statistical analysis methods are sensitive to the scale of the variables and there can be instances found in data sets where values for one variable could range between 1-10 and values for other variable could range from 1-10000000.

Especially for the distance based methods in machine learning, this could in turn impact the prediction accuracy. For such cases, we may need to normalise or scale the values under different variables such that they fall under common range.

Centering

Centering (a.k.a. mean-centering) involves the subtraction of the variable average from the data.

If we have more than one variable to center, we can calculate the average value of each variable and then subtract it from the data.
This implies that each column will be transformed in such a way that the resulting variable will have a zero mean.

# center_x <-scale(df, center = TRUE, scale = FALSE)

In the output, the new centered values for each column are given along with the column (variables’) averages.

Scaling

Scaling involves the division of the values to its standard deviation.

# scale_x1 <- scale(df, center = FALSE, scale = TRUE)

Note that, when we scale values without centering, the scale() function divides the values to the root-mean-square value instead of standard deviation.
Therefore, in this output the new scaled variables are actually scaled with the column root-mean-square values.

If we want to scale by the standard deviations without centering, we can use the following:

# scale_x2 <- scale(df, center = FALSE, scale = apply(df, 2, sd, na.rm = TRUE))

When to Use Centering vs Scaling?

Usually they are not used individually.

Instead, they are used together and it is called as the z-score standardisation.

z-score Standardisation

This is the combination of certering and scaling.

In the z-score transformation, the mean of observations are first subtracted from each individual data point, then divided by the standard deviation of all points.
The resulting transformed data values would have a zero mean and one standard deviation.

# z_x <- scale(df, center = TRUE, scale = TRUE)

Min-Max Normalisation (aka 0-1 normalisation)

An alternative approach to z-score standardisation is the Min-Max normalisation technique.

In this approach, the data is scaled to a fixed range - usually 0 to 1. This is why sometimes this method is called (0-1) normalisation.
In contrast to z-score standardisation this normalisation can suppress the effect of outliers.

Write a function

# minmaxnormalise <- function(x){(x- min(x)) /(max(x)-min(x))}

Use lapply() to apply to dataframe

# lapply(df, minmaxnormalise)

Binning (aka Discretisation)

Sometimes we may need to discretise numeric values as analysis methods require discrete values as a input or output variables (i.e., most versions of Naive Bayes and CHAID analysis).

Binning or discretisation methods transform numerical variables into categorical counterparts.
Binning is also useful to deal with possible outliers. It controls or mitigates the impact of outliers over the model by placing them to the first or last category.

Equal width (distance) binning

In equal-width binning, the variable is divided into n intervals of equal size.

# install.packages("infotheo")
# library(infotheo)

# ew_binned<-discretize(versicolor, disc = "equalwidth")

Equal depth (frequency) binning

In equal-depth binning method, the variable is divided into n intervals, each containing approximately the same number of observations (frequencies).

# install.packages("infotheo")
# library(infotheo)

# ed_binned<-discretize(versicolor, disc = "equalfreq")

Dimension Reduction

For large data sets, a common problem called the “curse of dimensionality” occurs as these data sets have huge number of variables (a.k.a. features/dimensions).

Mainly, there are two ways of dimension reduction; feature selection and feature extraction.

This high dimensionality will increase the computational complexity and increase the risk of overfitting.

Feature Selection

Feature Filtering

In feature filtering, redundant features are filtered out and the ones that are most useful or most relevant for the problem are selected.

Feature filtering methods include removing features with zero and near zero-variance and removing highly correlated variables (i.e., greater than 0.8).

Feature Ranking

This method includes ranking features according to an importance criteria and selecting those which are above a defined threshold. This technique is also called feature ranking.

In this technique features are ranked according to a statistical criteria (i.e., chi-square test, correlation test, entropy based tests, random forest, etc.) and either selected to be kept or removed from the data set.

Feature Extraction

Feature extraction reduces the data in a high dimensional space to a lower dimension space, i.e. a space with lesser number of dimensions.

Note that feature extraction is different from feature selection. Both methods seek to reduce the number of attributes in the data set, but feature extraction methods do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them.

Principal Component Analysis (PCA)

This method is an unsupervised algorithm that creates linear combinations of the original features.

The new extracted features are orthogonal, which means that they are uncorrelated.
The extracted components are ranked in order of their “explained variance”.
For example, the first principal component (PC1) explains the most variance in the data, PC2 explains the second-most variance, and so on.
Then you can decide to keep only as many principal components as needed to reach a cumulative explained variance of 90%.

Note that the advantage of this technique is fast and simple to implement, and works well in practice.

However the new principal components are not interpretable, because they are linear combinations of original features.