5.1.1 Normalisation

Normalisation techniques

There are three main types of normalisation techniques:

Centring and scaling

Centring (also known as mean-centring) involves the subtraction of the variable average from the data.

Let \(y\) denote the variable at the original scale and \(\bar{y}\) be the average. The centered variable \(y'\) is defined as:

\[ y' = y - \bar{y} \]

If we have more than one variable to centre, we can calculate the average value of each variable and then subtract it from the data. This implies that each column will be transformed in such a way that the resulting variable will have a zero mean.

We can use simple user-defined functions or built-in functions available in R to centre variables. One of the functions to apply mean-centering is the scale() function under Base R. The scale() function has the following arguments:

x: a numeric object
center: if TRUE, the objects’ column means are subtracted from the values in those columns (ignoring NAs); if FALSE, centring is not performed.
scale: if TRUE, the centred column values are divided by the column’s standard devaiation (when center is also TRUE), or divided by the root-mean-square (when center is FALSE). If scale = FALSE, scaling is not performed.

Scaling involved the division of the values to their standard deviation (or root-mean-square value).

Let \(y\) denote the variable at the original scale and \(SD_y\) be the standard deviation of the variable. The scaled variable, \(y'\) is defined as:

\[ y' = \frac{y}{SD_y} \]

z-score standardisation

You have already seen the z-scores and we used them to detect the outliers. In the z-score transformation, the means of observations are first subtracted from each individual data point, then divided by the standard deviation of all points. The resulting transformed data values would have a mean of zero and standard deviation of one. The z-score transformation can be applied using the following equation:

\[ z = \frac{\left( y - \bar{y} \right)}{SD_{y}} \]

In the equation above, \(y\) denotes the values of observations, and \(\bar{y}\) and \(SD_{y}\) are the sample mean and standard deviation, respectively.

The z-score transformation can also be applied using the scale() function with center = TRUE and scale = TRUE arguments.

Min-max normalisation (a.k.a range or (0–1) normalisation)

An alternative approach to z-score standardisation is the min–max normalisation technique, which specifies the following formula to be applied to each value of features to be normalised:

\[ y' = \frac{y - y_{min}}{y_{max} - y_{min}} \]

where: - \(y\) is the variable at the original scale; - \(y_{min}\) is the minimum value of the variable; - \(y_{max}\) is the maximum value of the variable.

In this approach, the data is scaled to a fixed range, usually 0 to 1. This is why sometimes this method is called (0–1) normalisation. In contrast to z-score standardisation, this normalisation can suppress the effect of outliers.

In R, min–max normalisation can be applied in many ways and the simplest way would be writing a function like the following:

function(x) {(x - min(x)) / (max(x) - min(x))}

## function(x) {(x - min(x)) / (max(x) - min(x))}

Practical Example

Let’s apply these normalisation techniques using R. Follow the steps below:

Start with applying centering and scaling in R.

Step 1: Create the data frame given below:

df <- data.frame(x1 = c(10, 20, 40, 50, 10), 
                 x2 = c(1000, 5000, 3000, 2000, 1500), 
                 x3 = c(0.1, 0.12, 0.11, 0.14, 0.16), 
                 x4 = c(2.5, 4.2, 3.2, 4.5, 3.8))

Step 2: To apply mean-centring, the scale() function can be used as follows:

centre_x <- scale(df, center = TRUE, scale = FALSE)
centre_x

##       x1    x2     x3    x4
## [1,] -16 -1500 -0.026 -1.14
## [2,]  -6  2500 -0.006  0.56
## [3,]  14   500 -0.016 -0.44
## [4,]  24  -500  0.014  0.86
## [5,] -16 -1000  0.034  0.16
## attr(,"scaled:center")
##       x1       x2       x3       x4 
##   26.000 2500.000    0.126    3.640

You can see in the output that the new centred values for each column are given along with the column (variable) averages.

Now, apply scaling in R.

Step 3: Use the same data frame.

In order to apply just scaling (without centring) to the data frame you can use the center = FALSE and scale = TRUE arguments as follows:

scale_x1 <- scale(df, center = FALSE, scale = TRUE)
scale_x1

##           x1        x2        x3        x4
## [1,] 0.29173 0.3113996 0.6997114 0.6027159
## [2,] 0.58346 1.5569979 0.8396537 1.0125628
## [3,] 1.16692 0.9341987 0.7696826 0.7714764
## [4,] 1.45865 0.6227992 0.9795960 1.0848887
## [5,] 0.29173 0.4670994 1.1195383 0.9161282
## attr(,"scaled:scale")
##           x1           x2           x3           x4 
##   34.2782730 3211.3081447    0.1429161    4.1478910

Note that when you scale values without centring, the scale() function divides the values to the root-mean-square value instead of standard deviation. Therefore, in this output, the new scaled variables are actually scaled with the column root-mean-square values.

Step 4: If you want to scale by the standard deviations without centering, you can use the following:

scale_x2 <- scale(df, center = FALSE, scale = apply(df, 2, sd, na.rm = TRUE)) 
scale_x2

##             x1        x2       x3       x4
## [1,] 0.5504819 0.6324555 4.152274 3.117701
## [2,] 1.1009638 3.1622777 4.982729 5.237738
## [3,] 2.2019275 1.8973666 4.567501 3.990658
## [4,] 2.7524094 1.2649111 5.813184 5.611863
## [5,] 0.5504819 0.9486833 6.643638 4.738906
## attr(,"scaled:scale")
##           x1           x2           x3           x4 
## 1.816590e+01 1.581139e+03 2.408319e-02 8.018728e-01

The output above now reports the scaled values (by standard deviation) along with the column standard deviations (note that the values differ).

Now, apply z-score standardisation in R.

Step 5: If you want to scale by the standard deviations without centering, you can use the following:

Apply z-score transformation using the scale() with the center = TRUE, scale = TRUE arguments. Use the same data frame created above.

z_x <- scale(df, center = TRUE, scale = TRUE) 
z_x

##              x1         x2         x3         x4
## [1,] -0.8807710 -0.9486833 -1.0795912 -1.4216719
## [2,] -0.3302891  1.5811388 -0.2491364  0.6983651
## [3,]  0.7706746  0.3162278 -0.6643638 -0.5487155
## [4,]  1.3211565 -0.3162278  0.5813184  1.0724893
## [5,] -0.8807710 -0.6324555  1.4117732  0.1995329
## attr(,"scaled:center")
##       x1       x2       x3       x4 
##   26.000 2500.000    0.126    3.640 
## attr(,"scaled:scale")
##           x1           x2           x3           x4 
## 1.816590e+01 1.581139e+03 2.408319e-02 8.018728e-01

You can also use other functions (e.g.scores()) from other packages to get the same result. Try it for yourself!

Lastly, apply min–max normalisation in R.

Step 6: Apply min-max normalisation using the function provided:

min_max_norm <- function(x) {
  (x - min(x)) / (max(x) - min(x)) 
  
}

# Bear in mind that this function will output NA if any values in the variable are NA
# (Because min(1, 2, 3, 4, 5, NA) is NA) 
# Therefore, to use this function, you will need to exclude or impute missing values 
# before using the function, otherwise you will end up with NA. 

# Alternative approach might be to include the na.rm = TRUE argument for all min and max functions
# Better yet, allow user to specify whether to include or exclude missing values. By not excluding 
# as a default, it may remind the user to check missing values before proceeding if it outputs NA, 
# but also allows user to exclude missing values if they wish. 

# We can do this by adding an additional function argument (and setting a default value) 

min_max_norm <- function(x, na.rm = FALSE) {
  (x - min(x, na.rm = na.rm)) / (max(x, na.rm = na.rm) - min(x, na.rm = na.rm)) 
  
}

# This adds the na.rm argument to the function, and specifies that it is FALSE by default. 
# The other functions within the function that also use the na.rm argument, we can specify 
# that they are whatever is entered to the na.rm argument of our function (the first "na.rm" 
# in the min() and max() functions are the argument for the function, the second "na.rm" is 
# checking what value the user of our function has given it).

Step 7: Then, use lapply() to apply this function to a data frame:

lapply(df, min_max_norm)

## $x1
## [1] 0.00 0.25 0.75 1.00 0.00
## 
## $x2
## [1] 0.000 1.000 0.500 0.250 0.125
## 
## $x3
## [1] 0.0000000 0.3333333 0.1666667 0.6666667 1.0000000
## 
## $x4
## [1] 0.00 0.85 0.35 1.00 0.65

# Note, if we assigned this to an object, we would have a list: 
min_max_norm_x <- lapply(df, min_max_norm)

Step 8: If you would like to store the normalised values as a data frame, you may also use the as.data.frame() function:

min_max_norm_df <- as.data.frame(lapply(df, min_max_norm)) 

# Alternative solution: 
min_max_norm_df <- lapply(df, min_max_norm) %>% as.data.frame()

min_max_norm_df

##     x1    x2        x3   x4
## 1 0.00 0.000 0.0000000 0.00
## 2 0.25 1.000 0.3333333 0.85
## 3 0.75 0.500 0.1666667 0.35
## 4 1.00 0.250 0.6666667 1.00
## 5 0.00 0.125 1.0000000 0.65