The goal of this tutorial is to learn the basics of normalization.
library(ggplot2)
# In this tutorial we are going to use the iris dataset
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# We draw the petal length vs the petal width
ggplot(data = iris, aes(x = Petal.Width, y = Petal.Length)) + geom_point(aes(color = Species))
# Sometimes we need to normalize data in order to compare different variables that are not in the same scale
# Imagine that we have the age and the salary of a person
# If we don't normalize these variables the weight in some predictive models could be very different
# The function to normalize data is (x - min(x))/(max(x) - min(x))
# We take only the numerical values to normalize
iris_norm <- as.data.frame(apply(iris[, 1:4], 2, function(x) (x - min(x))/(max(x)-min(x))))
iris_norm$Species <- iris$Species
str(iris_norm)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 0.2222 0.1667 0.1111 0.0833 0.1944 ...
## $ Sepal.Width : num 0.625 0.417 0.5 0.458 0.667 ...
## $ Petal.Length: num 0.0678 0.0678 0.0508 0.0847 0.0678 ...
## $ Petal.Width : num 0.0417 0.0417 0.0417 0.0417 0.0417 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris_norm)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.2222 1st Qu.:0.3333 1st Qu.:0.1017 1st Qu.:0.08333
## Median :0.4167 Median :0.4167 Median :0.5678 Median :0.50000
## Mean :0.4287 Mean :0.4406 Mean :0.4675 Mean :0.45806
## 3rd Qu.:0.5833 3rd Qu.:0.5417 3rd Qu.:0.6949 3rd Qu.:0.70833
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
# We can now repeat the previous plot and see that the two variables have been normalized
ggplot(data = iris_norm, aes(x = Petal.Width, y = Petal.Length)) + geom_point(aes(color = Species))
# The purpose of normalizing by rows can be to know which is the maximum value of each entry of the table
# It's not important the actual value but the relative value with the rest of variables of the table
# When we normalize by rows the resulting dataframe is flipped (transposed)
# Then we need to transpose the dataframe in order to keep the same structure
iris_norm <- as.data.frame(t(apply(iris[1:4], 1, function(x) (x - min(x))/(max(x)-min(x)))))
# Now we see that Sepal Length is always 1 because it is the maximum value in every row
# At the same time the Petal Width is always the lower value
summary(iris_norm)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :1 Min. :0.05556 Min. :0.1786 Min. :0
## 1st Qu.:1 1st Qu.:0.24111 1st Qu.:0.2766 1st Qu.:0
## Median :1 Median :0.32321 Median :0.6273 Median :0
## Mean :1 Mean :0.39693 Mean :0.5560 Mean :0
## 3rd Qu.:1 3rd Qu.:0.62908 3rd Qu.:0.7442 3rd Qu.:0
## Max. :1 Max. :0.78431 Max. :0.9211 Max. :0
In this tutorial we have learnt how to normalize a table by rows or by columns using the apply function.