1 Goal


The goal of this tutorial is to learn the basics of normalization.


2 Data preparation


library(ggplot2)

# In this tutorial we are going to use the iris dataset
data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
# We draw the petal length vs the petal width
ggplot(data = iris, aes(x = Petal.Width, y = Petal.Length)) + geom_point(aes(color = Species))


3 Why normalize data


# Sometimes we need to normalize data in order to compare different variables that are not in the same scale
# Imagine that we have the age and the salary of a person
# If we don't normalize these variables the weight in some predictive models could be very different

4 Normalizing the dataframe by columns


# The function to normalize data is (x - min(x))/(max(x) - min(x))
# We take only the numerical values to normalize
iris_norm <- as.data.frame(apply(iris[, 1:4], 2, function(x) (x - min(x))/(max(x)-min(x))))
iris_norm$Species <- iris$Species
str(iris_norm)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  0.2222 0.1667 0.1111 0.0833 0.1944 ...
##  $ Sepal.Width : num  0.625 0.417 0.5 0.458 0.667 ...
##  $ Petal.Length: num  0.0678 0.0678 0.0508 0.0847 0.0678 ...
##  $ Petal.Width : num  0.0417 0.0417 0.0417 0.0417 0.0417 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(iris_norm)
##   Sepal.Length     Sepal.Width      Petal.Length     Petal.Width     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.2222   1st Qu.:0.3333   1st Qu.:0.1017   1st Qu.:0.08333  
##  Median :0.4167   Median :0.4167   Median :0.5678   Median :0.50000  
##  Mean   :0.4287   Mean   :0.4406   Mean   :0.4675   Mean   :0.45806  
##  3rd Qu.:0.5833   3rd Qu.:0.5417   3rd Qu.:0.6949   3rd Qu.:0.70833  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
# We can now repeat the previous plot and see that the two variables have been normalized
ggplot(data = iris_norm, aes(x = Petal.Width, y = Petal.Length)) + geom_point(aes(color = Species))


5 Normalize the dataframe by rows


# The purpose of normalizing by rows can be to know which is the maximum value of each entry of the table
# It's not important the actual value but the relative value with the rest of variables of the table

# When we normalize by rows the resulting dataframe is flipped (transposed)
# Then we need to transpose the dataframe in order to keep the same structure

iris_norm <- as.data.frame(t(apply(iris[1:4], 1, function(x) (x - min(x))/(max(x)-min(x)))))

# Now we see that Sepal Length is always 1 because it is the maximum value in every row
# At the same time the Petal Width is always the lower value 
summary(iris_norm)
##   Sepal.Length  Sepal.Width       Petal.Length     Petal.Width
##  Min.   :1     Min.   :0.05556   Min.   :0.1786   Min.   :0   
##  1st Qu.:1     1st Qu.:0.24111   1st Qu.:0.2766   1st Qu.:0   
##  Median :1     Median :0.32321   Median :0.6273   Median :0   
##  Mean   :1     Mean   :0.39693   Mean   :0.5560   Mean   :0   
##  3rd Qu.:1     3rd Qu.:0.62908   3rd Qu.:0.7442   3rd Qu.:0   
##  Max.   :1     Max.   :0.78431   Max.   :0.9211   Max.   :0

6 Conclusion


In this tutorial we have learnt how to normalize a table by rows or by columns using the apply function.