1 Goal


The goal of this exercise is to find which variables have a high correlation inside of the correlation matrix. The way to do so is to make pairs of variables with correlation higher than 0.85. We will use the iris database for this example. This method will be useful when the pool of variables is too big to look at the correlation matrix.


2 Checking the dataset


# First we load the libraries
library(ggplot2)
library(reshape2)
library(corrplot)

# Then we load the dataset
data("iris")
str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

3 Now we keep only the numerical variables


iris_set <- iris[, -5]
str(iris_set)
## 'data.frame':    150 obs. of  4 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

4 Now our function will find high correlations


cor_matrix <- cor(iris_set)
cor_matrix
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
for (i in 1:nrow(cor_matrix)){
  correlations <-  which((cor_matrix[i,] > 0.85) & (cor_matrix[i,] != 1))
  
  if(length(correlations)> 0){
    print(colnames(iris_set)[i])
    print(correlations)
  }
}
## [1] "Sepal.Length"
## Petal.Length 
##            3 
## [1] "Petal.Length"
## Sepal.Length  Petal.Width 
##            1            4 
## [1] "Petal.Width"
## Petal.Length 
##            3
corrplot(cor_matrix, method = "ellipse")

Figure 1: Correlation plot. We can see from the last example the pairs of variables that are more linearly correlated. As the correlation matrix is symmetric we will look only over the diagonal and find the same 2 pairs of variables highly correlated than using our function.


5 Using user defined functions to solve the problem


#We repeat the same exercise defining a user function called corr_check that automatically prints the pairs of variables that are highly correlated. 
corr_check <- function(Dataset, threshold){
  cor_matrix <- cor(Dataset)
  cor_matrix

  for (i in 1:nrow(cor_matrix)){
    correlations <-  which((abs(cor_matrix[i,i:ncol(cor_matrix)]) > threshold) & (cor_matrix[i,i:ncol(cor_matrix)] != 1))
  
    if(length(correlations)> 0){
      lapply(correlations,FUN =  function(x) (cat(paste(colnames(Dataset)[i], "with",colnames(Dataset)[x]), "\n")))
     
    }
  }
}

corr_check(iris_set, 0.85)
## Sepal.Length with Petal.Length 
## Petal.Length with Sepal.Width

6 Conclusion


In this tutorial we have learnt how to create correlation plots and how to find directly the more correlated variables. This exercise becomes more useful as the number of variables grow and correlations are more difficult to find in a correlation plot.