The goal of this exercise is to find which variables have a high correlation inside of the correlation matrix. The way to do so is to make pairs of variables with correlation higher than 0.85. We will use the iris database for this example. This method will be useful when the pool of variables is too big to look at the correlation matrix.
# First we load the libraries
library(ggplot2)
library(reshape2)
library(corrplot)
# Then we load the dataset
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
iris_set <- iris[, -5]
str(iris_set)
## 'data.frame': 150 obs. of 4 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
cor_matrix <- cor(iris_set)
cor_matrix
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
for (i in 1:nrow(cor_matrix)){
correlations <- which((cor_matrix[i,] > 0.85) & (cor_matrix[i,] != 1))
if(length(correlations)> 0){
print(colnames(iris_set)[i])
print(correlations)
}
}
## [1] "Sepal.Length"
## Petal.Length
## 3
## [1] "Petal.Length"
## Sepal.Length Petal.Width
## 1 4
## [1] "Petal.Width"
## Petal.Length
## 3
corrplot(cor_matrix, method = "ellipse")
Figure 1: Correlation plot. We can see from the last example the pairs of variables that are more linearly correlated. As the correlation matrix is symmetric we will look only over the diagonal and find the same 2 pairs of variables highly correlated than using our function.
#We repeat the same exercise defining a user function called corr_check that automatically prints the pairs of variables that are highly correlated.
corr_check <- function(Dataset, threshold){
cor_matrix <- cor(Dataset)
cor_matrix
for (i in 1:nrow(cor_matrix)){
correlations <- which((abs(cor_matrix[i,i:ncol(cor_matrix)]) > threshold) & (cor_matrix[i,i:ncol(cor_matrix)] != 1))
if(length(correlations)> 0){
lapply(correlations,FUN = function(x) (cat(paste(colnames(Dataset)[i], "with",colnames(Dataset)[x]), "\n")))
}
}
}
corr_check(iris_set, 0.85)
## Sepal.Length with Petal.Length
## Petal.Length with Sepal.Width
In this tutorial we have learnt how to create correlation plots and how to find directly the more correlated variables. This exercise becomes more useful as the number of variables grow and correlations are more difficult to find in a correlation plot.