Using Mahalanobis Distance to Find Outliers

Question: 3. Using the Mahalanobis distance find the outliers in the following multidimensional dataset: This dataset contains the height (cm), weight (Kg) and the blood pressure (systolic) values of 25 patients diagnosed with type II diabetes and are under controlled treatment with the metformin drug.

R’s mahalanobis() function provides a simple means of detecting outliers in multidimensional data.

reading the data:

setwd("F:\\DATA SCIENCE\\STUDY MATERIAL\\R\\working_directory")
drug = read.csv("drug.csv")
head(drug)
##   Height Weight Blood.Pressure
## 1    164     70            100
## 2    167     72            125
## 3    180     76            120
## 4    168     68            139
## 5    156     54            140
## 6    147     67            158
dim(drug)
## [1] 25  3
summary(drug)
##      Height          Weight      Blood.Pressure 
##  Min.   :134.0   Min.   :54.00   Min.   :100.0  
##  1st Qu.:154.0   1st Qu.:70.00   1st Qu.:125.0  
##  Median :167.0   Median :76.00   Median :136.0  
##  Mean   :166.6   Mean   :74.92   Mean   :136.5  
##  3rd Qu.:179.0   3rd Qu.:82.00   3rd Qu.:148.0  
##  Max.   :196.0   Max.   :89.00   Max.   :167.0
names(drug)
## [1] "Height"         "Weight"         "Blood.Pressure"

Now let’s give mahalanobis() a spin:

we will find 1 outlier at a time and if makes sense we will exclude it from data set and re-calculate mahalanobis() to find nex outlier for the remaining data set.

the Mahalanobis distance equation only accounts for linear relationships with normal distributions of variables.

let’s first check the linearity relationships among variables in “drug” dataframe.

Linear relationship check:

library(GGally)
ggpairs(drug)

based on above plot, none of the variable have a strong linear relationship with other variables.

qqplot(mahalanobis(drug, colMeans(drug), cov(drug)), drug$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")

qqplot(mahalanobis(drug, colMeans(drug), cov(drug)), drug$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")

qqplot(mahalanobis(drug, colMeans(drug), cov(drug)), drug$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")

above three plots are plots of each variable in data set vs. the mahalanobis_distance.

from the plots, it is very clear that there is no linear relationship among variables. and because of that method of finding outliers using mahalanobis_distance calculations must be applied very carefully.

Normality check:

qqnorm(mahalanobis(drug, colMeans(drug), cov(drug)), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug, colMeans(drug), cov(drug)), col="red", lwd=2)

hist(mahalanobis(drug, colMeans(drug), cov(drug)),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")

based on above normal Q-Q plot and histogram of mahalanobis_distance, assumption of normal distribution is also not quite satisfying.

below is the process of identifying outlier using mahalanobis():

first, we will find the one most extreme outlier.

Finding 1st outlier:

n.outliers   <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug, colMeans(drug), cov(drug)), decreasing=TRUE)
is.outlier   <- rep(FALSE, nrow(drug))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1

plot3d(drug$Height, drug$Weight, drug$Blood.Pressure, type="s", col=col)

plot(drug, col=col, pch=19, cex=2)

As you can see from the above code, the mahalanobis() function calculates the Mahalanobis distance of a dataframe using a supplied vector of means and a supplied covariance matrix.

The above code marks as outlier the most extreme point according to their Mahalanobis distance (also known as the generalised squared distance).

This is the distance of each point (the rows of the dataframe) from the centre of the data that the dataframe comprises, normalised by the standard deviation of each of the variables (the columns of the dataframe) and adjusted for the covariances of those variables.

It may be thought of as the multidimensional analogue of the t-statistic.

you can see the outlier in red color data point in above plots.

now, we will check linear plots of variables, normal q-q plot and histogram of mahalanobis_distance to make sure that after removing the outlier that we had detected in earlier step is improving things.

qqplot(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), drug[-25,]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")

qqplot(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), drug[-25,]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")

qqplot(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), drug[-25,]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")

qqnorm(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), col="red", lwd=2)

hist(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")

based on 5 above plots, it looks like the removal of one found outlier using mahalanobis_distance is improving linearity as well as normality plots.

now we will try to find one more outlier and repeat the above check process again until we see no improvement in plots.

Finding 2nd outlier:

n.outliers   <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), decreasing=TRUE)
is.outlier   <- rep(FALSE, nrow(drug[-25,]))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1

plot3d(drug[-25,]$Height, drug[-25,]$Weight, drug[-25,]$Blood.Pressure, type="s", col=col)

plot(drug[-25,], col=col, pch=19, cex=2)

you can see the outlier in red color data point in above plots.

now, we will check linear plots of variables, normal q-q plot and histogram of mahalanobis_distance to make sure that after removing the outlier that we had detected in earlier step is improving things.

qqplot(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), drug[-c(24:25),]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")

qqplot(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), drug[-c(24:25),]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")

qqplot(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), drug[-c(24:25),]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")

qqnorm(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), col="red", lwd=2)

hist(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")

based on 5 above plots, it looks like the removal of one found outlier using mahalanobis_distance is improving linearity as well as normality plots.

now we will try to find one more outlier and repeat the above check process again until we see no improvement in plots.

Finding 3rd outlier:

n.outliers   <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), decreasing=TRUE)
is.outlier   <- rep(FALSE, nrow(drug[-c(24:25),]))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1

plot3d(drug[-c(24:25),]$Height, drug[-c(24:25),]$Weight, drug[-c(24:25),]$Blood.Pressure, type="s", col=col)

plot(drug[-c(24:25),], col=col, pch=19, cex=2)

you can see the outlier in red color data point in above plots.

now, we will check linear plots of variables, normal q-q plot and histogram of mahalanobis_distance to make sure that after removing the outlier that we had detected in earlier step is improving things.

qqplot(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), drug[-c(5,24,25),]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")

qqplot(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), drug[-c(5,24,25),]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")

qqplot(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), drug[-c(5,24,25),]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")

qqnorm(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), col="red", lwd=2)

hist(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")

based on 5 above plots, it looks like the removal of one found outlier using mahalanobis_distance is improving linearity as well as normality plots.

now we will try to find one more outlier and repeat the above check process again until we see no improvement in plots.

Finding 4th outlier:

n.outliers   <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), decreasing=TRUE)
is.outlier   <- rep(FALSE, nrow(drug[-c(5,24,25),]))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1

plot3d(drug[-c(5,24,25),]$Height, drug[-c(5,24,25),]$Weight, drug[-c(5,24,25),]$Blood.Pressure, type="s", col=col)

plot(drug[-c(5,24,25),], col=col, pch=19, cex=2)

you can see the outlier in red color data point in above plots.

now, we will check linear plots of variables, normal q-q plot and histogram of mahalanobis_distance to make sure that after removing the outlier that we had detected in earlier step is improving things.

qqplot(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), drug[-c(1,5,24,25),]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")

qqplot(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), drug[-c(1,5,24,25),]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")

qqplot(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), drug[-c(1,5,24,25),]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")

qqnorm(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), col="red", lwd=2)

hist(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")

based on 5 above plots, it looks like the removal of one found outlier using mahalanobis_distance is improving linearity as well as normality plots.

now we will try to find one more outlier and repeat the above check process again until we see no improvement in plots.

Finding 5th outlier:

n.outliers   <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), decreasing=TRUE)
is.outlier   <- rep(FALSE, nrow(drug[-c(5,24,25),]))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1

plot3d(drug[-c(5,24,25),]$Height, drug[-c(5,24,25),]$Weight, drug[-c(5,24,25),]$Blood.Pressure, type="s", col=col)

plot(drug[-c(5,24,25),], col=col, pch=19, cex=2)

you can see the outlier in red color data point in above plots.

now, we will check linear plots of variables, normal q-q plot and histogram of mahalanobis_distance to make sure that after removing the outlier that we had detected in earlier step is improving things.

qqplot(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), drug[-c(23,1,5,24,25),]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")

qqplot(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), drug[-c(23,1,5,24,25),]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")

qqplot(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), drug[-c(23,1,5,24,25),]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")

qqnorm(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), col="red", lwd=2)

hist(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")

based on 5 above plots, it looks like the removal of one found outlier using mahalanobis_distance is not improving linearity or normality plots. so we can stop here.

so in conclusion, through the Mahalanobis method using mahalanobis_distance calculations we found total 4 outliers. 5th Outlier did not improve things so we did not consider it as outlier.

below are these final 4 Outliers:

##    Height Weight Blood.Pressure
## 25    190     59            156
## 24    189     57            121
## 5     156     54            140
## 1     164     70            100