Question: 3. Using the Mahalanobis distance find the outliers in the following multidimensional dataset: This dataset contains the height (cm), weight (Kg) and the blood pressure (systolic) values of 25 patients diagnosed with type II diabetes and are under controlled treatment with the metformin drug.
R’s mahalanobis() function provides a simple means of detecting outliers in multidimensional data.
reading the data:
setwd("F:\\DATA SCIENCE\\STUDY MATERIAL\\R\\working_directory")
drug = read.csv("drug.csv")
head(drug)
## Height Weight Blood.Pressure
## 1 164 70 100
## 2 167 72 125
## 3 180 76 120
## 4 168 68 139
## 5 156 54 140
## 6 147 67 158
dim(drug)
## [1] 25 3
summary(drug)
## Height Weight Blood.Pressure
## Min. :134.0 Min. :54.00 Min. :100.0
## 1st Qu.:154.0 1st Qu.:70.00 1st Qu.:125.0
## Median :167.0 Median :76.00 Median :136.0
## Mean :166.6 Mean :74.92 Mean :136.5
## 3rd Qu.:179.0 3rd Qu.:82.00 3rd Qu.:148.0
## Max. :196.0 Max. :89.00 Max. :167.0
names(drug)
## [1] "Height" "Weight" "Blood.Pressure"
Now let’s give mahalanobis() a spin:
we will find 1 outlier at a time and if makes sense we will exclude it from data set and re-calculate mahalanobis() to find nex outlier for the remaining data set.
the Mahalanobis distance equation only accounts for linear relationships with normal distributions of variables.
let’s first check the linearity relationships among variables in “drug” dataframe.
library(GGally)
ggpairs(drug)
based on above plot, none of the variable have a strong linear relationship with other variables.
qqplot(mahalanobis(drug, colMeans(drug), cov(drug)), drug$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")
qqplot(mahalanobis(drug, colMeans(drug), cov(drug)), drug$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")
qqplot(mahalanobis(drug, colMeans(drug), cov(drug)), drug$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")
above three plots are plots of each variable in data set vs. the mahalanobis_distance.
from the plots, it is very clear that there is no linear relationship among variables. and because of that method of finding outliers using mahalanobis_distance calculations must be applied very carefully.
qqnorm(mahalanobis(drug, colMeans(drug), cov(drug)), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug, colMeans(drug), cov(drug)), col="red", lwd=2)
hist(mahalanobis(drug, colMeans(drug), cov(drug)),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")
based on above normal Q-Q plot and histogram of mahalanobis_distance, assumption of normal distribution is also not quite satisfying.
first, we will find the one most extreme outlier.
n.outliers <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug, colMeans(drug), cov(drug)), decreasing=TRUE)
is.outlier <- rep(FALSE, nrow(drug))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1
plot3d(drug$Height, drug$Weight, drug$Blood.Pressure, type="s", col=col)
plot(drug, col=col, pch=19, cex=2)
As you can see from the above code, the mahalanobis() function calculates the Mahalanobis distance of a dataframe using a supplied vector of means and a supplied covariance matrix.
The above code marks as outlier the most extreme point according to their Mahalanobis distance (also known as the generalised squared distance).
This is the distance of each point (the rows of the dataframe) from the centre of the data that the dataframe comprises, normalised by the standard deviation of each of the variables (the columns of the dataframe) and adjusted for the covariances of those variables.
It may be thought of as the multidimensional analogue of the t-statistic.
qqplot(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), drug[-25,]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")
qqplot(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), drug[-25,]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")
qqplot(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), drug[-25,]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")
qqnorm(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), col="red", lwd=2)
hist(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")
n.outliers <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug[-25,], colMeans(drug[-25,]), cov(drug[-25,])), decreasing=TRUE)
is.outlier <- rep(FALSE, nrow(drug[-25,]))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1
plot3d(drug[-25,]$Height, drug[-25,]$Weight, drug[-25,]$Blood.Pressure, type="s", col=col)
plot(drug[-25,], col=col, pch=19, cex=2)
qqplot(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), drug[-c(24:25),]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")
qqplot(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), drug[-c(24:25),]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")
qqplot(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), drug[-c(24:25),]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")
qqnorm(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), col="red", lwd=2)
hist(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")
n.outliers <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug[-c(24:25),], colMeans(drug[-c(24:25),]), cov(drug[-c(24:25),])), decreasing=TRUE)
is.outlier <- rep(FALSE, nrow(drug[-c(24:25),]))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1
plot3d(drug[-c(24:25),]$Height, drug[-c(24:25),]$Weight, drug[-c(24:25),]$Blood.Pressure, type="s", col=col)
plot(drug[-c(24:25),], col=col, pch=19, cex=2)
qqplot(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), drug[-c(5,24,25),]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")
qqplot(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), drug[-c(5,24,25),]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")
qqplot(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), drug[-c(5,24,25),]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")
qqnorm(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), col="red", lwd=2)
hist(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")
n.outliers <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), decreasing=TRUE)
is.outlier <- rep(FALSE, nrow(drug[-c(5,24,25),]))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1
plot3d(drug[-c(5,24,25),]$Height, drug[-c(5,24,25),]$Weight, drug[-c(5,24,25),]$Blood.Pressure, type="s", col=col)
plot(drug[-c(5,24,25),], col=col, pch=19, cex=2)
qqplot(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), drug[-c(1,5,24,25),]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")
qqplot(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), drug[-c(1,5,24,25),]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")
qqplot(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), drug[-c(1,5,24,25),]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")
qqnorm(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])), col="red", lwd=2)
hist(mahalanobis(drug[-c(1,5,24,25),], colMeans(drug[-c(1,5,24,25),]), cov(drug[-c(1,5,24,25),])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")
n.outliers <- 1 # Mark as outliers the 1 the most extreme point.
m.dist.order <- order(mahalanobis(drug[-c(5,24,25),], colMeans(drug[-c(5,24,25),]), cov(drug[-c(5,24,25),])), decreasing=TRUE)
is.outlier <- rep(FALSE, nrow(drug[-c(5,24,25),]))
is.outlier[m.dist.order[1:n.outliers]] <- TRUE
col <- is.outlier * 1 + 1
plot3d(drug[-c(5,24,25),]$Height, drug[-c(5,24,25),]$Weight, drug[-c(5,24,25),]$Blood.Pressure, type="s", col=col)
plot(drug[-c(5,24,25),], col=col, pch=19, cex=2)
qqplot(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), drug[-c(23,1,5,24,25),]$Height, cex=2, col="orange", pch=19, xlab = "mahalanobis_distance", ylab="Height")
qqplot(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), drug[-c(23,1,5,24,25),]$Weight, cex=2, col="blue", pch=19, xlab = "mahalanobis_distance", ylab="Weight")
qqplot(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), drug[-c(23,1,5,24,25),]$Blood.Pressure, cex=2, col="red", pch=19, xlab = "mahalanobis_distance", ylab="Blood Pressure")
qqnorm(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), col="blue", cex=2, pch=19)
qqline(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])), col="red", lwd=2)
hist(mahalanobis(drug[-c(23,1,5,24,25),], colMeans(drug[-c(23,1,5,24,25),]), cov(drug[-c(23,1,5,24,25),])),xlab = "mahalanobis_distance", col="blue", main = "Histogram of Mahalanobis_Distance")
## Height Weight Blood.Pressure
## 25 190 59 156
## 24 189 57 121
## 5 156 54 140
## 1 164 70 100