Detecting Outliers

Introduction

The wine data set is available at UCL ML repository. The goal is to find the relationship between red wine quality and its characteristics.

I’ll be looking into finding outliers in the data set. An outlier is a data point that’s very different from the rest of a data set.

Outlier Detection

# Visualizing the data set to identify the outliers
library(ggplot2)
ggplot(stack(wine), aes(x = ind, y = values))+
  geom_boxplot(fill='rosybrown', color="darkred") +
  coord_flip()

Total sulfur dioxide seems to have two outliers. The other ones will have to be plotted again with a better scale. But I’ll have a look at the ‘total.sulfur.dioxide’ column first.

# Run the Shapiro-Wilk test to test the normality of total sulfur dioxide data
shapiro.test(wine$total.sulfur.dioxide)

## 
##  Shapiro-Wilk normality test
## 
## data:  wine$total.sulfur.dioxide
## W = 0.87322, p-value < 2.2e-16

The null-hypothesis of Shapiro-Wilk normality test is that the data is normally distributed. p-value rejects the null hypothesis and the tested data is not normally distributed. However, this might be due to the outliers.

# Q-Q plot  to test the normality of the data
qqnorm(wine$total.sulfur.dioxide)
qqline(wine$total.sulfur.dioxide, col = "rosybrown", lwd = 2)

ggplot(wine, aes(x=total.sulfur.dioxide))+
  geom_histogram(fill='rosybrown', color="darkred", bins = 20)

ggplot(wine, aes(x=total.sulfur.dioxide)) + 
  geom_boxplot(fill = 'rosybrown', color = "darkred",
               outlier.colour="red", outlier.shape=8,
                outlier.size=4)

library(outliers)

# Run the Grubbs' test for two outliers on opposite tails
grubbs.test(wine$total.sulfur.dioxide, type = 11)

## 
##  Grubbs test for two opposite outliers
## 
## data:  wine$total.sulfur.dioxide
## G = 8.60305, U = 0.96502, p-value = 0.001236
## alternative hypothesis: 6 and 289 are outliers

# Runt the Grubbs' test for outliers on one tail
grubbs.test(wine$total.sulfur.dioxide, type = 10)

## 
##  Grubbs test for one outlier
## 
## data:  wine$total.sulfur.dioxide
## G = 7.37285, U = 0.96596, p-value = 8.326e-11
## alternative hypothesis: highest value 289 is an outlier

grubbs.test(wine$total.sulfur.dioxide, type = 10, opposite = T)

## 
##  Grubbs test for one outlier
## 
## data:  wine$total.sulfur.dioxide
## G = 1.23020, U = 0.99905, p-value = 1
## alternative hypothesis: lowest value 6 is an outlier

The Grubbs test type 11 has low p-value suggests that 6 and 289 might be outliers. The Grubbs test type = 10 has a very low p-value suggesting that the highest value 289 is an outlier. Looking at the other tail, with p-value = 1, the lowest value 6 is not an outlier.

# Create a new data set without the largest value
total.so2 <- wine$total.sulfur.dioxide
summary(total.so2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

wine[which.max(total.so2),]

##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1082           7.9              0.3        0.68            8.3      0.05
##      free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1082                37.5                  289 0.99316 3.01      0.51    12.3
##      quality
## 1082       7

# Remove the outlier from the data set.
total.so2 <- total.so2[-which.max(total.so2)]

The largest value contains 289 (mg / dm^3) of SO2. Sulfur dioxide acts as an antioxidant, and protects the wine against browning. However, too much SO2 can prevent healthy fermentation. Where over 50 ppm of SO2 is present, it becomes evident in the nose and taste of wine. The maximum allowable concentration of SO2 in wine is 350. Considering these two points, this wine is either a very bad wine or a mistake in the data set.

# Checking the second highest point
grubbs.test(total.so2, type = 10)

## 
##  Grubbs test for one outlier
## 
## data:  total.so2
## G = 7.16384, U = 0.96784, p-value = 4.112e-10
## alternative hypothesis: highest value 278 is an outlier

grubbs.test(total.so2, type = 11)

## 
##  Grubbs test for two opposite outliers
## 
## data:  total.so2
## G = 8.41044, U = 0.96688, p-value = 0.002868
## alternative hypothesis: 6 and 278 are outliers

The Grubbs test suggests that the second highest point is also an outlier.

# Remove the second highest point
total.so2 <- total.so2[-which.max(total.so2)]

grubbs.test(total.so2, type = 10)

## 
##  Grubbs test for one outlier
## 
## data:  total.so2
## G = 3.73365, U = 0.99126, p-value = 0.1462
## alternative hypothesis: highest value 165 is an outlier

The p-value is high enough that it’s not clear the next point is an outlier.

# Looking at the data set with the two outliers removed.
hist(total.so2, col = 'rosybrown', border = "darkred")

boxplot(total.so2, col = 'rosybrown', border = "darkred")

# testing for normality
shapiro.test(total.so2)

## 
##  Shapiro-Wilk normality test
## 
## data:  total.so2
## W = 0.8901, p-value < 2.2e-16

Conclusion

There were two data points which are outliers in terms of the amount of total sulfur dioxide. The next step would be to investigate these two points further in terms of their other variables, and to look at the possible outliers elsewhere.