Outliers detection
Primary check
Primary data evaluation allows looking for outliers with descriptive statistics and visual tools. We use it to identify general direction of outliers detection. Primary data evaluation includes comparison of median and mean values, computation of range and computation of skewness. We use skewness function from e1071 library to compute the skewness. Skewness computation method is type 1 method, as described in [2]:
\[ g_1 = \frac{m_3}{(m_2)^\frac{3}{2}} \]
All three features are positively skewed. There are outliers above 3rd quartile.
Sales are skewed to the right. Mean and meadian are much closer to minimal value than to the maximal value. Skewness is greater than 1 - it indicates positive skew. Maximal value is 90 times higher than the median, which marks that there are outliers in the 3rd quantile.
Price has positive skewness. Positive and large asymmetry coefficient shows significant shift of the bulk of observations to the left in the presence of a large right tail in the distribution. The maximum value is 1000 times the median. The right tail drags the mean to the right.
Volume distribution is similar to the sales. Characteristics of skewness and range are very close for both.
t1 <- tribble(~feature, ~coef_skewness, ~ mean,
"sales", skewness(mydata$sales), mean(mydata$sales),
"price", skewness(mydata$price), mean(mydata$price),
"volume", skewness(mydata$volume), mean(mydata$volume))
t2 <- tribble(~median, ~max, ~min,
median(mydata$sales), max(mydata$sales), min(mydata$sales),
median(mydata$price), max(mydata$price), min(mydata$price),
median(mydata$volume), max(mydata$volume), min(mydata$volume))
cbind(t1, t2)
## feature coef_skewness mean median max min
## 1 sales 7.882118 4392.37855 4400.0000 397100 1100.0000
## 2 price 32.414350 37.95974 1.2698 8000 0.1305
## 3 volume 9.772788 1530.98368 1008.7000 203467 1.1000
We can check a hypothesis that there are outliers by estimating the dispersion of observations. Coefficient of variation shows how homogeneous is the population as a fraction of standard deviation and mean [3]:
\[ c_v = \frac{\sigma}{\mu} \]
Coefficients of variation of sales, price and volume confirm that populations of these variables are heterogeneous. Since all three are positive values, it is logical to suppose that high positive deviations from the means cause heterogeneity in populations.
tribble(~feature, ~coeff.variation,
"sales", sd(mydata$sales) / mean(mydata$sales) * 100,
"price", sd(mydata$price) / mean(mydata$price) * 100,
"volume", sd(mydata$volume) / mean(mydata$volume) * 100)
## # A tibble: 3 x 2
## feature coeff.variation
## <chr> <dbl>
## 1 sales 86.4
## 2 price 124.
## 3 volume 184.
Visual inspection of histograms shows that the data is heavily right skewed. Sales histogram shows that there are sales with values up to 4e+05. There are so few observations with sales higher than 1e+04 that they do not show up on the histogram. The size of the sales axis marks that there are outliers. Histogram of volume confirms the presence of outliers. Strangely, but price histogram also shows wide spread of prices with concentration around zero. It indicates that outliers in sales and volume do not match by observations.

Now when we have found out that there is a good chance to find outliers in our features, it is time to detect exact points. We do this using three methods: Cook’s distance, interquartile range (IQR), Mahalanobis distance.
Cook’s distance
Cook’s distance (CD) measures the influence of an observation on regression model. It estimates the effect from deleting such observation from a model [4]. Cook’s distance is applicable for multivariate regression models. Computing Cook’s distance involved linear regression model [5]:
\[ y = X\beta + \epsilon \] where \(\epsilon ~ N(0, \sigma^2I)\) is the error term, \(\beta\) is the coefficient matrix \([\beta_0 \beta_1 ... \beta_{p-1}]\), \(p\) is the number of covariates for each observation, \(X\) is a design matrix with constant.
The solution to regression is to minimize the sum of squares. The least squeares estimator is given in [5]:
\[ b = (X^T X)^{-1}X^T y \]
Consequently, predicted values of the mean of y:
\[ \widehat{y} = Xb = (X^T X)^{-1}X^Ty = Hy \]
where \(H\) is a projection matrix. Diagonal elements of projection matrix are the leverage values of the observations [6]:
\[ h_{ii} = x_i^T (X^T X)^{-1} x_i \]
Residuals are denoted by \(e_i\):
\[ e = y - \widehat{y} = (I - H) \cdot y \]
The extent to which the model changes when an \(i\) observation is removed defines the influence of this observation. The observation with the largest change is the most influential. Thus Cook’s distance of \(i\) observation is then the sum of all the changes in the regression model after \(i\) has been removed:
\[ D_i = \frac{\sum(\widehat{y}_j - \widehat{y}_{i(j)})^2}{ps^2} \]
where \(\widehat{y}_{i(j)}\) is the predicted value when excluding \(i\) observation and \(s^2\) is the mean squared error of the model MSE:
\[ s^2 = \sum \frac{e_i^2}{n - p} \]
The last thing to decide is at what distance an observation becomes an outlier. There are different opinions on how to setup a cut-off value. Cook and Weisberg (1982) supposed that an observation with \(D_i > 1\) is an outlier [7]. This universal approach does not take into account different size of the data. Besley, Kuh and Welsch (1980) proposed the size-adjusted cut-off for distributions with large n [8]:
\[ c_o = \frac{2}{\sqrt n} = \frac{4}{n} \]
Even though this is a good rule of thumb for indentifying the outliers, a researcher has to make a final judgement based understanding of the data.
Computing Cook’s distance in R requires a linear model. In our case it is natural to make a model where sales depends on volume and price. Then we use cooks.distance function to calculate the distances and add a marker column to the dataset.
form <- as.formula(log(sales) ~ price + volume)
mod <- lm(form, data = mydata)
cooksd <- cooks.distance(mod)
cut_off <- 4 / nrow(mydata)
cooks_data <- data.frame(cooksd, cut_off, test = cooksd >= cut_off)
mydata <- mydata %>% mutate(cooksd = cooks_data$test)
rm(mod, cooksd, cut_off, form, cooks_data)
The method identified 1.7 percent of outliers in the data.
summary(mydata$cooksd)
## Mode FALSE TRUE
## logical 2332076 42143
Applying Cook’s distance method to the dataset reduced the number of outliers by 42k rows. Volume range decreased by 5 times. The same for the price range. The same for the sales range. However there are many outliers left if we look at the histograms.

Method of Cook’s distance did not detect many outliers because of the linear model that we used. The model fitted to maximal sales value and missed the range of 20 - 60k sales. One way to solve this problem is to use interquantile interval.
Interquantile interval
Interquartile interval method (IQR) is a basic method to detect the outliers. The method is only applicable for univariate approach, when you consider each variable indepent from others. IQR considers a data point an outlier if it is outside the range of \([q_1 - 1.5\cdot iqr, q_3 + 1.5 \cdot iqr]\), where \(q_1\) is firest quartile, \(q_3\) is third quartile and \(iqr\) stands for inter quartile range [9]:
\[ iqr = q_3 - q_1 \]
IQR method has a drawback that it is only a rule of thumb that does not apply to every case. This way a researcher has to examine every potential outlier in the context of the dataset. Such examination could be a problem for large datasets.
In our case, we need to compute the IQR for three varaibles. R has a function boxplot.stats that outputs the outliers detected using IQR method. Detection rate of IQR method is much better than Cook’s distance as quartiles are not biased by high values as it was with the linear model.
out_sales <- boxplot.stats(mydata$sales)$out
out_price <- boxplot.stats(mydata$price)$out
out_volume <- boxplot.stats(mydata$volume)$out
mydata$iqr <- 0
mydata$iqr[mydata$sales %in% out_sales] <- 1
mydata$iqr[mydata$price %in% out_price] <- 1
mydata$iqr[mydata$volume %in% out_volume] <- 1
mydata$iqr[mydata$iqr == 0] <- "FALSE"
mydata$iqr[mydata$iqr == 1] <- "TRUE"
mydata$iqr <- as.logical(mydata$iqr)
rm(out_sales, out_price, out_volume)
summary(mydata$iqr)
## Mode FALSE TRUE
## logical 2226781 147438
Manual check of detected outliers is easy in our case - visual inspection of sales histogram shows that the upper limit reduced 6 times compared to CD. Cleaning out the dataset with one-by-one outliers extraction results in better identification of outliers than Cook’s distance approach. However there are still to many outliers in the data. The algorithm worked well on sales, but the volume and the price required further exploration.

Mahalanobis distance
Mahalanobis distance (MD) finds out how far away a data point is from a mass of distribution. MD is a multi-variate interpretation of measuring how many standard deviations away an observation is from a dataset mean. The most common way to find MD is defined as [10]:
\[ D_M = \sqrt {(x_i - \mu) \cdot S^{-1} \cdot (x_i - \mu)^T} \]
where \(x_i\) is a vector of values for observation \(i\) and \(\mu\) is a vector of length \(p\) equal to the number of columns in \(x\) and \(S\) is the covariance matrix for \(x\).
An advantage of MD method is that a researcher can formally setup a cut-off value without using a judgement. This increases the possibility to reproduce a research. Cut-off value is based on a \(\chi^2\) distribution with p-value less than 0.001 [11]. An MD larger than the critical \(\chi^2\) value for df = k (the number of predictor variables in the model) at a critical alpha value of 0.001 indicates the presence of one or more multivariate outliers.
Generally, the critical value of \(\chi^2\) can be obtained from a table. In case of using R, there is a way to find the value with qchisq function:
mah_data <- mydata[, c("sales", "price", "volume")]
mah_data$sales <- log(mah_data$sales)
alpha <- 0.001
cutoff <- (qchisq(p = 1 - alpha, df = ncol(mah_data)))
MD detects less outliers than CD and 2.5 times liess outliers than the inter quartile range method. The reason, as in case with the Cook’s distance, is the impact from large misprint error values in sales and volume.
mahal_r <- mahalanobis(mah_data, colMeans(mah_data), cov(mah_data))
mah_data <- mah_data %>% mutate(mhlbs = mahal_r,
crit = cutoff,
test = mhlbs >= crit)
mydata$mhlbs <- mah_data$test
rm(mah_data, mahal_r)
summary(mydata$mhlbs)
## Mode FALSE TRUE
## logical 2350482 23737
Here we tested three methods of outliers detection. These basic methods are distance based and they are good for interpretation of detected data points. However none of them showed enough power to properly detect what is in the data. This is why knowing the data is always the first step in detecting process.
References
[1] https://thethrivingsmallbusiness.com/developing-and-managing-a-budget/
[2] D. N. Joanes and C. A. Gill (1998), Comparing measures of sample skewness and kurtosis. The Statistician, 47, 183–189.
[3] Everitt, Brian (1998). The Cambridge Dictionary of Statistics. Cambridge, UK New York: Cambridge University Press. ISBN 978-0521593465.
[4] Cook, R. Dennis (March 1979). “Influential Observations in Linear Regression”. Journal of the American Statistical Association. American Statistical Association. 74 (365): 169–174.
[5] Mendenhall, William; Sincich, Terry (1996). A Second Course in Statistics: Regression Analysis (5th ed.). Upper Saddle River, NJ: Prentice-Hall. p. 422.
[6] Hayashi, Fumio (2000). Econometrics. Princeton University Press. pp. 21–23.
[7] Cook, R. Dennis; Weisberg, Sanford (1982). Residuals and Influence in Regression. New York, NY: Chapman & Hall. ISBN 0-412-24280-X.
[8] Baltagi, H. Badi. Econometrics, 5e. New York, NY: Springer. ISBN 978-3-642-20058-8.
[9] Upton, Graham; Cook, Ian (1996). Understanding Statistics. Oxford University Press. p. 55. ISBN 0-19-914391-9.
[10] De Maesschalck, R.; Jouan-Rimbaud, D.; Massart, D.L. “The Mahalanobis distance”. Chemometrics and Intelligent Laboratory Systems. 50 (1): 1–18.
[11] Tabachnick, B.G., & Fidell, L.S. (2007). Using Multivariate Statistics (5th Ed.). Boston: Pearson. (p. 74).