Outlier Analysis in R

Author

Dr. Danilo Petti University of Essex

Univariate Outlier Analysis in R

options(repos = c(CRAN = "https://cloud.r-project.org"))
utils::install.packages("outliers")

The downloaded binary packages are in
    /var/folders/gp/ky553cmj3wl6fq4t7t44mfxc0000gn/T//RtmpeNic5K/downloaded_packages
utils::install.packages('EnvStats')

The downloaded binary packages are in
    /var/folders/gp/ky553cmj3wl6fq4t7t44mfxc0000gn/T//RtmpeNic5K/downloaded_packages
library('outliers')
library('EnvStats')

Attaching package: 'EnvStats'
The following objects are masked from 'package:stats':

    predict, predict.lm

chisq.out.test

set.seed(1234)
n <- 1000
Normal_sample = rnorm(n)
hist(Normal_sample, main = 'Normal Sample', xlab ='')

pareto_sample <- rpareto(n, location = 1, shape = 1)
hist(pareto_sample, main = 'Pareto Sample')

The chisq.out.test() function implements the chi-squared method as discussed in Dixon, W.J. (1950). Analysis of Extreme Values. Ann. Math. Stat. 21(4), 488-506. This method is primarily used for evaluating a numeric vector of observations to detect potential outliers.

Limitations

While it was innovative at the time, the chi-squared method has significant limitations:

  1. Lack of Power: The method is not as sensitive or reliable as modern techniques in detecting outliers, especially in datasets with non-normal distributions or larger sample sizes.

  2. Assumptions: It assumes that the data is normally distributed, which may not always hold true in practice.

  3. Modern Alternatives: Many more powerful and flexible methods are now available in the outliers package, such as grubbs.testdixon.test, and others, which provide more reliable results.

Function specification

  • x: A numeric vector containing the data to be analyzed for outlier detection.
  • variance: The variance of the data. If not specified, the function automatically calculates the variance of x using var(x).

  • opposite: A logical parameter (TRUE or FALSE) that reverses the function’s behavior. By default, it is set to FALSE.

Theory

First, any missing values in the data are removed. The data vector is then sorted in descending order, \(x_{(n)} > \dots > x_{(1)}\), where \(x_{(n)}\) is the largest value and \(x_{(1)}\) is the smallest value. Based on the \(\texttt{opposite}\) option, which can be set to either \(\texttt{TRUE}\) or \(\texttt{FALSE}\), the following statistic is calculated: \[\text{Statistic} = \begin{cases} \frac{(x_{(n)} - \bar{x})^2}{s^2}, & \text{if } \texttt{opposite = FALSE} \text{ (testing the largest value)} \\[10pt] \frac{(\bar{x} - x_{(1)})^2}{s^2}, & \text{if } \texttt{opposite = TRUE} \text{ (testing the smallest value)} \end{cases} \]

Where:

  • \(x_{(n)}\): Largest value in the dataset.

  • \(\bar{x}\): Mean of the data.

  • \(s^2\): Variance of the data.

Next, the probability to the right of the calculated \(\chi^2\) value is computed under a \(\chi^2_1\) distribution (chi-squared distribution with 1 degree of freedom). If the calculated value is extremely large, indicating it lies far from the center of the distribution, the corresponding observation will be identified as an outlier.

chisq.out.test(Normal_sample, variance = var(Normal_sample))

    chi-squared test for outlier

data:  Normal_sample
X-squared = 11.414, p-value = 0.0007289
alternative hypothesis: lowest value -3.39606353457436 is an outlier
chisq.out.test(Normal_sample, opposite = TRUE, variance = var(Normal_sample))

    chi-squared test for outlier

data:  Normal_sample
X-squared = 10.44, p-value = 0.001233
alternative hypothesis: highest value 3.19590119785739 is an outlier

Dixon.test

Another test for determining whether a data point in a dataset can be considered an extreme value is the Dixon Test, as introduced by Dixon, W.J. (1951) in “Ratios Involving Extreme Values,” Annals of Mathematical Statistics, 22(1), 68-78.

The Dixon Test is specifically designed for small datasets and works by comparing the difference between the extreme value (the smallest or largest) and its nearest neighbor, relative to the range of the dataset. This makes it particularly useful for identifying outliers in small samples, typically when the number of observations is between 3 and 30.

The method assumes that the data follows a normal distribution and is most appropriate for datasets with a single suspected outlier. While the test was a foundational approach in statistical outlier detection, it has some limitations, such as reduced power with larger datasets and sensitivity to deviations from normality. Despite its age, the Dixon Test remains a widely referenced method in statistical analysis for detecting outliers in small samples

Function Specification

  • x: A numeric vector containing the data values you want to analyze to identify potential outliers.

  • opposite: A logical parameter (TRUE or FALSE) that reverses the test’s behavior.

    • If set to TRUE, instead of checking the value with the largest difference from the mean (either the highest or the lowest), the test focuses on the opposite value (e.g., the lowest instead of the highest, or vice versa).

    • Default is FALSE.

  • type: An integer that specifies which variant of the Dixon test to use. Possible values are:

    • 1011122021: Corresponding to the variants proposed by Dixon (1950).

    • If set to 0, the function automatically selects the appropriate variant based on the sample size:

      • 10 for sample sizes \(3≤n≤73≤n≤7\).

      • 11 for \(n=8-10\).

      • 21 for \(n= 11-13\)

      • 22 for \(n \ge 14\).

      • The extreme value (either the smallest or the largest) is selected automatically, but this can be reversed using the opposite parameter.

  • two.sided: A logical parameter (TRUE or FALSE) that indicates whether the test should be treated as two-sided (i.e., checks for both extremely high and low values).

    • Default is TRUE.

Theory

The Dixon’s Q Test is a statistical test used to identify univariate outliers in small datasets, where the sample size \(n\) is typically greater than \(3\) but less than \(30\). Due to its reliance on small sample sizes, its practical utility is limited to specific scenarios.

Assumptions of the Test

  • The vector \(x\) must be numeric

  • The sample size must be small \(n < 30\)

  • The data must be sorted in ascending order \((x_{1}, x_{2}, \dots, x_{n})\)

  • The data should come from a normal distribution ( or be approximately normally distributed)

The test statistic \(Q\) is calculated as:

\[Q =\begin{cases} \frac{x_{(2)} - x_{(1)}}{x_{(n)} - x_{(1)}}, & \text{if } \texttt{opposite} = \texttt{TRUE} \text{ (testing the smallest value, \(x_{(1)}\))}, \\[10pt]\frac{x_{(n)} - x_{(n-1)}}{x_{(n)} - x_{(1)}}, & \text{if } \texttt{opposite} = \texttt{FALSE} \text{ (testing the largest value, \(x_{(n)}\))}.\end{cases}\]

The value of \(Q\) is then compared to a critical value from a probability distribution. This critical value, \(Q_{threshold}\) threshold​, varies depending on the sample size.

set.seed(1234)
Normal_sample <- rnorm(10)

dixon.test(Normal_sample)

    Dixon test for outliers

data:  Normal_sample
Q = 0.39927, p-value = 0.2187
alternative hypothesis: lowest value -2.34569770262935 is an outlier
dixon.test(Normal_sample,opposite=TRUE)

    Dixon test for outliers

data:  Normal_sample
Q = 0.2524, p-value = 0.6484
alternative hypothesis: highest value 1.08444117668306 is an outlier

Grubbs.test

The Grubbs’ test is a statistical method used to identify univariate outliers in a set of continuous data. It is particularly useful for determining whether an extreme value (either the largest or the smallest) is significantly different from the rest of the dataset, under the assumption that the data follows a normal distribution.

Function Specification

  • x:
    A numeric vector containing the data to be analyzed for outlier detection.
  • type:
    An integer parameter specifying the type of Grubbs’ test to perform. Possible values are:

    • 10: Test to identify a single outlier (either the largest or the smallest).

    • 11: Two-sided test to determine if there is an extreme value in both directions (high or low).

    • If not specified, the function automatically selects the most appropriate test based on the dataset.

  • opposite:
    A logical parameter (TRUE or FALSE) that determines whether to reverse the direction of the test.

    • If set to TRUE, the test focuses on the opposite extreme (e.g., the smallest value instead of the largest, or vice versa).

    • Default is FALSE.

Theory

The test uses the following test statistic:

\[ G = \frac{|x_{Extreme}- \bar{x}|}{s} \]

Dove:

  • \(x_{Extreme}\): The largest or smallest value in the dataset.

  • \(\bar{x}\) : The mean of the dataset.

  • \(s\): The standard deviation of the dataset.

The calculated \(G\) value is then compared to a critical value from the Student’s t-distribution. If \(G>G_{threshold}\) threshold​, the extreme value is considered an outlier.

set.seed(1234)
Normal_sample = rnorm(10)
grubbs.test(Normal_sample)

    Grubbs test for one outlier

data:  Normal_sample
G = 1.97084, U = 0.52047, p-value = 0.1323
alternative hypothesis: lowest value -2.34569770262935 is an outlier
grubbs.test(Normal_sample,type=20)

    Grubbs test for two outliers

data:  Normal_sample
U = 0.3836, p-value = 0.2459
alternative hypothesis: lowest values -2.34569770262935 , -1.20706574938542 are outliers
grubbs.test(Normal_sample,type=11)

    Grubbs test for two opposite outliers

data:  Normal_sample
G = 3.44465, U = 0.32364, p-value = 0.195
alternative hypothesis: -2.34569770262935 and 1.08444117668306 are outliers