Remove outliers

The goal is to remove outliers (by variable) by marking them as NA and keeping a record of which were outliers.

Data

First, lets create a sample data set

set.seed(20130828)
data <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000), 
    rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))

Here you can see part of the data

head(data)

##         X      Y      Z
## 1      NA 0.4228 2.2784
## 2 -0.1435 0.8740 1.4466
## 3  2.2522 0.4640 2.3158
## 4 -0.4158 0.4046 2.8864
## 5  0.4667 0.8338 1.5126
## 6 -2.5872 0.9596 0.8744

Notice for example that the first observation in variable X is a NA. Meaning that we will be dealing with original NAs and new NAs.

Find outliers

We will mark an outlier any observation outside 3 sd. The next function finds the cells of the matrix that are considered as outliers.

findOutlier <- function(data, cutoff = 3) {
    ## Calculate the sd
    sds <- apply(data, 2, sd, na.rm = TRUE)
    ## Identify the cells with value greater than cutoff * sd (column wise)
    result <- mapply(function(d, s) {
        which(d > cutoff * s)
    }, data, sds)
    result
}

outliers <- findOutlier(data)
outliers

## $X
## [1] 1003 1008 1010 1011 1013 1017 1018
## 
## $Y
##  [1] 1001 1002 1003 1004 1005 1006 1007 1008 1010 1011 1012 1013 1014 1015
## [15] 1017 1018 1020
## 
## $Z
##  [1]  14  43  92 104 107 136 151 158 211 215 223 274 332 367 400 427 454
## [18] 475 544 574 579 594 657 675 766 799 803 805 865 877 884 910 922 968
## [35] 978 981

Remove outliers

Next we can remove the ouliers.

removeOutlier <- function(data, outliers) {
    result <- mapply(function(d, o) {
        res <- d
        res[o] <- NA
        return(res)
    }, data, outliers)
    return(as.data.frame(result))
}

dataFilt <- removeOutlier(data, outliers)

Here is how the data looks after the filtering step. Use the information from the outliers to find the data entries that were filtered. For example, you can see entries 1,001 to 1,010.

## Example of several entries with filtered values.
dataFilt[1001:1010, ]

##             X      Y       Z
## 1001   0.1867     NA      NA
## 1002  -5.7693     NA 0.05927
## 1003       NA     NA 0.89794
## 1004 -14.7085     NA 0.50114
## 1005 -19.7741     NA 0.43058
## 1006  -8.9698     NA 0.47140
## 1007 -10.2305     NA 0.86841
## 1008       NA     NA 0.09225
## 1009 -17.3372 0.7872 0.23583
## 1010       NA     NA 0.77418

Iterate

If you want to, you can iterate the procedure. However, note that the standard deviations of the filtered data will be smaller than in the original data set, thus potentially finding many more outliers.

outliers2 <- findOutlier(dataFilt)
outliers2

## $X
## integer(0)
## 
## $Y
##   [1]   2   6  20  26  27  28  37  45  46  51  53  54  55  56  59  62  68
##  [18]  75  89  97 108 129 131 142 147 150 162 168 175 187 201 202 203 207
##  [35] 222 223 237 245 248 250 255 261 272 277 287 306 307 313 332 346 351
##  [52] 353 354 372 375 381 388 409 412 424 443 446 459 477 489 496 499 505
##  [69] 514 526 545 552 562 566 567 568 591 597 600 601 603 604 626 631 646
##  [86] 658 681 683 689 720 729 732 739 755 762 771 772 780 783 792 804 817
## [103] 825 837 854 879 887 894 913 918 922 931 938 953 955 959 973 978 981
## [120] 984 988 990
## 
## $Z
##  [1]   4 297 328 334 368 392 436 614 632 682 809 888 914 941

dataFilt2 <- removeOutlier(dataFilt, outliers2)

Here is the result after two iterations.

head(dataFilt2)

##         X      Y      Z
## 1      NA 0.4228 2.2784
## 2 -0.1435     NA 1.4466
## 3  2.2522 0.4640 2.3158
## 4 -0.4158 0.4046     NA
## 5  0.4667 0.8338 1.5126
## 6 -2.5872     NA 0.8744

Reproducibility

Sys.time()

## [1] "2013-08-30 14:52:20 EDT"

proc.time()

##    user  system elapsed 
##   0.742   0.042   0.773

sessionInfo()

## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.4.1
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.3   evaluate_0.4.7 formatR_0.9    stringr_0.6.2 
## [5] tools_3.0.1

This report written by L. Collado Torres and was generated using knitrBootstrap.