The goal is to remove outliers (by variable) by marking them as NA and keeping a record of which were outliers.
First, lets create a sample data set
set.seed(20130828)
data <- data.frame(X = c(NA, rnorm(1000), runif(20, -20, 20)), Y = c(runif(1000),
rnorm(20, 2), NA), Z = c(rnorm(1000, 1), NA, runif(20)))
Here you can see part of the data
head(data)
## X Y Z
## 1 NA 0.4228 2.2784
## 2 -0.1435 0.8740 1.4466
## 3 2.2522 0.4640 2.3158
## 4 -0.4158 0.4046 2.8864
## 5 0.4667 0.8338 1.5126
## 6 -2.5872 0.9596 0.8744
Notice for example that the first observation in variable X is a NA. Meaning that we will be dealing with original NAs and new NAs.
We will mark an outlier any observation outside 3 sd. The next function finds the cells of the matrix that are considered as outliers.
findOutlier <- function(data, cutoff = 3) {
## Calculate the sd
sds <- apply(data, 2, sd, na.rm = TRUE)
## Identify the cells with value greater than cutoff * sd (column wise)
result <- mapply(function(d, s) {
which(d > cutoff * s)
}, data, sds)
result
}
outliers <- findOutlier(data)
outliers
## $X
## [1] 1003 1008 1010 1011 1013 1017 1018
##
## $Y
## [1] 1001 1002 1003 1004 1005 1006 1007 1008 1010 1011 1012 1013 1014 1015
## [15] 1017 1018 1020
##
## $Z
## [1] 14 43 92 104 107 136 151 158 211 215 223 274 332 367 400 427 454
## [18] 475 544 574 579 594 657 675 766 799 803 805 865 877 884 910 922 968
## [35] 978 981
Next we can remove the ouliers.
removeOutlier <- function(data, outliers) {
result <- mapply(function(d, o) {
res <- d
res[o] <- NA
return(res)
}, data, outliers)
return(as.data.frame(result))
}
dataFilt <- removeOutlier(data, outliers)
Here is how the data looks after the filtering step. Use the information from the outliers to find the data entries that were filtered. For example, you can see entries 1,001 to 1,010.
## Example of several entries with filtered values.
dataFilt[1001:1010, ]
## X Y Z
## 1001 0.1867 NA NA
## 1002 -5.7693 NA 0.05927
## 1003 NA NA 0.89794
## 1004 -14.7085 NA 0.50114
## 1005 -19.7741 NA 0.43058
## 1006 -8.9698 NA 0.47140
## 1007 -10.2305 NA 0.86841
## 1008 NA NA 0.09225
## 1009 -17.3372 0.7872 0.23583
## 1010 NA NA 0.77418
If you want to, you can iterate the procedure. However, note that the standard deviations of the filtered data will be smaller than in the original data set, thus potentially finding many more outliers.
outliers2 <- findOutlier(dataFilt)
outliers2
## $X
## integer(0)
##
## $Y
## [1] 2 6 20 26 27 28 37 45 46 51 53 54 55 56 59 62 68
## [18] 75 89 97 108 129 131 142 147 150 162 168 175 187 201 202 203 207
## [35] 222 223 237 245 248 250 255 261 272 277 287 306 307 313 332 346 351
## [52] 353 354 372 375 381 388 409 412 424 443 446 459 477 489 496 499 505
## [69] 514 526 545 552 562 566 567 568 591 597 600 601 603 604 626 631 646
## [86] 658 681 683 689 720 729 732 739 755 762 771 772 780 783 792 804 817
## [103] 825 837 854 879 887 894 913 918 922 931 938 953 955 959 973 978 981
## [120] 984 988 990
##
## $Z
## [1] 4 297 328 334 368 392 436 614 632 682 809 888 914 941
dataFilt2 <- removeOutlier(dataFilt, outliers2)
Here is the result after two iterations.
head(dataFilt2)
## X Y Z
## 1 NA 0.4228 2.2784
## 2 -0.1435 NA 1.4466
## 3 2.2522 0.4640 2.3158
## 4 -0.4158 0.4046 NA
## 5 0.4667 0.8338 1.5126
## 6 -2.5872 NA 0.8744
Sys.time()
## [1] "2013-08-30 14:52:20 EDT"
proc.time()
## user system elapsed
## 0.742 0.042 0.773
sessionInfo()
## R version 3.0.1 (2013-05-16)
## Platform: x86_64-apple-darwin10.8.0 (64-bit)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.4.1
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.3 evaluate_0.4.7 formatR_0.9 stringr_0.6.2
## [5] tools_3.0.1
This report written by L. Collado Torres and was generated using knitrBootstrap.