Robust methods for dealing with outliers

Note This tutorial was inspired by a status of Pr. Nguyen Van Tuan on Facebook (24/03/2017).In this status, Pr. Tuan introduced the Yuen’s test for dealing with outliers in data. This method belongs to a new area of statistical analysis that called “robust methods”. As those methods have never been introduced in Vietnam, we will include them in our ESMS project. This project aims to help Vietnamese medical students in developing their skills of data analysis using R.

Introduction

Most of the conventional statistical tests require multiple assumptions on data, but the real world data rarely meet these assumptions. Frequent assumption violations might include outliers, skewed or multimodal distributions, inequivalent sample sizes and inhomogeneity in variances. Setting the null-hypothesis on Mean in those data is no more reliable. There are some robust approaches for dealing with these violations, such as non-parametric tests, trimmed mean, median or multiple quantiles comparison, as well as bootstrapping. Many concerns and limitations were associated to nonparametric methods (Wilcox 2012), the non-parametric approachs are no more considered as the first line tool for data analysis. Data transformation could also be considered, but this might alter the original scale and results inference. Bootstrapping remains a very powerful method, particularly for the small sample sized data.

In the present tutorial, the author would like to introduce some robust methods that could be used for comparing two independent samples with outliers, unequal sample sizes and/or not normally distributed. Those methods belong to a new area in statistical analysis that called “Robust methods”. They are all supported by the WRS2 package in R.

The WRS2 package, developped by Patrick Mair (Havard University) and Rand Wilcox (University of Southern California) in 2016 provides a comprehensive toolbox for robust statistical analysis. In this tutorial we will only focus on the Yuen test and associated methods, with or without bootstrapping as alternative solutions for t-test.

Context

In this study, we use the “Heart disease” dataset that includes clinical data of more than 700 patients from 4 Cardiolgy centers (Cleveland,Budapest,Long Beach and Zurich). The goal of original study was to making diagnosis of heart disease based on 14 explanatory variables uncluding age,sex,chest pain type,serum cholesterol,fasting blood sugar test, ST slope induced by exercise relative to rest and maximum heart rate achieved during CPET. However, the present tutorial will only focus on the Cholesterol variable. Our objective is to compare the Cholesterol level between the patients with and without Heart disease.

Materials and method

Following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2.

library(tidyverse) # for data management
library(WRS2) # Robust statistical methods

Preparing the data

library(tidyverse)

va=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data", sep =",",na.strings="?",strip.white=TRUE, fill = TRUE)%>%as_tibble()
hu=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data", sep =",",na.strings="?",strip.white=TRUE, fill = TRUE)%>%as_tibble()
sw=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data", sep =",",na.strings="?",strip.white=TRUE, fill = TRUE)%>%as_tibble()
cl=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", sep =",",na.strings="?",strip.white=TRUE, fill = TRUE)%>%as_tibble()

df=rbind(va,hu,sw,cl)
names(df)=c("Age","Sex","ChestPain","RestBP","Chol","FBS","RestECG","MaxHR","CPETAgina","Oldpeak","Slope","CA","Thal","Class")

data=df[,-c(11,12,13)]%>%filter(.,Chol!=0)

data=na.omit(data)
data$Class%<>%as.factor()%>%recode_factor(.,`0` = "Negative", `1` = "Positive",`2` = "Positive", `3` = "Positive",`4` = "Positive")

df=data[,c(4,5,8,11)]

Data exploration

df
## # A tibble: 661 × 4
##    RestBP  Chol MaxHR    Class
##     <dbl> <dbl> <dbl>   <fctr>
## 1     140   260   112 Positive
## 2     130   209   127 Negative
## 3     132   218   140 Positive
## 4     142   228   149 Positive
## 5     110   213    99 Negative
## 6     150   236   105 Positive
## 7     160   267   157 Positive
## 8     126   166   140 Negative
## 9     120   220    86 Negative
## 10    170   177    84 Positive
## # ... with 651 more rows
psych::describeBy(df$Chol,df$Class)
## $Negative
##    vars   n   mean    sd median trimmed   mad min max range skew kurtosis
## X1    1 347 240.06 54.26    233  236.79 47.44  85 564   479 1.11      4.1
##      se
## X1 2.91
## 
## $Positive
##    vars   n  mean   sd median trimmed   mad min max range skew kurtosis
## X1    1 314 253.5 60.4    248  248.99 50.41 100 603   503 1.48     5.59
##      se
## X1 3.41
## 
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)

The Skewness and Kurtosis of Cholesterol data suggested a violation of normality assumption. To verify this, we can use the d’Agotest as follows

library(fBasics)
df$Chol%>%split(df$Class)%>%map(~dagoTest(.))
## $Negative
## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 90.5277
##     Z3  | Skewness: 7.1435
##     Z4  | Kurtosis: 6.2847
##   P VALUE:
##     Omnibus  Test: < 2.2e-16 
##     Skewness Test: 9.095e-13 
##     Kurtosis Test: 3.285e-10 
## 
## Description:
##  Fri Mar 24 22:34:09 2017 by user: Admin
## 
## 
## $Positive
## 
## Title:
##  D'Agostino Normality Test
## 
## Test Results:
##   STATISTIC:
##     Chi2 | Omnibus: 115.6027
##     Z3  | Skewness: 8.3194
##     Z4  | Kurtosis: 6.811
##   P VALUE:
##     Omnibus  Test: < 2.2e-16 
##     Skewness Test: < 2.2e-16 
##     Kurtosis Test: 9.691e-12 
## 
## Description:
##  Fri Mar 24 22:34:09 2017 by user: Admin

By using d’Agotest, we have confirmed the first assumption violation of our data: the Cholesterol was not normally distributed in two sub-groups. This violation was associated to both Skewness and Kurtosis.

p1=df%>%ggplot(aes(x=Class,y=Chol,fill=Class))+geom_boxplot(alpha=0.5)+theme_bw()+coord_flip()

p2=df%>%ggplot(aes(x=Chol,fill=Class))+geom_rug(aes(color=Class))+geom_density(alpha=0.5)+theme_bw()+geom_vline(xintercept = c(233,248),linetype="dashed",color=c("red","blue"))

library(gridExtra)

grid.arrange(p1,p2,ncol=1)

A simple box-plot showed another assumption violation in Cholesterol data: there are many outliers in both sides. However, it could be expected that the medians (as presented by the red and blue dashed lines) in 2 groups could be different. The question is how we can statistically prove this?

The YUEN TEST

In 1974, Yuen KK introduced a new statistical test for comparing two-sample trimmed means. This method also implies a t-distribution under null hypothesis of trimmed mean equivalence. However, the Yuen’s test is more robust than the conventional t-test by handling outliers or unequal variances problems.

The mean could be trimmed up to 20% from both sides. If no trimming is involved, the Yuen test is considered equivalent to a Welch’s t test.

The WRS2 package supports the Yuen test via yuen( ) function. Our package also provides a bootstrap version for Yuen test which is useful for groups with unequal sample sizes.

For example, we can perform a Yuen test on 20% trimmed mean as follows:

formula="Chol~Class"

library(WRS2)

yuen(formula=formula, data=df,tr = 0.2)
## Call:
## yuen(formula = formula, data = df, tr = 0.2)
## 
## Test statistic: 3.3622 (df = 387.19), p-value = 0.00085
## 
## Trimmed mean difference:  -13.53828 
## 95 percent confidence interval:
## -21.4551     -5.6214

In terms of effect size, Algina, Keselman and Penfield (2005) proposed a robust version of conventional Cohen’s d effect size (See the previous tutorial on Student-t test), that based on up to 20% trimmed means and Winsorised variances. This analysis is supported by WRS2 method via akp.effect( ) function. We can apply the same interpretation rule as for Cohen’s d, that is absolute d values of 0.2, 0.5 and 0.8 correspond to small, medium and large effects, respectively. When the argument EQVAR set to FALSE, the function will apply the Algina (2005) method, that consists of calculating 2 effect-sizes, one with the group 1’s Winsorised variance and another one with group 2’s Winsorised variance in denominator.

akp.effect(formula, data=df, EQVAR = FALSE, tr = 0.2)
## [1] -0.2888242 -0.2703434

The results suggest that the 20% trimmed means were statistically different between patients with and without Heart disease (95%CI: -21.45 to -5.62; p=0.00085, small effect size) , i.e the patients with Heart disease has a trimmed mean significantly higher than that in control group.

One might ask this question :How much the data could be trimmed ? The authors suggested that trimming level could vary between 0 (no trimming) to 0.2 (20% trimming). In order to determine the optimal trimming level for our mean, we will perform a simulation experiment as follows:

# Yuen test

trim=seq(0,0.25,by=0.01)
pval=rep(NA,length(trim))
dif=rep(NA,length(trim))
LL=rep(NA,length(trim))
UL=rep(NA,length(trim))
Effect1=rep(NA,length(trim))
Effect2=rep(NA,length(trim))

n=1
for (i in 1:length(trim)) {
  tr=trim[i]
  yt=yuen(formula=formula, data=df,tr=tr)
  ef=akp.effect(formula, data=df, EQVAR = FALSE, tr = tr)
  pval[n]=yt$p.value
  dif[n]=yt$diff
  LL[n]=yt$conf.int[1]
  UL[n]=yt$conf.int[2]
  Effect1[n]=ef[1]
  Effect2[n]=ef[2]
  n=n+1
}

ytdf=cbind.data.frame(dif,LL,UL,pval,Effect1,Effect2)%>%as_tibble()%>%mutate(.,Trim=trim)
ytdf$Trim=as.factor(ytdf$Trim)

ytdf%>%ggplot(aes(x=Trim))+geom_errorbar(aes(ymax=UL,ymin=LL),size=1,color="grey")+geom_path(aes(y=dif,group=1),size=1,color="red4")+geom_point(aes(y=dif),shape=21,size=3,fill="red",color="black",show.legend = F)+theme_bw()

ytdf%>%ggplot(aes(x=Trim))+geom_path(aes(y=pval,group=1),show.legend = F,size=1,color="purple")+geom_point(aes(y=pval),shape=21,size=3,fill="gold",color="black",show.legend = F)+theme_bw()

ytdf%>%ggplot(aes(x=Trim))+geom_path(aes(y=Effect1,group=1),size=1,color="red3")+geom_point(aes(y=Effect1),shape=21,size=3,fill="red",color="black")+geom_path(aes(y=Effect2,group=1),size=1,color="blue3")+geom_point(aes(y=Effect2),shape=21,size=3,fill="blue",color="black")+theme_bw()

Our stimulation consists of using a loop in R for stimulating every possible Yuen test outcomes that correspond to 26 trimming levels from 0 to 25.

The graphical results indicate that the difference of trimmed mean did not change too much when trimming level increases from 0 to 25. Its value was consistently around -12.5, with 95%CI from -5 to -20. The p-value of t-test gradually droped when trimming level increases above 9%, but p_values were consistently lower than 0.05, indicating a sufficient ability for rejecting our null-hypothesis. The effect sizes were steady for trimming levels between 0 to 10% but began to increase as trimming level increases from 12% to 21%. Taking together, we might conclude that the optimal trimming level for our data is 20 to 21%. This level results in steadyly high contrast in mean, lowest p_value and largest effect-size.

YUEN TEST WITH BOOTSTRAPPING

yuenbt(formula, data=df, tr = 0.2, nboot = 1000)
## Call:
## yuenbt(formula = formula, data = df, tr = 0.2, nboot = 1000)
## 
## Test statistic: -3.344 (df = NA), p-value = 0.001
## 
## Trimmed mean difference:  -13.53828 
## 95 percent confidence interval:
## -21.4854     -5.5911
yuen.effect.ci(formula, data=df, tr = 0.2,nboot = 1000,alpha = 0.05)
## $effsize
## [1] 0.1878498
## 
## $CI
## [1] 0.08597846 0.28624654

Note: When the Algina’s 2 effect sizes do not lead to the same conclusion, we could use an alternative method as proposed by Wilcox and Tian (2011). This consists of an explanatory measure of effect size. This method is supported by the yuen.effect.ci () function. The thresholds for interpreting Wilcox’s effect size are 0.1 (small), 0.3 (medium) and 0.5 (high).

We can also replicate the same simulating experiment on these 2 functions, with iteration number fixed at 1000 for the bootstrap

#Bootstraped Yuen test

trim=seq(0,0.25,by=0.01)
pval=rep(NA,length(trim))
dif=rep(NA,length(trim))
LL=rep(NA,length(trim))
UL=rep(NA,length(trim))
Effect=rep(NA,length(trim))
EffUL=rep(NA,length(trim))
EffLL=rep(NA,length(trim))

n=1
bootnum=1000
for (i in 1:length(trim)) {
  tr=trim[i]
  yt=yuenbt(formula, data=df, tr = tr, nboot = bootnum)
  yef=yuen.effect.ci(formula, data=df, tr = tr,nboot = bootnum,alpha = 0.05)
  pval[n]=yt$p.value
  dif[n]=yt$diff
  LL[n]=yt$conf.int[1]
  UL[n]=yt$conf.int[2]
  Effect[n]=yef$effsize
  EffUL[n]=yef$CI[2]
  EffLL[n]=yef$CI[1]
  n=n+1
}

ytdf=cbind.data.frame(dif,LL,UL,pval,Effect,EffLL,EffUL)%>%as_tibble()%>%mutate(.,Trim=trim)
ytdf$Trim=as.factor(ytdf$Trim)

ytdf%>%ggplot(aes(x=Trim))+geom_errorbar(aes(ymax=UL,ymin=LL),size=1,color="grey")+geom_path(aes(y=dif,group=1),size=1,color="red4")+geom_point(aes(y=dif),shape=21,size=3,fill="red",color="black",show.legend = F)+theme_bw()

ytdf%>%ggplot(aes(x=Trim))+geom_errorbar(aes(ymax=EffUL,ymin=EffLL),size=1,color="grey")+geom_path(aes(y=Effect,group=1),size=1,color="blue4")+geom_point(aes(y=Effect),shape=21,size=3,fill="blue",color="black",show.legend = F)+theme_bw()

ytdf%>%ggplot(aes(x=Trim))+geom_path(aes(y=pval,group=1),show.legend = F,size=1,color="purple")+geom_point(aes(y=pval),shape=21,size=3,fill="gold",color="black",show.legend = F)+theme_bw()

When using Bootstrap, the effect-size was more stabilized, and p_value follows no more a clear pattern but seems to be lower than single Yuen test without bootstrap. The lowest p-value was obtained at trimming level of 20%.

QUANTILE COMPARISON

we might consider a more robust of location measure, that could be applied to any quantiles, as proposed by Harrell and Davis (1982). The corresponding function of this method in WRS2 package is qcomhd( ). Multiple quantiles could be introduced for generating the statistics. Instead of setting only one Null-hypothesis on Mean or median, this method consists of replicating null-hypothesis testing for multiple quantiles. By doing this we will no more be worried about the normality assumption.

QC=qcomhd(formula, data=df, q = c(0.025,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.975), nboot = 1000, alpha = 0.05, ADJ.CI = TRUE)

QCdf=QC$partable

QCdf%>%ggplot(aes(x=q))+geom_errorbar(aes(ymax=ci.up,ymin=ci.low),size=1,color="grey")+geom_path(aes(y=`est1-est.2`,group=1),size=1,color="blue4")+geom_point(aes(y=`est1-est.2`),shape=21,size=3,fill="blue",color="black",show.legend = F)+theme_bw()

QCdf%>%ggplot(aes(x=q))+geom_path(aes(y=p.crit,group=1),size=1,color="red3")+geom_point(aes(y=p.crit),shape=21,size=3,fill="red",color="black")+geom_path(aes(y=p.value,group=1),size=1,color="blue3")+geom_point(aes(y=p.value),shape=21,size=3,fill="blue",color="black")+theme_bw()+scale_y_continuous(breaks=c(0,0.01,0.02,0.05,0.1,0.2,0.3))

QC$partable
##       q  n1  n2     est1     est2 est1-est.2     ci.low     ci.up
## 1 0.025 347 314 150.5298 161.5751 -11.045370  -32.68379 11.256361
## 2 0.050 347 314 165.5248 172.4942  -6.969357  -22.71261  8.947613
## 3 0.100 347 314 181.9048 187.8101  -5.905309  -17.39914  5.372083
## 4 0.250 347 314 205.3756 215.2149  -9.839321  -20.42474  1.336347
## 5 0.500 347 314 232.8511 248.6772 -15.826183  -29.88414 -1.137301
## 6 0.750 347 314 269.3097 283.6409 -14.331199  -27.94715 -1.222559
## 7 0.900 347 314 307.1308 321.9023 -14.771486  -35.70154  9.194554
## 8 0.950 347 314 330.4469 347.6793 -17.232380  -66.32777 16.320784
## 9 0.975 347 314 361.1796 396.8749 -35.695315 -127.08273 30.020520
##        p.crit p.value
## 1 0.025000000   0.272
## 2 0.012500000   0.228
## 3 0.050000000   0.306
## 4 0.007142857   0.020
## 5 0.006250000   0.006
## 6 0.005555556   0.008
## 7 0.008333333   0.102
## 8 0.010000000   0.148
## 9 0.016666667   0.224

The result of quantile comparison showed that only null hypothesis on Median can lead to positive result for our comparison (95%CI of difference in estimated median was from -30.79 to -0.27). The other quartiles (0.25 and 0.75) are also characterised by a clear contrast (difference of -15.8 and -9.8) but they are not statistically significative (95%CI includes zero value, p_value was higher than critical level). Thus, if our null hypothesis is based on median, we can get a favorable result.

CONCLUSION

In this tutorial, we have introduced two robust methods for comparing the two independent samples when there are one or more assumption violation in data, particularly in presence of outliers. The Yuen test based on trimmed mean is very robust for dealing with problems in data. Its performance is even higher when combined with bootstraping. A simulation and/or centiles exploration might help us to determine the optimal trimming level for Yuen test. We recommend using these robust tests as alternative solution for t-test rather than using the non-parametric tests.

You can also practice those method with 2 remaining variables in the dataset: Resting Blood pressure and Max heart rate during Excercise test. Thank and see you in next tutorial.

P.S: This study is dedicated to Pr. Nguyen Van Tuan, who inspired me and many generations of medical students to become true data scientists.