Note This tutorial was inspired by a status of Pr. Nguyen Van Tuan on Facebook (24/03/2017).In this status, Pr. Tuan introduced the Yuen’s test for dealing with outliers in data. This method belongs to a new area of statistical analysis that called “robust methods”. As those methods have never been introduced in Vietnam, we will include them in our ESMS project. This project aims to help Vietnamese medical students in developing their skills of data analysis using R.
Introduction
Most of the conventional statistical tests require multiple assumptions on data, but the real world data rarely meet these assumptions. Frequent assumption violations might include outliers, skewed or multimodal distributions, inequivalent sample sizes and inhomogeneity in variances. Setting the null-hypothesis on Mean in those data is no more reliable. There are some robust approaches for dealing with these violations, such as non-parametric tests, trimmed mean, median or multiple quantiles comparison, as well as bootstrapping. Many concerns and limitations were associated to nonparametric methods (Wilcox 2012), the non-parametric approachs are no more considered as the first line tool for data analysis. Data transformation could also be considered, but this might alter the original scale and results inference. Bootstrapping remains a very powerful method, particularly for the small sample sized data.
In the present tutorial, the author would like to introduce some robust methods that could be used for comparing two independent samples with outliers, unequal sample sizes and/or not normally distributed. Those methods belong to a new area in statistical analysis that called “Robust methods”. They are all supported by the WRS2 package in R.
The WRS2 package, developped by Patrick Mair (Havard University) and Rand Wilcox (University of Southern California) in 2016 provides a comprehensive toolbox for robust statistical analysis. In this tutorial we will only focus on the Yuen test and associated methods, with or without bootstrapping as alternative solutions for t-test.
Context
In this study, we use the “Heart disease” dataset that includes clinical data of more than 700 patients from 4 Cardiolgy centers (Cleveland,Budapest,Long Beach and Zurich). The goal of original study was to making diagnosis of heart disease based on 14 explanatory variables uncluding age,sex,chest pain type,serum cholesterol,fasting blood sugar test, ST slope induced by exercise relative to rest and maximum heart rate achieved during CPET. However, the present tutorial will only focus on the Cholesterol variable. Our objective is to compare the Cholesterol level between the patients with and without Heart disease.
Materials and method
Following packages must be established in R-studio (some other packages might also be required during the analysis), and we will personalize the aesthetic effects for ggplot2.
library(tidyverse) # for data management
library(WRS2) # Robust statistical methods
Preparing the data
library(tidyverse)
va=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data", sep =",",na.strings="?",strip.white=TRUE, fill = TRUE)%>%as_tibble()
hu=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.hungarian.data", sep =",",na.strings="?",strip.white=TRUE, fill = TRUE)%>%as_tibble()
sw=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.switzerland.data", sep =",",na.strings="?",strip.white=TRUE, fill = TRUE)%>%as_tibble()
cl=read.table("https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data", sep =",",na.strings="?",strip.white=TRUE, fill = TRUE)%>%as_tibble()
df=rbind(va,hu,sw,cl)
names(df)=c("Age","Sex","ChestPain","RestBP","Chol","FBS","RestECG","MaxHR","CPETAgina","Oldpeak","Slope","CA","Thal","Class")
data=df[,-c(11,12,13)]%>%filter(.,Chol!=0)
data=na.omit(data)
data$Class%<>%as.factor()%>%recode_factor(.,`0` = "Negative", `1` = "Positive",`2` = "Positive", `3` = "Positive",`4` = "Positive")
df=data[,c(4,5,8,11)]
Data exploration
df
## # A tibble: 661 × 4
## RestBP Chol MaxHR Class
## <dbl> <dbl> <dbl> <fctr>
## 1 140 260 112 Positive
## 2 130 209 127 Negative
## 3 132 218 140 Positive
## 4 142 228 149 Positive
## 5 110 213 99 Negative
## 6 150 236 105 Positive
## 7 160 267 157 Positive
## 8 126 166 140 Negative
## 9 120 220 86 Negative
## 10 170 177 84 Positive
## # ... with 651 more rows
psych::describeBy(df$Chol,df$Class)
## $Negative
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 347 240.06 54.26 233 236.79 47.44 85 564 479 1.11 4.1
## se
## X1 2.91
##
## $Positive
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 314 253.5 60.4 248 248.99 50.41 100 603 503 1.48 5.59
## se
## X1 3.41
##
## attr(,"call")
## by.default(data = x, INDICES = group, FUN = describe, type = type)
The Skewness and Kurtosis of Cholesterol data suggested a violation of normality assumption. To verify this, we can use the d’Agotest as follows
library(fBasics)
df$Chol%>%split(df$Class)%>%map(~dagoTest(.))
## $Negative
##
## Title:
## D'Agostino Normality Test
##
## Test Results:
## STATISTIC:
## Chi2 | Omnibus: 90.5277
## Z3 | Skewness: 7.1435
## Z4 | Kurtosis: 6.2847
## P VALUE:
## Omnibus Test: < 2.2e-16
## Skewness Test: 9.095e-13
## Kurtosis Test: 3.285e-10
##
## Description:
## Fri Mar 24 22:34:09 2017 by user: Admin
##
##
## $Positive
##
## Title:
## D'Agostino Normality Test
##
## Test Results:
## STATISTIC:
## Chi2 | Omnibus: 115.6027
## Z3 | Skewness: 8.3194
## Z4 | Kurtosis: 6.811
## P VALUE:
## Omnibus Test: < 2.2e-16
## Skewness Test: < 2.2e-16
## Kurtosis Test: 9.691e-12
##
## Description:
## Fri Mar 24 22:34:09 2017 by user: Admin
By using d’Agotest, we have confirmed the first assumption violation of our data: the Cholesterol was not normally distributed in two sub-groups. This violation was associated to both Skewness and Kurtosis.
p1=df%>%ggplot(aes(x=Class,y=Chol,fill=Class))+geom_boxplot(alpha=0.5)+theme_bw()+coord_flip()
p2=df%>%ggplot(aes(x=Chol,fill=Class))+geom_rug(aes(color=Class))+geom_density(alpha=0.5)+theme_bw()+geom_vline(xintercept = c(233,248),linetype="dashed",color=c("red","blue"))
library(gridExtra)
grid.arrange(p1,p2,ncol=1)
A simple box-plot showed another assumption violation in Cholesterol data: there are many outliers in both sides. However, it could be expected that the medians (as presented by the red and blue dashed lines) in 2 groups could be different. The question is how we can statistically prove this?
The YUEN TEST
In 1974, Yuen KK introduced a new statistical test for comparing two-sample trimmed means. This method also implies a t-distribution under null hypothesis of trimmed mean equivalence. However, the Yuen’s test is more robust than the conventional t-test by handling outliers or unequal variances problems.
The mean could be trimmed up to 20% from both sides. If no trimming is involved, the Yuen test is considered equivalent to a Welch’s t test.
The WRS2 package supports the Yuen test via yuen( ) function. Our package also provides a bootstrap version for Yuen test which is useful for groups with unequal sample sizes.
For example, we can perform a Yuen test on 20% trimmed mean as follows:
formula="Chol~Class"
library(WRS2)
yuen(formula=formula, data=df,tr = 0.2)
## Call:
## yuen(formula = formula, data = df, tr = 0.2)
##
## Test statistic: 3.3622 (df = 387.19), p-value = 0.00085
##
## Trimmed mean difference: -13.53828
## 95 percent confidence interval:
## -21.4551 -5.6214
In terms of effect size, Algina, Keselman and Penfield (2005) proposed a robust version of conventional Cohen’s d effect size (See the previous tutorial on Student-t test), that based on up to 20% trimmed means and Winsorised variances. This analysis is supported by WRS2 method via akp.effect( ) function. We can apply the same interpretation rule as for Cohen’s d, that is absolute d values of 0.2, 0.5 and 0.8 correspond to small, medium and large effects, respectively. When the argument EQVAR set to FALSE, the function will apply the Algina (2005) method, that consists of calculating 2 effect-sizes, one with the group 1’s Winsorised variance and another one with group 2’s Winsorised variance in denominator.
akp.effect(formula, data=df, EQVAR = FALSE, tr = 0.2)
## [1] -0.2888242 -0.2703434
The results suggest that the 20% trimmed means were statistically different between patients with and without Heart disease (95%CI: -21.45 to -5.62; p=0.00085, small effect size) , i.e the patients with Heart disease has a trimmed mean significantly higher than that in control group.
One might ask this question :How much the data could be trimmed ? The authors suggested that trimming level could vary between 0 (no trimming) to 0.2 (20% trimming). In order to determine the optimal trimming level for our mean, we will perform a simulation experiment as follows:
# Yuen test
trim=seq(0,0.25,by=0.01)
pval=rep(NA,length(trim))
dif=rep(NA,length(trim))
LL=rep(NA,length(trim))
UL=rep(NA,length(trim))
Effect1=rep(NA,length(trim))
Effect2=rep(NA,length(trim))
n=1
for (i in 1:length(trim)) {
tr=trim[i]
yt=yuen(formula=formula, data=df,tr=tr)
ef=akp.effect(formula, data=df, EQVAR = FALSE, tr = tr)
pval[n]=yt$p.value
dif[n]=yt$diff
LL[n]=yt$conf.int[1]
UL[n]=yt$conf.int[2]
Effect1[n]=ef[1]
Effect2[n]=ef[2]
n=n+1
}
ytdf=cbind.data.frame(dif,LL,UL,pval,Effect1,Effect2)%>%as_tibble()%>%mutate(.,Trim=trim)
ytdf$Trim=as.factor(ytdf$Trim)
ytdf%>%ggplot(aes(x=Trim))+geom_errorbar(aes(ymax=UL,ymin=LL),size=1,color="grey")+geom_path(aes(y=dif,group=1),size=1,color="red4")+geom_point(aes(y=dif),shape=21,size=3,fill="red",color="black",show.legend = F)+theme_bw()
ytdf%>%ggplot(aes(x=Trim))+geom_path(aes(y=pval,group=1),show.legend = F,size=1,color="purple")+geom_point(aes(y=pval),shape=21,size=3,fill="gold",color="black",show.legend = F)+theme_bw()
ytdf%>%ggplot(aes(x=Trim))+geom_path(aes(y=Effect1,group=1),size=1,color="red3")+geom_point(aes(y=Effect1),shape=21,size=3,fill="red",color="black")+geom_path(aes(y=Effect2,group=1),size=1,color="blue3")+geom_point(aes(y=Effect2),shape=21,size=3,fill="blue",color="black")+theme_bw()
Our stimulation consists of using a loop in R for stimulating every possible Yuen test outcomes that correspond to 26 trimming levels from 0 to 25.
The graphical results indicate that the difference of trimmed mean did not change too much when trimming level increases from 0 to 25. Its value was consistently around -12.5, with 95%CI from -5 to -20. The p-value of t-test gradually droped when trimming level increases above 9%, but p_values were consistently lower than 0.05, indicating a sufficient ability for rejecting our null-hypothesis. The effect sizes were steady for trimming levels between 0 to 10% but began to increase as trimming level increases from 12% to 21%. Taking together, we might conclude that the optimal trimming level for our data is 20 to 21%. This level results in steadyly high contrast in mean, lowest p_value and largest effect-size.
YUEN TEST WITH BOOTSTRAPPING
yuenbt(formula, data=df, tr = 0.2, nboot = 1000)
## Call:
## yuenbt(formula = formula, data = df, tr = 0.2, nboot = 1000)
##
## Test statistic: -3.344 (df = NA), p-value = 0.001
##
## Trimmed mean difference: -13.53828
## 95 percent confidence interval:
## -21.4854 -5.5911
yuen.effect.ci(formula, data=df, tr = 0.2,nboot = 1000,alpha = 0.05)
## $effsize
## [1] 0.1878498
##
## $CI
## [1] 0.08597846 0.28624654
Note: When the Algina’s 2 effect sizes do not lead to the same conclusion, we could use an alternative method as proposed by Wilcox and Tian (2011). This consists of an explanatory measure of effect size. This method is supported by the yuen.effect.ci () function. The thresholds for interpreting Wilcox’s effect size are 0.1 (small), 0.3 (medium) and 0.5 (high).
We can also replicate the same simulating experiment on these 2 functions, with iteration number fixed at 1000 for the bootstrap
#Bootstraped Yuen test
trim=seq(0,0.25,by=0.01)
pval=rep(NA,length(trim))
dif=rep(NA,length(trim))
LL=rep(NA,length(trim))
UL=rep(NA,length(trim))
Effect=rep(NA,length(trim))
EffUL=rep(NA,length(trim))
EffLL=rep(NA,length(trim))
n=1
bootnum=1000
for (i in 1:length(trim)) {
tr=trim[i]
yt=yuenbt(formula, data=df, tr = tr, nboot = bootnum)
yef=yuen.effect.ci(formula, data=df, tr = tr,nboot = bootnum,alpha = 0.05)
pval[n]=yt$p.value
dif[n]=yt$diff
LL[n]=yt$conf.int[1]
UL[n]=yt$conf.int[2]
Effect[n]=yef$effsize
EffUL[n]=yef$CI[2]
EffLL[n]=yef$CI[1]
n=n+1
}
ytdf=cbind.data.frame(dif,LL,UL,pval,Effect,EffLL,EffUL)%>%as_tibble()%>%mutate(.,Trim=trim)
ytdf$Trim=as.factor(ytdf$Trim)
ytdf%>%ggplot(aes(x=Trim))+geom_errorbar(aes(ymax=UL,ymin=LL),size=1,color="grey")+geom_path(aes(y=dif,group=1),size=1,color="red4")+geom_point(aes(y=dif),shape=21,size=3,fill="red",color="black",show.legend = F)+theme_bw()
ytdf%>%ggplot(aes(x=Trim))+geom_errorbar(aes(ymax=EffUL,ymin=EffLL),size=1,color="grey")+geom_path(aes(y=Effect,group=1),size=1,color="blue4")+geom_point(aes(y=Effect),shape=21,size=3,fill="blue",color="black",show.legend = F)+theme_bw()
ytdf%>%ggplot(aes(x=Trim))+geom_path(aes(y=pval,group=1),show.legend = F,size=1,color="purple")+geom_point(aes(y=pval),shape=21,size=3,fill="gold",color="black",show.legend = F)+theme_bw()
When using Bootstrap, the effect-size was more stabilized, and p_value follows no more a clear pattern but seems to be lower than single Yuen test without bootstrap. The lowest p-value was obtained at trimming level of 20%.
QUANTILE COMPARISON
we might consider a more robust of location measure, that could be applied to any quantiles, as proposed by Harrell and Davis (1982). The corresponding function of this method in WRS2 package is qcomhd( ). Multiple quantiles could be introduced for generating the statistics. Instead of setting only one Null-hypothesis on Mean or median, this method consists of replicating null-hypothesis testing for multiple quantiles. By doing this we will no more be worried about the normality assumption.
QC=qcomhd(formula, data=df, q = c(0.025,0.05,0.1,0.25,0.5,0.75,0.9,0.95,0.975), nboot = 1000, alpha = 0.05, ADJ.CI = TRUE)
QCdf=QC$partable
QCdf%>%ggplot(aes(x=q))+geom_errorbar(aes(ymax=ci.up,ymin=ci.low),size=1,color="grey")+geom_path(aes(y=`est1-est.2`,group=1),size=1,color="blue4")+geom_point(aes(y=`est1-est.2`),shape=21,size=3,fill="blue",color="black",show.legend = F)+theme_bw()
QCdf%>%ggplot(aes(x=q))+geom_path(aes(y=p.crit,group=1),size=1,color="red3")+geom_point(aes(y=p.crit),shape=21,size=3,fill="red",color="black")+geom_path(aes(y=p.value,group=1),size=1,color="blue3")+geom_point(aes(y=p.value),shape=21,size=3,fill="blue",color="black")+theme_bw()+scale_y_continuous(breaks=c(0,0.01,0.02,0.05,0.1,0.2,0.3))
QC$partable
## q n1 n2 est1 est2 est1-est.2 ci.low ci.up
## 1 0.025 347 314 150.5298 161.5751 -11.045370 -32.68379 11.256361
## 2 0.050 347 314 165.5248 172.4942 -6.969357 -22.71261 8.947613
## 3 0.100 347 314 181.9048 187.8101 -5.905309 -17.39914 5.372083
## 4 0.250 347 314 205.3756 215.2149 -9.839321 -20.42474 1.336347
## 5 0.500 347 314 232.8511 248.6772 -15.826183 -29.88414 -1.137301
## 6 0.750 347 314 269.3097 283.6409 -14.331199 -27.94715 -1.222559
## 7 0.900 347 314 307.1308 321.9023 -14.771486 -35.70154 9.194554
## 8 0.950 347 314 330.4469 347.6793 -17.232380 -66.32777 16.320784
## 9 0.975 347 314 361.1796 396.8749 -35.695315 -127.08273 30.020520
## p.crit p.value
## 1 0.025000000 0.272
## 2 0.012500000 0.228
## 3 0.050000000 0.306
## 4 0.007142857 0.020
## 5 0.006250000 0.006
## 6 0.005555556 0.008
## 7 0.008333333 0.102
## 8 0.010000000 0.148
## 9 0.016666667 0.224
The result of quantile comparison showed that only null hypothesis on Median can lead to positive result for our comparison (95%CI of difference in estimated median was from -30.79 to -0.27). The other quartiles (0.25 and 0.75) are also characterised by a clear contrast (difference of -15.8 and -9.8) but they are not statistically significative (95%CI includes zero value, p_value was higher than critical level). Thus, if our null hypothesis is based on median, we can get a favorable result.
CONCLUSION
In this tutorial, we have introduced two robust methods for comparing the two independent samples when there are one or more assumption violation in data, particularly in presence of outliers. The Yuen test based on trimmed mean is very robust for dealing with problems in data. Its performance is even higher when combined with bootstraping. A simulation and/or centiles exploration might help us to determine the optimal trimming level for Yuen test. We recommend using these robust tests as alternative solution for t-test rather than using the non-parametric tests.
You can also practice those method with 2 remaining variables in the dataset: Resting Blood pressure and Max heart rate during Excercise test. Thank and see you in next tutorial.
P.S: This study is dedicated to Pr. Nguyen Van Tuan, who inspired me and many generations of medical students to become true data scientists.