Nonparametric Methods for Two Independent Samples in R: A Set of Useful Functions

Author

Andrii Bova

Published

November 1, 2022

Introduction

This tutorial contains some functions from several libraries of the R programming language, which are provided with brief comments. Nonparametric methods are used in one or more cases: the distribution of the dependent variable in the groups is not normal; outliers are present in the data; the residuals of the model are not normally distributed. Nonparametric methods are used with an ordinal or interval scale of the outcome variable.

The median and other quartiles better describe differences between groups with skewed distributions than the mean. The tests aim to detect differences in the central location and differences in the variation in the data. Some nonparametric tests are based on the rank transformation of scales. General linear models can serve as a generalization of nonparametric tests.

The design of a quantitative study, without regard to statistical test type, is based on questions: null vs alternative hypothesis, a sample size of the research, a type I error (alpha) and a type II error (beta), one-tailed test hypothesis or two-tailed test hypothesis, an effect size of a test. Therefore, this tutorial will focus on effect size, sample size, and power.

I identify three approaches to nonparametric analysis.

Nonparametric tests. These are the Wilcoxon-Mann-Whitney test, the Van der Waerden test, the modified Wilcoxon-Mann-Whitney test (the Fligner-Policello test), the Generalized Wilcoxon test (the Brunner-Munzel test), the quantile test.
Permutation tests and bootstrap.
General linear models such as OLS regression on ranks, rank-based regression, and quantile regression.

All examples are based on data from an online survey “Student youth of Kyiv under the corona crisis”. Period: February 24 - April 18, 2021. Participants: 253 students of the National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”. The survey was conducted via Google Forms.

This study attempted a Ukrainian-language adaptation of the the Fear of COVID-19 Scale (FCV-19S) (Ahorsu et al. 2022). FCV-19S includes such items.

I am most afraid of Corona.
It makes me uncomfortable to think about Corona.
My hands become clammy when I think about Corona.
I am afraid of losing my life because of Corona.
When I watch news and stories about Corona on social media, I become nervous or anxious.
I cannot sleep because I’m worrying about getting Corona.
My heart races or palpitates when I think about getting Corona.

Respondents used the five-point Likert scale to answer. Level of Agreement. 1 - strongly disagree, 2 - disagree, 3 - neutral, 4 - agree, 5 - strongly agree.

The index was calculated as the sum of item scores.

In the published article I used Welch’s t-test and the effect size of Cohen’s d to identify the difference in mean scores of the Fear of COVID-19 Scale (FCV-19S) across gender (Bova 2022).
Here I would like to share with esteemed enthusiasts another perspective on analyzing data using nonparametric statistics. All hypotheses were two-sided and tested at the significance level alpha = 0.05.

So let’s proceed with the analysis. Let’s load libraries and datasets and make the necessary transformations.

library(haven)          # read data set
library (statpsych)     # Sample Size of WMV test
                        # CI for a 2-group median difference)
library(MKpower)        # Sample Size of WMV test
library(samplesize)     # sample size of WMV test for ordinal scale
library(ggplot2)        # EDF, histograms
library(DescTools)      # check outliers
library(rempsyc)        # check normality
library(summarytools)   # summary statistics
library(kableExtra)     # tables format
library(ggstatsplot)    # box/violin plots with WMW & ES (rrb)
library(sjstats)        # Mann-Whitney U Test & ES (r)
library(stats)          # WMW test
library(effectsize)     # ES (rrb, WMW odds, Cohen's d)
library(asht)           # WMW test & ES (p)
library(coin)           # Wilcoxon test with Hodges-Lehmann estimation,
                        # van der Waerden test,
                        # Fligner-Killeen & Ansari-Bradley tests
library(rstatix)        # Wilcoxon test & ES (r)
library(robustrank)     # combine WMW and Fligner-Policello test
library(brunnermunzel)  # Brunner-Munzel test & ES (p)
library(wmwpow)         # power of the WMW
library(snpar)          # Quantile test
library(quantreg)       # Quantile regression
library (nptest)        # Studentized Wilcoxon test
library (Rfit)          # Rank-based regression
library(infer)          # bootstrap of CI for difference in the median
library(car)            # Levene's test
library(jtools)         # table for regression model
library(tidyverse)      # data manipulation
library(report)         # reporting of results

setwd("D:/NonparametricTests/")
# Load data set
df <- read_sav("Stud2021.sav")

1. Transforming Data

Let’s calculate the total score for the seven items measured on the Likert scale, which had five gradations each. Higher scores indicate a greater fear of the coronavirus.

# Sum of items

df$FCV <- rowSums(df[(42:48)])

Let’s transform the variable “sex” into the factor variable “gender” with two categories - women and men.

# Gender as factor variable

df$gender <- factor(df$Sex,
levels = c(1, 2),
labels = c("women", "men"))

Let’s calculate the ranks of the additive index.

# Rank of FCV-19S scores

df$rFCV <- rank(df$FCV)

We will select observations separately for women and men. It needs for the Shapiro-Wilk test and the Burner-Munzel test.

# Select variables in new dataset

stud <- df[c("FCV", "gender", "rFCV")]

Divide the scale into 4 equal groups to calculate a sample size with an ordinal outcome variable.

# Categorize the scale based on the 25th, 50th, and 75th percentiles

stud$FCV_rec <- cut(stud$FCV,
include.lowest = FALSE,
right = TRUE, dig.lab = 4,
breaks = c(5, 11, 14, 17, 35))

Table 1 shows cross-tabulation

xtabs(~ gender + FCV_rec, stud) %>%
prop.table(1) %>%
kable(digits = 2) %>%
kable_classic(full_width = F) %>%
kable_styling(font_size = 15)

Table 1: The categorized FCV-19S by gender
	(5,11]	(11,14]	(14,17]	(17,35]
women	0.20	0.27	0.21	0.32
men	0.45	0.24	0.16	0.16

We will make a stratified random sub-sample of 50 respondents. It needs for the exact WMW test.

# Select cases for tests

women <- stud[stud$gender=="women",]$FCV
men <- stud[stud$gender=="men",]$FCV

# Stratifed random sample from a data frame for permutation test

set.seed(1)
strat_50 <- stud %>%
group_by(gender) %>%
sample_n(size = 25)

2. Exploratory Data Analysis

Exploratory data analysis involves visualizing and calculating descriptive statistics for the sample and the two groups. I always start by examining plots. They allow me to see the features of the distribution in the whole sample and the differences between the two groups and then formulate research hypotheses.

2.1. Check Outliers and Normality of Distribution

On the histogram, you can see the right-hand skewness. On the left side are concentrated respondents who had no fear of the coronavirus, and on the right side are respondents who were very afraid of the coronavirus.

On average, Kyiv students had a very low level of fear of the coronavirus. A distribution has a moderate right (or positive) skewness. Three outliers are present. This is consistent with the methodology. It was to be expected that the sample would include a few respondents who had a high fear of the coronavirus. Outliers cannot be excluded from the sample.

Figure 1 shows the histogram of FCV-19S (scores)

# Visualization of outcome


hist(stud$FCV, col = "white", main = NULL,
freq = T, xlab = "FCV-19S (scores)",
ylab = "Freqlency", breaks = "FD")
abline(v = median (stud$FCV),
col = "blue",lwd = 3)
text(x = median(stud$FCV) * 1.3,
y = median(stud$FCV) * 2.7,
paste("Median =", median(stud$FCV)),
col = "black", cex = 1)

# Check outliers

DescTools:: Outlier(stud$FCV, method = "boxplot")

[1] 35 30 32

stud %>% select(FCV, gender) %>%
filter(FCV > 29)

# A tibble: 3 x 2
    FCV gender
  <dbl> <fct> 
1    35 men   
2    30 women 
3    32 women

For men, the values on the scale are concentrated more on the left pole of the scale. On the QQ plot, the distribution of scores in the group of women is close to normal. But the Shapiro-Wilk test reject the null hypothesis. The distribution of FCV-19S scores in both the women and men samples differs from normal.

Figure 2 shows the density plots and histograms of FCV-19S (scores) by gender

# Density plots {rempsyc}
# Freedman-Diaconis rule for bins

range <-25
IQR<-6
n1 <-122
FD_bins<-round((n1^(1/3)*range)/(2*IQR), 0)

nice_density(
data = stud,
variable = "FCV",
group = "gender",
xtitle = "FCV-19S (scores)",
ytitle = "Density (vs. Normal Distribution)",
grid = FALSE,
shapiro = F,
histogram = T,
breaks.auto = F,
bins = FD_bins,
title = NULL ) + 
              theme(axis.title = element_text(size = 14),
              axis.text = element_text(size = 12))

Figure 2: Density plot and histogram of FCV-19S (scores) by gender

?@fig-qq shows QQ plot of FCV-19S by gender

# QQ plots per group {rempsyc}

np<-nice_qq(
data = stud,
variable = "FCV",
group = "gender",
grid = FALSE,
shapiro = F,
title = NULL) + 
  theme(axis.title = element_text(size = 14),
              axis.text = element_text(size = 12))

# Shapiro-Wilk test of normality

shapiro.test(women)


    Shapiro-Wilk normality test

data:  women
W = 0.96726, p-value = 0.002978

shapiro.test(men)


    Shapiro-Wilk normality test

data:  men
W = 0.91103, p-value = 6.286e-07

2.2. Descriptive Analysis of the Data

The percentage of women and men in the sample corresponds to the in the proportions population. The mean (14.09) and median (14) of FCV-19S scores in the sample were equal. The mean, first, second, and third quartiles were higher in the women’s group than in the men’s. The variability is most likely equal. This can be seen from the values of standard deviations and interquartile ranges in the two groups.

Table 2 shows frequency of women and men in the sample

# Frequency table {summarytools}

freq(stud$gender)%>%
            kable(digits = 1) %>%
            kable_classic(full_width = F) %>%
            kable_styling(font_size = 15)

Table 2: Frequency table of women and men in the sample
	Freq	% Valid	% Valid Cum.	% Total	% Total Cum.
women	131	51.8	51.8	51.8	51.8
men	122	48.2	100.0	48.2	100.0
<NA>	0	NA	NA	0.0	100.0
Total	253	100.0	100.0	100.0	100.0

Table 3 shows descriptive statistics of FCV-19S on sample data and by gender

# Descriptive statistics on sample data and by gender {summarytools}

table_sample <-descr(stud$FCV)
tabl_descriptive<- by(data = stud$FCV, INDICES = stud$gender,
FUN = descr)
tabl_descr<-cbind(table_sample,
tabl_descriptive$women, tabl_descriptive$men)

colnames(tabl_descr)[1] ="sample"
colnames(tabl_descr)[2] ="women"
colnames(tabl_descr)[3] ="men"

tabl_descr %>% kable(digits = 2) %>%
              kable_classic(full_width = F) %>%
              kable_styling(font_size = 15)

Table 3: Descriptive statistics of FCV-19S on sample data and by gender
	sample	women	men
Mean	14.09	15.18	12.92
Std.Dev	4.79	4.49	4.85
Min	7.00	7.00	7.00
Q1	11.00	12.00	9.00
Median	14.00	15.00	12.00
Q3	17.00	18.00	16.00
Max	35.00	32.00	35.00
MAD	4.45	4.45	4.45
IQR	6.00	6.00	7.00
CV	0.34	0.30	0.38
Skewness	0.79	0.58	1.19
SE.Skewness	0.15	0.21	0.22
Kurtosis	1.28	0.88	2.49
N.Valid	253.00	131.00	122.00
Pct.Valid	100.00	100.00	100.00

2.3. Visualization of Group Differences

The empirical distribution function plot shows the differences in the sample distributions of the two groups. The blue line is shifted to the left. Men score less on FCV-19S than women. The distribution of scores in the women group dominates the distribution of scores in the man group. It is possible that at the 90th percentile the differences in the scale between the groups are leveled.

Another common visualization of differences between groups is the box plot with the violin plot. With a visual analysis, you can outline not only the approximate shape of the distribution or outliers but also hypothesize about differences in shift, scale, or shape. The graph shows the distribution shapes of two groups of six-number summaries of a set of data. They are minimum, first quartile, median, mean, third quartile, and maximum. The women’s group has a larger median and mean of FCV-19S scores than the men’s group. The graph also contains Wilcoxon statistics, the p-value, rank-biserial correlation coefficient, which shows the strength of the relationship between the two variables.

Figure 3 shows an empirical cumulative distribution plot of FCV-19S (scores) by gender

ggplot(stud, aes(x = FCV)) +
      stat_ecdf(aes(color = gender, linetype = gender),
      geom = "step", size = 1.5) +
      scale_color_manual(values = c("#FF9999", "#3399FF")) +
      labs(y = "f(FCV-19S)") +
      labs(x = "FCV-19S (scores)") +
      theme_minimal() +
      theme(axis.title = element_text(size = 14),
      axis.text = element_text(size = 12))

Figure 3: Empirical cumulative distribution plot of FCV-19S (scores) by gender

Figure 4 shows box plots with violin plots of FCV-19S (scores) by gender

# Box plot with the violin plot with WMW test and rrb {ggstatsplot}

ggbetweenstats(
              data = stud,
              x = gender,
              y = FCV,
              point.args = list(alpha = 0),
              type = "nonparametric",
              centrality.plotting = T,
              results.subtitle = T,
              xlab = "Gender",
              ylab = "FCV-19S (scores)") + coord_flip() +
              theme(
              axis.title = element_text(
              size = 12,
              face = "plain"),
              axis.text = element_text(size = 12),
              plot.subtitle = element_text(size = 11))

Figure 4: box plots with violin plots of FCV-19S (scores) by gender

3. Nonparametric Tests

3.1. Wilcoxon-Mann-Whitney Test

3.1.1.Assumptions of the Wilcoxon-Mann-Whitney Test

The most popular nonparametric test (distribution-free test) to compare outcome variable between two independent groups is the Mann-Whitney U test or the equivalent Wilcoxon rang-sum test. In the Wilcoxon-Mann-Whitney (abbreviated MWW) test the null hypothesis is that, for randomly selected values X and Y from two populations, the probability of X being greater than Y is equal to the probability of Y being greater than X.

Depending on the nature of the data and our a priori assumptions, three hypotheses can be tested using the WMW test: equality of distributions, equality of medians, and stochastic equality (Karch 2021, 3).

The assumptions of the WMW test are:

1. The outcome variable is ordinal or interval.

2. The independent variable is nominal with two classes (dichotomous variable).

3. Observations from the sample must be taken independently and randomly.

Additional the WMW test for testing equivalence of medians.

4. The shapes of the two distributions are identical.

5. Distributions must have equal variances.

Sample sizes may be unequal. The WMW test is often used when the sample size is not very large. There are more than ten functions in R for calculating the WMW criterion, its generalizations, and modifications. Some functions allow us to give both asymptotic and exact estimates of the p-value. For example, the functions “wilcox_test {coin}”, “wilcox.test {stats}” and “brunnermunzel.test {brunnermunzel}” are exact (permutation) versions of the tests. Some functions allow to check as a null hypothesis equality to a certain value of the Mann-Whitney parameter (“wmwTest {asht}”), others are equality to a certain value of the Hodges-Lehmann estimation (“wilcox_test {coin}”, “wilcox_test {rstatix}”).

When testing the hypothesis of equivalence of distributions in the two groups using the WMW test, ranks are used. The order of the means ranks in the two groups is not necessarily the same as the order of the medians or means. For example, the medians in the two distributions will be equal, but the mean ranks will not. The graph shows histograms of rank distributions in the women’s group (on the top) and in the men’s group (on the bottom). The men’s scores are shifted to the left, while the women’s are shifted to the center and right.

Let’s look at the values of average ranks in the two independent groups and in the sample as a whole. The women’s group has a higher average rank than the men’s group.

I supplemented the graph with the Mann-Whitney parameter, its significance, and 95% confidence intervals.

Figure 5 shows the histograms of FCV-19S (ranks) by gender

# Distribution of ranks in two groups {ggplot}

ggplot(stud, aes(x = rFCV))+
      geom_histogram(fill = "white", colour = "black", 
                     bins = 10) +
      theme(panel.background = element_rect(fill = "white"))+
      facet_grid(gender ~ .)+ labs(x = "FCV-19S (ranks)",
      y = "count") + 
      labs(subtitle = 
      "Mann-Whitney est = 0.65, 95% CI [0.59, 0.72]") +
      theme(
      axis.title = element_text(
      size = 14,
      face = "plain"),
      axis.text = element_text(size = 12),
      plot.subtitle = element_text(size = 11))

Figure 5: Histograms of FCV-19S (ranks) by gender

3.1.2. Effect Size of Wilcoxon-Mann-Whitney Test

The practical significance of differences between distributions is shown by such measures of effect size.

The standardized U statistics divided by the total number of observations or the Rosenthal correlation coefficient (r);
The Rank-biserial correlation coefficient (rrb);
The Mann-Whitney parameter (p) or AUC;
The Wilcoxon-Mann-Whitney odds;
The Hodges-Lehmann estimation.

Correlation coefficients measure the strength of the relationship between variables. For correlation coefficients, there is a rule an empirical rule for interpreting the effect size according to Cohen (rrb = 0.1 - small, rrb = 0.3 - medium, rrb = 0.5 - large effect size). The probability of superiority quantifies the stochastic dominance of one distribution over another. The Mann-Whitney parameter, the Wilcoxon-Mann-Whitney odds and the rank-biserial correlation are related. From the value of one, you can calculate the values of the others. The Hodges-Lehmann estimator is used to estimate the location shift. This is essentially the median of the distances between each pair of numbers from the two samples.

Note that tests can test the hypothesis about the magnitude of the shift (the rank-biserial correlation coefficient, the Mann-Whitney parameter, and the Hodges-Lehmann estimator). After all, the researcher is primarily interested not in the question of the equality of the difference between two values to zero, but in the difference from the value that has at least minimal practical value.

When interpreting the effect, it is important to consider which category is the reference. The order of the groups being compared has an impact on the direction of the relationship, the value of the probability index, or the sign of the Hodges-Lehmann estimator.

This variant of the WMW test gives an r as effect size.

### WMW with ES
### Two-sample Mann-Whitney U test with r {sjstats}

mwu(stud, FCV, gender)


# Mann-Whitney-U-Test

Groups 1 = women (n = 131) | 2 = men (n = 122):
  U = 19099.000, W = 10453.000, p < .001, Z = 4.243
  effect-size r =   0.267
   rank-mean(1) = 145.79
   rank-mean(2) = 106.82

Calculation r from standardized U statistics

round(4.243/sqrt(253),3)

[1] 0.267

According to the results of the analysis of the survey data WMW test was statistically significant. It is possible to reject the null hypothesis at the chosen level of significance (alpha = 0.05) and accept the alternative hypothesis. The distributions of FCV-19S score in the groups are different. Women, in general, had a higher score on the scale than men. The effect was positive and medium (rrb = 0.31).

Table 4 shows the rank-biserial correlation coefficient

# rank-biserial correlation {effectsize}

rank_biserial(FCV ~ gender, data=stud)%>%
              kable(digits = 2) %>%
              kable_classic(full_width = F)%>%
              kable_styling(font_size = 15)

Table 4: Rank-biserial correlation coefficient with 95% CI
r_rank_biserial	CI	CI_low	CI_high
0.31	0.95	0.17	0.43

# Effect size of rrb 
# rrb = 0.1 - small, rrb = 0.3 - medium, rrb = 0.5 - large

In addition, we show the probabilistic index calculation and odds (Grissom and Kim 2012). In the literature, it may be referred to by the letters “f”, “phi” or “p”. If p = 0.5, then there is an equal probability of FCV scores in women and men. Based on the results, the probability that a randomly selected woman would have a higher FCV-19S score than a randomly selected man would be 0.154.

### Mann-Whitney parameter

sizep <- function(U, n1, n2) { f <- U / (n1 * n2)}  

U <- 10453; n1 <- 131; n2 <- 122

p_women <-round(sizep(U, n1, n2),3)
p_women

[1] 0.654

p_men <- 1 - p_women
p_men

[1] 0.346

Table 5 shows WMW odds

# WMW odds {effectsize}

wmw_odds(FCV ~ gender, data=stud) %>%
        kable(digits = 2) %>%
        kable_classic(full_width = F)%>%
        kable_styling(font_size = 15)

Table 5: WMW odds with 95% CI
WMW_odds	CI	CI_low	CI_high
1.89	0.95	1.42	2.51

# Calculation WMW odds

round((p_women/p_men), 3)

[1] 1.89

This variant of the WMW test shows the Mann-Whitney parameter with 95% CI (Fay and Malinovsky 2018).

### WMW test with CI on Mann-Whitney Parameter {asht}

wmwTest(FCV ~ gender,
        data = stud,
        alternative = "two.sided",
        method = "asymptotic")


    Wilcoxon-Mann-Whitney test with continuity correction (confidence
    interval requires proportional odds assumption, but test does not)

data:  FCV by gender
Mann-Whitney estimate = 0.34595, tie factor = 0.99508, p-value =
2.21e-05
alternative hypothesis: two distributions are not equal
95 percent confidence interval:
 0.2831733 0.4159026
sample estimates:
Mann-Whitney estimate 
            0.3459517

3.1.3. Hodges-Lehmann Estimation

Another way to estimate the location shift is the Hodges-Lehmann estimator. The Hodges-Lehmann estimation is not the median difference but rather the median of the difference between a sample from x and a sample from y. Let us first calculate the Hodges-Lehmann estimate of location shift for two-sample data.

### Calculation of pairwise differences

diff <- outer(women,men,"-")
quantile(diff, probs=c(.25, .50, .75))

25% 50% 75% 
 -2   2   7

Figure 6 shows distribution pairwise differences and its median

hist (diff, col = "white", main = " ",
      xlab = "Pairwise differences distribution",
      ylab = "Freqlency", breaks = "FD")
      abline(v = 2, col = "blue", lwd = 3)
      text(x = -3.7*5 , y = 17*50,
      paste("W = 10453, p < 0.001,
      Diff in Location = 2, 95% CI [1, 4]"),
      col = "black", cex = 1)

Figure 6: Distribution of the pairwise differences

WMW test with Hodges-Lehmann Estimation and the report on the effect size.

# WWW test with Hodges-Lehmann Estimation  {stats}
# reporting WMW test wit ES (rrb) {report}

wt<-stats::wilcox.test(stud$FCV~stud$gender,
                               conf.int = T)
wt


    Wilcoxon rank sum test with continuity correction

data:  stud$FCV by stud$gender
W = 10453, p-value = 2.21e-05
alternative hypothesis: true location shift is not equal to 0
95 percent confidence interval:
 1.000029 3.999974
sample estimates:
difference in location 
              2.000053

report (wt)

Effect sizes were labelled following Funder's (2019) recommendations.

The Wilcoxon rank sum test with continuity correction testing the difference in
ranks between stud$FCV and stud$gender suggests that the effect is positive,
statistically significant, and large (W = 10453.00, p < .001; r (rank biserial)
= 0.31, 95% CI [0.17, 0.43])

3.1.4. Difference in Medians

Table 6 shows difference in medians

### CI for a 2-group median difference {statpsych}

ci.median2(.05, women, men)%>%
          kable(digits = 2) %>%
          kable_classic(full_width = F)%>%
          kable_styling(font_size = 15)

Table 6: Difference in Medians with 95% CI
Median1	Median2	Median1-Median2	SE	LL	UL
15	12	3	0.86	1.31	4.69

The median of the difference is 2, and difference of population medians on FCV-19S scores is 3.

3.1.5. Sample Size and Power of the Wilcoxon-Mann-Whitney Test

There are several libraries in R that allow you to calculate the necessary sample size based on the estimated effect size and test power for a variable that is an interval or ordinal. In the study, I planned to achieve a moderate effect size. This is typical in most psychological research. According to a meta-analysis of studies for attitudes, the effect size is average (r = 0.29) (Paterson et al. 2016, 8). A rank-biserial correlation coefficient of 0.3 corresponds to the average strength of the relationship of the variables. Using the formula from the rank-biserial coefficient, we obtained the Mann-Whitney parameter (p = 0.65). With significance level alpha = 0.05 and power 0.8, a two-sided hypothesis requires interviewing 59 respondents in each group. A total of 118 respondents.

### Sample size for WMW test
# The sample size for continuous data
# The sample size for a Mann-Whitney test {statpsych}

# Arguments
# alpha alpha level for hypothesis test
# pow desired power
# p planning value of Mann-Whitney parameter
# alpha is 0.05
# power is 0.8

# Effect size of rrb 
# rrb = 0.1 - small, rrb = 0.3 - medium, rrb = 0.5 - large
# p = 0.5 + rb:2
# Planning rb = 0.3 (moderate effect size)
# p = 0.5 + 0.3:2 = 0.5 + 0.15 = 0.65

size.test.mann(.05, .80, .65)

     Sample size per group
[1,]                    59

The smaller the effect is planned to be detected, the larger the sample size required (Brunner, Bathke, and Konietschke 2018, 156).

Based on simulations, 70 respondents in each group should be interviewed. The total is 140 respondents.

# Simulate the empirical power of Wilcoxon rank sum test {MKpower}
rx <- function(n) rnorm(n, mean = 15.18, sd = 4.49)
ry <- function(n) rnorm(n, mean = 12.92, sd = 4.85)

sim.ssize.wilcox.test(
                      rx = rx, ry = ry,
                      n.min = 10, n.max = 100,
                      step.size = 10)


     Wilcoxon rank sum test 

              n = 10, 20, 30, 40, 50, 60, 70, 80
             rx = rnorm(n, mean = 15.18, sd = 4.49)
             ry = rnorm(n, mean = 12.92, sd = 4.85)
      sig.level = 0.05
      emp.power = 0.1550, 0.3024, 0.4333, 0.5424, 0.6412, 0.7284, 0.7915, 0.8451
    alternative = two.sided

NOTE: n is number in *each* group

If the scale were ordinal with four categories and the distribution of respondents presented in the cross-tab above, 48 males and 52 females (100 respondents total) would need to be interviewed.

# The sample size for discrete data {samplesize}

n.wilcox.ord(
            power = .8, alpha = .05,
            t = .52, p = c(.20, .27, .21, .32),
            q = c(.45, .24, .16, .15))

$`total sample size`
[1] 100

$m
[1] 48

$n
[1] 52

According to the results of the analysis, the power of the test is very high (0.99). This is due to the increased sample size (n = 253).

### Power of the WMW test for a continuous outcome {wmwpow}

shiehpow(
          n = 131,
          m = 122,
          p = .65,
          alpha = .05,
          dist = "norm",
          sides = "two.sided")

Distribution: norm
Sample sizes: 131 and 122
p: 0.65
WMW odds: 1.857
sides: Two-sided
alpha: 0.05

Shieh Power: 0.989

3.2. Van der Waerden Test

The van der Waerden normal quantiles test has good asymptotic efficiency when the underlying data are normal. The WMW test and the Van der Warden’s test are the rank homogeneity tests for two independent samples (Bagdonavičius, Kruopis, and Nikulin 2011). Based on van der Waerden test, the null hypothesis should be rejected and the alternative hypothesis accepted.

# Two-Sample van der Waerden test {coin}

normal_test(FCV ~ gender, stud)


    Asymptotic Two-Sample van der Waerden (Normal Quantile) Test

data:  FCV by gender (women, men)
Z = 4.1104, p-value = 3.95e-05
alternative hypothesis: true mu is not equal to 0

3.3. Tests on Homogeneity of Group Variances

Two tests the homogeneity of group variances. Equality of variances is one of the assumptions for meaningful inferences of equality of medians. The Fligner-Killeen test and the Ansari-Bradley test showed equal variances in FCV-19S scores in the two groups.

# Fligner-Killeen test of homogeneity of variances {coin}

fligner_test(FCV ~ gender, stud,
            distribution = "asymptotic")


    Asymptotic Two-Sample Fligner-Killeen Test

data:  FCV by gender (women, men)
Z = -0.1808, p-value = 0.8565
alternative hypothesis: true ratio of scales is not equal to 1

# Ansari-Bradley two-sample test for a difference in scale parameters
# for possibly tied observations {coin}

ansari_test(FCV ~ gender, stud,
            distribution = "asymptotic")


    Asymptotic Two-Sample Ansari-Bradley Test

data:  FCV by gender (women, men)
Z = 1.1599, p-value = 0.2461
alternative hypothesis: true ratio of scales is not equal to 1

It is visually evident that the shapes of distributions in the two groups are different. Therefore, it is reasonable to use modifications or generalizations of the WMW test. The Fligner-Policello test is used when the distributions have different shapes and unequal variances. It assumes that the distributions are symmetric (Fong and Huang 2019). The Bruner-Munzel test is used when observed heterogeneity in variances or the distributions are asymmetrical as an analog of the Welch t-test for the parametric procedure (Karch 2021, 11).

Based on the tests, the null hypothesis should be rejected and the alternative hypothesis accepted.

3.4. Nonparametric Tests with Unequal Variance

# Fligner-Policello test


mod.wmw.test(women, men, method="fp")

[1] 7.860652e-06

# Brunner-Munzel test {brunnermunzel}

brunnermunzel.test(women, men)


    Brunner-Munzel Test

data:  women and men
Brunner-Munzel Test Statistic = -4.479, df = 237.92, p-value =
1.165e-05
95 percent confidence interval:
 0.2781975 0.4137059
sample estimates:
P(X<Y)+.5*P(X=Y) 
       0.3459517

The quantile test allows us to compare distributions not only by the third quartile, but also by any of the quantiles. For example, in the 0.90 quantile we cannot reject the null hypothesis about equal the FCV-19S scores in the two groups.

### Quantile test {snpar}

quantile(men,probs = .90)

 90% 
18.9

quantile(women,probs = .90)

90% 
 20

quant.test(men, women, p = .90)


    Approximate two-sample (Brown-Mood) quantile test

data:  men and women
Difference = -1.1, p-value = 0.4582
alternative hypothesis: true location shift is not equal to 0

4. Permutation Tests and Bootstrap

Permutation tests are appropriate for calculating p-values in small samples. Bootstrap is less sensitive to unequal variance.

Let’s run the WMW test on the whole sample and the subsample. In the first case Monte Carlo resampling is used, in the second one exact distribution is used. In the first case, we should reject the null hypothesis and accept the alternative hypothesis, and in the second case, the null hypothesis cannot be rejected. Apparently, the sample of 50 respondents is insufficient to detect differences in the FCV-19S scores of the two groups.

### Wilcoxon test with the Hodges-Lehmann estimation 
###  by Monte Carlo simulation {coin}

options(scipen = 999)

# n = 253

coin:: wilcox_test(FCV ~ gender, stud,
       alternative = "two.sided",
       distribution = "approximate",
       conf.int = T)


    Approximative Wilcoxon-Mann-Whitney Test

data:  FCV by gender (women, men)
Z = 4.2434, p-value < 0.0001
alternative hypothesis: true mu is not equal to 0
95 percent confidence interval:
 1 4
sample estimates:
difference in location 
                     2

# Exact Wilcoxon test with Fisher-Pitman permutation test
# n = 50

library (coin)
coin::wilcox_test(FCV ~ gender, strat_50,
                  alternative = "two.sided",
                  distribution = "exact",
                  conf.int = T)


    Exact Wilcoxon-Mann-Whitney Test

data:  FCV by gender (women, men)
Z = 1.6647, p-value = 0.0972
alternative hypothesis: true mu is not equal to 0
95 percent confidence interval:
 0 5
sample estimates:
difference in location 
                     2

Resampling can be used for permutation tests, and bootstrap can be used to calculate location shift or confidence intervals of the WMW statistics or r.

# WMW test
options(scipen = 999)
rstatix::wilcox_test(stud, FCV ~ gender)%>%
                      kable(digits = 5) %>%
                      kable_classic(full_width = F)%>%
                      kable_styling(font_size = 15)

.y.	group1	group2	n1	n2	statistic	p
FCV	women	men	131	122	10453	0.00002

Table 7 shows effect size for U Mann-Whitney test

### Confidence intervals for r 
### based on 1000 resamplings {rstatix}

rstatix :: wilcox_effsize(
                      stud, FCV ~ gender,
                      ci = T, conf.level = 0.95,
                      ci.type = "perc",
                      nboot = 1000) %>%
                      kable(digits = 2) %>%
                      kable_classic(full_width = F)%>%
                      kable_styling(font_size = 15)

Table 7: Effect size for U Mann-Whitney test with 95% CI
.y.	group1	group2	effsize	n1	n2	conf.low	conf.high	magnitude
FCV	women	men	0.27	131	122	0.15	0.38	small

Studentized Wilcoxon test with resamples for the permutation test

Bootstrap can also be used to estimate the CI of difference in two medians.

# Confidence intervals of difference in medians {infer}

set.seed(1)
diff_med_ci <- stud %>%
                specify(FCV ~ gender) %>%
                generate(reps = 5000, type = "bootstrap") %>%
                calculate("diff in medians", order = c("women", "men"))
                diff_med_ci %>%
                summarize(
                l = quantile(stat, .025),
                u = quantile(stat, .975)
                )

# A tibble: 1 x 2
      l     u
  <dbl> <dbl>
1     1     4

5. General Linear Models

General linear models are another way to show differences between groups.

«Nonparametric Mann-Whitney-Wilcoxon test is not recommended under any circumstances. Among the disadvantages of this test are that (a) the critical values are not extensively tabled, (b) tied ranks can affect the results and no optimal procedure has yet been developed (Wilcox, 1986), and (c) Type I error appears to be inflated regardless of the status of the assumptions (Zimmerman, 2003)» (Hahs-Vaughn and Lomax 2020, 235).

Let’s look again at the mean ranks and standard deviations in the groups.

Table 8 shows mean and standard deviation of FCV ranks by gender

stud %>%
        group_by(gender) %>%
        summarise_at(vars(rFCV),
        list("ranks mean" = mean, "ranks sd" = sd)) %>%
        kable(digits = 2) %>%
        kable_classic(full_width = F)%>%
        kable_styling(font_size = 15)

Table 8: Mean and standard deviation of the FCV-19S (ranks) by gender
gender	ranks mean	ranks sd
women	145.79	67.84
men	106.82	73.21

Let’s run the OLS regression with the dependent variable of the FCV-19S converted to ranks. The reference category is women. The intercept in the regression equation is the mean rank of the FCV-19S score in the women’s group. If we subtract the value of the regression coefficient from the intercept, we obtain the mean rank in the men’s group. The regression coefficient shows the difference between the mean ranks. Compared to women, the effect is negative and statistical significant.

Table 9 shows OLS regression model

### OLS regression for rank difference {stats}

fit_lm <- lm(rFCV ~ gender, data = stud)
summ(fit_lm, digits = 3)

Table 9: Gender as prdictor of FCV-19S (ranks), OLS regression
Observations	253
Dependent variable	rFCV
Type	OLS linear regression

F(1,251)	19.315
R	0.071
Adj. R	0.068

	Est.	S.E.	t val.	p
(Intercept)	145.794	6.158	23.675	0.000
gendermen	-38.974	8.868	-4.395	0.000
Standard errors: OLS

The effect was medium (d = 0.55).

# Effect size of Cohen's d
# d = 0.2 - small, d = 0.5 - medium, d = 0.8 - large
# Cohen's d for ranks FCV{effectsize}

cohens_d(rFCV ~ gender, data = stud)%>%
        kable(digits = 2) %>%
        kable_classic(full_width = F) %>%
        kable_styling(font_size = 15)

.y.	group1	group2	effsize	n1	n2	magnitude
rFCV	women	men	0.55	131	122	moderate

If the dependent variable is ordinal you can use rank-based regression.

### Rank-based estimates of regression coefficients {Rfit}

fit_rfit <- rfit(FCV ~ gender, data = stud)
summary (fit_rfit)

Call:
rfit.default(formula = FCV ~ gender, data = stud)

Coefficients:
            Estimate Std. Error t.value               p.value    
(Intercept) 15.00000    0.37035 40.5023 < 0.00000000000000022 ***
gendermen   -2.00000    0.55373 -3.6118             0.0003671 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multiple R-squared (Robust): 0.06997226 
Reduction in Dispersion Test: 18.88442 p-value: 0.00002

Quantile regression allows finding differences in groups by quantiles. The findings are consistent with the quantile test. In the 90th of the FCV-19S quantile, scores are equivalent in the women’s and men’s groups.

Table 10 shows quantile linear regression model

### Quantile linear regression {quantreg}

# Compare 90th percentile using quantile linear regression

fit_rq <- rq(FCV~gender,data=stud,tau=.90,method='fn')
summ(fit_rq,digits = 3)

Table 10: Gender as prdictor of FCV-19S (ranks), quantile regression
Observations	253
Dependent variable	FCV
Type	Quantile regression
Quantile (tau)	0.9
Method	Frisch-Newton

	Est.	S.E.	t val.	p
(Intercept)	20.000	0.719	27.828	0.000
gendermen	-1.000	1.226	-0.816	0.415
Standard errors: Sandwich (Huber)

Conclusions

For skewed distributions that contain outliers, it is better to use nonparametric methods. They allow you to compare not only medians, but also other quantiles in groups. This can be important for solving practical tasks. In social studies, ordinal scales are widely used, for which nonparametric methods of analysis are more adequate. In some cases, particularly in the present study, outliers are important data (perhaps the most important). They cannot be excluded from the sample or winsorization.

The tutorial shows how to test the hypothesis of stochastic equivalence of distributions, the equivalence of variation in two groups, the calculation of sample size, and the estimation statistical power of the test. Different measures of effect size are given. The results of the WMW test can be interpreted both in a probabilistic sense and in a correlation sense.

Data from a survey of students in Kyiv by using the FCV-19S in a group of women and men were analyzed using nonparametric tests and linear models. Women had a higher score on this scale than men. The effect was positive, statistically significant and medium. In the group with a very high level of fear of the coronavirus, there are no differences in scores among the two genders.

References

Ahorsu, Daniel Kwasi, Chung-Ying Lin, Vida Imani, Mohsen Saffari, Mark D. Griffiths, and Amir H. Pakpour. 2022. “The Fear of COVID-19 Scale: Development and Initial Validation.” International Journal of Mental Health and Addiction 20 (3): 1537–45. https://doi.org/10.1007/s11469-020-00270-8.

Bagdonavičius, V., Julius Kruopis, and M. S. Nikulin. 2011. Non-Parametric Tests for Complete Data. London ; Hoboken, NJ: ISTE/Wiley.

Bova, Andrii. 2022. “The Fear of COVID-19: Gender Differences Among Kyiv Students [in Ukrainian].” Вісник Науки Та Освіти, no. 3(3) (October). https://doi.org/10.52058/2786-6165-2022-3(3)-244-257.

Brunner, Edgar, Arne C. Bathke, and Frank Konietschke. 2018. Rank and Pseudo-Rank Procedures for Independent Observations in Factorial Designs: Using R and SAS. Springer Series in Statistics. Cham: Springer International Publishing. https://doi.org/10.1007/978-3-030-02914-2.

Fay, Michael P., and Yaakov Malinovsky. 2018. “Confidence Intervals of the Mann-Whitney Parameter That Are Compatible with the Wilcoxon-Mann-Whitney Test.” Statistics in Medicine 37 (27): 3991–4006. https://doi.org/10.1002/sim.7890.

Fong, Youyi, and Ying Huang. 2019. “Modified Wilcoxon-Mann-Whitney Test and Power Against Atrong Null.” The American Statistician 73 (1): 43–49. https://doi.org/10.1080/00031305.2017.1328375.

Grissom, Robert J., and John J. Kim. 2012. Effect Sizes for Research: Univariate and Multivariate Applications. 2nd ed. New York: Routledge.

Hahs-Vaughn, Debbie L., and Richard G. Lomax. 2020. An Introduction to Statistical Concepts. Fourth edition. New York, NY: Routledge, Taylor & Francis Group.

Karch, Julian D. 2021. “Psychologists Should Use Brunner-Munzel’s Instead of Mann-Whitney’s Test as the Default Nonparametric Procedure.” Advances in Methods and Practices in Psychological Science 4 (2): 251524592199960. https://doi.org/10.1177/2515245921999602.

Paterson, Ted A., P. D. Harms, Piers Steel, and Marcus Credé. 2016. “An Assessment of the Magnitude of Effect Sizes: Evidence from 30 Years of Meta-Analysis in Management.” Journal of Leadership & Organizational Studies 23 (1): 66–81. https://doi.org/10.1177/1548051815614321.

Citation

BibTeX citation:

@online{bova2022,
  author = {Andrii Bova},
  editor = {},
  title = {Nonparametric {Methods} for {Two} {Independent} {Samples} in
    {R:} {A} {Set} of {Useful} {Functions} (Tutorial)},
  date = {2022-11-01},
  url = {https://rpubs.com/abova/nptest},
  langid = {en}
}

For attribution, please cite this work as:

Andrii Bova. 2022. “Nonparametric Methods for Two Independent Samples in R: A Set of Useful Functions (Tutorial).” November 1, 2022. https://rpubs.com/abova/nptest.