Statistical tests for differences between groups

General Testing

If our sample is random, we know the the distribution of specific sample characteristics. Hence, if, and only if, our sample is random, we know the probability for the random occurrence of these characteristics, such as differences in means of subsamples. If the probability for a variation being random is very low (say below 5%) it seems unlikely, that our sample is random. Thus, in addition to mere random effects at least one systematic effect exists that accounts for this variation.

General process for hypothesis testing:

Define H₀ (null-hypothesis): “All variation we observe in our sample is due to random effects. It doesn’t exist in the population.” The alternative Hypothesis H_A claims everything not included in H₀.
Define the maximum accepted probability for making an error if the Null-hypothesis is rejected. This is the significance level \(\alpha\), i.e., the likelihood to reject H₀ although H₀ is true (to make an \(\alpha\)-error).
Calculate the significance \(p\), i.e. the probability that the observed variation occurs under the condition of H₀.
If \(p\) is lesser than the significance level \(\alpha\), reject H₀, ie. accept H_A. Otherwise, consider H₀ preliminary valid.

In the following, this process will be outlined in detail by referring to different hypotheses. We work with a small sample (n=50) from ESS8 data, namely the gender of respondent (gndr), self-rated health (health), highest level of education (edlv), tolerance towards gays and lesbians (multi-item scale tolerance), and a weighting variable (w).

t-Test: Test if two population means are different

Prerequisites

Independent variable is binary
Dependent variable is metric and approximately normally distributed
Rather big than small sample

Null hypothesis

H₀: There is no difference between the group means in the population.

H_A: Group means are different in the population.

General test idea

If two population means do not differ, two samples A and B from this population should have two (more or less) equal means: Mean_A\(\approx\)Mean_B. In other words: (Mean_A - Mean_B)\(\approx\) 0.

If we (theoretically) draw all samples of same size than A and B, the differences (Mean_A-Mean_B) were symmetrically distributed with mean 0. We divide these differences by their standard error SE (i.e. the standard deviation of a theoretical function) to normalize (standardize) the differences to the average difference. The resulting theoretical distribution is known as t-distribution.

It describes the probability for the random occurrence of a given difference between two sample means under the condition that both samples were drawn from the same population, i.e. that their difference in the population is actually zero.

In other words, if we assume both samples were drawn from the same population, we can calculate the likelihood for the random occurrence of our sampled difference between two means. Most differences are very close to zero. The more the differences deviate from zero towards negative or positive values, the less likely the respective sample pairs are (the less often they occur when drawn randomly from the same population).

In practice, we define the maximum error probability we are willing to accept in our analysis. This user defined threshold is called significance level \(\alpha\). Based on this we can clearly decide, if our sampled values are too unlikely to be drawn randomly. We then calculate the actual probability to draw our sample randomly from a homogeneous population were no difference between the sampled means can be observed. This probability is called the significance \(p\) of your sample. It can determined by applying the theoretical t-distribution as a “lookup-table” for the sample-specific t-Value.

In social sciences, significance levels of 5% or 1% are quite common. At a significance level of \(\alpha\) = 5% we we are wrong in one out of every 20 test decisions on average. However, if you build a nuclear power plant or perform heart surgery, then you should work with significantly smaller significance levels.

If the significance is greater than the significance level, there is nothing against the initial assumption that our sample was drawn randomly from a homogeneous population.

If the significance is lower than the significance level we have to question our initial premise: We made an assumption and must now acknowledge that our result (our sample) is quite unlikely under this assumption. Then we have two options: Either we doubt our own sample (for example, we may have made methodological errors in the sampling process), or we doubt the assumption that led to this result. The former is always reasonable, of course. However, if we find no errors, then we must reject the basic assumption that our sample comes randomly from a homogeneous population. Conversely, we must acknowledge that ou sample was not drawn randomly from a homogeneous population, i.e. that the differences in our sample means are not random, that there is a significant difference between the groups in the population, that group membership has a significant effect on the variable in question.

The initial assumption that everything we observe in our sample is just random is called the null-hypothesis (H₀). In most cases we are interested in the rejection of H₀, because we want to show that our observations are not random.

Note, that H₀ can not be proven but only be rejected. In other words, the fact that H₀ cannot be rejected does not necessarily mean that H₀ is true. If judiciary sends me to jail because of murder they’re sufficiently certain, that I am guilty. However, if they don’t jail me for want of evidence, it doesn’t mean that I am innocent. They simply are not save enough to jail me although I might still be a murder. Same with H₀: If we can’t reject it, it may still be false, i.e. the effect we’re after might still exist.

Example: Does tolerance towards gays and lesbians differ between gender?

##          gndr tolerance        se
## Male     Male  3.588056 0.4195371
## Female Female  3.134684 0.4194148

We observe a clear difference in tolerance levels between male (3.588) and female (3.135) in our sample, indicating that female are less tolerant than male. The difference in means is 3.13 - 3.59 = -.453. Does this difference deviate sufficiently far from zero to be able to say that it is not random? The t-Test provides an answer:

## 
##  Design-based t-test
## 
## data:  tolerance ~ gndr
## t = -0.76425, df = 48, p-value = 0.4485
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
##  -1.6461384  0.7393936
## sample estimates:
## difference in mean 
##         -0.4533724

Although we observe a clear differences in tolerance levels between male and female in our sample, the standardized difference of -.76425 is quite likely to occur randomly (\(p\)=.449). More formally, the likelihood of randomly drawing two subsamples of male and female where male and female differ by at least .76425 standard errors although they do not differ in the population is .499.

As kind of a “cooking receipt” we can simply argue as follows: If we test on a significance level of \(\alpha\) =5%=.05 we see that p > \(\alpha\), indicating that there is no significant difference in tolerance between gender in the population.

To illustrate this situation differently: If we reject H₀ our test decision is incorrect with a probability of p=0.4485. This is a good reason to consider H₀ preliminary valid: The obesreved differences in our subsamples can easily be explained by simple random variations in the sampling process.

Degrees of freedom

Note that the t-distribution for our test depends on the size of our sample (n=50). More precisely, the shape of the theoretical distribution depends on the number of degrees of freedom (df=50-2=48) of our sample. What does this mean?

Imagine a sample of n=3, e.g. tolerance levels of three individuals, namely 3, 4, and 5. This results in a sample mean of 4 and a sample standard deviation of 0.81 (square root of 2/3). We can freely choose who we include in our survey, so all three sample values are free, i.e. we had three degrees of freedom when collecting our sample data.

Now, if we draw conclusions to a population we must take all parameters into account that are used in our argumentation. As we know that our sample mean is 4 and we freely draw two answers, say 3 and 4, the third answer must be 5 in order to match the constraint that the average of all three answers is 4. If we have a sample of three with a mean of 4, we can freely chose two sample data. Estmating the population mean, thus, decreases the number of freely varying sample values by one. Same is true for the second parameter: When estimating the standard error we make use of the sample standard deviation. In a sample of three with a given mean and standard error, we can freely choose one element.

In general, degrees of freedom are equal to the sample size minus number of estimated parameters. When compared to a standard normal distribution, t-distributions have greater standard deviations and are slighter flatter. From 30 degrees of freedom onwards, the t-distribution closely approximates a normal distribution.

ANOVA: Test if three or more population means are different

Prerequisites

Independent variable is categorical with three or more categories
Dependent variable is metric and approximately normally distributed
Rather big than small sample

Null hypothesis

H₀: There is no difference between group means in the population, i.e. all group means are equal.

H_A: At least one group mean in the population differs from all others.

General test idea

If three or more samples A, B, C … were drawn from a homogeneous population the sample means should be more or less equal : Mean_A\(\approx\)Mean_B\(\approx\)Mean_C… In contrast to the two sample case (t-Test) we cannot refer to a single difference when comparing more than two sample means. Thus, we need an alternative approach.

A representation of the data processed by ANOVA could look like this:

When comparing the groups we must state that groups A and C widely overlap while group B has much greater values on average. In addition B looks more compact, i.e. it has a smaller standard deviation or variance, respectively. This is what the Analysis of Variance looks for.

ANOVA decomposes the total variation of all data points across all groups from the grand mean of all values into two parts:

Variation within groups (blue) is the variation of values from the group means. It describes the proportion of total variance which cannot be traced back to the effect of the grouping variable. This variance is independent of group membership and must be regarded as error variance.
Variation between groups (orange) is the variation of group means from the grand mean (i.e. the mean of the group means in case of equally sized groups). It describes the proportion of total variance which depends on group membership and can thus be regarded as systematic variance.

The question now is, which of these variances is greater, within groups or between groups variance, error of systematic variance?

Variances can be compared by the F-Test. If we (theoretically) draw all samples of same size than A, B, C … the quotient of (systematic variance)/(error variance) follows an F-distribution.

F-distributions are more or less right skewed depending on the degrees of freedom, one degree for the nominator, a second for the denominator.

Example: Does tolerance towards gays and lesbians differ with education?

##          edlv tolerance        se
## low       low  3.795818 0.4020106
## medium medium  3.008001 0.2795561
## high     high  3.132578 0.8115710

##             mean     SE
## tolerance 3.3546 0.3123

We observe a clear difference in tolerance levels between low educated (3.795) and higher educated groups (3.008 and 3.133) in our sample, indicating that low education negatively effects tolerance. ANOVA can be used to test the null hypothesis “All group means are equal in the European population”.

##              Df Sum Sq Mean Sq F value Pr(>F)
## essData$edlv  2   7.86   3.930    2.18  0.124
## Residuals    47  84.74   1.803

Between group variance (i.e. average sum of squares, or sum of squares divided by df, respectively) is 3.930. It is 2.18 times greater than error variance. In theory, however, 12.4% of all samples that can be drawn randomly will show this or even higher ratios. This is not unlikely enough to allow us to reject the null hypothesis stating that none of the groups behaves significantly different compared to all others. Between group differences are not sufficient to trump within group error. There is no significant difference in tolerance levels between differently educated groups.

Mann-Whitney-U-Test (Wilkinson-Test): Test if two population medians are different

Prerequisites

Independent variable is binary
Dependent variable is metric or ordinal
No claims about distribution characteristics
Well suited for small samples

Null hypothesis

H₀: There is no difference between the group medians in the population.

H_A: Group medians are different in the population.

General test idea

If two population medians do not differ, two samples A and B from this population should have two (more or less) equal medians: Median_A\(\approx\)Median_B. In other words: (Median_A - Median_B)\(\approx\) 0.

The medians, however, are not accessed directly. The test compares the ranks of the dependent variable:

If two population medians do not differ, two samples A and B from this population should be (more or less) ranked equally: Ranks_A\(\approx\)Ranks_B. In other words: (Mean of Ranks_A - Mean od Ranks_B)\(\approx\) 0.

In other words, the null-hypothesis states that if we sort the data set by values of the dependent variable, the position (row numbers) of both groups is random. The alternative hypothesis claims that, after sorting, one group is located in the upper, the second group in the lower part of the data set.

A mini example data set could look as follows:

##         x        y
## 1 Group A 7.498804
## 2 Group A 5.433394
## 3 Group A 5.546485
## 4 Group A 2.597701
## 5 Group B 7.124686
## 6 Group B 7.361563
## 7 Group B 8.005519
## 8 Group B 6.881874
## 9 Group B 7.061795

We calculate the ranks of the dependent variable, i.e. the lowest value is ranked 1., the second lowest 2. and so on.

##         x        y rank_y
## 1 Group A 7.498804      8
## 2 Group A 5.433394      2
## 3 Group A 5.546485      3
## 4 Group A 2.597701      1
## 5 Group B 7.124686      6
## 6 Group B 7.361563      7
## 7 Group B 8.005519      9
## 8 Group B 6.881874      4
## 9 Group B 7.061795      5

To illustrate, we sort the data set by ranks. Note that this is not required, as ranks are already computed.

##         x        y rank_y
## 4 Group A 2.597701      1
## 2 Group A 5.433394      2
## 3 Group A 5.546485      3
## 8 Group B 6.881874      4
## 9 Group B 7.061795      5
## 5 Group B 7.124686      6
## 6 Group B 7.361563      7
## 1 Group A 7.498804      8
## 7 Group B 8.005519      9

Finally, we calculate mean ranks.

## Group A Group B 
##     3.5     6.2

M-W-Test asks if these mean ranks are equal (H₀) or unequal (H_A) in the population. Based on the ranks we can calculate a test parameter U which follows a t-distribution and can be approximated by the normal distribution for larger samples.

Example: Does tolerance towards gays and lesbians differ between gender?

We have addressed this question above by employing the t-test. If we look at the frequency distribution of the tolerance variable we must, however, admit that this violates the prerequisite of our dependent variable being normally distributed:

Because of this we should employ the M-W-test (Wilkinson-test) as a more robust non-parametric alternative.

## 
##  Design-based KruskalWallis test
## 
## data:  tolerance ~ gndr
## t = -0.84359, df = 48, p-value = 0.4031
## alternative hypothesis: true difference in mean rank score is not equal to 0
## sample estimates:
## difference in mean rank score 
##                    -0.1220612

The difference between mean ranks is -.122. This is quite close to zero, indicating that there is not much of a difference between ranks of the groups. If we argue on a 5%-significance level we conclude that there is no significant difference in tolerance between gender because p = .4031 > .05.

Differences to the t-test

While the t-test compares mean values of the dependent variable, the M-W-test compares mean ranks. As the latter does not refer to parameters like mean and standard deviation, it is called a non-parametric test. While the t-test claims normally distributed values, the M-W-test does not make any assumption about the distribution of the dependent variable. The latter, thus, is more robust (rejection oof H₀ is more valid), while the t-test has greater power (it is easier to reject H₀). The power of the t-test, however, is only guaranteed if the precondition (normality of dependent variable) is met. If the dependent variable is skewed or otherwise not normal, choose the less powerful M-W-test. With non-parametric tests you are always on the safe side.

Kruskall-Wallis-H-Test: Test if three or more population medians are different

Prerequisites

Independent variable is categorical with three or more categories
Dependent variable is metric or ordinal
No claims about distribution characteristics
Well suited for small samples

Null hypothesis

H₀: There is no difference between group medians in the population, i.e. all group medians are equal.

H_A: At least one group median in the population differs from all others.

General test idea

If three or more samples A, B, C … were drawn from a homogeneous population the sample medians should be more or less equal : Median_A\(\approx\)Median_B\(\approx\)Median_C… In contrast to the two sample case (M-W-test) we cannot refer to a single difference when comparing more than two sample medians. Thus, we need an alternative approach.

The K-W-test compares the ranks of the dependent variable. If group population medians do not differ, samples A, B, C… from this population should be (more or less) ranked equally: Ranks_A\(\approx\)Ranks_B\(\approx\)Ranks_C.

In other words, the null-hypothesis states that if we sort the data set by values of the dependent variable, the position (row numbers) of both groups is random. The alternative hypothesis claims that, after sorting, the groups are separated, i.e. more or less clustered in the upper, middle, or lower part of the data set.

K-W-test decomposes the total variation of all ranks across all groups from the grand mean of all ranks into systematic and error variation. Variances of the ranks can be compared by the F-Test. If we (theoretically) draw all samples of same size than A, B, C … the quotient of (systematic variance of ranks)/(error variance of ranks) follows an F-distribution.

A mini example data set could look as follows:

##          x        y
## 1  Group A 7.498804
## 2  Group A 5.433394
## 3  Group A 5.546485
## 4  Group A 2.597701
## 5  Group B 7.124686
## 6  Group B 7.361563
## 7  Group B 8.005519
## 8  Group B 6.881874
## 9  Group B 7.061795
## 10 Group C 4.712030
## 11 Group C 2.523318
## 12 Group C 5.132940
## 13 Group C 3.828854
## 14 Group C 3.814692
## 15 Group C 6.698059

We calculate the ranks of the dependent variable, i.e. the lowest value is ranked 1., the second lowest 2. and so on.

##          x        y rank_y
## 1  Group A 7.498804     14
## 2  Group A 5.433394      7
## 3  Group A 5.546485      8
## 4  Group A 2.597701      2
## 5  Group B 7.124686     12
## 6  Group B 7.361563     13
## 7  Group B 8.005519     15
## 8  Group B 6.881874     10
## 9  Group B 7.061795     11
## 10 Group C 4.712030      5
## 11 Group C 2.523318      1
## 12 Group C 5.132940      6
## 13 Group C 3.828854      4
## 14 Group C 3.814692      3
## 15 Group C 6.698059      9

To illustrate, we sort the data set by ranks. Note that this is not required, as ranks are already computed.

##          x        y rank_y
## 11 Group C 2.523318      1
## 4  Group A 2.597701      2
## 14 Group C 3.814692      3
## 13 Group C 3.828854      4
## 10 Group C 4.712030      5
## 12 Group C 5.132940      6
## 2  Group A 5.433394      7
## 3  Group A 5.546485      8
## 15 Group C 6.698059      9
## 8  Group B 6.881874     10
## 9  Group B 7.061795     11
## 5  Group B 7.124686     12
## 6  Group B 7.361563     13
## 1  Group A 7.498804     14
## 7  Group B 8.005519     15

Finally, we calculate mean ranks.

##   Group A   Group B   Group C 
##  7.750000 12.200000  4.666667

K-W-Test asks if these mean ranks are all equal (H₀) or if at least one of them differs from all others (H_A) in the population. Based on the ranks we can calculate different test parameters which follows a \(\chi^2\)-distribution (“chi square”, see below).

Example: Does tolerance towards gays and lesbians differ by education?

We have addressed this question above by employing ANOVA. As our dependent variable tolerance is not normally distributed, we should employ the K-W-test as a more robust non-parametric alternative.

## 
##  Design-based KruskalWallis test
## 
## data:  tolerance ~ edlv
## df = 2, Chisq = 3.1934, p-value = 0.2134

If all groups have equal ranks we expect a \(\chi^2\)-value of zero. The test provides a \(\chi^2\)-value of 3.193. This is not zero but still too small to reject the null-hypothesis. If we argue on a 5%-significance level we conclude that there is no significant difference in tolerance between education levels because p = .2134 > .05.

Differences to ANOVA

While ANOVA compares mean values of the dependent variable, the K-W-test compares mean ranks. As the latter does not refer to parameters like mean and standard deviation, it is called a non-parametric test. While ANOVA claims normally distributed values, the K-W-test does not make any assumption about the distribution of the dependent variable. The latter, thus, is more robust (rejection oof H₀ is more valid), while ANOVA has greater power (it is easier to reject H₀). The power of ANOVA, however, is only guaranteed if the precondition (normality of dependent variable) is met. If the dependent variable is skewed or otherwise not normal, choose the less powerful K-W-test. With non-parametric tests you are always on the safe side.

\(\chi^2\)-Test of independence: Test if two categorical variables are associated

Prerequisites

Both independent and dependent variables are categorical
No claims about distribution characteristics
Well suited for small samples
Requires sufficient table margins to guarantee expected frequencies above 5

Null hypothesis

H₀: There is no association between the variables in the population.

H_A: Variables in the population are associated.

General test idea

Any analysis of categorical variables is based on frequency tables, so-called cross tables. If two variables are not associated, i.e. if they are independent from each other, we are able to estimate the frequencies of the value combinations of both variables (the “inner” frequencies in a cross table) based on the single frequency distributions of both variables (the “outer” frequencies, i.e. the margins in a cross table). These estimates can be regarded as expected frequencies under tha condition that there is no associatoin (H₀). We can prove the dependency of variables by showing that empirical and expected frequencies significantly differ.

A mini example could look like this:

##          x
## y         Group D Group E Group F sum
##   Group A      10       4       4  18
##   Group B       3       9       0  12
##   Group C       1       0      11  12
##   sum          14      13      15  42

From these observed frequencies we estimate the expected frequencies which should be observed if both variables are independent. The calculation is simple, e.g. if 18 out of 42 individuals belong to Group A, Group D should contain the same relative number of A-members, i.e. 14\(\cdot\)(18/42) = 6. In general, the expected frequencies can be calculated as row sum \(\cdot\) column sum / total sum.

##          x
## y         Group D   Group E   Group F sum
##   Group A       6  5.571429  6.428571  18
##   Group B       4  3.714286  4.285714  12
##   Group C       4  3.714286  4.285714  12
##   sum          14 13.000000 15.000000  42

Now we calculate the residuals, i.e. the differences between observed and expected frequencies. If H₀ holds, these residuals should all be close to zero.

##          x
## y           Group D   Group E   Group F
##   Group A  4.000000 -1.571429 -2.428571
##   Group B -1.000000  5.285714 -4.285714
##   Group C -3.000000 -3.714286  6.714286

The sample parameter \(\chi^2\) is calculated by normalizing the squared residuals to the expected frequencies. Each residual is squared an divided by its respective expected frequency:

##          x
## y            Group D    Group E    Group F
##   Group A  2.6666667  0.4432234  0.9174603
##   Group B  0.2500000  7.5219780  4.2857143
##   Group C  2.2500000  3.7142857 10.5190476

These normalized residuals are finally summed up to compute the sample parameter \(\chi^2\).

## [1] 32.56838

This parameter describes the similarity between observed and expected frequencies. If both variables are independent, this parameter follows a \(\chi^2\)-distribution and is expected to be close to zero. To prove the alternative hypothesis we must prove that the test value \(\chi^2\) of our sample is very unlikely under the condition of independence.

The sample \(\chi^2\) varies with table size. The degrees of freedom is equal to (number of table columns - 1) \(\cdot\) (number of table rows - 1), in our case (3-1)\(\cdot\)(3-1) = 4. For more than 30 degrees of freedom the \(\chi^2\)-distribution transitions into a normal distribution.

Example: Do education levels differ between gender?

##         
##          low medium high sum
##   Male     7      8    6  21
##   Female   7     14    8  29
##   sum     14     22   14  50

## 
##  Pearson's Chi-squared test
## 
## data:  t
## X-squared = 0.078669, df = 2, p-value = 0.9614

The observed frequencies look unstructured and the \(\chi^2\)-value is close to zero, indicating that variables are independent. On a 5%-significance level we cannot reject our null-hypothesis because p=.961.

Statistical tests for differences between groups

Nils Mevenkamp

2025-12-03

General Testing

t-Test: Test if two population means are different

Prerequisites

Null hypothesis

General test idea

Example: Does tolerance towards gays and lesbians differ between gender?

Degrees of freedom

ANOVA: Test if three or more population means are different

Prerequisites

Null hypothesis

General test idea

Example: Does tolerance towards gays and lesbians differ with education?

Mann-Whitney-U-Test (Wilkinson-Test): Test if two population medians are different

Prerequisites

Null hypothesis

General test idea

Example: Does tolerance towards gays and lesbians differ between gender?

Differences to the t-test

Kruskall-Wallis-H-Test: Test if three or more population medians are different

Prerequisites

Null hypothesis

General test idea

Example: Does tolerance towards gays and lesbians differ by education?

Differences to ANOVA

\(\chi^2\)-Test of independence: Test if two categorical variables are associated

Prerequisites

Null hypothesis

General test idea

Example: Do education levels differ between gender?