If our sample is random, we know the the distribution of specific sample characteristics. Hence, if, and only if, our sample is random, we know the probability for the random occurrence of these characteristics, such as differences in means of subsamples. If the probability for a variation being random is very low (say below 5%) it seems unlikely, that our sample is random. Thus, in addition to mere random effects at least one systematic effect exists that accounts for this variation.
General process for hypothesis testing:
In the following, this process will be outlined in detail by referring to different hypotheses. We work with a small sample (n=50) from ESS8 data, namely the gender of respondent (gndr), self-rated health (health), highest level of education (edlv), tolerance towards gays and lesbians (multi-item scale tolerance), and a weighting variable (w).
H0: There is no difference between the group means in the population.
HA: Group means are different in the population.
If two population means do not differ, two samples A and B from this population should have two (more or less) equal means: MeanA\(\approx\)MeanB. In other words: (MeanA - MeanB)\(\approx\) 0.
If we (theoretically) draw all samples of same size than A and B, the differences (MeanA-MeanB) were symmetrically distributed with mean 0. We divide these differences by their standard error SE (i.e. the standard deviation of a theoretical function) to normalize (standardize) the differences to the average difference. The resulting theoretical distribution is known as t-distribution.
It describes the probability for the random occurrence of a given difference between two sample means under the condition that both samples were drawn from the same population, i.e. that their difference in the population is actually zero.
In other words, if we assume both samples were drawn from the same population, we can calculate the likelihood for the random occurrence of our sampled difference between two means. Most differences are very close to zero. The more the differences deviate from zero towards negative or positive values, the less likely the respective sample pairs are (the less often they occur when drawn randomly from the same population).
In practice, we define the maximum error probability we are willing to accept in our analysis. This user defined threshold is called significance level \(\alpha\). Based on this we can clearly decide, if our sampled values are too unlikely to be drawn randomly. We then calculate the actual probability to draw our sample randomly from a homogeneous population were no difference between the sampled means can be observed. This probability is called the significance \(p\) of your sample. It can determined by applying the theoretical t-distribution as a “lookup-table” for the sample-specific t-Value.
In social sciences, significance levels of 5% or 1% are quite common. At a significance level of \(\alpha\) = 5% we we are wrong in one out of every 20 test decisions on average. However, if you build a nuclear power plant or perform heart surgery, then you should work with significantly smaller significance levels.
If the significance is greater than the significance level, there is nothing against the initial assumption that our sample was drawn randomly from a homogeneous population.
If the significance is lower than the significance level we have to question our initial premise: We made an assumption and must now acknowledge that our result (our sample) is quite unlikely under this assumption. Then we have two options: Either we doubt our own sample (for example, we may have made methodological errors in the sampling process), or we doubt the assumption that led to this result. The former is always reasonable, of course. However, if we find no errors, then we must reject the basic assumption that our sample comes randomly from a homogeneous population. Conversely, we must acknowledge that ou sample was not drawn randomly from a homogeneous population, i.e. that the differences in our sample means are not random, that there is a significant difference between the groups in the population, that group membership has a significant effect on the variable in question.
The initial assumption that everything we observe in our sample is just random is called the null-hypothesis (H0). In most cases we are interested in the rejection of H0, because we want to show that our observations are not random.
Note, that H0 can not be proven but only be rejected. In other words, the fact that H0 cannot be rejected does not necessarily mean that H0 is true. If judiciary sends me to jail because of murder they’re sufficiently certain, that I am guilty. However, if they don’t jail me for want of evidence, it doesn’t mean that I am innocent. They simply are not save enough to jail me although I might still be a murder. Same with H0: If we can’t reject it, it may still be false, i.e. the effect we’re after might still exist.
## gndr tolerance se
## Male Male 3.588056 0.4195371
## Female Female 3.134684 0.4194148
We observe a clear difference in tolerance levels between male (3.588) and female (3.135) in our sample, indicating that female are less tolerant than male. The difference in means is 3.13 - 3.59 = -.453. Does this difference deviate sufficiently far from zero to be able to say that it is not random? The t-Test provides an answer:
##
## Design-based t-test
##
## data: tolerance ~ gndr
## t = -0.76425, df = 48, p-value = 0.4485
## alternative hypothesis: true difference in mean is not equal to 0
## 95 percent confidence interval:
## -1.6461384 0.7393936
## sample estimates:
## difference in mean
## -0.4533724
Although we observe a clear differences in tolerance levels between male and female in our sample, the standardized difference of -.76425 is quite likely to occur randomly (\(p\)=.449). More formally, the likelihood of randomly drawing two subsamples of male and female where male and female differ by at least .76425 standard errors although they do not differ in the population is .499.
As kind of a “cooking receipt” we can simply argue as follows: If we test on a significance level of \(\alpha\) =5%=.05 we see that p > \(\alpha\), indicating that there is no significant difference in tolerance between gender in the population.
To illustrate this situation differently: If we reject H0 our test decision is incorrect with a probability of p=0.4485. This is a good reason to consider H0 preliminary valid: The obesreved differences in our subsamples can easily be explained by simple random variations in the sampling process.
Note that the t-distribution for our test depends on the size of our sample (n=50). More precisely, the shape of the theoretical distribution depends on the number of degrees of freedom (df=50-2=48) of our sample. What does this mean?
Imagine a sample of n=3, e.g. tolerance levels of three individuals, namely 3, 4, and 5. This results in a sample mean of 4 and a sample standard deviation of 0.81 (square root of 2/3). We can freely choose who we include in our survey, so all three sample values are free, i.e. we had three degrees of freedom when collecting our sample data.
Now, if we draw conclusions to a population we must take all parameters into account that are used in our argumentation. As we know that our sample mean is 4 and we freely draw two answers, say 3 and 4, the third answer must be 5 in order to match the constraint that the average of all three answers is 4. If we have a sample of three with a mean of 4, we can freely chose two sample data. Estmating the population mean, thus, decreases the number of freely varying sample values by one. Same is true for the second parameter: When estimating the standard error we make use of the sample standard deviation. In a sample of three with a given mean and standard error, we can freely choose one element.
In general, degrees of freedom are equal to the sample size minus number of estimated parameters. When compared to a standard normal distribution, t-distributions have greater standard deviations and are slighter flatter. From 30 degrees of freedom onwards, the t-distribution closely approximates a normal distribution.
H0: There is no difference between group means in the population, i.e. all group means are equal.
HA: At least one group mean in the population differs from all others.
If three or more samples A, B, C … were drawn from a homogeneous population the sample means should be more or less equal : MeanA\(\approx\)MeanB\(\approx\)MeanC… In contrast to the two sample case (t-Test) we cannot refer to a single difference when comparing more than two sample means. Thus, we need an alternative approach.
A representation of the data processed by ANOVA could look like this:
When comparing the groups we must state that groups A and C widely overlap while group B has much greater values on average. In addition B looks more compact, i.e. it has a smaller standard deviation or variance, respectively. This is what the Analysis of Variance looks for.
ANOVA decomposes the total variation of all data points across all groups from the grand mean of all values into two parts:
The question now is, which of these variances is greater, within groups or between groups variance, error of systematic variance?
Variances can be compared by the F-Test. If we (theoretically) draw all samples of same size than A, B, C … the quotient of (systematic variance)/(error variance) follows an F-distribution.
F-distributions are more or less right skewed depending on the degrees of freedom, one degree for the nominator, a second for the denominator.
## edlv tolerance se
## low low 3.795818 0.4020106
## medium medium 3.008001 0.2795561
## high high 3.132578 0.8115710
## mean SE
## tolerance 3.3546 0.3123
We observe a clear difference in tolerance levels between low educated (3.795) and higher educated groups (3.008 and 3.133) in our sample, indicating that low education negatively effects tolerance. ANOVA can be used to test the null hypothesis “All group means are equal in the European population”.
## Df Sum Sq Mean Sq F value Pr(>F)
## essData$edlv 2 7.86 3.930 2.18 0.124
## Residuals 47 84.74 1.803
Between group variance (i.e. average sum of squares, or sum of squares divided by df, respectively) is 3.930. It is 2.18 times greater than error variance. In theory, however, 12.4% of all samples that can be drawn randomly will show this or even higher ratios. This is not unlikely enough to allow us to reject the null hypothesis stating that none of the groups behaves significantly different compared to all others. Between group differences are not sufficient to trump within group error. There is no significant difference in tolerance levels between differently educated groups.
H0: There is no difference between the group medians in the population.
HA: Group medians are different in the population.
If two population medians do not differ, two samples A and B from this population should have two (more or less) equal medians: MedianA\(\approx\)MedianB. In other words: (MedianA - MedianB)\(\approx\) 0.
The medians, however, are not accessed directly. The test compares the ranks of the dependent variable:
If two population medians do not differ, two samples A and B from this population should be (more or less) ranked equally: RanksA\(\approx\)RanksB. In other words: (Mean of RanksA - Mean od RanksB)\(\approx\) 0.
In other words, the null-hypothesis states that if we sort the data set by values of the dependent variable, the position (row numbers) of both groups is random. The alternative hypothesis claims that, after sorting, one group is located in the upper, the second group in the lower part of the data set.
A mini example data set could look as follows:
## x y
## 1 Group A 7.498804
## 2 Group A 5.433394
## 3 Group A 5.546485
## 4 Group A 2.597701
## 5 Group B 7.124686
## 6 Group B 7.361563
## 7 Group B 8.005519
## 8 Group B 6.881874
## 9 Group B 7.061795
We calculate the ranks of the dependent variable, i.e. the lowest value is ranked 1., the second lowest 2. and so on.
## x y rank_y
## 1 Group A 7.498804 8
## 2 Group A 5.433394 2
## 3 Group A 5.546485 3
## 4 Group A 2.597701 1
## 5 Group B 7.124686 6
## 6 Group B 7.361563 7
## 7 Group B 8.005519 9
## 8 Group B 6.881874 4
## 9 Group B 7.061795 5
To illustrate, we sort the data set by ranks. Note that this is not required, as ranks are already computed.
## x y rank_y
## 4 Group A 2.597701 1
## 2 Group A 5.433394 2
## 3 Group A 5.546485 3
## 8 Group B 6.881874 4
## 9 Group B 7.061795 5
## 5 Group B 7.124686 6
## 6 Group B 7.361563 7
## 1 Group A 7.498804 8
## 7 Group B 8.005519 9
Finally, we calculate mean ranks.
## Group A Group B
## 3.5 6.2
M-W-Test asks if these mean ranks are equal (H0) or unequal (HA) in the population. Based on the ranks we can calculate a test parameter U which follows a t-distribution and can be approximated by the normal distribution for larger samples.
We have addressed this question above by employing the
t-test. If we look at the frequency distribution of the
tolerance variable we must, however, admit that this violates the
prerequisite of our dependent variable being normally distributed:
Because of this we should employ the M-W-test (Wilkinson-test) as a more robust non-parametric alternative.
##
## Design-based KruskalWallis test
##
## data: tolerance ~ gndr
## t = -0.84359, df = 48, p-value = 0.4031
## alternative hypothesis: true difference in mean rank score is not equal to 0
## sample estimates:
## difference in mean rank score
## -0.1220612
The difference between mean ranks is -.122. This is quite close to zero, indicating that there is not much of a difference between ranks of the groups. If we argue on a 5%-significance level we conclude that there is no significant difference in tolerance between gender because p = .4031 > .05.
While the t-test compares mean values of the dependent variable, the M-W-test compares mean ranks. As the latter does not refer to parameters like mean and standard deviation, it is called a non-parametric test. While the t-test claims normally distributed values, the M-W-test does not make any assumption about the distribution of the dependent variable. The latter, thus, is more robust (rejection oof H0 is more valid), while the t-test has greater power (it is easier to reject H0). The power of the t-test, however, is only guaranteed if the precondition (normality of dependent variable) is met. If the dependent variable is skewed or otherwise not normal, choose the less powerful M-W-test. With non-parametric tests you are always on the safe side.
H0: There is no difference between group medians in the population, i.e. all group medians are equal.
HA: At least one group median in the population differs from all others.
If three or more samples A, B, C … were drawn from a homogeneous population the sample medians should be more or less equal : MedianA\(\approx\)MedianB\(\approx\)MedianC… In contrast to the two sample case (M-W-test) we cannot refer to a single difference when comparing more than two sample medians. Thus, we need an alternative approach.
The K-W-test compares the ranks of the dependent variable. If group population medians do not differ, samples A, B, C… from this population should be (more or less) ranked equally: RanksA\(\approx\)RanksB\(\approx\)RanksC.
In other words, the null-hypothesis states that if we sort the data set by values of the dependent variable, the position (row numbers) of both groups is random. The alternative hypothesis claims that, after sorting, the groups are separated, i.e. more or less clustered in the upper, middle, or lower part of the data set.
K-W-test decomposes the total variation of all ranks across all groups from the grand mean of all ranks into systematic and error variation. Variances of the ranks can be compared by the F-Test. If we (theoretically) draw all samples of same size than A, B, C … the quotient of (systematic variance of ranks)/(error variance of ranks) follows an F-distribution.
A mini example data set could look as follows:
## x y
## 1 Group A 7.498804
## 2 Group A 5.433394
## 3 Group A 5.546485
## 4 Group A 2.597701
## 5 Group B 7.124686
## 6 Group B 7.361563
## 7 Group B 8.005519
## 8 Group B 6.881874
## 9 Group B 7.061795
## 10 Group C 4.712030
## 11 Group C 2.523318
## 12 Group C 5.132940
## 13 Group C 3.828854
## 14 Group C 3.814692
## 15 Group C 6.698059
We calculate the ranks of the dependent variable, i.e. the lowest value is ranked 1., the second lowest 2. and so on.
## x y rank_y
## 1 Group A 7.498804 14
## 2 Group A 5.433394 7
## 3 Group A 5.546485 8
## 4 Group A 2.597701 2
## 5 Group B 7.124686 12
## 6 Group B 7.361563 13
## 7 Group B 8.005519 15
## 8 Group B 6.881874 10
## 9 Group B 7.061795 11
## 10 Group C 4.712030 5
## 11 Group C 2.523318 1
## 12 Group C 5.132940 6
## 13 Group C 3.828854 4
## 14 Group C 3.814692 3
## 15 Group C 6.698059 9
To illustrate, we sort the data set by ranks. Note that this is not required, as ranks are already computed.
## x y rank_y
## 11 Group C 2.523318 1
## 4 Group A 2.597701 2
## 14 Group C 3.814692 3
## 13 Group C 3.828854 4
## 10 Group C 4.712030 5
## 12 Group C 5.132940 6
## 2 Group A 5.433394 7
## 3 Group A 5.546485 8
## 15 Group C 6.698059 9
## 8 Group B 6.881874 10
## 9 Group B 7.061795 11
## 5 Group B 7.124686 12
## 6 Group B 7.361563 13
## 1 Group A 7.498804 14
## 7 Group B 8.005519 15
Finally, we calculate mean ranks.
## Group A Group B Group C
## 7.750000 12.200000 4.666667
K-W-Test asks if these mean ranks are all equal (H0) or if at least one of them differs from all others (HA) in the population. Based on the ranks we can calculate different test parameters which follows a \(\chi^2\)-distribution (“chi square”, see below).
We have addressed this question above by employing ANOVA. As our dependent variable tolerance is not normally distributed, we should employ the K-W-test as a more robust non-parametric alternative.
##
## Design-based KruskalWallis test
##
## data: tolerance ~ edlv
## df = 2, Chisq = 3.1934, p-value = 0.2134
If all groups have equal ranks we expect a \(\chi^2\)-value of zero. The test provides a \(\chi^2\)-value of 3.193. This is not zero but still too small to reject the null-hypothesis. If we argue on a 5%-significance level we conclude that there is no significant difference in tolerance between education levels because p = .2134 > .05.
While ANOVA compares mean values of the dependent variable, the K-W-test compares mean ranks. As the latter does not refer to parameters like mean and standard deviation, it is called a non-parametric test. While ANOVA claims normally distributed values, the K-W-test does not make any assumption about the distribution of the dependent variable. The latter, thus, is more robust (rejection oof H0 is more valid), while ANOVA has greater power (it is easier to reject H0). The power of ANOVA, however, is only guaranteed if the precondition (normality of dependent variable) is met. If the dependent variable is skewed or otherwise not normal, choose the less powerful K-W-test. With non-parametric tests you are always on the safe side.
H0: There is no association between the variables in the population.
HA: Variables in the population are associated.
Any analysis of categorical variables is based on frequency tables, so-called cross tables. If two variables are not associated, i.e. if they are independent from each other, we are able to estimate the frequencies of the value combinations of both variables (the “inner” frequencies in a cross table) based on the single frequency distributions of both variables (the “outer” frequencies, i.e. the margins in a cross table). These estimates can be regarded as expected frequencies under tha condition that there is no associatoin (H0). We can prove the dependency of variables by showing that empirical and expected frequencies significantly differ.
A mini example could look like this:
## x
## y Group D Group E Group F sum
## Group A 10 4 4 18
## Group B 3 9 0 12
## Group C 1 0 11 12
## sum 14 13 15 42
From these observed frequencies we estimate the expected frequencies which should be observed if both variables are independent. The calculation is simple, e.g. if 18 out of 42 individuals belong to Group A, Group D should contain the same relative number of A-members, i.e. 14\(\cdot\)(18/42) = 6. In general, the expected frequencies can be calculated as row sum \(\cdot\) column sum / total sum.
## x
## y Group D Group E Group F sum
## Group A 6 5.571429 6.428571 18
## Group B 4 3.714286 4.285714 12
## Group C 4 3.714286 4.285714 12
## sum 14 13.000000 15.000000 42
Now we calculate the residuals, i.e. the differences between observed and expected frequencies. If H0 holds, these residuals should all be close to zero.
## x
## y Group D Group E Group F
## Group A 4.000000 -1.571429 -2.428571
## Group B -1.000000 5.285714 -4.285714
## Group C -3.000000 -3.714286 6.714286
The sample parameter \(\chi^2\) is calculated by normalizing the squared residuals to the expected frequencies. Each residual is squared an divided by its respective expected frequency:
## x
## y Group D Group E Group F
## Group A 2.6666667 0.4432234 0.9174603
## Group B 0.2500000 7.5219780 4.2857143
## Group C 2.2500000 3.7142857 10.5190476
These normalized residuals are finally summed up to compute the sample parameter \(\chi^2\).
## [1] 32.56838
This parameter describes the similarity between observed and expected frequencies. If both variables are independent, this parameter follows a \(\chi^2\)-distribution and is expected to be close to zero. To prove the alternative hypothesis we must prove that the test value \(\chi^2\) of our sample is very unlikely under the condition of independence.
The sample \(\chi^2\) varies with table size. The degrees of freedom is equal to (number of table columns - 1) \(\cdot\) (number of table rows - 1), in our case (3-1)\(\cdot\)(3-1) = 4. For more than 30 degrees of freedom the \(\chi^2\)-distribution transitions into a normal distribution.
##
## low medium high sum
## Male 7 8 6 21
## Female 7 14 8 29
## sum 14 22 14 50
##
## Pearson's Chi-squared test
##
## data: t
## X-squared = 0.078669, df = 2, p-value = 0.9614
The observed frequencies look unstructured and the \(\chi^2\)-value is close to zero, indicating that variables are independent. On a 5%-significance level we cannot reject our null-hypothesis because p=.961.