T-tests: Comparing Two Means

This document was composed from Dr. Snopkowski’s ANTH 504 Week 8 lecture and Danielle Navarro’s 2021 Learning statistics with R Chapter 13.

Statistical Tests

For each statistical test we discuss, I want you to note:

  1. What type of variables do we need for this test?

  2. What is the null and alternative hypotheses?

  3. What test (and R code) do we need to run the statistical analysis?

  4. What are the assumptions of the test?

_ How do we check the assumptions of the test?

_ What alternative tests do we run if the assumptions are not met?

Independent T-test

Independent T-test

_ The simplest form of experiment that can be done is one with only one independent variable that is manipulated in only two ways and only one outcome is measured.

  • More often than not the manipulation of the independent variable involves having an experimental condition and a control.

  • E.g., Is the movie Scream 2 scarier than the original Scream? We could measure heart rates (which indicate anxiety) during both films and compare them.

_ This situation can be analysed with a t-test

  1. What type of data do we have?

Independent is the movie which is 2 categories and dependent is the heart rate which is continuous

Two Types of T-test

_ Dependent t-test

  • Compares two means based on related data.

  • E.g., Data from the same people measured at different times.

  • Data from ‘matched’ samples.

_ Independent t-test

  • Compares two means based on independent data

  • E.g., data from different groups of people

Rationale for the t-test

_ Two samples of data are collected and the sample means calculated. These means might differ by either a little or a lot.

_ If the samples come from the same population, then we expect their means to be roughly equal. Although it is possible for their means to differ by chance alone, we would expect large differences between sample means to occur very infrequently.

_ What is our null and alternative hypothesis?

Ho: mu1 = mu2

Ha: mu1 ≠ mu2

Rationale for the t-test

_ We compare the difference between the sample means that we collected to the difference between the sample means that we would expect to obtain if there were no effect

  • (i.e. if the null hypothesis were true).

_ We use the standard error as a gauge of the variability between sample means.

_ If the difference between the samples we have collected is larger than what we would expect based on the standard error then we can assume one of two:

  • There is no effect and sample means in our population fluctuate a lot and we have, by chance, collected two samples that are atypical of the population from which they came.

  • The two samples come from different populations but are typical of their respective parent population. In this scenario, the difference between samples represents a genuine difference between the samples (and so the null hypothesis is incorrect).

_ As the observed difference between the sample means gets larger, the more confident we become that the second explanation is correct (i.e. that the null hypothesis should be rejected). If the null hypothesis is incorrect, then we gain confidence that the two sample means differ because of the different experimental manipulation imposed on each sample.

The Independent t-test

t = (observed difference between sample means − expected difference between population means (if null hypothesis is true)) / (estimate of the standard error of the difference between two sample means)

Assumptions of the t-test

Both the independent t-test and the dependent t-test are parametric tests based on the normal distribution. Therefore, they assume:

  • The sampling distribution is normally distributed.

  • Data are measured at least at the interval level. (Continuous variable is interval)

  • Variances in these populations are roughly equal (homogeneity of variance).

  • Scores in different treatment conditions are independent (because they come from different people).

Independent Samples student t-test

Question: Are the grades of students who are taught by different instructors / TA’s significantly different?

  • We have 2 TA’s: Anastasia & Bernadette

  • Dataset = “harpo.Rdata”

Data comes from: https://learningstatisticswithr.com/

Variables?

_ What are our variables? What type of variables are they? (Binary, Categorical, Continuous)?

_Independent variable is “tutor” which has two categories. The dependent is the “grade” which is continuous.

_ How might we visually display (or conduct descriptive statistics) to see the differences (if any exist) between the TAs?

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   1.0.1 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
load("harpo.Rdata")
# Get summary statistics for the the continuous column for each category
harpo_summary <-harpo %>%
  group_by(tutor) %>%
  summarise(n = n(),
    mean_grade = mean(grade),
    sd_grade = sd(grade),
    min_grade = min(grade),
    max_grade = max(grade)
  )
harpo_summary 
## # A tibble: 2 × 6
##   tutor          n mean_grade sd_grade min_grade max_grade
##   <fct>      <int>      <dbl>    <dbl>     <dbl>     <dbl>
## 1 Anastasia     15       74.5     9.00        55        90
## 2 Bernadette    18       69.1     5.77        56        79
# Create a histogram of the grade variable
ggplot(harpo, aes(x = grade, fill = tutor)) +
  geom_histogram(position = "dodge", bins = 10) +
  facet_wrap(~tutor) +
  labs(title = "Histogram of Grade by Tutor", x = "Grade", y = "Count") +
  theme_bw()

# Create a box plot of the continuous variable
ggplot(harpo, aes(x = tutor, y = grade, fill = tutor)) +
  geom_boxplot() +
  labs(title = "Box plot of Grade by Tutor", x = "Tutor", y = "Grade") +
  theme_bw()

# Create a customized density plot of the continuous variable
ggplot(harpo, aes(x = grade, fill = tutor)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density plot of Value by Category", x = "Grade", y = "Density") +
  theme_bw()

Calculating by hand

#separate each set of data by tutor
ana <- harpo %>%
  filter(tutor=="Anastasia")
bern <- harpo %>%
  filter(tutor=="Bernadette")
#calculate standard deviations
sd(ana$grade)
## [1] 8.998942
sd(bern$grade)
## [1] 5.774918
#calculate means
m_a = mean(ana$grade)
m_b = mean(bern$grade)
#get n (sample size)
n_a <- length(ana$grade)
n_a
## [1] 15
n_b <- length(bern$grade)
n_b
## [1] 18
#calculate pooled standard deviation
numerator = (n_a-1)*sd(ana$grade)^2 + (n_b-1)*sd(bern$grade)^2
sp = sqrt(numerator/(n_a+n_b-2))
sp
## [1] 7.406792
#calculate denominator of t-statistic
denominator = sqrt((sp^2/n_a) + (sp^2/n_b))
denominator
## [1] 2.589436
#calculate t-statistic
t = (m_a - m_b) / denominator
t
## [1] 2.115432
#use t-distribution to get corresponding p-value
pt(t, n_a + n_b - 2)
## [1] 0.9787353
(1-pt(t, n_a + n_b - 2))
## [1] 0.02126474
2*(1-pt(t, n_a + n_b - 2))
## [1] 0.04252949

We use the pt() function. This is similar to the pnorm function. It allows us to give it a values and it will return the p-value to us. For example,we give the z-score 1.95 in pnorm(1.95) gives us the area under the curve. In pt() we give it t, which is “2.11”. And then we need to give it the degrees of freedom. The t-distribution always required the degrees of freedom. In this case, the degrees of freedom is number of ana plus number of bern minus 2. Usually it is just one less but because we have two here, we do 2 less.

When we do pt() of of that value, we get 0.9787. This is the area to the left of that. But we want to know the area under the tail. So we take 1 minus that value. This is 0.02. This is the area to the right of that value.
Because we are doing a two tailed test, we need to include the area under the other tail. This is why we multiply be 2, or we’d add them together. 0.0425 is our p-value.

Calculating this in R

#install.packages("lsr")
library(lsr)
independentSamplesTTest(formula = grade ~ tutor, data=harpo, var.equal=TRUE) 
## 
##    Student's independent samples t-test 
## 
## Outcome variable:   grade 
## Grouping variable:  tutor 
## 
## Descriptive statistics: 
##             Anastasia Bernadette
##    mean        74.533     69.056
##    std dev.     8.999      5.775
## 
## Hypotheses: 
##    null:        population means equal for both groups
##    alternative: different population means in each group
## 
## Test results: 
##    t-statistic:  2.115 
##    degrees of freedom:  31 
##    p-value:  0.043 
## 
## Other information: 
##    two-sided 95% confidence interval:  [0.197, 10.759] 
##    estimated effect size (Cohen's d):  0.74

OR

t.test(grade ~ tutor, data=harpo, var.equal=TRUE) 
## 
##  Two Sample t-test
## 
## data:  grade by tutor
## t = 2.1154, df = 31, p-value = 0.04253
## alternative hypothesis: true difference in means between group Anastasia and group Bernadette is not equal to 0
## 95 percent confidence interval:
##   0.1965873 10.7589683
## sample estimates:
##  mean in group Anastasia mean in group Bernadette 
##                 74.53333                 69.05556

t.test(*dependent_first* ~ *independent*, name_data, var.equal=TRUE) Because the confidence interval does not range between 0, and the p-value is less then 0.05, we can conclude that there is a difference between the means. var.equal=TRUE

We are 95% confident that the population difference between the groups is between 0.197 and 10.759.

Assumptions of the t-test

  1. Normality – We assume that both groups are normally distributed

Create a histogram Shapiro-Wilk test tests a null that the data are normal. If the p-value is significant, then you have significant deviations from normality.This test has some challenges to it. If you have a large sample, then you are more likely you get statistical significance in the Shapiro-Wilk, however, of you have a big sample, you are more likely to have a normal sample (central limit theorem). If you have a small sample, you are less likely to have a Shapiro-Wilk test tell you that you are deviating from normality.

  1. Independence – We assume that each observation is independent of the others

Use your brain

  1. Homogeneity of Variance (also known as homoscedasticity) – the variance is the same across both groups Use leveneTest If the variances are not equal, remove var.equal=TRUE in: t.test(*dependent_first* ~ *independent*, name_data, var.equal=TRUE) It automatically runs a Welch t-test.
  • If the homogeneity of variance assumption is not met, then you can run the Welch t-test – which has the same assumptions, except homogeneity of variance.

Testing normality in R

  1. Visually – create a histogram Normal within group
hist(ana$grade)

hist(bern$grade)

  1. Perform a Shapiro-Wilk test; the null hypothesis is that the data are normal (so significant p-values mean that you have significant deviations from normality). Note: large samples are likely to be significant simply because they have a large sample…so in this case, utilize a histogram instead.
ana <- harpo %>% filter(tutor == "Anastasia")
bern <- harpo %>% filter(tutor == "Bernadette")
shapiro.test(ana$grade)
## 
##  Shapiro-Wilk normality test
## 
## data:  ana$grade
## W = 0.98186, p-value = 0.9806
shapiro.test(bern$grade)
## 
##  Shapiro-Wilk normality test
## 
## data:  bern$grade
## W = 0.96908, p-value = 0.7801

Homogeneity of Variance

  1. Make two histograms to see if the variances look about the same

  2. Compare the standard deviations across the groups.

  3. Run a test: There are a few different tests out there, but we’ll use Levene’s test, which is most frequently used in the literature.

  • The null hypothesis is that the variances in the different groups are equal – meaning that a significant result from Levene’s test could indicate a problem. H0 = sigma1 = sigma2 Ha = sigma1 ≠ sigma2
Note: large sample sizes may give you a significant result simply because of the big sample size.
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
leveneTest(grade ~ tutor, data=harpo)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  2.1287 0.1546
##       31

If not equal, run Welch.

Welch t-test

_ The standard error of the difference between two sample means is calculated differently

And degrees of freedom are calculated as (and don’t have to be an integer value):

Running the Welch test in R

Same as student t-test, except: independentSamplesTTest(formula = grade ~ tutor, data=harpo, var.equal=TRUE)

independentSamplesTTest(formula = grade ~ tutor, data=harpo)
## 
##    Welch's independent samples t-test 
## 
## Outcome variable:   grade 
## Grouping variable:  tutor 
## 
## Descriptive statistics: 
##             Anastasia Bernadette
##    mean        74.533     69.056
##    std dev.     8.999      5.775
## 
## Hypotheses: 
##    null:        population means equal for both groups
##    alternative: different population means in each group
## 
## Test results: 
##    t-statistic:  2.034 
##    degrees of freedom:  23.025 
##    p-value:  0.054 
## 
## Other information: 
##    two-sided 95% confidence interval:  [-0.092, 11.048] 
##    estimated effect size (Cohen's d):  0.724

OR

t.test(grade ~ tutor, data=harpo)
## 
##  Welch Two Sample t-test
## 
## data:  grade by tutor
## t = 2.0342, df = 23.025, p-value = 0.05361
## alternative hypothesis: true difference in means between group Anastasia and group Bernadette is not equal to 0
## 95 percent confidence interval:
##  -0.09249349 11.04804904
## sample estimates:
##  mean in group Anastasia mean in group Bernadette 
##                 74.53333                 69.05556

Communication

  1. Descriptive Statistics

  2. Description of the null hypothesis

  3. A “stat” block

  4. The results are interpreted

On average, Anastasia’s students performed better (M = 74.5, SD = 9.0) than Bernadette’s students (M = 69.1, SD = 5.77). We conducted a t-test to test whether the means of the two TAs was significantly different. This difference, 5.4, was significant t(31) = 2.115, p = 0.043. Based on this result, we can conclude that Anastasia’s student performed significantly better on average than Bernadette’s students.

The Wilcoxon rank-sum test and Mann–Whitney test

  • These tests are the non-parametric equivalent of the independent t-test, meaning that you can utilize this test if the data are NOT normal.

  • Use to test differences between two conditions in which different participants have been used.

Ranking Data

library(readr)
  • The tests in this lecture work on the principle of ranking the data for each group:

    • Lowest score = a rank of 1,

    • Next highest score = a rank of 2, and so on.

    • Tied ranks are given the same rank: the average of the potential ranks.

  • For an unequal group size

    • The test statistic (Ws) = sum of ranks in the group that contains the least people.
  • For an equal group size

    • Ws = the value of the smaller summed rank.
  • Add up the ranks for the two groups and take the lowest of these sums to be our test statistic.

  • The analysis is carried out on the ranks rather than the actual data.

Zirconium color rank

Let’s use the data: zirconium_content.csv

# Read in the data from the CSV file
zirconium_content <- read.csv("zirconium_content.csv", header = FALSE, sep = "\t")
head(zirconium_content)
##                        V1
## 1 zirconium_content color
## 2                131.5\t0
## 3                131.5\t0
## 4                131.6\t0
## 5                131.7\t0
## 6                131.8\t0
# Separate the single column into two columns
zirconium_content <- separate(zirconium_content, col = 1, into = c("zirconium_content", "color"), sep = "\t")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 1 rows [1].
head(zirconium_content)
##         zirconium_content color
## 1 zirconium_content color  <NA>
## 2                   131.5     0
## 3                   131.5     0
## 4                   131.6     0
## 5                   131.7     0
## 6                   131.8     0
zirconium_content <- zirconium_content[-1,]


# View the resulting data frame
zirconium_content
##    zirconium_content color
## 2              131.5     0
## 3              131.5     0
## 4              131.6     0
## 5              131.7     0
## 6              131.8     0
## 7              131.9     0
## 8              132.1     0
## 9              138.4     0
## 10             138.6     0
## 11             138.8     0
## 12             139.1     0
## 13             140.3     0
## 14             140.9     0
## 15             144.4     0
## 16             145.5     0
## 17             145.5     0
## 18             146.8     0
## 19             128.2     1
## 20             130.1     1
## 21             130.3     1
## 22             131.5     1
## 23             132.6     1
## 24             135.1     1
## 25             135.2     1
## 26             135.7     1
## 27             135.9     1
## 28             136.2     1
## 29             136.8     1
## 30             136.9     1
## 31               137     1
## 32             138.9     1
## 33             139.2     1
## 34             139.7     1
## 35             140.1     1
## 36             142.2     1
## 37             142.2     1
sapply(zirconium_content, class)
## zirconium_content             color 
##       "character"       "character"
# Convert the zirconium_content column to numeric
zirconium_content <- mutate(zirconium_content, zirconium_content = as.numeric(zirconium_content))

# Convert the color column to a factor
zirconium_content <- mutate(zirconium_content, color = factor(color))
  1. Is the data normally distributed? Visually – create a histogram
#Normal within group
hist(zirconium_content$zirconium_content)

  1. Is there homogeneity of variance?
leveneTest(zirconium_content ~ color, data=zirconium_content)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  3.2741 0.07923 .
##       34                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

_Levene’s Test for Homogeneity of Variance compares the variance of the zirconium_content values between the two levels of the color factor. The Df value indicates the degrees of freedom for the test, which is one in this case since there are two groups being compared. The F value is the test statistic, which is used to test the null hypothesis that the variances are equal. The Pr(>F) value is the p-value of the test, which indicates the probability of obtaining the observed test statistic under the null hypothesis of equal variances.

In this output, the p-value is 0.07923, which is greater than the significance level of 0.05. This indicates that there is no significant evidence to reject the null hypothesis of equal variances at the 5% level of significance. Therefore, we can assume that the variances of the two groups are equal.

Can calculate the test statistic by hand

Our test statistic W = sum of ranks – mean rank

N is sample size of one group

N_b = 17 #for black

N_g = 19 #for gray
#Mean_rank = N*(N+1)/2 
Mean_rankblack <- 17*(18)/2 
Mean_rankblack
## [1] 153
Mean_rankgray <- 19*(20)/2 
Mean_rankgray
## [1] 190

Actual values we are getting subtract Mean_rank

Wblack <- 343-153
Wblack
## [1] 190

This all happens under the hood in R. Our test statistic is 190

Wgray <- 323-190
Wgray
## [1] 133

Running Wilcoxon rank-sum test

wilcox.test(zirconium_content ~ color, data=zirconium_content, paired=FALSE, exact=F)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  zirconium_content by color
## W = 190, p-value = 0.3748
## alternative hypothesis: true location shift is not equal to 0

The wilcox.test() function is used to perform a Wilcoxon rank sum test (also known as the Mann-Whitney U test), which is a nonparametric test to compare the distribution of two independent groups. The function takes several arguments to specify the variables and parameters of the test:

zirconium_content ~ color: specifies the variables to be compared in the formula notation. In this case, we want to compare the zirconium_content variable grouped by the levels of the color factor. data = zirconium_content: specifies the data frame that contains the variables. paired = FALSE: specifies that the groups are independent (i.e., not paired). exact = FALSE: specifies whether to compute exact p-values or use asymptotic approximations. In this case, we use the asymptotic approximation, which is appropriate for larger sample sizes.

If there are ties, you need to say that ‘exact’ meaning to calculate an exact p-value is FALSE

The null hypothesis for the Wilcoxon rank sum test is that the two groups have the same distribution, or equivalently, that the location parameter of the two groups is the same. The mean of the ranks (medians) are equal to each other. The alternative hypothesis is that the two groups have different location parameters.

The test statistic W is 190 and the p-value is 0.3748. We fail to reject the null

Reporting the Results

Zirconium in black obsidian artifacts (Mdn = 138.6) did not differ significantly from zirconium in gray obsidian (Mdn = 136.2), W = 190, p = .379.

Dependent t-test

_ Dependent t-test

Sometimes called: “Matched-samples” t-test

Example

_ Are invisible people mischievous?

  • 24 Participants

_ Manipulation

  • Placed participants in an enclosed community riddled with hidden cameras.

  • For first week participants normal behaviour was observed.

  • For the second week, participants were given an invisibility cloak.

_ Outcome

  • measured how many mischievous acts participants performed in week 1 and week 2.

Rationale for the dependent t-test

_ We will take the difference of the two scores for each participant. If there is no effect of the treatment (e.g., being invisible), then we expect the difference scores to be approximately 0, on average.

_ What is our null & alternative hypotheses?

Ho: mu = 0

Ha: mu ≠ 0

Calculation of test statistic (t value)

Focuses on difference scores

Assumptions of the dependent t-test

  1. The sampling distribution is normally distributed. In the dependent t-test this means that the sampling distribution of the differences between scores should be normal, not the scores themselves.

  2. Data are measured at least at the interval level. [same assumption as independent t-test]

Data

Running the code in R (by hand)

data <- c("no_cloak" = 3, 1, 5, 4, 6, 4, 6, 2, 0, 5, 4, 5)
data2 <- c("cloak" = 4, 3, 6, 6, 8, 5, 5, 4, 2, 5, 7, 5)
#calculate difference scores
diff <- data - data2
mean(diff) #mean of differences
## [1] -1.25
sd(diff) #standard deviation of differences
## [1] 1.13818
t <- mean(diff) / (sd(diff)/sqrt(length(diff)))
#pt- area to the left on the t-distribution curve
pt(t, df=11)
## [1] 0.001460396
#2 sided test - need to multiply by 2
2*pt(t, df=11)
## [1] 0.002920793

R code

  1. Check assumption of normality
#check assumption - normality
hist(diff)

shapiro.test(diff)
## 
##  Shapiro-Wilk normality test
## 
## data:  diff
## W = 0.91231, p-value = 0.2284
  1. Run dependent t-test
#run t-test
t.test(data, data2, paired=T)
## 
##  Paired t-test
## 
## data:  data and data2
## t = -3.8044, df = 11, p-value = 0.002921
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -1.9731653 -0.5268347
## sample estimates:
## mean difference 
##           -1.25

What if the assumptions are not met?

_ The Wilcoxon signed-rank test which utilizes ranks

Steps to creating ranks:

  1. Rank the absolute value of the scores (1 for the smallest)

  2. Then sum up the positive ranks

  3. Sum up the negative ranks

  4. Your test statistic is the smaller of the two sums

R code

#Wilcoxon signed-rank test
wilcox.test(data, data2, paired=T, exact=F)
## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  data and data2
## V = 2.5, p-value = 0.01085
## alternative hypothesis: true location shift is not equal to 0