Comparing Two Means, the t-test

In earlier sections introduced the basics of a sampling distribution using the sample mean. When the interest is to compare two means the t-test is useful and the sampling distribution of the mean difference between two groups drives the analyses.

The mean of the sampling distribution of \(\bar{Y_1}-\bar{Y_2}\) \((\mu_{\bar{Y_1}-\bar{Y_2}})\) is always equal to \(\mu_1 - \mu_2\), but the standard deviation of the sampling distribution \((\sigma_{\bar{Y_1}-\bar{Y_2}})\) depends on the design used to collect the data.

Example: Consider an example in which the tensile strength of wounds closed by Suture and Tape is compared. The design for conducting this study will have one factor, Method of Wound Closure, with two levels, Tape and Suture. The following are two designs for conducting the study:

Within-subjects design. Incisions are made on both sides of the spine for each of 10 rats. Tape was used to close one of the wounds; the other was sutured. For each rat the wound closed by tape was determined randomly. This design is called within-subjects because the measurements under tape and suture are made on the same rat; rats are the subjects in the study.

Between-subjects design. Beginning with 20 rats, 10 are randomly assigned to have a wound closed by tape and the other 10 rats have a wound closed by suture. For each rat an incision is made on one side of the spine. The side is determined randomly for each rat.(Half of the rats assigned to each closure method have the incison on the left side of the spine and half on the right side. We ignore side of the spine as a factor in this example.) This design is called between-subjects because the measurements under tape and suture are made on different rats. An additional requirement for classifying the design as between-subjects is that no attempt was made to match the rats prior to random assignment. For example if the 20 rats were from 10 litters with different parents, the rats might have been matched on litter prior to random assignment.

One can imagine a population mean and a population standard deviation under each closure method.
For example the population mean under tape closure is the mean for an indefinitely large group of rats all of which have a wound closed by tape.

In the following comparison it is assumed that the population mean for tape closing will be the same in the within-subjects and the between-subjects design and that the population standard deviation will be the same in the within-subjects and the between-subjects design.

The corresponding assumptions for the population mean and standard deviation for the suture closing are made.

The following are the symbols for these population parameters.

Parameter for Population	Tape	Suture
Mean	\(\mu_T\)	\(\mu_S\)
Standard deviation	\(\sigma_T\)	\(\sigma_S\)
Sample size	\(n_T\)	\(n_S\)

Note. More generally, \(\mu_1\) and \(\mu_2\) for population means for the two treatments and \(\sigma_1\) and \(\sigma_2\) for population standard deviations for the two treatments.

Parameter for Sampling Distribution	Between-Subjects	Within-Subjects
Mean (\(\mu_{\bar{Y_T}-\bar{Y_S}}\))	\(\mu_T-\mu_S\)	\(\mu_T-\mu_S\)
Std deviation (\(\sigma_{\bar{Y_T}-\bar{Y_S}}\))	\(\sqrt{\frac{\sigma_T^2+\sigma_S^2}{n}}\)	\(\sqrt{\frac{\sigma_T^2+\sigma_S^2-2\sigma_T \sigma_S \rho_{TS}}{n}}\)

\(\rho_{TS}\) is the correlation between the tensile strength scores in the tape and suture treatments in the within-subjects design.
The difference in the standard errors is due to \(\rho_{TS}\). If this correlation is zero the designs result in the same standard error.

An important goal in designing a study is to make the standard error as small as possible. When the standard error is small the statistic in which we are interested will tend to be close in numeric value to the parameter we are estimating.

In data analysis we must select a formula for a standard error (or for the error variance). Selecting the wrong formula is a critical error in data analysis.

In practice the standard error is selected by classifying the design as between-subjects or within subjects. This means that incorrectly classifying the design is a critical error in data analysis.

Between-Subjects t-test (The Independent Groups t-test)

The gender attitudes scores for college graduates vs non-collage graduates in the city of USAK are compared. The density plot for each group’s gender attitudes scores is shown below.


# load csv from an online repository
urlfile='https://raw.githubusercontent.com/burakaydin/materyaller/gh-pages/ARPASS/dataWBT.csv'
dataWBT=read.csv(urlfile)

#remove URL 
rm(urlfile)
dataWBT_USAK=dataWBT[dataWBT$city=="USAK",]


# We explained the functions 'factor' and 'droplevels' in section 5.2.4
# here we create a factor, Higher Education Factor (HEF). 
# it is labeled as 'non-college' when the higher_ed variable equals 0, 
# 'college' when equals to 1.
# if you dont use droplevels function, you might have an empty level 
dataWBT_USAK$HEF=droplevels(factor(dataWBT_USAK$higher_ed, 
                    levels = c(0,1), 
                    labels = c("non-college", "college")))

require(ggplot2)
plotdata=na.omit(dataWBT_USAK[,c("gen_att","HEF")])
ggplot(plotdata, aes(x = gen_att)) +
  geom_histogram(aes(y=after_stat(density)),col="black",binwidth = 0.2,alpha=0.7) + 
  geom_density(size=2) +
  theme_bw()+labs(x = "Gender Attitude by HEF in USAK")+ facet_wrap(~ HEF)+
  theme(axis.text=element_text(size=15),
        axis.title=element_text(size=14,face="bold"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Gender Attitudes by Treatment Group

R codes for the independent groups t-test

The following are the steps for conducting the independent groups t-test and R code for implementing the steps

Create descriptive statistics
Calculate the test statistic

\[t=\frac{\bar{Y_1}-\bar{Y_2}}{S_p \sqrt{\frac{1}{n_1}+\frac{1}{n_2}}}\]

\[ S_p = \sqrt{\frac{(n_1-1)S_1^2 + (n_2-1)S_2^2 }{n_1+n_2-2}} \] 3. Find the critical value \(\pm t_{\alpha/2,n_1+n_2-2}\)to test \[H_0:\mu_1-\mu_2=0\] \[H_1:\mu_1-\mu_2 \neq0\]


library(psych)
descIDT=with(dataWBT_USAK,describeBy(gen_att, HEF,mat=T,digits = 2))
descIDT
##     item      group1 vars  n mean   sd median trimmed  mad min max range skew
## X11    1 non-college    1 86 1.83 0.54    1.8    1.80 0.59   1 3.8   2.8 0.72
## X12    2     college    1 51 1.64 0.61    1.6    1.54 0.59   1 3.4   2.4 1.19
##     kurtosis   se
## X11     0.90 0.06
## X12     1.09 0.09
#write.csv(descIDT,file="independent_t_test_desc.csv")

# Pooled sd
sp=sqrt((85*.543^2 + 50*.608^2)/(86+51-2))

# t-statistic
tstatistic=(1.832-1.635)/(sp*sqrt(1/86+1/51))

# critical value for alpha=0.05
qt(.975,df=135)
## [1] 1.977692

Since 1.963 is smaller than the critical value of \(t_{.975,135}=1.978\) , \(H_0\) is retained.

For \(H_1:\mu_1-\mu_2 > 0\), the critical value is \(t_{.95,135}=1.66\) which would yield the rejection of \(H_0\) given 1.93 is greater than 1.66.

For \(H_1:\mu_1-\mu_2 < 0\), the critical value is \(t_{.05,135}=-1.66\) which would yield the retaining of \(H_0\) given 1.93 is not lower than -1.66.

A more convenient R code would be;


# The dataWBT does not have HEF factor, 
# you should define it as it is given a few lines above.

t.test(gen_att~HEF,data=dataWBT_USAK,var.equal=T,
                                     alternative="two.sided",
                                     conf.level=0.95)
## 
##  Two Sample t-test
## 
## data:  gen_att by HEF
## t = 1.9587, df = 135, p-value = 0.05221
## alternative hypothesis: true difference in means between group non-college and group college is not equal to 0
## 95 percent confidence interval:
##  -0.001903268  0.394880924
## sample estimates:
## mean in group non-college     mean in group college 
##                  1.831783                  1.635294

# greater
t.test(gen_att~HEF,data=dataWBT_USAK,var.equal=T,
                                     alternative="greater",
                                     conf.level=0.95)
## 
##  Two Sample t-test
## 
## data:  gen_att by HEF
## t = 1.9587, df = 135, p-value = 0.0261
## alternative hypothesis: true difference in means between group non-college and group college is greater than 0
## 95 percent confidence interval:
##  0.03034529        Inf
## sample estimates:
## mean in group non-college     mean in group college 
##                  1.831783                  1.635294


# less
t.test(gen_att~HEF,data=dataWBT_USAK,var.equal=T,
                                     alternative="less",
                                     conf.level=0.95)
## 
##  Two Sample t-test
## 
## data:  gen_att by HEF
## t = 1.9587, df = 135, p-value = 0.9739
## alternative hypothesis: true difference in means between group non-college and group college is less than 0
## 95 percent confidence interval:
##       -Inf 0.3626324
## sample estimates:
## mean in group non-college     mean in group college 
##                  1.831783                  1.635294

Write up for non-directional test:

An independent groups t-test showed that in the city of USAK, the gender attitudes scores for the college graduates (n=51, mean=1.64, SD=0.61, skew=1.19, kurtosis=1.09) were not statistically different than the non-college graduates (n=86, mean=1.83, SD=0.54, skew=0.72, kurtosis=0.90), t(135)=1.96, p=0.052. The 95% confidence interval was [-0.002,0.395].¹:

Write up for directional test:

A directional independent groups t-test showed that in the city of USAK, the gender attitudes scores for the college graduates (n=51, mean=1.64, SD=0.61, skew=1.19, kurtosis=1.09) were significantly lower than the non-college graduates (n=86, mean=1.83, SD=0.54, skew=0.72, kurtosis=0.90), t(135)=1.96, p=0.026. The 95% confidence interval was [0.030,\(\infty\)].

Assumptions of the independent groups t-test

Three assumptions should be met to claim statistical validity for a conventional between-subjects t-test.

Independence . The scores in each group should be independently distributed. The validity of this assumption is questionable when (a) scores for participants within a group are collected over time or (b) the participants within a group work together in a manner such that a participant’s response could have been influenced by another participant in the study. (See ANOVA assumptions for additional discussion)
Normality. The scores with each group are drawn from a normal distribution. However Myers et al. (2013) states that when the two groups are equal in size and the total sample size is 40 or larger departures from normality can be tolerated unless the scores are drawn from extremely skewed distributions. As noted earlier, the authors of the current book are hesitant to conduct tests for normality. However the use of robust procedures is advised when there is doubt for the normality.
Equal variance. This assumption is also called the homogeneity of variance assumption and means it is assumed that samples in the two groups are drawn from two populations with equal variances. Myers et al. (2013) states that when the sample sizes are equal and larger than 5, even with very large variance ratios (\(s_1^2/s_2^2=100\)) the conventional t-test leads to acceptable Type-I error rates. However this not the case with unequal sample sizes. Field et al. (2012) states that tests for the variance homogeneity, i.e. Levene, might not perform well with small and unequal sample sizes. The problems with tests on variance are that they are not powerful enough to detect inequality of variance even when it is large enough to cause problems with the t test and most are less robust to non-normality than the t test is. The t.test function , by default, does not assume equal variances and uses a Welch’s t-test.

Even though we briefly summarized the assumptions of the independent groups t-test above, they were only introductory. For example we did not discuss violating equal variance and normality simultaneously. The discussion of what is “acceptable” is another limitation for our brief summary, for example when n1 = n2 = 10 we estimated the Type I error rate for \(\alpha = .01\) and a non-directional test to be .018 based on a 100000 replications. Most people would see .018 as liberal with \(\alpha = .01\)

There is an enormous literature on the effects of violating the assumptions of the independent samples t test on both Type I error rate and power and a great deal is known about when the independent samples t test works well and when it does not. However, because that literature is so large it is difficult to summarize it in a way that will allow data analysts to decide in every situation if the independent samples t test should be used. Perhaps a reasonable summary is that if independence appears to be violated an appropriate alternative to the independent sample t test should be used. If independence does not appear to be violated, then when the sample sizes are equal and at least 20 in each group and the scores are approximately normally distributed the independent samples t test can be used. In other situations alternatives to the independent samples t test should be used.

Using Welch’s t test

Welch’ t-test can be conveniently implemented in R and is a reasonable choice for comparing means for independent groups when the normality is not severely violated, the groups have different sample sizes and each groups’ sample size is reasonable large, (e.g. > 20) , and the homogeneity of variance assumption is not made.


t.test(gen_att~HEF,data=dataWBT_USAK,var.equal=F,
                                     alternative="two.sided",
                                     conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  gen_att by HEF
## t = 1.9028, df = 95.885, p-value = 0.06006
## alternative hypothesis: true difference in means between group non-college and group college is not equal to 0
## 95 percent confidence interval:
##  -0.008484626  0.401462282
## sample estimates:
## mean in group non-college     mean in group college 
##                  1.831783                  1.635294

Write up for non-directional Welch’s t-test:

An independent groups Welch’s t-test showed that in the city of USAK, the gender attitudes scores for the college graduates (n=51, mean=1.64, SD=0.61, skew=1.19, kurtosis=1.09) were not statistically different than the non-college graduates (n=86, mean=1.83, SD=0.54, skew=0.72, kurtosis=0.90), t(95.89)=1.90, p=0.06. The 95% confidence interval was [-0.008,0.402].

When the departures from the normality is severe, especially when the groups demonstrate substantially different distributions, a percentile bootstrap procedure is effective (Wilcox, 2012, p. 171).

#Calculate 95% CI using bootstrap (normality is not assumed)
set.seed(04012025)
B=5000       # number of bootstraps
alpha=0.05   # alpha

# define groups
GroupCollege=na.omit(dataWBT_USAK[dataWBT_USAK$HEF=="college","gen_att"])
GroupNONcollege=na.omit(dataWBT_USAK[dataWBT_USAK$HEF=="non-college","gen_att"])

output=c()
for (i in 1:B){

  x1=mean(sample(GroupCollege,replace=T,size=length(GroupCollege)))
  x2=mean(sample(GroupNONcollege,replace=T,size=length(GroupNONcollege)))
  output[i]=x2-x1
  }
output=sort(output)

## non-directional 
# D star lower
output[as.integer(B*alpha/2)+1]
## [1] -0.01147971

# D star upper
output[B-as.integer(B*alpha/2)]
## [1] 0.3896147

##Directional x2>x1
# D star lower
output[as.integer(B*alpha)+1]
## [1] 0.0246124

#wrong direction x2<x1
# D star upper
output[as.integer(B*(1-alpha))]
## [1] 0.3589489

Write up for percentile bootstrap method:

In the city of USAK, the gender attitudes scores for the college graduates (n=51, mean=1.64, SD=0.61, skew=1.19, kurtosis=1.09) were not statistically different than the non-college graduates (n=86, mean=1.83, SD=0.54, skew=0.72, kurtosis=0.90) given that the 95% confidence interval was [-0.013,0.390].²

For a directional test: When the direction is appropriately stated in the alternative hypothesis, the lower limit of the 95% CI is 0.022 and yields the rejection of the null hypothesis of \(H_0:\mu_{non-college} = \mu_{college}\) in favor of \(H_1:\mu_{non-college}-\mu_{college} > 0\).

For a directional test: When the direction is NOT appropriately stated in the alternative hypothesis, the upper limit of the 95% CI is 0.358 and yields the retaining of the null hypothesis of \(H_0:\mu_{non-college} = \mu_{college}\) against the \(H_1:\mu_{non-college}-\mu_{college} < 0\).

Effect size for the independent groups t-test

A t statistic tells whether the mean difference is large in a statistical sense but not in a substantive sense. To judge whether a mean difference is large in a substantive sense one can use an effect size. Cohen’s effect size is the difference between the means divided by the pooled standard deviation and can be computed using; \[ES=\frac{t}{\sqrt{\frac{n_1n_2}{n_1+n_2}}}\]

Effect sizes are often judged in terms of criteria suggested by Cohen (1962).

Effect Size	Description
.2	Small
.5	Medium
.8	Large

##  the normality and the equal variances assumptions are made 
## given the robust procedures provided roughly the same results
n1=51
n2=86
tval=1.96

ES=tval/sqrt((n1*n2)/(n1+n2))
ES
## [1] 0.3464033

#or by the lsr package from Danielle Navarro
t.test(gen_att~HEF,data=dataWBT_USAK,var.equal=F,
                                     alternative="two.sided",
                                     conf.level=0.95)
## 
##  Welch Two Sample t-test
## 
## data:  gen_att by HEF
## t = 1.9028, df = 95.885, p-value = 0.06006
## alternative hypothesis: true difference in means between group non-college and group college is not equal to 0
## 95 percent confidence interval:
##  -0.008484626  0.401462282
## sample estimates:
## mean in group non-college     mean in group college 
##                  1.831783                  1.635294
library(lsr)
cohensD(gen_att~HEF,data=dataWBT_USAK,method = "pooled")
## [1] 0.346177
# experiment method argument

The lsr package by Navarro (2015) reported an effect size of 0.35.

Extra: Practical significance vs statistical significance

There are a number of points to keep in mind about practical significance (a term similar to practical significance is clinical significance.) versus statistical significance.

What do these terms mean? In treatment studies, statistically significant means large enough to be unlikely to have occurred by sampling error if the population means are equal whereas practically significant means large enough to be judged as practically important. Note then that significant has a different meaning in the two terms.

In treatment studies, practical significance can be measured by the mean difference or, when the scale of measurement is not well understood, by the effect size.

The claim is sometimes made that and effect can be practically significant but not statistically significant. This would mean that the effect is judged to be large but is not statistically significant. The problem with this claim is that an effect that is large but not statistically significant can only occur in a small study. Therefore the effect will be imprecisely estimated, which undermines the credibility of the claim that the effect is practically significant.

Another claim sometimes made is that an effect can be statistically significant, but not practically significant. This claim can be correct. For example, suppose there were 400 participants in an experiment, resulting in 200 participants in each group. The researcher found a small ES of 0.20 which is significantly different than zero (t = 2, p < .05). If we regard an effect size of .2 as not practically significant then we have an effect that is statistically, but not practically significant.

Missing data techniques for the independent groups t-test

To be added

Supportive graphs for the independent groups t-test