Note

If you find this document helpful for your research please do not forget to cite the references listed in this document. For this document itself you can cite Aydın et al. (2018):

Aydın, B., Algina, J., Leite, W. L., & Atilgan, H. (2018). An R Companion: A Compact Introduction for Social Scientists. Ankara: Anı

The dependent groups t-test (Within-subjects t-test)

To examine whether surgical tape or suture is a better method for closing wounds, for each of 20 rats incisions were made on both sides of the spine. One of the wounds was closed by using tape; the other was sutured. The side closed by tape was determined at random. After 10 days the tensile strength of the wounds was measured. The following are the data.

wounds=data.frame(ratid=1:20,
                  tape=c(6.59,9.84 ,3.97,5.74,4.47,4.79,6.76,7.61,6.47,5.77,
                         7.36,10.45,4.98,5.85,5.65,5.88,7.77,8.84,7.68,6.89),
                  suture=c(4.52,5.87,4.60,7.87,3.51,2.77,2.34,5.16,5.77,5.13,
                           5.55,6.99,5.78,7.41,4.51,3.96,3.56,6.22,6.72,5.17))

# Create plot data
library(tidyr)
plotdata=gather(wounds, method, strength, tape:suture, factor_key=TRUE)

require(ggplot2)
ggplot(plotdata, aes(x = strength)) +
  geom_histogram(aes(y = ..density..),col="black",alpha=0.7) + 
  geom_density(size=2) +
  theme_bw()+labs(x = "strength")+ facet_wrap(~ method)+
  theme(axis.text=element_text(size=15),
        axis.title=element_text(size=14,face="bold"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Wounds example

Wounds example

R codes for the dependent groups t-test

The following are the steps for conducting the dependent groups t-test and R code for implementing the steps

  1. Create descriptive statistics
  2. Calculate the test statistic

\[t=\frac{\bar{Y_1}-\bar{Y_2}}{\sqrt{\frac{S_1^2+S_2^2-2S_1 S_2 r_{12}}{n}}}\]

  1. Find the critical value \(\pm t_{\alpha/2,n-1}\)to test \[H_0:\mu_1-\mu_2=0\] \[H_1:\mu_1-\mu_2 \neq0\]

library(psych)
descDT=with(wounds,describe(cbind(tape,suture)))
descDT
##        vars  n mean   sd median trimmed  mad  min   max range  skew kurtosis
## tape      1 20 6.67 1.71   6.53    6.54 1.45 3.97 10.45  6.48  0.55    -0.45
## suture    2 20 5.17 1.49   5.16    5.19 1.30 2.34  7.87  5.53 -0.08    -0.87
##          se
## tape   0.38
## suture 0.33

corDT=with(wounds,cor(tape,suture,use="complete.obs"))
corDT
## [1] 0.3536491


# estimated standard error
ese=sqrt(((1.71^2+1.49^2)-(2*1.71*1.49*corDT))/(20))

# t-statistic
tstatistic=(6.67-5.17)/ese

# critical value for alpha=0.05
qt(.975,df=19)
## [1] 2.093024

Given 3.67 is grater than the critical value of \(t_{.975,19}=2.09\) , \(H_0\) is rejected

A more convenient R code would be;


library(psych)
with(wounds, t.test(tape,suture,paired=T,
                                     alternative="two.sided",
                                     conf.level=0.95))
## 
##  Paired t-test
## 
## data:  tape and suture
## t = 3.6678, df = 19, p-value = 0.001636
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.6429426 2.3520574
## sample estimates:
## mean difference 
##          1.4975

Write up for non-directional dependent groups t-test:

A dependent groups t-test showed that the tensile strength after surgical tape (mean=6.67, SD=1.71, skew=0.55, kurtosis=-0.45) was statistically different than the tensile strength after the suture (mean=5.17, SD=1.49, skew=-0.08, kurtosis=-0.87), t(19)=3.67, p=0.002 ,r=0.35. The 95% confidence interval was [0.64,2.35].

Assumption for the dependent groups t-test

The score difference (\(Y_{1i} - Y_{2i}\)) should be normally distributed and the difference scores should be independent.However,the dependent t test is expected to be robust to normality with large sample sizes.

Robust estimation for the dependent groups t-test

When the departures from the normality is severe, a percentile bootstrap procedure can be employed (Wilcox, 2012, p. 201).

#Calculate 95% CI using bootstrap (normality is not assumed)
set.seed(04012017)
B=5000       # number of bootstraps
alpha=0.05   # alpha

wounds=data.frame(ratid=1:20,
                  tape=c(6.59,9.84 ,3.97,5.74,4.47,4.79,6.76,7.61,6.47,5.77,
                         7.36,10.45,4.98,5.85,5.65,5.88,7.77,8.84,7.68,6.89),
                  suture=c(4.52,5.87,4.60,7.87,3.51,2.77,2.34,5.16,5.77,5.13,
                           5.55,6.99,5.78,7.41,4.51,3.96,3.56,6.22,6.72,5.17))

output=c()
for (i in 1:B){
  #sample rows
  bs_rows=sample(wounds$ratid,replace=T,size=nrow(wounds))
  bs_sample=wounds[bs_rows,]
  mean1=mean(bs_sample$tape)
  mean2=mean(bs_sample$suture)
  output[i]=mean1-mean2
  }
output=sort(output)

## Uni-directional 
# d star lower
output[as.integer(B*alpha/2)+1]
## [1] 0.717

# d star upper
output[B-as.integer(B*alpha/2)]
## [1] 2.2755

##Directional x2>x1
# d star lower
output[as.integer(B*alpha)+1]
## [1] 0.8485

#wrong direction x2<x1
# d star upper
output[as.integer(B*(1-alpha))]
## [1] 2.1495

Write up for a non-directional percentile bootstrap method:

The tensile strength after surgical tape (mean=6.67, SD=1.71, skew=0.55, kurtosis=-0.45) was statistically different than the tensile strength after the suture (mean=5.17, SD=1.49, skew=-0.08, kurtosis=-0.87) given that the 95% confidence interval was [0.667,2.2555].1:

Effect size for the dependent groups t-test

A simple effect size formulae for a dependent t test is (Equation 7 in Lakens (2013))2;

\[ES=\frac{t}{\sqrt{n}}\]

##  the normality and the equal variances assumptions are made 
## given the robust procedures provided roughly the same results
n=20
tval=3.6678

ES=tval/sqrt(n)
ES
## [1] 0.820145

library(effsize)
cohen.d(wounds$tape,wounds$suture, 
        paired=T, conf.level=0.95,noncentral=F)
## 
## Cohen's d
## 
## d estimate: 0.9324681 (large)
## 95 percent confidence interval:
##     lower     upper 
## 0.3159907 1.5489454

The effsize package (Torchiano, 2016) reported an effect size of 0.820 and the 95% CI was [0.135, 1.505]

Missing data techniques for the dependent groups t-test

To be added

Supportive graphs for the dependent groups t-test

To be added

Power calculations for the dependent groups t-test

The basics of statistical power were provided in earlier slides.

#power.t.test
power.t.test(delta=.35, sd=.6,sig.level=0.05, power=0.95, 
             type="paired", alternative="two.sided")
## 
##      Paired t test power calculation 
## 
##               n = 40.16447
##           delta = 0.35
##              sd = 0.6
##       sig.level = 0.05
##           power = 0.95
##     alternative = two.sided
## 
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs

This illustration shows that for the pre-determined knowns of a mean difference of 0.35, a standard deviation of 0.6, an alpha level of 0.05, a non-directional test and a desired power of 0.95, the sample size (number of pairs) should be 41. In other words, the probability of rejecting \(H_0:\mu_1-\mu_2=0\), the null is .95 with a sample size of 41, a mean difference of 0.35, SD=0.6, alpha=0.05 and a non-directional paired t-test.

Common Designs

We first present examples of designs commonly used in studies in the social and behavioral sciences to compare two means. The steps used in such studies are

  1. obtain scores under each of the two treatments
  2. compute the mean for each treatment, and
  3. compare the means using a statistical hypothesis test.

An important distinction in selecting a statistical test is whether the scores in the two treatments are correlated or independent. We classify the designs by whether the scores in the two treatments are correlated or independent. Then we turn to a presentation of terminology for describing designs. This terminology facilitates discussion of designs and determining the correct data analysis procedure to use with a design.

Designs in which Scores in the Two Treatments are Correlated

We want to be able to determine whether the scores used to compute one mean are likely to be correlated with the scores used to compute the second mean. While this goal would seem to require analyzing the data, the surface characteristics of the design used to collect the data can be used to determine whether or not the scores are likely to be correlated.

Repeated measures designs

These are designs in which multiple measurements of the same variables are made on the same subjects.

  1. Subjects as own control design: To examine whether activation of a concept in semantic memory increases accessibility of related concepts, 100 college students were asked to read pairs of words. The first member of each pair was either a weapon word (such as “dagger” or “bullet”) or a non-weapon word. The second member was always an aggressive word (such as “destroy” or “wound”). On each of 192 trials, a computer presented a priming stimulus word (either a weapon or non-weapon word) for 1.25 seconds, a blank screen for 0.5 seconds, and then the target aggressive word. The experimenter instructed the participants to read the first word to themselves and then to read the second word out loud as quickly as they could. The computer recorded how long it took to read the second word. Average reaction time was computed for each participant under each type of prime word. The data could be recorded in a table like the following
Prime Word
Subject Weapon Non-weapon
1
2
100

Based on the idea that some participants read more quickly than others, we would expect the reaction times under the two types of prime words to be correlated.

  1. Longitudinal designs: Mathematics achievement is measured twice for 48 6th grade students: at the beginning of the school year and at the end of the school year. The purpose is to test whether or not the means change over time. The data could be recorded in a table like the following
Time
Subject Beginning End
1
2
48

Because the same students are measured on each occasion we expect the scores to be correlated over time.

Blocking designs

These are designs in which participants are placed in pairs; the members of each pair are expected to perform similarly.

  1. Randomized Block Design: A study was conducted to examine the effects of metacognitive instruction on reading. Thirty second-grade students were administered a reading test and placed in pairs based on the results.
Pair Ranks on Reading Pretest
1 1,2
2 3,4
15 29,30

As shown, the students with the two highest scores were in the first pair, the students with the second highest scores were in the second pair, and so forth. From within each pair one student was randomly assigned to the metacognitive training and one to the control treatment.

Following completion of training the students were tested again on reading. The purpose was to determine whether or not type of training affected mean reading. The data can be recorded in a table like the following

Training
Pair Metacognitive Control
1
2
15

Clearly the scores on the reading pretest will be correlated for pairs of students. However, the scores that are to be analyzed are the scores on the reading posttest. Will these be correlated? Because the students within the first pair have the two highest reading pretest scores, we would expect the student assigned from this pair to the metacognitive treatment to have among the highest scores on the reading posttest; similarly for the student assigned to the control treatment. The students within the last pair have the two lowest reading pretest scores. Therefore we would expect the student assigned from this pair to the metacognitive treatment to have among the lowest scores on the reading posttest; similarly for the student assigned to the control treatment.

The term block is a more general term than pair. It refers to a group of subjects who are homogeneous on some variable. When there are just two treatments a randomized block design (RBD) can be diagrammed as follows:

Treatments
Block 1 2
1
2
n

Each block is a pair of subjects. One member of the block is exposed to treatment 1 and the other is exposed to treatment 2.

  1. Nonrandomized block design: A study is conducted to investigate state anxiety levels of physically abused children in a stressful situation. A control group consists of non-abused children matched (matched is a synonym for blocked when each block consists of a pair of subjects) on trait anxiety with the abused children. There were 20 abused children in the study. The data could be recorded in a table like the following:
Type of Child
Pair Abused Control
1
2
20

We expect the state anxiety scores to be correlated because of the matching on trait anxiety.

  1. Familial Designs: Twenty-five pairs of mothers and adult daughters are surveyed about their political views. The purpose is to test for mean differences between mothers and daughters. The data could be recorded in a table like the following:
Pair Type of Person
Mother Daughter
1
2
25

We expect the political views of mothers and daughters to be at least somewhat correlated.

  1. Dyad Designs: Fifty pairs of African-American and European-American students are formed. The pairs complete a task involving cooperation. Following completion of the task, subject rate the cooperativeness of their partner. The data could be recorded in a table like the following
Ethnic Background
Pair African American European American
1
2
25

We expect the cooperativeness scores for members of a pair to be related.

Designs in which Scores in the Two Treatments are Independent

  1. Completely Randomized Design: It has been proposed that pain can be treated with magnetic fields. Fifty patients experiencing arthritic pain were recruited. Half of the patients were randomly assigned to be treated with an active magnetic device and half were assigned to be treated with an inactive device. All patients rated their pain after application of the device. The purpose is to determine whether or not type of device affects mean pain ratings. The data can be recorded in a table like the following:
Device
Magnetic Inactive
.
.
.

Note that there is no way to pair the scores and therefore the scores cannot be correlated.

  1. Nonrandomized Design: Fifty 8th grade boys and 50 8th grade girls take a test on addition of two-digit addition. The test is computer generated and measures the amount of time taken to answer each question. The purpose is to determine whether or not there are gender differences in mean time to respond. Again there is no way to pair the scores and that therefore the scores cannot be correlated.

References

Aydın, B., Algina, J., Leite, W., & Atılgan, H. (2018). An R companion: A compact introduction for social scientists. Anı Yayıncılık.
Lakens, D. (2013). Calculating and reporting effect sizes to facilitate cumulative science: A practical primer for t-tests and ANOVAs. Frontiers in Psychology, 4, 863. https://doi.org/10.3389/fpsyg.2013.00863
R Core Team. (2016). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Revelle, W. (2016). Psych: Procedures for psychological, psychometric, and personality research. https://CRAN.R-project.org/package=psych
Torchiano, M. (2016). Effsize: Efficient effect size computation. https://CRAN.R-project.org/package=effsize
Wilcox, R. R. (2012). Introduction to robust estimation and hypothesis testing (3rd;3rd; ed.). Academic Press.

  1. The descriptive statistics were calculated with the psych package (Revelle, 2016) and the non-directional percentile bootstrap method with 5000 replications was conducted with the base package (R Core Team, 2016).↩︎

  2. it goes to infinity as r goes to 1 even when the means are very similar. Equation 10 in Lakens (2013) is more appropriate which is \(\frac{mean difference}{(SD_1+SD_2)/2}\)↩︎