Clarification time!

(Note that you can run all this code yourself, too! Just copy paste the code from the grey sections into R and run it.)

Example

Let’s say we have a group of husbands n=(100) and a group of wives (n=100), and we measure them both on how much they enjoy Cheetos, on a scale from 1 (I hate them) to 10 (Cheetos are totally my favorite food ever). Here are their scores:

set.seed(123) # this syncs up my computer's random number generator, so if you set this same seed, you can get exactly the same random results as me. Like when they sync up watches before the big heist in a cool action movie!!

men <- runif(n=100, min=1, max=10) # this generates some random numbers for me
women <- .9*men + runif(n=100, min=-1, max=1) # this generates more random numbers for me, such that the true underlying correlation between husbands and wives will be approx. r=.9

# IST
t.test(men, women, paired=FALSE, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  men and women
## t = 1.5035, df = 198, p-value = 0.1343
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1621259  1.2026326
## sample estimates:
## mean of x mean of y 
##  5.487031  4.966778
# Plot it
df<- data.frame(men=men, women=women)
library(ggplot2) 
library(reshape2)
melt <- melt(df)
ggplot(melt, aes(x=value, fill=variable)) +
  geom_histogram(position="identity", alpha=.5) +
  scale_x_continuous(name="How much do you like Cheetos?")

There appears to be no real difference between the two groups (men and women) in how much they like Cheetos.
The effect size here will be small.

est <- (mean(men)-mean(women)) # point estimate
est
## [1] 0.5202533
pooledSD <- sqrt((var(men) + var(women))/2) # exactly equal n's here, so we can just average the variances
pooledSD
## [1] 2.446809
d <- est/pooledSD
d
## [1] 0.2126252

But so far we’ve been treating them as though the two samples are independent, when there’s is almost certainly a relationship between them! Surely, shared love or hatred for Cheetos is the foundation of any happy marriage. Husbands and wives probably have very similar scores, and most of the variability is just bewteen couples.
Take a peak at the data and see what you notice:

head(df)
##        men    women
## 1 3.588198 3.429356
## 2 8.094746 6.950919
## 3 4.680792 4.189939
## 4 8.947157 8.961389
## 5 9.464206 8.483590
## 6 1.410008 2.049708

See how women who love Cheetos (higher scores) tend to also have husbands with high scores? And see how the Cheeto haters flock together as well?

# plot the correlation as a scatterplot
ggplot(df, aes(x=men, y=women)) +
  geom_point() +
  scale_x_continuous(name="How much do the fellas love Cheetos?") +
  scale_y_continuous(name="How much do the ladies love Cheetos?")

Let’s conduct a DST, and control for the couple-to-couple variability in Cheeto attitues, and just see if within each couple there is a difference between men’s and women’s attitudes toward Cheetos.

df$diff <- men-women # calculate difference scores
head(df)
##        men    women        diff
## 1 3.588198 3.429356  0.15884185
## 2 8.094746 6.950919  1.14382754
## 3 4.680792 4.189939  0.49085316
## 4 8.947157 8.961389 -0.01423199
## 5 9.464206 8.483590  0.98061576
## 6 1.410008 2.049708 -0.63969959
# Plot it
ggplot(df, aes(x=diff)) +
  geom_histogram() +
  scale_x_continuous(limits=c(-5,5), name="Difference in Cheeto scores for husbands and wives")

Note the scale on the x-axis. These are all pretty small differences, given that it’s a 10-point scale.

t.test(men, women, paired=TRUE)
## 
##  Paired t-test
## 
## data:  men and women
## t = 8.5794, df = 99, p-value = 1.355e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3999312 0.6405755
## sample estimates:
## mean of the differences 
##               0.5202533

Controlling for couple-to-couple differences, men definitely like Cheetos more than their wives.
How to express the size of that effect?

est # from above
## [1] 0.5202533
mean(df$diff) # the mean of the differences is the same as the difference of the means, so no matter what we do with IST vs. DST, the point estimate of the difference will be identical
## [1] 0.5202533
SDdiff <- sd(df$diff)
SDdiff
## [1] 0.6063961
d_prime <- est/SDdiff # this is a large effect. 
d_prime
## [1] 0.857943

So d is a small effect and d prime is a large effect? That’s because even though there’s a lot of variability in how much people like Cheetos, there’s actually very little variability in the difference scores for husbands and wives (i.e. whether they like Cheetos a lot or a little, they tend to agree about it).
We rely mostly on d for this class because d prime can be misleading. In this case, you might be tempted to say that men like Cheetos a LOT more than their wives do, based on the large effect size, but remember what our data actually look like:

head(df)
##        men    women        diff
## 1 3.588198 3.429356  0.15884185
## 2 8.094746 6.950919  1.14382754
## 3 4.680792 4.189939  0.49085316
## 4 8.947157 8.961389 -0.01423199
## 5 9.464206 8.483590  0.98061576
## 6 1.410008 2.049708 -0.63969959

The effect size estimate is only so big because the correlation is so tight; the difference between men and their wives is actually pretty small, but there’s so little variance in those difference scores that even a small difference is kind of a big deal. In general, husbands and wives are pretty even on how much they like Cheetos (as reflected by the small d effect size estimate).

How does this relate to power calculations?

Remember power is the probability of rejecting the null, assuming that it is indeed false. Power is based on alpha, n, and effect size. Power goes up when alpha goes up, when n goes up, or when effect size goes up. The DST is clearly more powerful than the IST in this case. Why?
Alpha doesn’t change, and n actually changes in the wrong direction (n is smaller in the DST since we’re just looking at one set of difference scores instead of 2 samples). If n were the only change, the DST would be less powerful than the IST, not more.
It’s because of d prime. That’s the measure of effect size that is important to the DST. In this case, d prime is big and d is small, and that’s why the DST is significant and the IST isn’t. While d might be a better way to conceptually characterize what’s going on in our data (which is why we might prefer to report it), d prime is the one that actually matters behind the scenes for DST power.

# Present it in a nice table
table <- data.frame(Estimate=c(est, mean(df$diff)), SD=c(pooledSD, SDdiff), EffectSize=c(d, d_prime), row.names=c("Based on sample scores", "Based on difference scores"))
library(knitr)
kable(table)
Estimate SD EffectSize
Based on sample scores 0.5202533 2.4468090 0.2126252
Based on difference scores 0.5202533 0.6063961 0.8579430

In a nutshell

  • This is a DST issue; there’s no weirdness like this for IST. That’s just d, plain and simple.
  • When reporting effect size, you might prefer to use d instead of d prime for DST (this varies a little bit by researcher and/or field). They convey different information, so depending on what you’re trying to explain, one or the other may be more appropriate. For this class, we’ll pretty much always want you to report d for both IST and DST.
  • When calculating power (for example, in gpower), you need to make sure you’re providing the effect size estimate the program is expecting. For gpower, you can either enter the raw info (group means, group SDs, and r), or you can enter some slightly less raw info (estimate of the difference and the SD of the difference - note that it’s NOT the pooled SD here), or you can just plug in d prime. Do not plug in d instead of d prime. Gpower is expecting d prime, and if you disappoint it, it will punish you with inaccurate power estimates.