Example from:
As bad as these decontextualized criteria are, the other widely used way to evaluate effect size is arguably even worse. This method is to take the reported r and square it. For example, an r of .30, squared, yields the number .09 as the “proportion of variance explained,” and this conversion, when reported, often includes the word “only,” as in “the .30 correlation explained only 9% of the variance.”
We suggest that this calculation has become widespread for three reasons. First, it is easy arithmetic that gives the illusion of adding information to a statistic. Second, the common terminology of “variance explained” makes the number sound as if it does precisely what one would want to it do, the word “explained” evoking a particularly virtuous response. Third, the context in which this calculation is often deployed allows writers to disparage certain findings that they find incompatible with their own theoretical predilections. One prominent example is found in Mischel’s (1968) classic critique of personality psychology, in which he complained that the “personality coefficient” of .30, described by him as the highest correlation empirically found between trait measurements and behavior, 3 “accounts for less than 10 percent of the relevant variance” (p. 38). As Abelson (1985) observed, “it is usually an effective criticism when one can highlight the explanatory weakness of an investigator’s pet variables in percentage terms” (p. 129).
The computation of variance involves squaring the deviations of a variable from its mean. However, squared deviations produce squared units that are less interpretable than raw units (e.g., squared conscientiousness units). As a consequence, r2 is also less interpretable than r because it reflects the proportion of variance in one variable accounted for by another. One can search statistics textbook after textbook without finding any attempt to explain why (as opposed to assert that) r2 is an appropriate effect-size measure. Although r2 has some utility as a measure for model fit and model comparison, the original, unsquared r is the equivalent of a regression slope when both variables are standardized, and this slope is like a z score, in standard-deviation units instead of squared units.
Consider the difference in value between nickels and dimes. An example introduced by Darlington (1990) shows how this difference can be distorted by traditional analyses. Imagine a coin-tossing game in which one flips a nickel and then a dime, and receives a 5¢ or 10¢ payoff (respectively) if the coin comes up heads.
From the payoff matrix in Table 1, correlations can be calculated between the nickel column and the payoff column (r = .4472) and between the dime column and the payoff column (r = .8944). If one squares these correlations to calculate the traditional percentage of variance explained, the result is that nickels explain exactly 20% of the variance in payoff, and dimes explain 80%. And indeed, these two numbers do sum neatly to 100%, which helps to explain the attractiveness of this method in certain analytic contexts. But if they lead to the conclusion that dimes matter 4 times as much as nickels, these numbers have obviously been misleading. The two rs afford a more informative comparison, as .8944 is exactly twice as much as .4472. Similarly, a correlation of .4 reveals an effect twice as large as a correlation of .2; moreover, half of a perfect association is .5, not .707 (Ozer, 1985, 2007). Squaring the r is not merely uninformative; for purposes of evaluating effect size, the practice is actively misleading.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
set.seed(1)
data = tibble(
nickel = sample(c(0, 1), size = 10000, replace = T),
dime = sample(c(0, 1), size = 10000, replace = T),
payoff = nickel*5 + dime*10
)
#correlations
cor(data)
## nickel dime payoff
## nickel 1.000000000 0.002477526 0.4490858
## dime 0.002477526 1.000000000 0.8945985
## payoff 0.449085775 0.894598529 1.0000000
#correlations squared
cor(data)^2
## nickel dime payoff
## nickel 1.000000e+00 6.138137e-06 0.2016780
## dime 6.138137e-06 1.000000e+00 0.8003065
## payoff 2.016780e-01 8.003065e-01 1.0000000
#linear regression
lm(payoff ~ nickel + dime, data = data) %>% summary()
##
## Call:
## lm(formula = payoff ~ nickel + dime, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.945e-12 -8.000e-15 -6.000e-15 9.000e-15 8.145e-11
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8.143e-13 1.404e-14 -5.802e+01 <2e-16 ***
## nickel 5.000e+00 1.633e-14 3.062e+14 <2e-16 ***
## dime 1.000e+01 1.633e-14 6.122e+14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.165e-13 on 9997 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.347e+29 on 2 and 9997 DF, p-value: < 2.2e-16
#ANOVA effect sizes
lm(payoff ~ nickel + dime, data = data) %>% anova() %>% sjstats::anova_stats()
## Warning in anova.lm(.): ANOVA F-tests on an essentially perfect fit are
## unreliable
## Warning in pf(qf(sig.level, u, v, lower = FALSE), u, v, lambda, lower = FALSE):
## NaNs produced
So we can see the absurd conclusion: If we relied on variance explained by variables, then we would conclude that the dime throw is 4 times as important as the nickel throw, using the regression coefficients and correlations, we get the correct result that the dime throw is 2 times as important.