Setup

library(pacman); p_load(psych, dagitty)

Rationale

Bates & Gignac (2022) assessed the effect of a small financial incentive on IQ scoring to assess the effect of effort on IQ. At first, they assessed the cross-sectional relationship between IQ and effort (r’s of .29, .27, and .27, corrected for unreliablility to .34, .39, and .43). Then, they conducted a series of studies in which people were given incentives to encourage giving additional effort when taking an IQ test. They interpreted their findings to indicate that they did not find much evidence for a causal interpretation of the cross-sectional association.

There was, however, an unfortunate disconnect between their methods and their conclusions. This occurred for a simple reason that can be explained with DAGs. This first plot is what they wanted to interpret: an effect of an intervention on effort, and the effect of that boost in effort on IQ.

plot(Desired)

Instead of that, what they interpreted was the effect of an intervention on IQ, rather than of the incentive-induced effort boost on IQ.

plot(Interpreted)

In other words, \(P(\text{IQ}|\text{Effort}) > P(\text{IQ})\) vs \(P(\text{Effort}|\text{IQ}) > P(\text{Effort})\) was assessed, but \(P(\text{IQ}|\text{do}(\text{Effort})) > P(\text{Effort})\) was not. Instead of testing \(P(\text{IQ}|\text{do}\text{(Effort)})\), they tested \(P(\text{IQ}|\text{do}\text{(Intervention)})\).

Getting at whether the IQ-Effort correlation in the general population is causal requires, at a minimum, manipulating effort. That was the point of the incentives. However, the incentives did not yield significant effects on effort. To get at what the desired effect is, we can use instrumental variables to get the causal effect, in units of score/effort. Borrowing and modifying some code from Tailcalled (https://archive.ph/NYoTF), lets estimate how large those effects were.

I used the ivreg() function from the ivreg package to estimate the p-values for these effects and the OLS estimates (effort * IQ) were all significant, but the IV estimates were all nonsignificant, but this is not consequential because it signals low power due to the high variance that plagues IV estimation.

Analyses

import numpy as np

def did(group_a, group_b):
    diff_a = group_a[1] - group_a[0]
    diff_b = group_b[1] - group_b[0]
    return diff_b - diff_a

score_control = np.array([5.36, 5.59])
effort_control = np.array([4.17, 4.13])
 
score_intervention = np.array([5.37, 5.97])
effort_intervention = np.array([4.06, 4.20])
 
effect_score = did(score_intervention, score_control)
effect_effort = did(effort_intervention, effort_control)
 
IV_effect = effect_score / effect_effort
 
implied_correlation = IV_effect * 0.61 / 2.26 #Time 1 SDs, because we want the cross-sectional units; post-intervention no longer has the cross-sectional units
print("Implied correlation between effort and IQ, based on the IV estimate of the effort -> IQ effect:")

## Implied correlation between effort and IQ, based on the IV estimate of the effort -> IQ effect:

print(np.round(implied_correlation, 2))

## 0.55

score_control = np.array([4.93, 5.36])
effort_control = np.array([4.21, 4.20])
 
score_intervention = np.array([4.96, 5.63])
effort_intervention = np.array([4.13, 4.29])
    
effect_score = did(score_intervention, score_control)
effect_effort = did(effort_intervention, effort_control)
 
IV_effect = effect_score / effect_effort
 
implied_correlation = IV_effect * 0.62 / 2.33
print("Implied correlation between effort and IQ, based on the IV estimate of the effort -> IQ effect:")

## Implied correlation between effort and IQ, based on the IV estimate of the effort -> IQ effect:

print(np.round(implied_correlation, 2))

## 0.38

score_control = np.array([5.08, 5.44])
effort_control = np.array([4.20, 4.17])
 
score_2incentive = np.array([5.10, 5.74])
effort_2incentive = np.array([4.11, 4.26])
 
score_10incentive = np.array([5.07, 5.64])
effort_10incentive = np.array([4.13, 4.28])
 
score_intervention = (score_2incentive * 604 + score_10incentive * 150)/(604+150)
effort_intervention = (effort_2incentive * 604 + effort_10incentive * 150)/(604+150)
    
effect_score = did(score_intervention, score_control)
effect_effort = did(effort_intervention, effort_control)
 
IV_effect = effect_score / effect_effort
 
implied_correlation = IV_effect * 0.63 / 2.32 #N-weighted means, since MI is attainable between intervention groups, but they have unequal sample sizes
print("Implied correlation between effort and IQ, based on the IV estimate of the effort -> IQ effect:")

## Implied correlation between effort and IQ, based on the IV estimate of the effort -> IQ effect:

print(np.round(implied_correlation, 2))

## 0.4

The implied correlations between effort and IQ are in line with the cross-sectional ones, and all the more clearly because the CIs on these estimates are very large. This is what matters: even though an intervention may only produce, say, a 2-point IQ gain because neither IQ nor effort exist in a vacuum that the manipulation fully fills. The intervention should generate some partial effect on effort compared to the total variance in effort since it is a new source of variance, and it should move IQ partially as well. If it is fully the same variance as effort and effort is merely being manipulated by the intervention, it should come with strict measurement invariance for effort measurement. Without that ensured, different procedures need to be used, but that is a different subject.

Michel Nivard addressed the issue of power with this method in a Github Gist (https://archive.ph/2Sn7J) and estimated we would need more than 3,000 to be well powered enough to assess the whether the effort-IQ relationship was causal with this sort of data. Modifying his code based on the observed reliability of effort (r = .752) and the practice effect being presumed reliably estimated with their data despite this not being the case due to the use of different forms (d \(\approx\) .15), I get somewhat more favorable, but still unfortunate numbers: a required N for 80% power with only the observed (and thus very imprecise!) effects of 1,633 (pre-post, runs = 10,000, seed = 1), with an alpha of .05. To have ample power, we could cut the effect in half and get 80% power for that, but then it becomes prohibitively expensive.

There may be other datasets out there that can be used to test the effect of effort on IQ scores. Going through Duckworth et al.’s (2011) sources is a good place to start. Applicable studies need (1) baseline effort and IQ scores and SDs, (2) a control group. I have omitted their cited studies with only posttest results. Breuning’s results were omitted because he is a convicted fraudster and the results are suspiciously large and accurate.

Study	Sample Size	Effect (g)	Control Group	Usable
Benton (1936)	50	-.09	Yes	No (Missing Effort Measure)
Bergan et al. (1971) - A	16	.19	Yes	No (Missing Effort Measure)
Bergan et al. (1971) - B	16	-.06	Yes	No (Missing Effort Measure)
Blanding et al. (1994)	29	.82	Yes	No (Missing Effort Measure)
Bradley-Johnson et al. (1984) - A	22	.17	Yes	No (Missing Effort Measure)
Bradley-Johnson et al. (1984) - B	22	.67	Yes	No (Missing Effort Measure)
Bradley-Johnson et al. (1986) - A	20	.85	Yes	No (Missing Effort Measure)
Bradley-Johnson et al. (1986) - B	20	.80	Yes	No (Missing Effort Measure)
Clingman & Fowler (1976) - A	16	-.06	Yes	No (Missing Effort Measure)
Clingman & Fowler (1976) - B	16	-.40	Yes	No (Missing Effort Measure)
Clingman & Fowler (1976) - C	16	1.42	Yes	No (Missing Effort Measure)
Devers & Bradley-Johnson (1994)	25	.91	Yes	No (Missing Effort Measure)
Edlund (1972)	22	.89	Yes	No (Missing Effort Measure)
Ferguson (1937)	156	.03	Yes	No (Missing Effort Measure)
Galbraith et al. (1986)	30	.73	Yes	No (Missing Effort Measure)
Graham (1971)	128	-.14	Unknown	No (Inaccesible Manuscript)
Lloyd & Zylla (1988) - A	16	1.04	Yes	No (Missing Effort Measure)
Lloyd & Zylla (1988) - B	16	.62	Yes	No (Missing Effort Measure)
Weiss (1981) - A	10	.57	Yes	No (Missing Effort Measure)
Weiss (1981) - B	10	1.25	Yes	No (Missing Effort Measure)
Weiss (1981) - C	10	3.64	Yes	No (Missing Effort Measure)
Willis & Shibata (1978)	20	.59	Yes	No (Missing Effort Measure)

The Weiss (1981) data might have been usable by exploiting the variable “frequency of ‘please’ and ‘thank you’ from baseline to manipulation”, but values were not available for the whole sample or all studies, and the standard deviation was also not available at all. Silm, Pedaste & Taht (2020) meta-analytically reviewed the literature on the relationship between performance and effort, but none of their studies had usable data either. Because of the design of Gignac (2018), it was also not usable either.

As far as I can tell, Bates & Gignac (2022) was the first study which had data that was usable for estimating the causality of the effort-IQ association. They had (1) a pre-post design, with (2) a control group, and (3) effort measures, with (4) no incentives for either control or intervention groups initially, followed by (5) incentives for the treatment group. An active control of some sort would enhance inference while probably reducing power, and a better effort measure would probably makes this study stronger, too. It is unfortunate that, as far as I can tell, no one else has conducted an appropriate study of the relationship between effort and IQ scoring. Instead, they have almost universally interpreted the effect of incentives on IQ, rather than the effect of incentives on effort, and that effort effect on IQ.

Discussion

Perhaps we could increase the size of the incentive and thus the effort boost, or handle this more psychometrically acutely using latent variable modeling if the same instruments were used for ability testing. Getting a better measure of effort might also help, and using different rewards, like candy and money, or like Bergan et al. (1971), social reinforcement, to achieve an active control group and garner plausible variation in the degree of effort inducement would also be worthwhile. Regardless, Bates & Gignac (2022) was underpowered to detect their desired effect, but when estimated, it does not significantly differ from the cross-sectional effect, consistent with causality for the IQ-Effort relationship. Its low power means their study does not bring evidence against a causal relationship, nor could it. A causal relationship is also not evidenced, it is merely on the table, and the methods to find it are now known. Previous works that have similarly failed to estimate the correct quantity should also be suspect. This means all other published work in this area of research should be considered as such.

No existing work on the relationship between effort or motivation and IQ has attempted to estimate the correct estimand. Instead of estimating \(P(\text{IQ}|\text{do}(\text{Effort})) > P(\text{Effort})\), they have estimated \(P(\text{IQ}|\text{do}(\text{Incentive})) > P(\text{Incentive})\), in a world where it is known that incentive effects on effort seem to be small. Future studies should estimate the correct quantity and do their power analyses properly to estimate that quantity. This means fielding much larger samples. Improvements in the measurement of effort/motivation can help in this endeavor, as can improved measurement of ability.

References

Bates, T. C., & Gignac, G. E. (2022). Effort impacts IQ test scores in a minor way: A multi-study investigation with healthy adult volunteers. Intelligence, 92, 101652. https://doi.org/10.1016/j.intell.2022.101652

Duckworth, A. L., Quinn, P. D., Lynam, D. R., Loeber, R., & Stouthamer-Loeber, M. (2011). Role of test motivation in intelligence testing. Proceedings of the National Academy of Sciences, 108(19), 7716–7720. https://doi.org/10.1073/pnas.1018601108

Benton, A. L. (2012). Influence of Incentives Upon Intelligence Test Scores of School Children. The Pedagogical Seminary and Journal of Genetic Psychology. https://www.tandfonline.com/doi/abs/10.1080/08856559.1936.10533790

Bergan, A., Manis, D. L. M., & Melchert, P. A. (1971). Effects of Social and Token Reinforcement on WISC Block Design Performance. Perceptual and Motor Skills, 32(3), 871–880. https://doi.org/10.2466/pms.1971.32.3.871

Blanding, K. M., Richards, J., Bradley-Johnson, S., & Johnson, C. M. (1994). The Effect of Token Reinforcement on McCarthy Scale Performance for White Preschoolers of Low and High Social Position. Journal of Behavioral Education, 4(1), 33–39.

Bradley-Johnson, S. & et al. (1984). Effects of token reinforcement on WISC-R performance of Black and White, low socioeconomic second graders. Behavioral Assessment, 6(4), 365–373.

Bradley-Johnson, S., & Others, A. (1986). Token Reinforcement on WISC-R Performance for White, Low-Socioeconomic, Upper and Lower Elementary-School-Age Students. Journal of School Psychology, 24(1), 73–79.

Clingman, J., & Fowler, R. L. (1976). The effects of primary reward on the I.Q. performance of grade-school children as a function of initial I.Q. level. Journal of Applied Behavior Analysis, 9(1), 19–23. https://doi.org/10.1901/jaba.1976.9-19

Devers, R., & Bradley-Johnson, S. (1994). The effect of token reinforcement on WISC-R performance for fifth- through ninth-grade American Indians. The Psychological Record, 44(3), 441–450.

Edlund, C. V. (1972). The effect on the test behavior of children, as reflected in the I.Q. scores, when reinforced after each correct response. Journal of Applied Behavior Analysis, 5(3), 317–319. https://doi.org/10.1901/jaba.1972.5-317

Ferguson, H. H. (1937). Incentives and an intelligence tests. Australasian Journal of Psychology and Philosophy, 15(1), 39–53. https://doi.org/10.1080/00048403708541086

Graham GA (1971) The effects of material and social incentives on the performance on intelligence test tasks by lower class and middle class Negro preschool children. PhD dissertation (George Washington University, Washington, DC).

Lloyd, M. E., & Zylla, T. M. (1988). Effect of Incentives Delivered for Correctly Answered Items on the Measured IQS of Children of Low and High IQ. Psychological Reports, 63(2), 555–561. https://doi.org/10.2466/pr0.1988.63.2.555

Weiss, R. (1980). Effects of Reinforcement on the IQ Scores of Preschool Children as a Function of Initial IQ. All Graduate Theses and Dissertations. https://doi.org/10.26076/109e-7e06

Willis, J., & Shibata, B. (1978). A COMPARISON OF TANGIBLE REINFORCEMENT AND FEEDBACK EFFECTS ON THE WPPSI I.Q. SCORES OF NURSERY SCHOOL CHILDREN. Education and Treatment of Children, 1(2), 31–45.

Silm, G., Pedaste, M., & Täht, K. (2020). The relationship between performance and test-taking effort when measured with self-report or time-based instruments: A meta-analytic review. Educational Research Review, 31, 100335. https://doi.org/10.1016/j.edurev.2020.100335

Gignac, G. E. (2018). A moderate financial incentive can increase effort, but not intelligence test performance in adult volunteers. British Journal of Psychology, 109(3), 500–516. https://doi.org/10.1111/bjop.12288

Appendix - DiD Plots

This is the plot for Study 2.

p_load(ggplot2, ggthemr, ggpubr)

ggthemr("dust")

DiDData <- data.frame(
  "Time"    = c("Time 1", "Time 2", "Time 1", "Time 2"),
  "Group"   = c("Control", "Control", "Intervention", "Intervention"),
  "Ability" = c(5.36, 5.59, 5.37, 5.97),
  "Effort"  = c(4.17, 4.13, 4.06, 4.20))

DiDData$Time <- factor(DiDData$Time, levels = c("Time 1", "Time 2"))

control_diff <- mean(DiDData$Ability[DiDData$Group == "Control" & DiDData$Time == "Time 2"]) - 
                mean(DiDData$Ability[DiDData$Group == "Control" & DiDData$Time == "Time 1"])

intervention_diff <- mean(DiDData$Ability[DiDData$Group == "Intervention" & DiDData$Time == "Time 2"]) - 
                     mean(DiDData$Ability[DiDData$Group == "Intervention" & DiDData$Time == "Time 1"])

diff_in_diff <- intervention_diff - control_diff

Ability <- ggplot(DiDData, aes(x = Time, y = Ability, group = Group, color = Group)) +
  geom_line() +
  geom_point() +
  geom_segment(aes(x = "Time 1", 
                   xend = "Time 2", 
                   y = mean(Ability[Group == "Control" & Time == "Time 1"]), 
                   yend = mean(Ability[Group == "Control" & Time == "Time 2"])), 
               color = "black", linetype = "dashed") +
  geom_segment(aes(x = "Time 1", 
                   xend = "Time 2", 
                   y = mean(Ability[Group == "Intervention" & Time == "Time 1"]), 
                   yend = mean(Ability[Group == "Intervention" & Time == "Time 2"])), 
               color = "black", linetype = "dotted") +
  geom_rect(aes(xmin = .41 + .629, xmax = .785 + .654, 
                ymin = 5.92 - .052,
                ymax = 5.98),
            fill = "white", color = "black", size = .5, alpha = .8) +
  annotate("text", x = .6 + .64, 
           y = 5.8 + 0.13, 
           label = paste0("Control Difference = ", round(control_diff, 2)), 
           color = "#5b4f4c", size = 3.5) +
  annotate("text", x = .6 + .64, 
           y = 5.8 + 0.15, 
           label = paste0("Intervention Difference = ", round(intervention_diff, 2)), 
           color = "#5b4f4c", size = 3.5) +
  annotate("text", x = .6 + .64, 
           y = 5.8 + 0.17, 
           label = paste0("Difference-in-Differences = ", round(diff_in_diff, 2)), 
           color = "#5b4f4c", size = 3.5) + 
  theme(legend.position = c(.38, .835),
        legend.background = element_blank(),
        legend.title = element_blank())

control_diff <- mean(DiDData$Effort[DiDData$Group == "Control" & DiDData$Time == "Time 2"]) -
                mean(DiDData$Effort[DiDData$Group == "Control" & DiDData$Time == "Time 1"])

intervention_diff <- mean(DiDData$Effort[DiDData$Group == "Intervention" & DiDData$Time == "Time 2"]) -
                     mean(DiDData$Effort[DiDData$Group == "Intervention" & DiDData$Time == "Time 1"])  

diff_in_diff <- intervention_diff - control_diff

Effort <- ggplot(DiDData, aes(x = Time, y = Effort, group = Group, color = Group)) +
  geom_line() +
  geom_point() +
  geom_segment(aes(x = "Time 1", 
                   xend = "Time 2", 
                   y = mean(Effort[Group == "Control" & Time == "Time 1"]), 
                   yend = mean(Effort[Group == "Control" & Time == "Time 2"])), 
               color = "black", linetype = "dashed") +
  geom_segment(aes(x = "Time 1", 
                   xend = "Time 2", 
                   y = mean(Effort[Group == "Intervention" & Time == "Time 1"]), 
                   yend = mean(Effort[Group == "Intervention" & Time == "Time 2"])), 
               color = "black", linetype = "dotted") +
  geom_rect(aes(xmin = .41 + .629, xmax = .785 + .655, 
                ymin = 4.191 - .018,
                ymax = 4.22 - .021),
            fill = "white", color = "black", size = .5, alpha = .8) +
  annotate("text", x = .6 + .64, 
           y = 4.207 - .02, 
           label = paste0("Control Difference = ", round(control_diff, 2)), 
           color = "#5b4f4c", size = 3.5) +
  annotate("text", x = .6 + .64, 
           y = 4.212 - .02, 
           label = paste0("Intervention Difference = ", round(intervention_diff, 2)), 
           color = "#5b4f4c", size = 3.5) +
  annotate("text", x = .6 + .64, 
           y = 4.217 - .02, 
           label = paste0("Difference-in-Differences = ", round(diff_in_diff, 2)), 
           color = "#5b4f4c", size = 3.5) + 
  theme(legend.position = c(.38, .825),
        legend.background = element_blank(),
        legend.title = element_blank())

ggarrange(Ability, Effort,
          nrow = 1, ncol = 2)

Here is the plot for Study 3.

ggarrange(Ability, Effort,
          nrow = 1, ncol = 2)

And the plot for Study 4

ggarrange(Ability, Effort,
          nrow = 1, ncol = 2)

sessionInfo()

## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19045)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.utf8 
## [2] LC_CTYPE=English_United States.utf8   
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.utf8    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggpubr_0.6.0       ggthemr_1.1.0      ggplot2_3.4.1      dagitty_0.3-1     
## [5] psych_2.2.9        pacman_0.5.1       modelsummary_1.3.0 ivreg_0.6-1       
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10      here_1.0.1       lattice_0.20-45  tidyr_1.3.0     
##  [5] png_0.1-8        zoo_1.8-11       rprojroot_2.0.3  digest_0.6.31   
##  [9] lmtest_0.9-40    utf8_1.2.3       V8_4.2.2         R6_2.5.1        
## [13] backports_1.4.1  evaluate_0.20    highr_0.10       pillar_1.8.1    
## [17] rlang_1.0.6      curl_5.0.0       rstudioapi_0.14  car_3.1-1       
## [21] jquerylib_0.1.4  Matrix_1.5-1     reticulate_1.28  rmarkdown_2.20  
## [25] labeling_0.4.2   munsell_0.5.0    broom_1.0.3      compiler_4.2.2  
## [29] xfun_0.37        pkgconfig_2.0.3  mnormt_2.1.1     htmltools_0.5.4 
## [33] tidyselect_1.2.0 tibble_3.1.8     fansi_1.0.4      dplyr_1.1.0     
## [37] withr_2.5.0      tables_0.9.10    MASS_7.3-58.1    grid_4.2.2      
## [41] nlme_3.1-160     jsonlite_1.8.4   gtable_0.3.1     lifecycle_1.0.3 
## [45] magrittr_2.0.3   scales_1.2.1     cli_3.6.0        cachem_1.0.6    
## [49] carData_3.0-5    farver_2.1.1     ggsignif_0.6.4   bslib_0.4.2     
## [53] generics_0.1.3   vctrs_0.5.2      cowplot_1.1.1    boot_1.3-28     
## [57] Formula_1.2-4    tools_4.2.2      glue_1.6.2       purrr_1.0.1     
## [61] abind_1.4-5      parallel_4.2.2   fastmap_1.1.0    yaml_2.3.7      
## [65] colorspace_2.1-0 rstatix_0.7.2    knitr_1.42       sass_0.4.5