Setup

Packages

library(pacman)
p_load(dplyr, DT, meta, ggplot2, dmetar)

Data

#As the authors coded it in their supporting information with the Breuning studies removed.

dc <- read.csv("DuckworthSupportCode.csv")
#dc2 <- subset(dc, Hedge.s.g < 2) #If you want to remove Weiss (1980)

datatable(dc, extensions = c("Buttons"), options = list(dom = 'Bfrtip', buttons = c('copy', 'csv', 'print'), scrollX = T))
#Full data

d <- read.csv("DuckworthFull.csv")
#df <- subset(d, Fraud == 0) #denotes studies conducted by Stephen Breuning
datatable(d, extensions = c("Buttons"), options = list(dom = 'Bfrtip', buttons = c('copy', 'csv', 'print'), scrollX = T))

Rationale

Someone asked me how much stock I put in Duckworth et al. (2011) after it cropped up in a twitter thread they read. My answer is ‘very little’. There are several reasons for this.

  1. There’s no measurement whatsoever nor is there any measurement-based criteria for study inclusion.
  2. Three of the studies were conducted by a known fraud.
  3. There is strong motivation to publish positive results.
  4. There is a modest to strong relationship between study standard errors and effect sizes.
  5. Sumscores are not properly calculated and several scores are not reasonably interpretable as IQs.

The first point is the most important for me since I detest conflating observed scores with abilities/skills/the intended measure. The most tried and true means of increasing scores is giving people all of the answers to a test. And yet, no one treats those gains as meaningful. Why? They may have just as much importance as these “motivation” gains in that both can be as psychometrically hollow. Gains can, of course, sometimes be ‘real’ in that they’re attributable to any of the various modeled common factors, but glancing at IQ sumscores does not tell us about the nature of those gains; in fact, there have been many instances where gains are very misleading. Take Jensen (1991) for example. In that study, Jensen trained students for certain parts of a test, but this training did not transfer (a highly replicable result); as such, the IQ sumscore for this sample would not represent the same components it did prior to said training or, quite probably, in comparison with an untrained group. Lievens, Reeve & Heggestad (2007) are another good example. They administered a test and then allowed participants to retake it. As is typically the case, there were gains. However, these gains did not reflect g, as they would if they were ‘real’, and instead, they reflected the memory group factor and scores no longer possessed any criterion validity. Psychometric meta-analyses of gains which fail to account for their nature - including measurement bias and the specific nature of their effects - risk making strong statements that have no worth to clinicians or others aiming to implement interventions or forge nomological networks.

The relationship between SEs and ESs represents publication bias. The typical bias towards significant results necessitates that smaller studies must have larger effects in order to be significant; small studies selected for their significance often pollute the literature as a result. This is no less true for this work on test-taking incentives/motivation. I’ll show that below, alongside two other things:

  1. The meta-analytic effect size from the study (they provided this as well except not by removing Breuning’s studies, just those with effect sizes > 2; this is sensible since the study by Weiss with an effect size of g = 3.64 - or 54 IQ points (n = 5 intervention/5 control, low ability samples) in the SD = 15 metric - is hard to believe, especially for such small incentives).
  2. The trim-and-fill funnel plot and other indications of bias.

Anyone should feel free to download the data and extend the result if they want. The data can be downloaded as a .csv file from the table above. Data was retrieved from https://www.pnas.org/highwire/filestream/606036/field_highwire_adjunct_files/1/sd01.xls (I added the ‘Fraud’ column to the complete data to denote studies by Breuning). A few later studies should be added (multiple recent ones, in particular, indicate motivation gains to scores without gains to g or other common factors) but I’m not doing that here because the point is to look at the Duckworth et al. (2011) study’s convincingness rather than to update it (I might do this in the future).

Analysis

Evidence for Bias

The authors claim to have conducted four tests for publication bias. These are described rather haphazardly in their supporting information. The first test was correlating sample and effect sizes and for this they did not report their result but we can assume it is just an insignificant correlation and that is why. It is curious that they did not correlate the standard errors and effect sizes, as is more typical, informative, and actually likely to indicate anything about bias (since a relatively low N can be precise or imprecise, which is what we are interested in). Their second test was Egger’s (1997) regression; this yielded an intercept of 0.01 (p = 0.99). The third test for publication bias was not a test for publication bias at all - it was simply the fail-safe N needed to bring the aggregate effect size to g = 0 or 0.2. For those familiar with this “test”, it is really stunning that people think it is a test of publication bias at all, but there it is. An analysis they did not describe in their supporting information but reported the result of was that there was evidence that article type (published versus dissertation) moderated the effect of incentives on scores (strangely, they say “on IQ scores” not “on IQ gains”, though they do then proceed to talk about the difference being the gain effect sizes and they speculate that the difference is due to the higher average IQs of the dissertation samples, suggesting diminishing returns to motivation at higher IQs or, perhaps, confounding of higher IQs with motivation, or moderation by level of IQ, but in whatever case, a probable lack of invariance by IQ level since invariance in the factor structure implies metric invariance for influences as well). Finally, they performed a trim-and-fill analysis but claimed that there was no adjustment necessary. This is baffling because there are egregious outliers in their data, though probably the case because of how small these samples were and how the studies by Breuning influenced the result. I agree with their take on the difference between dissertations and published research so I’ll attempt to replicate their three actual publication bias analyses and add one more:

  1. Correlating effect and sample sizes, to be followed by the more proper test, relating the SE to effect sizes.
  2. Egger et al.’s (1997) regression.
  3. Duval & Tweedie’s (2000a, b) Trim-and-fill.
  4. Simonsohn, Nelson & Simmons’ (2014) p-curve.

N and Standard Error versus Effect Size

#Hedge's g versus N

summary(lm(N ~ Hedge.s.g, dc)); cor(dc$N, dc$Hedge.s.g)
## 
## Call:
## lm(formula = N ~ Hedge.s.g, data = dc)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -33.89 -16.57  -6.72   5.86 112.53 
## 
## Coefficients:
##             Estimate Std. Error t value     Pr(>|t|)    
## (Intercept)    43.92       5.79    7.58 0.0000000025 ***
## Hedge.s.g     -14.93       6.49   -2.30        0.027 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.2 on 41 degrees of freedom
## Multiple R-squared:  0.114,  Adjusted R-squared:  0.0926 
## F-statistic: 5.29 on 1 and 41 DF,  p-value: 0.0267
## [1] -0.33797
#Hedge's g versus SE

summary(lm(SE ~ Hedge.s.g, dc)); cor(dc$SE, dc$Hedge.s.g)
## 
## Call:
## lm(formula = SE ~ Hedge.s.g, data = dc)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1763 -0.0816 -0.0192  0.0586  0.2161 
## 
## Coefficients:
##             Estimate Std. Error t value  Pr(>|t|)    
## (Intercept)   0.3314     0.0225   14.74   < 2e-16 ***
## Hedge.s.g     0.1324     0.0252    5.25 0.0000049 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.113 on 41 degrees of freedom
## Multiple R-squared:  0.402,  Adjusted R-squared:  0.388 
## F-statistic: 27.6 on 1 and 41 DF,  p-value: 0.00000492
## [1] 0.63439

Note: if the extreme result by Weiss is also removed, the correlation drops to r = 0.3479 (p = 0.024) for r(SE, ES) and r = -0.3466 (p = 0.025) for r(N, ES).

ggplot(dc, aes(x = SE, y = Hedge.s.g)) + geom_point(aes(size = N)) + geom_smooth(method = lm, color = "orangered")+ labs(title = "Relationship between Error and Effect", x = "Standard Error", y = "Hedge's g") + theme_minimal() + theme(legend.position = "none", text = element_text(size = 12, family = "serif"), plot.title = element_text(hjust = 0.5))
## `geom_smooth()` using formula 'y ~ x'

For Egger’s test, trim-and-fill, and the p-curve, dcmeta is the random effects meta-analysis of the data without Breuning’s studies (coded below).

eggers.test(x = dcmeta)
trimfill(dcmeta)
##                                             SMD             95%-CI %W(random)
## Benton (1936)                           -0.0900 [-0.6388;  0.4588]        2.0
## Bergan et al. (1971)                     0.1900 [-0.7312;  1.1112]        1.6
## Bergan et al. (1971)                    -0.0600 [-1.0008;  0.8808]        1.6
## Blanding et al. (1994)                   0.8200 [ 0.0752;  1.5648]        1.8
## Blanding et al. (1994)                   1.3300 [ 0.4480;  2.2120]        1.7
## Blanding et al. (1994)                   0.4400 [-0.4028;  1.2828]        1.7
## Bradley-Johnson et al. (1984)            0.1700 [-0.6336;  0.9736]        1.7
## Bradley-Johnson et al. (1984)            0.6700 [-0.1532;  1.4932]        1.7
## Bradley-Johnson et al. (1986)            0.8500 [-0.0320;  1.7320]        1.7
## Bradley-Johnson et al. (1986)            0.8000 [-0.0820;  1.6820]        1.7
## Clingman & Fowler (1976)                -0.0600 [-0.9812;  0.8612]        1.6
## Clingman & Fowler (1976)                -0.4000 [-1.3408;  0.5408]        1.6
## Clingman & Fowler (1976)                 1.4200 [ 0.3616;  2.4784]        1.5
## Devers & Bradley-Johnson (1994)          0.9100 [ 0.1064;  1.7136]        1.7
## Dickstein & Ayers (1973)                 0.4900 [-0.1960;  1.1760]        1.9
## Edlund (1972)                            0.8900 [ 0.0472;  1.7328]        1.7
## Ferguson (1937)                          0.0300 [-0.2836;  0.3436]        2.1
## Galbraith et al. (1986)                  0.7300 [ 0.0048;  1.4552]        1.8
## Gerwell (1981)                           0.5800 [ 0.0900;  1.0700]        2.0
## Graham (1971)                           -0.1400 [-0.4928;  0.2128]        2.1
## Holt & Hobbs (1979)                      1.0300 [ 0.3832;  1.6768]        1.9
## Kapenis (1979)                           0.5100 [-0.2152;  1.2352]        1.8
## Kapenis (1979)                          -0.1700 [-0.8952;  0.5552]        1.8
## Kapenis (1979)                           0.0900 [-0.6352;  0.8152]        1.8
## Kieffer & Goh (1981)                     0.6600 [-0.0260;  1.3460]        1.9
## Kieffer & Goh (1981)                    -0.2600 [-0.9460;  0.4260]        1.9
## Lloyd & Zylla (1988)                     1.0400 [ 0.0404;  2.0396]        1.6
## Lloyd & Zylla (1988)                     0.6200 [-0.3404;  1.5804]        1.6
## Saigh & Antoun (1983)                    0.9300 [ 0.2244;  1.6356]        1.8
## Steinweg (1979)                          0.1700 [-0.9472;  1.2872]        1.5
## Steinweg (1979)                          0.2000 [-0.9172;  1.3172]        1.5
## Sweet & Ringness (1971)                  0.3900 [-0.2568;  1.0368]        1.9
## Sweet & Ringness (1971)                  1.1700 [ 0.5624;  1.7776]        1.9
## Sweet & Ringness (1971)                  0.1300 [-0.3208;  0.5808]        2.0
## Terrell et al. (1980)                    1.1800 [ 0.4156;  1.9444]        1.8
## Terrell et al. (1980)                    1.4000 [ 0.6160;  2.1840]        1.8
## Tiber (1963)                             0.1100 [-0.3212;  0.5412]        2.1
## Tiber (1963)                            -0.3400 [-0.7712;  0.0912]        2.1
## Tiber (1963)                             0.0000 [-0.4312;  0.4312]        2.1
## Weiss (1981)                             0.5700 [-0.5864;  1.7264]        1.4
## Weiss (1981)                             1.2500 [-0.0044;  2.5044]        1.3
## Weiss (1981)                             3.6400 [ 1.6800;  5.6000]        0.9
## Willis & Shibata (1978)                  0.5900 [-0.2724;  1.4524]        1.7
## Filled: Bradley-Johnson et al. (1986)   -0.5551 [-1.4371;  0.3269]        1.7
## Filled: Blanding et al. (1994)          -0.5751 [-1.3199;  0.1697]        1.8
## Filled: Bradley-Johnson et al. (1986)   -0.6051 [-1.4871;  0.2769]        1.7
## Filled: Edlund (1972)                   -0.6451 [-1.4879;  0.1977]        1.7
## Filled: Devers & Bradley-Johnson (1994) -0.6651 [-1.4687;  0.1385]        1.7
## Filled: Saigh & Antoun (1983)           -0.6851 [-1.3907;  0.0205]        1.8
## Filled: Holt & Hobbs (1979)             -0.7851 [-1.4319; -0.1383]        1.9
## Filled: Lloyd & Zylla (1988)            -0.7951 [-1.7947;  0.2045]        1.6
## Filled: Sweet & Ringness (1971)         -0.9251 [-1.5327; -0.3175]        1.9
## Filled: Terrell et al. (1980)           -0.9351 [-1.6995; -0.1707]        1.8
## Filled: Weiss (1981)                    -1.0051 [-2.2595;  0.2493]        1.3
## Filled: Blanding et al. (1994)          -1.0851 [-1.9671; -0.2031]        1.7
## Filled: Terrell et al. (1980)           -1.1551 [-1.9391; -0.3711]        1.8
## Filled: Clingman & Fowler (1976)        -1.1751 [-2.2335; -0.1167]        1.5
## Filled: Weiss (1981)                    -3.3951 [-5.3551; -1.4351]        0.9
## 
## Number of studies combined: k = 58 (with 15 added studies)
## 
##                         SMD            95%-CI    t p-value
## Random effects model 0.1614 [-0.0632; 0.3859] 1.44  0.1557
## Prediction interval         [-1.4587; 1.7814]             
## 
## Quantifying heterogeneity:
##  tau^2 = 0.6414 [0.3019; 0.9707]; tau = 0.8009 [0.5495; 0.9852];
##  I^2 = 72.1% [63.7%; 78.5%]; H = 1.89 [1.66; 2.16]
## 
## Test of heterogeneity:
##       Q d.f.  p-value
##  204.14   57 < 0.0001
## 
## Details on meta-analytical method:
## - Inverse variance method
## - Sidik-Jonkman estimator for tau^2
## - Q-profile method for confidence interval of tau^2 and tau
## - Hartung-Knapp adjustment for random effects model
## - Trim-and-fill method to adjust for funnel plot asymmetry
pcurve(dcmeta)

## P-curve analysis 
##  ----------------------- 
## - Total number of provided studies: k = 43 
## - Total number of p<0.05 studies included into the analysis: k = 14 (32.56%) 
## - Total number of studies with p<0.025: k = 9 (20.93%) 
##    
## Results 
##  ----------------------- 
##                     pBinomial  zFull pFull  zHalf pHalf
## Right-skewness test     0.212 -2.862 0.002 -3.518     0
## Flatness test           0.370  0.329 0.629  4.038     1
## Note: p-values of 0 or 1 correspond to p<0.001 and p>0.999, respectively.   
## Power Estimate: 40% (12.8%-69.1%)
##    
## Evidential value 
##  ----------------------- 
## - Evidential value present: yes 
## - Evidential value absent/inadequate: no

These results give way to a pithy summary: fraud makes the difference. Clearly without Breuning’s studies, less reliable estimates have larger effects, the trim-and-fill result made the effect insignificant, and there was too little power for the analysis to be acceptable (40%), plus Egger’s test was no longer insignificant. Breuning was sentenced in 1988 and his studies are outliers; there is no excuse for their appearance in this meta-analysis. It is noteworthy that I did not used a mixed-effects meta-analysis, but I don’t believe the authors should have either since the variability is limited, minor, and probably more affected by error (reinforced by the small number of samples, their respectively tiny N’s, considerable measure heterogeneity, and the use of ranges rather than precise figures) than systematic variance.

Random Effects Meta-Analysis

dcmeta <- metagen(Hedge.s.g, SE, data = dc,
                  studlab = Study, 
                  comb.fixed = F,
                  comb.random = T, 
                  method.tau = "SJ", 
                  hakn = T, prediction = T, sm = "SMD")

dcf <- forest(dcmeta, sortvar = TE, xlim = c(-1, 3), rightlabs = c("Effect Size", "95% CI", "Weight"), leftcols = c("Study"), leftlabs = c("Study"), pooled.totals = F, smlab = "", text.random = "Overall Effect", print.tau2 = F, col.diamond = "orangered", col.diamond.lines = "black", col.predict = "black", print.I2.ci = F, digits.sd = 2, comb.fixed = F, col.square = "#00348E", overall = T)

A decently large effect size (ignoring heterogeneity regarding the source of “IQ” estimates and whatnot).

Trim-and-Fill

TF <- trimfill(dcmeta)
dfG <- data.frame(TF$seTE, TF$TE); dfG
estimateG = TF$TE.random; seG = TF$seTE.random; estimateGNC = dcmeta$TE.random; seGNC = dcmeta$seTE.random
(1-(pnorm(estimateG/seG)))*2; (1-(pnorm(estimateGNC/seGNC)))*2
## [1] 0.15022
## [1] 0.000000036446

The result is clearly significant on its own (i.e., without the Breuning studies; it also remains significant without the Weiss outlier). However, the effect after trim-and-fill is not significant.

se.seq = seq(0, max(dfG$TF.seTE), 0.001)
ll95 = estimateG - (1.96*se.seq)
ul95 = estimateG + (1.96*se.seq)
ll95a = TF$lower.random
ul95a = TF$upper.random
ll99 = estimateG - (3.29*se.seq)
ul99 = estimateG + (3.29*se.seq)
ll99a = 1.67857*TF$lower.random
ul99a = 1.67857*TF$upper.random
meanll95 = estimateG - (1.96*seG)
meanul95 = estimateG + (1.96*seG)
dfGCI <- data.frame(ll95, ul95, ll99, ul99, se.seq, estimateG, meanll95, meanul95, ll95a, ul95a, ll99a, ul99a)

ggplot(aes(x = TF.seTE, y = TF.TE), data = dfG) + 
  geom_point(shape = 16, size = 3, colour = "#164D07") + #sizing by sample size might be more informative 
  xlab('Standard Error') + ylab('Effect Size') + 
  geom_line(aes(x = se.seq, y = ll95), linetype = 'dotted', colour = "#666666", size = 1, data = dfGCI) +
  geom_line(aes(x = se.seq, y = ul95), linetype = 'dotted', colour = "#666666", size = 1, data = dfGCI) +
  geom_line(aes(x = se.seq, y = ll99), linetype = 'dashed', colour = "#666666", size = 1, data = dfGCI) +
  geom_line(aes(x = se.seq, y = ul99), linetype = 'dashed', colour = "#666666", size = 1, data = dfGCI) +
  geom_segment(aes(x = min(se.seq), y = estimateG, xend = max(se.seq), yend = estimateG), linetype='dotted', colour = "#B72020", size = 1, data=dfGCI) +
  geom_segment(aes(x = min(se.seq), y = ll95a, xend = max(se.seq), yend = ll95a), linetype='dotted' , colour = "orangered", size = 1, data=dfGCI) +
  geom_segment(aes(x = min(se.seq), y = ul95a, xend = max(se.seq), yend = ul95a), linetype='dotted' , colour = "orangered", size = 1, data=dfGCI) +
  scale_x_reverse() +
  coord_flip() + 
  theme_bw() + 
  theme(text = element_text(family = "serif", size = 12))

Conclusion

The final estimate is small to modest and, presumably, if measurement were a focus, would be even smaller. To clarify, most of these studies have an insufficient number of people in them to satisfy something like Gorsuch’s (rather liberal) rule regarding how many people should be required for a factor analysis (that is, five per indicator and for a multi-group analysis this is per group). Pruning all samples that fail to meet this criteria and obviously redacting Breuning’s studies, only six studies (and possibly not Tiber’s depending on the actual number of subtests) would have qualified as usable for the purposes of the meta-analysis. I’ve posted their information in the table below.

Study Total N g SE p
Ferguson (1937) 156 0.0336 0.1594 0.8331
Gerwell (1981) 64 0.5821 0.2538 0.0218
Graham (1971) 128 -0.1445 0.1760 0.4116
Tiber (1963a) 80 0.1080 0.2216 0.6260
Tiber (1963b) 80 -0.3360 0.2230 0.1319
Tiber (1963c) 80 0.0026 0.2214 0.9906

A major problem is the manner of aggregating studies with multiple subtests (i.e., Gerwell, 1981; Bergan, McManis, & Melchert, 1971; Saigh & Antoun, 1983; Dickstein & Ayers, 1973) and the treatment of studies with few subtests as equivalent to other studies which had standard full-scale IQ measures. Simple averaging produces an incorrect sumscore and if gains are associated with, say, g, will lead to effect underestimation (and vice-versa if they negatively associate). To remedy this, I used averaging like the original authors (their provided values), weighted by a presumed g loading (verbal = 0.75, performance = 0.66, etc.; this is like the Woodcock-Johnson’s GIA weighting, except they use empirical values which were unavailable here), and so on, but the result with aggregation did not change; I may attempt a multivariate or three-level meta-analysis to assess this later. The meta-analytic result with dependency flagrantly displayed as a result of keeping subtest scores in the meta-analysis, treated as independent results, is a nearly insignificant effect (~0.15, p = 0.025 with trim-and-fill) with publication bias still indicated. Subscales being treated like full-scale IQs means that some of the effects will be conflating, perhaps, more specific motivational effects with the more general ones that should be expected with ‘real’ full-scale IQ modification, an error, but one that can’t be corrected with such limited data. To really address this issue requires data to be modeled to tease out the sources of gains (or, indeed, their actuality).

If raw data were available for all studies, only those ones which achieved measurement invariance should have been kept (unless the interest is in likely useless score changes, or changes unlike group differences, or specificities, or something along those lines; regardless, invariance is requisite for meaningful interpretation and it’s probably not achieved - note the differences in pre and post SD changes for the intervention and control groups that have those measures available). Unless the effect of incentives on motivation is an invariant effect on motivation, which itself must have a sizable impact on IQ scores (or, if we’re taking measurement seriously, some common factor score, a specificity that would become a common factor with more indicators, or at the very least, something with some criterion validity), you can bet there would be an invariance violation. It is shocking that there were no psychometricians involved in this analysis or its review, but far worse that it continues to be cited uncritically both in research publications and informally.

References

Duckworth, A. L., Quinn, P. D., Lynam, D. R., Loeber, R., & Stouthamer-Loeber, M. (2011). Role of test motivation in intelligence testing. Proceedings of the National Academy of Sciences, 108(19), 7716-7720. https://doi.org/10.1073/pnas.1018601108

Jensen, A. R. (1991). Spearman’s g and the Problem of Educational Equality. Oxford Review of Education, 17(2), 169-187. https://doi.org/10.1080/0305498910170205

Lievens, F., Reeve, C. L., & Heggestad, E. D. (2007). An examination of psychometric bias due to retesting on cognitive ability tests in selection settings. Journal of Applied Psychology, 92(6), 1672-1682. https://doi.org/10.1037/0021-9010.92.6.1672

Egger, M., Smith, G. D., Schneider, M., & Minder, C. (1997). Bias in meta-analysis detected by a simple, graphical test. BMJ, 315(7109), 629-634. https://doi.org/10.1136/bmj.315.7109.629

Duval, S., & Tweedie, R. (2000). A Nonparametric “Trim and Fill” Method of Accounting for Publication Bias in Meta-Analysis. Journal of the American Statistical Association, 95(449), 89-98. https://doi.org/10.1080/01621459.2000.10473905

Duval, S., & Tweedie, R. (2000). Trim and fill: A simple funnel-plot-based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56(2), 455-463. https://doi.org/10.1111/j.0006-341x.2000.00455.x

Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). p-Curve and Effect Size: Correcting for Publication Bias Using Only Significant Results. Perspectives on Psychological Science: A Journal of the Association for Psychological Science, 9(6), 666-681. https://doi.org/10.1177/1745691614553988

Ferguson, H. H. (1937). Incentives and an intelligence tests. Australasian Journal of Psychology and Philosophy, 15(1), 39-53. https://doi.org/10.1080/00048403708541086

Graham, G. A. (1971). The effects of material and social incentives on the performance on intelligence test tasks by lower class and middle class Negro preschool children (Vol. 31, Issues 7-B, p. 4311). Dissertation Abstracts International. ProQuest Information & Learning.

Tiber, N. (1963) The effects of incentives on intelligence test performance. PhD dissertation (Florida State University, Tallahassee, FL)

Benton, AL. (1936) Influence of incentives upon intelligence test scores of school children. J Genet Psychol 49:494-497.

Example studies to include in future meta-analyses:

https://onlinelibrary.wiley.com/doi/abs/10.1111/bjop.12288

https://www.tandfonline.com/doi/abs/10.1080/13803395.2018.1551519

An ending note about meta-analyses:

They still seem to produce overestimated effects, which is sensible if they’re, like this one, an aggregate of a bunch of tiny and highly selected studies. Methods like trim-and-fill, 3PSM, and even PET-PEESE (which does not seem to work) may still be producing overestimated effects because of this and other issues (including, notably, measurement problems). It is unfortunate that a MASEM cannot be conducted on these studies because they used heterogeneous instruments and, worse, they did not provide enough data to analyze them anyway. Additionally, it should go without saying that meta-analyses need to be comprehensive or they can be seriously biased.

https://www.nature.com/articles/s41562-019-0787-z

https://pubmed.ncbi.nlm.nih.gov/31464485/