Group differences in standardized test scores are a product of several factors. Without invariance, bias is among them; with it, it is not. For the sake of this exercise, invariance will be taken for granted and, as such, differences between groups primarily reflect differences in ability. To the extent that groups differ in the extent to which they utilize test prep and the quality of said prep, some part of their observed score gap may be due to that. After accounting for differences in the quality of test prep, the difference due to test prep (D) is a simple function of the effects of test prep and the proportions in use (P) for groups A and B:
\[(EA \times PA)- (EB \times PB) = D\]
If you have different types of prep and their proportions being used, and we assume additivity and a lack of diminishing returns to practice - something not observed in practice - one can just extend that equation
\[(D1_A + D2_A + \cdots DN_A) - (D1_B + D2_B + \cdots DN_B) = D\]
Note that much of the apparent effect of test preparation in the general population is confounded and some varieties of preparation, whilst appearing cross-sectionally related to scores, have no effect in experiments.
If two groups differ in the effectiveness and their rates, a part of their difference can be
PrepPoints <- function(PA, PB, EA = 15, EB = 15){
PP = (EA * PA) - (EB * PB); return(PP)}
PAL <- seq(0, 1, 0.01); PBL <- seq(1, 0, -0.01)
PrepPoints(PA = PAL, PB = PBL, 15, 10)
## [1] -10.00 -9.75 -9.50 -9.25 -9.00 -8.75 -8.50 -8.25 -8.00 -7.75
## [11] -7.50 -7.25 -7.00 -6.75 -6.50 -6.25 -6.00 -5.75 -5.50 -5.25
## [21] -5.00 -4.75 -4.50 -4.25 -4.00 -3.75 -3.50 -3.25 -3.00 -2.75
## [31] -2.50 -2.25 -2.00 -1.75 -1.50 -1.25 -1.00 -0.75 -0.50 -0.25
## [41] 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 2.25
## [51] 2.50 2.75 3.00 3.25 3.50 3.75 4.00 4.25 4.50 4.75
## [61] 5.00 5.25 5.50 5.75 6.00 6.25 6.50 6.75 7.00 7.25
## [71] 7.50 7.75 8.00 8.25 8.50 8.75 9.00 9.25 9.50 9.75
## [81] 10.00 10.25 10.50 10.75 11.00 11.25 11.50 11.75 12.00 12.25
## [91] 12.50 12.75 13.00 13.25 13.50 13.75 14.00 14.25 14.50 14.75
## [101] 15.00
The College Board (linked below) provided a datapoint on preparation effects but it lacked the detail to make it more than worth a passing mention. Buchmann, Condron & Roscigno (2010) provided a series of heavily controlled models to assess the effects of different types of preparation and use by social class. They arrived at insignificant gains for books, videos or compute software, high school prep courses leading to a 26 point gain, private courses leading to boosts of 30 points, and private tutors leading to 37 points of gain. They note:
These estimates are much smaller than the gains of 100 points or more that test prep companies advertise. They are also more in line with results of studies that similarly account for potentially confounding factors and smaller scale studies using data on pre- and post-test preparation SAT scores that find score gains in the range of 20-30 points. (p. 450)
They also note, in a similar fashion to Byun & Park (2011), Alon (2010) and Devine-Eller (2012) that
Although these models reveal some lingering racial disadvantages, particularly for blacks, it is interesting that the gap between black and white students’ SAT scores increases when we account for test preparation. This makes sense given that blacks engage in test preparation more often than whites net of other factors. (pp. 451-452)
Others have done large evaluations that come to similar conclusions, with different types of evidence about the efficacy of test preparation, for the SAT (Briggs, 2001; Domingue & Briggs, 2009), ACT (Moore, Sanchez & San Pedro, 2018; Sanchez & Harnisher, 2018) and, in usually smaller investigations, for tests like Raven’s Progressive Matrices (Denney & Heidrich, 1990) or the GRE (Wilson, 1988). There are many studies I’m omitting here because this is not a review and there are so many, but the general tenor of the conclusions is that studying effects are small, test-retest effects are modest but larger, and both show diminishing returns, with invariance across levels of test preparation (e.g., Arendasy et al., 2016) and race, ethnicity and sex being found with few (and mostly unnotable because of bias canceling across tests; see, e.g., Roznowski & Reith, 1999) exceptions. Note that, although different levels of test preparation might yield invariance (perhaps due to their minor and similar effects as well as diminishing returns), it has not been the case that the results of test-familiar people are similar to test-unfamiliar people; the bias from test familiarity can be considerable and it tends to have a strongly negative impact on the criterion validity of the test, perhaps because they no longer seem to assess intelligence and, instead, assess memory at the point where tests are familiar to the test taker (e.g., Reeve, Heggestad & Lievens, 2009; Lievens, Reeve & Heggestad, 2007; cf. Hausknecht, Trevor & Farr, 2002 for a different type of validity and Iddekinge et al., 2011 for the opposite result); worth noting is that invariance and score gains without factor mean differences is not unexpected, it may signal low power or or effects on specificities, both of which are not uncommonly the case. This may be a good reason for schools to adjust the scores of those who retake them, although this could be hard to do in practice for reasons including the unknown quantity of how many times tests like the PSAT were taken.
In Byun & Park’s data, 12% of students take commercial test preparation courses; 30% of East Asians take them, 15% of “Other” Asians take them, 10% of Whites take them, 16% of Blacks take them and 11% of Hispanics do as well. For private one-to-one tutoring, these numbers are 9, 11, 7, 7, 17 and 7%, respectively. Within the Asian groups, prior achievement is related to the likelihood of taking a commercial preparation course but within Whites and Hispanics, the relationship is slightly negative; there is no relationship in the Black group. In all groups, the relationship between prior achievement and the likelihood of participating in a private test preparation course is negative, although not significantly so in the Asian groups. Commercial preparation effects differ in this sample between the East Asian and non-East Asian groups, with East Asians netting 68.835 points, other Asians netting 23.842, Whites getting 12.286, Blacks earning 14.987 and Hispanics obtaining 24.625 (some of these are insignificant). Private one-to-one tutoring effects were considerably smaller and not significant in any group, with effects of 13.402, 2.1760, 9.266, -7.501 and 12.145 points for those groups, respectively. We can assess the effects on group gaps.
EAW <- PrepPoints(0.3, 0.1, 68.835, 12.286); EAW
## [1] 19.4219
WB <- PrepPoints(0.1, 0.16, 12.286, 14.987); WB
## [1] -1.16932
WH <- PrepPoints(0.1, 0.11, 12.286, 24.625); WH
## [1] -1.48015
There are more gaps to be discussed but the impacts of tutoring on them here are obvious. It is worth noting that, in this sample, the gap between East Asians and Whites is increased by 19.422 points by commercial test preparation and the gaps between Whites and Blacks and Whites and Hispanics, respectively, are reduced by 1.169 and 1.480 points due to the higher rates of test prep use and their greater effects relative to Whites. Most of the differences in use and effect are not significant and a one point difference is certainly not significant, but the direction of test prep impacts being as they are is interesting. Buchmann, condron & Roscigno (2010), Alon (2010) and Devine-Eller (2012) allow similar conclusions. No one, to my knowledge, has discussed misreporting regarding the use of test preparation. Some (e.g., Alon) have advanced the idea that differences between groups in the extent of test preparation may, to some degree, reflect a desire to “have the same options as whites” (p. 11).
Briggs (2009) provided a summary of preparation effects with a list of studies and some reviews. Among these, studies with randomization had scores ranging from -1 points to 121. Without the 121 point gain outlier - that was likely due to the “severe sample attrition” and lack of an “attempt… to control for group differences statistically” -, the range becomes -1 to 57 points. DerSimonian & Laird (1983) conducted a meta-analysis of coaching studies and provided a critical piece of nuance that also applies to similar findings: uncontrolled trials have larger effects than controlled trials, which have larger effects than matched or randomized ones. For the mathematics section, these gains were 40.6 points (39 with nonparametric estimates) for uncontrolled studies, 15.3 points (14.7) for controlled studies and 10.1 points for matched or randomized ones (10.3); for the verbal section, these were 53.8 (54.3), 15.6 (16.3) and 9.8 (10.1) points. In other areas, it’s common to see that controlled studies produce overestimated effects when the control group is passive; active control group studies typically move effects towards zero.
Meaningful portions of racial/ethnic, sex, social class or, really, any major group differences are not in any clear way attributable to differences in the use of test preparation services. The frequent claims of massive gains from the test preparation industry do not seem to be evidenced.
Buchmann, C., Condron, D. J., & Roscigno, V. J. (2010). Shadow Education, American Style: Test Preparation, the SAT and College Enrollment. Social Forces, 89(2), 435–461. https://doi.org/10.1353/sof.2010.0105
Byun, S., & Park, H. (2012). The Academic Success of East Asian American Youth: The Role of Shadow Education. Sociology of Education, 85(1), 40–60. https://doi.org/10.1177/0038040711417009
Alon, S. (2010). Racial Differences in Test Preparation Strategies: A commentary on Shadow Education, American Style: Test Preparation, the SAT and College Enrollment. Social Forces, 89(2), 463–474.
Devine‐Eller, A. (2012). Timing Matters: Test Preparation, Race, and Grade Level. Sociological Forum, 27(2), 458–480. https://doi.org/10.1111/j.1573-7861.2012.01326.x
Briggs, D. C. (2001). The Effect of Admissions Test Preparation: Evidence from NELS-88. https://nepc.colorado.edu/publication/the-effect-admissions-test-preparation-evidence-nels-88, https://archive.ph/tti1b, https://web.archive.org/web/20230209224735/https://nepc.colorado.edu/publication/the-effect-admissions-test-preparation-evidence-nels-88
Domingue, B., & Briggs, D. C. (2009). Using Linear Regression and Propensity Score Matching to Estimate the Effect of Coaching on the SAT. Multiple Linear Regression Viewpoints, 35(1)(35(1)), 12.
Moore, R., Sanchez, E., & San Pedro, M. O. (2017). Investigating Test Prep Impact on Score Gains Using Quasi-Experimental Propensity Score Matching (No. ED593130). ERIC; https://archive.ph/2RxC2, https://web.archive.org/web/20210424050252/https://files.eric.ed.gov/fulltext/ED593130.pdf. https://files.eric.ed.gov/fulltext/ED593130.pdf
Sanchez, E., & Harnisher, J. (2018). The Impact of ACT Kaplan Online Prep Live on ACT Score Gains (Technical Brief No. R1705; p. 20). ACT; https://archive.ph/sV4Un, https://web.archive.org/web/20220819030533/https://www.act.org/content/dam/act/unsecured/documents/R1705-test-prep-impact-2018-07.pdf. https://www.act.org/content/dam/act/unsecured/documents/R1705-test-prep-impact-2018-07.pdf
Also related, a summary of Sanchez & Harnisher (2018): https://www.act.org/content/dam/act/unsecured/documents/R1705-test-prep-study-2018-08.pdf, https://archive.ph/8YGlI, https://web.archive.org/web/20230209220035/https://www.act.org/content/dam/act/unsecured/documents/R1705-test-prep-study-2018-08.pdf
Denney, N. W., & Heidrich, S. M. (1990). Training effects on Raven’s Progressive Matrices in young, middle-aged, and elderly adults. Psychology and Aging, 5, 144–145. https://doi.org/10.1037/0882-7974.5.1.144
Wilson, K. M. (1988). A study of the long-term stability of GRE general test scores. Research in Higher Education, 29(1), 3–40. https://doi.org/10.1007/BF00992141
Arendasy, M. E., Sommer, M., Gutiérrez-Lobos, K., & Punter, J. F. (2016). Do individual differences in test preparation compromise the measurement fairness of admission tests? Intelligence, 55, 44–56. https://doi.org/10.1016/j.intell.2016.01.004
Roznowski, M., & Reith, J. (1999). Examining the Measurement Quality of Tests Containing Differentially Functioning Items: Do Biased Items Result in Poor Measurement? Educational and Psychological Measurement, 59(2), 248–269. https://doi.org/10.1177/00131649921969839
Reeve, C. L., Heggestad, E. D., & Lievens, F. (2009). Modeling the impact of test anxiety and test familiarity on the criterion-related validity of cognitive ability tests. Intelligence, 37(1), 34–41. https://doi.org/10.1016/j.intell.2008.05.003
Lievens, F., Reeve, C. L., & Heggestad, E. D. (2007). An examination of psychometric bias due to retesting on cognitive ability tests in selection settings. Journal of Applied Psychology, 92(6), 1672–1682. https://doi.org/10.1037/0021-9010.92.6.1672
Hausknecht, J. P., Trevor, C. O., & Farr, J. L. (2002). Retaking ability tests in a selection setting: Implications for practice effects, training performance, and turnover. Journal of Applied Psychology, 87, 243–254. https://doi.org/10.1037/0021-9010.87.2.243
Van Iddekinge, C. H., Morgeson, F. P., Schleicher, D. J., & Campion, M. A. (2011). Can I retake it? Exploring subgroup differences and criterion-related validity in promotion retesting. Journal of Applied Psychology, 96, 941–955. https://doi.org/10.1037/a0023562
Briggs, D. C. (2009). Preparation for College Admission Exams. ERIC; https://archive.ph/JKNlf, https://web.archive.org/web/20230209220843/https://files.eric.ed.gov/fulltext/ED505529.pdf. https://files.eric.ed.gov/fulltext/ED505529.pdf
DerSimonian, R., & Laird, N. (2012). Evaluating the Effect of Coaching on SAT Scores: A Meta-Analysis. Harvard Educational Review, 53(1), 1–15. https://doi.org/10.17763/haer.53.1.n06j5h5356217648
sessionInfo()
## R version 4.2.1 (2022-06-23 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.29 R6_2.5.1 jsonlite_1.8.2 magrittr_2.0.3
## [5] evaluate_0.17 stringi_1.7.8 cachem_1.0.6 rlang_1.0.6
## [9] cli_3.4.1 rstudioapi_0.14 jquerylib_0.1.4 bslib_0.4.0
## [13] rmarkdown_2.17 tools_4.2.1 stringr_1.4.1 xfun_0.33
## [17] yaml_2.3.5 fastmap_1.1.0 compiler_4.2.1 htmltools_0.5.3
## [21] knitr_1.40 sass_0.4.2