Prosocial evaluation: A new meta-analysis

Author

Alvin W.M. Tan

Published

October 15, 2024

NOTE: This has been updated on 2024-10-15 to correct for some errors in coding.

Motivation

For PSYCH 211 (Developmental Psychology) in Winter Quarter 2023, we were assigned to read the seminal paper by Hamlin, Wynn, and Bloom (2007) on infants’ evaluation of prosocial and antisocial agents, and to write a reading response. As I was looking up relevant research papers, I found a neat meta-analysis by Margoni and Surian (2018) on this exact topic, and then recalled that this was one of the datasets on MetaLab (Bergmann et al. 2018), so it was possible to access the underlying coded data, graciously provided by the authors. I also found a number of studies that had been published since 2017 (when the meta-analysis was conducted), and wanted to see how the field had progressed in its understanding of prosocial evaluation since then. As a result, I ended up retrieving the data using metalabr (Iverson and El-Shawa 2023), and appending new studies that had emerged since 2017, leading to a frantic one-day meta-analysis push that I incorporated into my reading response. Eventually, I became curious about other related effects (e.g., what about neutral agents?) as well as a whole class of looking time studies, which the authors had initially excluded to reduce methodological heterogeneity. The last ~5 years have seen a number of studies measuring such effects, thereby ensuring that there were sufficient data to investigate them meaningfully. This write-up is a culmination of this process, intended as an approachable meta-analytic view of the domain (and which perhaps might be reshaped into a manuscript at a future date).

Acknowledgements at this point must be given to Francesco Margoni and Laura Franchin, who very kindly shared additional details of their studies and provided access to some of the papers included. Thanks also to my fantastic advisor, Mike Frank, whose commitment to open and collaborative science made this possible (and who has immense patience for his grad students working on numerous unrelated side projects, haha). Finally, thanks to Ellen Markman and Carol Dweck for the lovely class we’ve had this quarter, and for bringing this interesting domain of study to my attention in the first place.¹

Background

There are several excellent reviews of prosocial evaluation (e.g., Holvoet et al. 2016; Lavoie et al. 2022), so I will direct the interested reader to those for a more thorough treatment of this phenomenon. For now, it suffices to paint a broad picture of the phenomenon of prosocial evaluation. This phenomenon is based on the observation that social cognition relies strongly on rapid evaluation on whether another agent is “friend” or “foe”, “good” or “bad”, and so on. In order to study how early such a capacity emerges, developmental scientists have probed young infants’ ability to distinguish between and show a preference for agents that are more prosocial (e.g., helpers, givers, fair distributors, sharers, defenders) than agents that are more antisocial (e.g., hinderers, takers, unfair distributors, hoarders, bystanders). This is often measured using a manual task (e.g., choosing between the two agents, selective helping, offering a gift or reward), or a looking time task (e.g., violation of expectation regarding socially relevant events, preferential looking between the two agents, anticipatory looking for expectation of event completion).

The domain of prosocial evaluation was arguably inaugurated by Hamlin, Wynn, and Bloom (2007) (although see Kuhlmeier, Wynn, and Bloom (2003)), and has attracted a great amount of subsequent interest. Notably, a number of attempted replications (Salvadori et al. 2015; Schlingloff, Csibra, and Tatone 2020) have failed to replicate the original findings. This has motivated a meta-analysis by Margoni and Surian (2018), the seed of the present project, as well as an ongoing large-scale multi-lab replication attempt (Lucca et al. 2021) that is due to complete data collection later this year. For now, I use a meta-analytic approach to characterise and understand the evidence within this domain, which will hopefully be an informative summary of the phenomenon of prosocial evaluation.

Method

Search and inclusion

I adopted the search criteria from Margoni and Surian (2018), specifically searching the PsycInfo database with the search terms: infant* AND (moral* OR help* OR hinder* OR good* OR fair), but limiting the publication date range to 2017–2023. I also conducted an additional search with the same set of search terms plus looking time, with the date range 2007–2017. Finally, I also conducted a forward search from Hamlin, Wynn, and Bloom (2007) in Google Scholar to scoop up any missed items (especially unpublished grey literature), again searching through all records from 2017–2023 as well as records including looking time from 2007–2017.

Because I was interested in other potential effects not captured by the original study, I expanded the inclusion criteria. Here I list the original criteria, along with the expanded criteria used in the present project, using the SPIDER framework (Cooke, Smith, and Booth 2012).

Sample
- Original: Typically-developing infants aged 4–36 months
- New: Typically-developing infants aged 3–36 months (I expanded this to include a number of looking time studies with 3-month-olds)
Phenomenon of interest
- Original and new: Evaluation of prosocial agents
Design
- Original: Expression of preference between “morally good” and “morally bad” characters, with within-participants measures
- New: Expression of preference between “morally good” and “morally bad” characters, or between either of those and “morally neutral/ambiguous” characters, with within- or between-participants measures
Evaluation
- Original: Manual task (manual choice, selective helping, gift offering)
- New: Manual task (as above) or looking time task (violation of expectation, preferential looking, or anticipatory looking)
Research type
- Original and new: Experimental studies

The original meta-analysis included 26 papers with 62 effect sizes. The replication included an additional 23 papers with 55 effect sizes. The looking time extension included an additional 19 papers and an additional 72 effect sizes. The neutral and other effect extension included an additional 2 papers and an additional 28 effect sizes. In total, this meta-analysis included 70 papers and 217 effect sizes. Note that the extensions also included some new effect sizes from previously included papers (e.g., a new looking time effect size from a paper that had been previously coded for a manual effect size).

Coding

I included all the coded variables from Margoni and Surian (2018):

Sample size
Sample mean age
Type of scenario (help/hinder, fair/unfair, give/take)
Modality of stimulus presentation (live show, movies)
Stimulus type (real, cartoon)
Choice object (puppets, shapes, experimenters, cartoon people, cartoon animals)
Dependent variable (reaching, offering help or reward, violation of expectation, preferential looking, anticipatory looking)
Lab of origin (Hamlin, other)

I added a number of variables which could potentially affect effect sizes (these will be further explained below when they are actually tested for moderation):

Intent valence (measuring agents’ intentions; positive/negative, positive/neutral, neutral/negative)
Outcome valence (measuring actual outcome; opposite [positive/negative], neutral [positive/neutral or neutral/negative], same [positive/positive, neutral/neutral, negative/negative], reversed [negative/positive])
In manual tasks, number of exclusions due to non-choice
In looking time tasks, look target (preference, event itself, third-party approach, third-party reward)

Analysis

I first conducted a replication of the study by Margoni and Surian (2018), including only manual tasks with opposite intent valence (positive/negative). I then examined looking time studies, both by themselves and in conjunction with the manual tasks. Finally, I looked at both manual and looking time studies, and included studies with non-opposite intent valence (positive/neutral or neutral/negative). In all cases, I used the metalabr package (Iverson and El-Shawa 2023) to calculate effect sizes and the metafor package (Viechtbauer 2010) to run mixed-effect meta-analyses.

Effect sizes were log odds ratios for the manual tasks, and standardised mean differences for the looking time tasks. The random effects were sample group nested within papers (since some papers included multiple effect sizes from the same sample).

Results and discussion

Instead of printing one forest plot per section, I’ll only print the full forest plot at the end (after both extensions; i.e., when all effect sizes have been included).

Replication

In the replication, we included only manual task studies with opposite intent valence (one agent was positive, and one was negative), as a way to provide comparability to the original meta-analysis. Model results suggested a significantly positive estimate of \(\beta =\) 0.34 (95% CI [0.26, 0.43], \(p\) < .001), which is equivalent to a proportion of 0.58 (95% CI [0.56, 0.61]) of infants choosing the prosocial agent over the antisocial agent.

Lab and sample moderators

As in the original meta-analysis, we included lab of origin as a moderator, finding a tendency towards significance for the factor of whether or not the paper originated from the Hamlin lab, \(\beta =\) 0.16, \(Q_M\)(1) \(=\) 2.80, \(p=\) .094.

There was also no effect of age, \(Q_M\)(1) \(=\) 1.95, \(p=\) .163.

When looking through the studies, I also noticed that there was some variability in the number of exclusions due to infants’ “misperforming” during the choice task—either they chose neither object, or they chose both objects. These were excluded due to uninterpretability, but one could conceivably consider a zero or both choice to be a legitimate choice; excluding these infants would possibly inflate the proportion of infants that chose the prosocial agent. Thus, I included this specific type of exclusion (zero or both choice) as a moderator, operationalised as the proportion of this exclusion to the full sample size. Note that some of these exclusions were estimated (e.g., if overall exclusions were given without breaking down into per-condition exclusions), and a few papers which did not report exclusions were dropped. Regardless, proportion of excluded participants was not a significant predictor, \(Q_M\)(1) \(=\) 0.60, \(p=\) .439.

Scenario and stimuli moderators

There was an effect of scenario type, \(Q_M\)(3) \(=\) 13.05, \(p=\) .005. In particular, the help/hinder scenario had smaller effect sizes than the give/take scenario, as found by the original meta-analysis.

The dependent variable (reaching, helping), stimuli (real, cartoon), and choice object (shapes, puppets, people, experimenters) all did not significantly moderate the effect size, all \(p\) > .15. However, in a deviation from the original meta-analysis, the presentation modality (movie, live show) did significantly moderate the effect size, \(Q_M\)(1) \(=\) 7.29, \(p=\) .007. Specifically, having a live show elicited larger effects than using movies.

Power analysis

A post-hoc power analysis suggested that a sample of 224 infants is required to reliably detect an effect size of the meta-analytic magnitude at 80% power. This size is clearly much much larger than any sample in any of the included papers, and perhaps serves as a signal to exercise more caution when designing, conducting, and interpreting manual task studies.

Interim discussion

Broadly, the replication did replicate the effects found in the original meta-analysis. I found a small but positive overall effect, which was moderated by scenario type. I also found that the Hamlin lab tended to produce larger effects than other labs, although this effect was only a tendency and not significant. Surprisingly, presentation modality did affect effect size; future studies on prosocial evaluation should consider using live shows to more reliably elicit the desired effect.

Looking time extension

Looking time has become an ubiquitous measure for infant studies, since it requires relatively little motor or linguistic capabilities and can thus be used on younger infants. It can also potentially offer more fine-grained measurements, since responses are continuous rather than binary (in the manual case). Although there are some debates around looking time paradigms (e.g., Bergmann, Rabagliati, and Tsuji 2019), they have been used extensively to study prosocial evaluation, motivating a meta-analysis of such studies.

Indeed, a meta-analysis of looking time studies demonstrated a larger effect size of \(\beta =\) 0.57 (95% CI [0.41, 0.74], \(p\) < .001), although the variance in effect sizes was larger, with significant heterogeneity, \(Q\)(65) < .001.

Lab and sample moderators

The Hamlin lab only produced two codable looking time estimates, thus it was not possible to establish whether lab of origin was a significant predictor of effect sizes.

Nonetheless, age remained a non-significant predictor, \(Q_M\)(1) \(=\) 0.30, \(p=\) .582. This was again surprising, as I had thought that the greater sensitivity of looking time measures may have revealed an effect. Perhaps, as has been argued elsewhere (Margoni and Surian 2018 among others), there is truly no developmental change from 3 to 36 months, meaning that either infants have an innate preference towards prosocial agents (even when they themselves are not the target of the prosocial action), or that any development must have occurred in the first three months of life. It’s probably quite difficult to probe this in infants any younger than three months of age using current methodologies, so we may need alternative strategies to examine prosocial evaluation in even younger infants.

Scenario and stimuli moderators

There was no effect of scenario type, \(Q_M\)(3) \(=\) 3.14, \(p=\) .371, in contrary to the manual tasks. My guess is that this is driven largely by the greater heterogeneity and uncertainty in looking time results.

Stimuli type and choice object again did not have any effect, \(p\) > .15. Furthermore, presentation modality did not have an effect either, \(Q_M\)(1) \(=\) 1.10, \(p=\) .294, although the direction of the estimate was the same as that of manual tasks. Method of measurement did not have an effect, \(Q_M\)(2) \(=\) 3.52, \(p=\) .172, although it appeared that violation of expectation had a tendency towards larger effects than preferential looking. To investigate this further, I reran the meta-analysis, excluding studies using anticipatory looking (as only 3 effect sizes used this measure). This resulted in a trend between violation of expectation and preferential looking, \(\beta =\) 0.31, \(p=\) .059.

I also broke down violation of expectation into three target types:

The event itself (expectation for agents in general to be prosocial)
Third-party approach (expectation for third parties to approach prosocial agents)
Third-party reward (expectation for third parties to reward prosocial agents)

This did not explain any more variance in effect sizes, \(Q_M\)(3) \(=\) 2.04.

Participant design

The looking time studies also differed according to their participant design, with four studies (accounting for 10 effect sizes) using between-participant designs and 24 studies (accounting for 56 effect sizes) using within-participant designs. There was a tendency towards within-participant designs having smaller effect sizes than between-participant designs, \(\beta =\) -0.58, \(Q_M\)(1) \(=\) 6.30, \(p=\) .012.

Power analysis

A post-hoc power analysis suggested that a sample of 23 infants is required to reliably detect an effect size of the meta-analytic magnitude at 80% power. This is notably much smaller than that for manual tasks, due to the larger estimated effect size. It seems reasonable, then, to suggest that prosocial evaluation research should consider using looking time studies as a slightly more sensitive mode of measurement.

Combined meta-analysis

Combining the manual and looking time tasks showed that response mode did not actually significantly moderate the effect size, \(Q_M\)(1) \(=\) 2.94, \(p=\) .086.

Interim discussion

Looking time studies supported young infants’ ability to conduct prosocial evaluation, and did not significantly differ from manual tasks in their estimation of this effect size. These studies also showed some patterns in similar directions to manual task studies, including the fact that age was not a significant moderator. However, they demonstrated a different set of relationships with moderators such as scenario type and method of measurement. As such, even if manual and looking time studies were truly measuring the same underlying construct, they seem to involve different intervening processes and linking hypotheses (see Cao, Lewis, and Frank submitted). It would be interesting to examine only studies that included both manual and looking time metrics on the same set of participants (noting that there are 13 pairs of effect sizes that fulfil this criterion), but that is an analysis for another day.

Neutral agent extension

A number of studies of prosocial evaluation have examined not just prosocial and antisocial agents, but neutral agents (e.g., agents who neither perform a prosocial nor an antisocial action) or ambiguous agents (e.g., agents who act inconsistently); I classified both such categories as “neutral”. Some studies also manipulated the observed outcome of attempted actions, such that socially valenced events sometimes led to neutral outcomes (e.g., if the event was not played to completion so the outcome is unknown) or even reversed outcomes (e.g., with failed attempts to help or hinder that inadvertently caused the opposite outcome). This motivated an extension which looked at the effects of intent and outcome valence to determine whether and how they affect prosocial evaluations.

Nonetheless, neither intent nor outcome valence appeared to significantly moderate effect size (all \(p\) > .35). This is not too surprising for outcome valence, which feels like it truly shouldn’t affect one’s evaluation of an agent’s prosociality. But this is somewhat surprising for intent valence: it suggests that there is no difference between a prosocial–antisocial comparison and one with a neutral agent, which is intuitively surprising. It also seems to run counter to the suggestion that infants have a negativity bias—that they perceive neutral and antisocial agents as being more dissimilar than positive and neutral agents (Chae and Song 2018; Hamlin, Wynn, and Bloom 2007). These results seem to instead suggest that prosocial evaluation can be modelled as an estimation of relative degree of prosociality, which is then thresholded, such that a sufficient quantity of relative difference is enough to trigger differentiated responding. One way to probe this would be to run studies that have more than two alternatives (e.g., very prosocial, somewhat prosocial, and neutral agents) to determine whether infants consistently choose the most prosocial of the presented options, and further, whether there is a graded response towards the agents that is proportional to their prosociality. Of course, this would rely strongly on infants’ working memory capabilities (as they would need to maintain yet another agent representation), so we may only see an effect emerge for older infants. In fact, it would be good to probe this in adults as well—we may observe non-linearities in the response function that could be interesting to explore.

Cumulative meta-analysis

A cumulative meta-analysis of all the results shows how effect size has decreased over time, although it seems to have stabilised around \(d=\) 0.5 from 2018. Note that estimates are based on a fixed effects model, as the cumulative meta-analysis does not take random effects.

Forest plot

Finally, the plot that you’ve doubtlessly been waiting for: the forest plot, made with meta (Balduzzi, Rücker, and Schwarzer 2019).

Final thoughts

This has been a really fun exercise in running a meta-analysis, perhaps made easier as I had fewer analytic and coding decisions to make (as many of them had been made for me by Margoni and Surian (2018)). It took me about a week and change to perform the search, do the coding, and write up this report, which is rather speedy for a meta-analysis. Hopefully, this has been an informative look into the field of prosocial evaluation, as well as related methodological issues including behavioural versus looking time measures, the major author effect, and the effect of stimuli. Perhaps, too, it is a reaffirmation that meta-analyses can be an extremely useful tool to summarise evidence from a field and understand possible factors underlying observed variance. There are many other interesting things to study that this meta-analysis has prompted, but I have to admit that I’m not really a social development researcher, so I’ll leave those questions for my excellent colleagues to answer. For now, I’m glad that I have had the chance to practice my meta-analytic skills, and I hope you’ve found this investigation interesting too.

References

Balduzzi, Sara, Gerta Rücker, and Guido Schwarzer. 2019. “How to Perform a Meta-Analysis with R: A Practical Tutorial.” Evidence-Based Mental Health, no. 22: 153–60.

Bergmann, Christina, Hugh Rabagliati, and Sho Tsuji. 2019. “What’s in a Looking Time Preference?” March 1, 2019. https://doi.org/10.31234/osf.io/6u453.

Bergmann, Christina, Sho Tsuji, Page E. Piccinini, Molly L. Lewis, Mika Braginsky, Michael C. Frank, and Alejandrina Cristia. 2018. “Promoting Replicability in Developmental Research Through Meta-analyses: Insights From Language Acquisition Research.” Child Development 89 (6): 1996–2009. https://doi.org/10.1111/cdev.13079.

Cao, Anjie, Molly L. Lewis, and Michael C. Frank. submitted. “A Synthesis of Early Cognitive and Language Development Using (Meta-)meta-Analysis.”

Chae, Joanna Joo Kyung, and Hyun-joo Song. 2018. “Negativity Bias in Infants’ Expectations about Agents’ Dispositions.” British Journal of Developmental Psychology 36 (4): 620–33. https://doi.org/10.1111/bjdp.12246.

Cooke, Alison, Debbie Smith, and Andrew Booth. 2012. “Beyond PICO: The SPIDER Tool for Qualitative Evidence Synthesis.” Qualitative Health Research 22 (10): 1435–43. https://doi.org/10.1177/1049732312452938.

Hamlin, J. Kiley, Karen Wynn, and Paul Bloom. 2007. “Social Evaluation by Preverbal Infants.” Nature 450 (7169, 7169): 557–59. https://doi.org/10.1038/nature06288.

Holvoet, Claire, C’eline Scola, Thomas Arciszewski, and Delphine Picard. 2016. “Infants’ Preference for Prosocial Behaviors: A Literature Review.” Infant Behavior and Development 45 (November): 125–39. https://doi.org/10.1016/j.infbeh.2016.10.008.

Iverson, Erik, and Sara El-Shawa. 2023. Metalabr: R Package for Accessing MetaLab Data. Manual.

Kuhlmeier, Valerie, Karen Wynn, and Paul Bloom. 2003. “Attribution of Dispositional States by 12-Month-Olds.” Psychological Science 14 (5): 402–8. https://doi.org/10.1111/1467-9280.01454.

Lavoie, Jennifer, Aja L. Murray, Guy Skinner, and Emilia Janiczek. 2022. “Measuring Morality in Infancy: A Scoping Methodological Review.” Infant and Child Development 31 (3): e2298. https://doi.org/10.1002/icd.2298.

Lucca, Kelsey, Arthur Capelier-Mourguy, Laura Cirelli, Krista Byers-Heinlein, Rodrigo Dal Ben, Michael C. Frank, Annette M. E. Henderson, et al. 2021. “Infants’ Social Evaluation of Helpers and Hinderers: A Large-Scale, Multi-Lab, Coordinated Replication Study.” June 28, 2021. https://doi.org/10.31234/osf.io/qhxkm.

Margoni, Francesco, and Luca Surian. 2018. “Infants’ Evaluation of Prosocial and Antisocial Agents: A Meta-Analysis.” Developmental Psychology 54 (8): 1445–55. https://doi.org/10.1037/dev0000538.

Salvadori, Eliala, Tatiana Blazsekova, Agnes Volein, Zsuzsanna Karap, Denis Tatone, Olivier Mascaro, and Gergely Csibra. 2015. “Probing the Strength of Infants’ Preference for Helpers over Hinderers: Two Replication Attempts of Hamlin and Wynn (2011).” PLOS ONE 10 (11): e0140570. https://doi.org/10.1371/journal.pone.0140570.

Schlingloff, Laura, Gergely Csibra, and Denis Tatone. 2020. “Do 15-Month-Old Infants Prefer Helpers? A Replication of Hamlin Et Al. (2007).” Royal Society Open Science 7 (4): 191795. https://doi.org/10.1098/rsos.191795.

Viechtbauer, Wolfgang. 2010. “Conducting Meta-Analyses in R with the metafor Package.” Journal of Statistical Software 36 (3): 1–48. https://doi.org/10.18637/jss.v036.i03.

Footnotes

A prettier version of this document can be found online on RPubs.↩︎