##Introduction

This study supports a model of optimal integration of stimuli. In a spatial localization task, the authors claim that visual and auditory spatial likelihood distributions are multiplicatively combined to form a posterior distribution of the combined event. They claim that the weighting of the two input distributions is governed by their relative reliability, as defined by the localization error of the stimulus (measured as the width of the gaussian perceptual likelihood distribution of each unimodal stimulus). They show a reduction in the uncertainty of bimodal cues compared to unimodal in line with an optimally integrating observer. They also show that the percept of discrepant bimodal cues falls between the two unimodal cues, pulled towards one, consistent with their relative reliabilities.

##Methods

###Power Analysis

The authors fit best-fit psychometric functions to 2AFC task data in order to define the perceptual likelihood distribution of the stimulus, given by the mean and standard deviation (uncertainty) of the function. Confidence intervals for the mean and uncertainty are determined via 500 bootstrapped repetitions from the original data, with replacement. 1-tailed t-tests on these bootstrapped values determine significance. In this project, I will be replicating the localization error threshold (figure 3 of the original paper).

###Planned Sample

5 subjects, corrected to normal vision and hearing, found through Stanford Psychology’s SONA participant pool.

###Materials

Visual stimuli are presented on a 60 hz Vpixx monitor. Auditory stimuli are delivered through over-ear headphones. Gaussian blob visual stimuli are created with mgl (Gardner, 2018) and MATLAB. Auditory clicks are localized with head-related transfer functions from the MIT Media lab.

###Procedure

“Observers were required to localize in space brief light “blobs” or sound “clicks,” presented either separately (unimodally) or together (bimodally). In a given trial, two sets of stimuli were presented successively (separated by a 500 ms pause) and observers were asked to indicate which appeared to be more to the left, guessing if unsure. The visual stimuli were low-contrast (10%) Gaussian blobs of various widths, back-projected for 15 ms onto a translucent perspex screen (80 × 110 cm) subtending 90° × 108° when viewed binocularly from 40 cm. The auditory stimuli were brief (1.5 ms) clicks presented though two visible high-quality speakers at the edge of the screen, with the apparent position of the sound controlled by interaural time differences.

“With unimodal thresholds established, we then measured localization for a blob and click presented simultaneously. Observers were asked to envisage each presentation as a single event, like a ball hitting the screen, and report which of the two presentations was more leftward. For one presentation (randomly first or second), the visual and auditory stimuli were in conflict, with the visual stimulus displaced Δ degrees rightward and the auditory stimulus displaced Δ degrees leftward (SV – SA = 2Δ, where SV and SA are the spatial positions of the visual and auditory stimuli). The average position of this stimulus was always zero, as in previous studies [9]. On the other (non-conflict) presentation, the two modalities were covaried to the left or right of center by the amount shown in the abscissa (with positive meaning rightward)”

###Analysis Plan

Fit psychometric curves to the data. Get confidence intervals from 500-repetition bootstrap. Use 1-tailed t test to compare predictions to model predictions (single-value mean and variance measures). I will do this for each individual subject, as well as aggregated data, as in the original paper.

I will also test alternative models to see if they are statistically significant as well. I will also evaluate them via a model-based approach, by testing models’ abilities to explain the data rather than comparing summary statistics of the distributions.

###Differences from Original Study

We will display Gaussian stimuli on noisy backgrounds to make the visual condition slightly harder. We will also use smaller stimuli (due to screen constraints). We will also test 3 offset conditions (opposed to 5) in order to reduce the time it takes to collect full data sets. Our sample size will be different (sampling from Stanford undergraduates).

Actual Sample

3 undergraduates from the Stanford SONA pool with normal or corrected-to-normal vision and hearing.

Differences from pre-data collection methods plan

We will display our stimuli on noisy backgrounds to make the task more difficult, as well as use smaller stimulus sizes to fit our screen.

##Results

Data preparation

Data preparation following the analysis plan.

See ‘alaisReplicationAnalysis.m’ for analysis. I did the analysis in Matlab.

Confirmatory analysis

The analyses as specified in the analysis plan.

Original Analyses. My data.

###Exploratory analyses

I implemented a Model-Based comparison vs. alternative models. Blue and Green points are individual subjects, and the bar graph is the average. Values are difference in AIC scores vs a null model with permuted stimulus labels. The original model does not perform as well as other candidate models, a difference we would not have seen in the original analysis. I excluded 1 subject who was aware of the purpose of the study and who’s performance may have been affected in the discrepant condition (included in the original analysis, where we did not analyze this specific condition).

Model Comparison.

Discussion

Summary of Replication Attempt

The original study found that none of the localization error predictions in any subject deviated significantly from the optimal integration model predictions (P>.2 for all 6 subjects). I had a different result; all of the subjects I ran had P values under .2. One subject was clearly significantly different from the model prediction (P>.01), one was debatable (P=.05), and one was not technically different at typical significance (P = .18). I would consider this a partial replication.

Commentary

Generally, I don’t think this is the appropriate way to compare behavioral models. Looking for statistical significance between likelihood parameters is a poor way to arbitrate between different accounts of behavior. A more appropriate path forward is with model comparison: comparing the likelihood that different models produced the data. I did some exploration with that, and was pretty easily able to find better-fit models than the original one that was proposed in the paper. The original analyses would not have been able to arbitrate between the models in the same way. I hesitate to comment on the “replicability” of this project, because I think the initial analyses is simply the wrong thing to do.