Experiment 1: Director-Matcher Task
Methods
For this experiment, we recruited 15 native speakers of Xhosa and 15 native speakers of Afrikaans. Each participant described 25 sung tone pairs to a confederate matcher.
Hypotheses
- Afrikaans speakers will mainly use the ‘height’ metaphor in speech.
- Xhosa speakers, will use both ‘height’ and ‘size’ flexibly.
- Based on previous findings, we expect gestures accompanying ‘height’ to consistently converge with the spatial mappings invoked by expressions like “high” and “low”.
- Conversely, we expect gestures accompanying ‘size’ in speech to indicate ‘size’ to a lesser extent, and sometimes reveal spatial mappings consistent with ‘height’ metaphors.
Gesture frequency
In the table anch figure below, we see that Xhosa speakers gestured more when using metaphors in speech. Within the two groups, the by-metaphor gesture rates were comparable.
Speech-gesture convergence
Here, we only consider the spatial metaphors ‘height’ and ‘size’.
Gestures were coded for dimension (in terms of movement and location) and handshape (i.e. flat hand, “grip”) and speech-gesture pairs were then coded as either “yes” (convergent), “no” (divergent, e.g. ‘size’ in speech with vertical gestures), “mixed” (both ‘height’ and ‘size’ mappings expressed in gesture) and “n/a” (gestures not clearly expressing spatial mappings).

Summary
- We see that in both languages, the ‘height’ metaphor is consistently accompanied by gestures indicating a vertical space-pitch mapping.
- As predicted, this was also the case with ‘size’ metaphors in Xhosa, which were mostly accompanied by similar gestures expressing verticality.
Experiment 2: Implicit Associations Task
Methods
For this experiment, we recruited 30 native speakers of Xhosa and 30 native speakers of Afrikaans.
Participants performed an RT task targeting implicit space-pitch associations by pairing circles differing in vertical position (high/low, height condition) or size (small, big) with a high/low pitched voice. Participants were asked to indicate whether the sound in each trial was high or low-pitched with button presses.
There were 16 blocks (8 for each condition), with 20 trials in each (320 trials per participant). The order of blocks was randomised.
Stimulus pairs were presented for 200 msec and participants used button presses to indicate whether the sound
Half of the stimulus pairs were “incongruent” in terms of the space-pitch mapping.
Hypotheses
In experiment one, Afrikaans speakers described pitch in terms of ‘height’, whereas Xhosa speakers used ‘size’ in addition to ‘height’. Based on these findings and previous work, we would expect the following:
- Afrikaans speakers’ RTs would be slower (in response to incongruent stimuli) in the ‘height’ condition than in the ‘size’ condition.
- Xhosa speakers’ RTs would be slower in response to incongruent stimuli in both conditions.
We thus expected a three-way interaction effect between language, condition and congruence.
Trimming and filtering the data
RT data are generally difficult to handle for a number of reasons, and the literature proposes a number of procedures to trim and filter the data prior to statistical analyses. In the next sections, we go through each step discussed in the relevant literature.
Accuracy
Ideally, we would want to remove data from participants performing at chance level by pressing buttons at random. To my knowledge, there isn’t a specific threshold for accuracy that’s widely agreed upon for this or similar tasks. However, inspecting the distribution of overall accuracies for each participant, we see that a few participants (n=5) clearly stand out from the rest. Data from these participants are left out in the later analyses.
Another method used by Abutalebi et al., is to use confidence intervals and then set the threshold at the lower CI for each language group. I tried this, and found that this would have further excluded data from four participants.

Setting an upper RT threshold
In the distribution of the response times below, we see a very long right tail. The slowest response is 28 seconds!

Setting an upper threshold will affect RT estimates, but these extreme values would themselves have a major influence making estimates unreliable.
I therefore propose “mild” initial trimming of the data excluding response slower than five seconds as indicated in the figures below.

I’ve also checked for responses that are faster than 100 msec (which are generally considered to be errors), but found none. The fastest recorded RT is 160 msec.
Individual thresholds
There appears to be wide agreement that individual thresholds should be set based on the overall mean and standard deviation for each participant, but authors advocate different levels. Baayen & Milin argue that the frequently used limit at 2 (perhaps also 2.5) standard deviations is too aggressive and proposes an upper limit of 3 standard deviations above the mean coupled with minimal trimming based on residuals after fitting a model. This is essentially what they call performing “model criticism”.
Following this suggestion, we set the threshold for individual RTs at three SDs above individual means.
The table below indicates the amount of data removed in each step. The amount of discarded data seems reasonable and in line with what is generally considered acceptable. Note that the majority of the discarded data is due to poor performance.
Statistical approaches
The classical central tendency approach to analysing RTs is using ANOVAs on the by-participant aggregated means. However, this technique assumes normally distributed data, in which case participant means would offer a reliable summary of the data. The problem is that RTs rarely follow a normal distribution, but rather a positively skewed distribution resembling an ex-Gaussian distribution characterized by a long right tail. As the below plot shows, this is also the case here.

To deal with this issue, we’ll try out the following approaches:
- The Ex-Gaussian approach with separate analyses for the \(\mu\) and \(\tau\) parameters in RT distributions
- Linear regression/ANOVA on the conflict effect and coefficient of variability
- Mixed-effects linear regression with model criticism, checking for autocorrelated RT lags.
The ex-Gaussian approach
In this approach, we compute the three parameters, \(\sigma\) (sigma), \(\mu\) (mu) and tau (tau), that describe an ex-Gaussian distribution, which itself is a convolution of a normal and an exponential distribution. The idea is that potentially interesting effects may hide in the long right tail (\(\tau\) component), which is ignored in standard central tendency tests. \(\mu\) and \(\sigma\) reflect the mean and standard deviation of the Gaussian component, whereas \(\tau\) reflects the mean and the standard deviation of the exponential component.
This procedure has been used e.g. by Abutalebi et al. and Calabria et al.
The figure below shows the dstributions of the three parameters.

We already know that the the grouped distributions appear to follow ex-Gaussian distributions, but we can go further and inspect distributions for each participant in the figure below ordered by mean RT.

From this, it is perhaps less clear that the ex-Gaussian distribution provides the best fit for our data.
We can quantify this by creating simulated data based on the aggregated normal and ex-Gaussian parameters for each participant and perform Kolmogorov-Smirnov tests to see whether the real and simulated distributions are significantly different.
The p-values in the below figure tell us how many participants follow an ex-Gaussian vs. a normal distribution. There are more values above .05 in the ex-Gaussian panel. We therefore conclude that this distribution provides a better fit for the majority of our data.

Before fitting models to our data, we can inspect the following plots which give us a better idea of possible interactions between factors within each parameter. There’s no clear evidence for interactions in the estimated parameters, but there might be main effects of language.

The \(\mu\) parameter
Fitting a full model with all predictors and interactions reveals a marginally significant effect of language (p = .059), i.e. the Xhosa group had shorter RTs in the Gaussian component. No effects reached significant p-values.
Call:
lm(formula = mu ~ language * condition * congruent, data = params)
Residuals:
Min 1Q Median 3Q Max
-131.82 -50.05 -13.16 51.48 253.78
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 592.737 13.468 44.010 <2e-16 ***
languageXhosa -37.173 19.588 -1.898 0.0591 .
conditionsize -13.829 19.047 -0.726 0.4686
congruentTRUE -21.541 19.047 -1.131 0.2593
languageXhosa:conditionsize 3.375 27.702 0.122 0.9031
languageXhosa:congruentTRUE 18.597 27.702 0.671 0.5027
conditionsize:congruentTRUE 27.784 26.936 1.031 0.3035
languageXhosa:conditionsize:congruentTRUE -41.386 39.177 -1.056 0.2920
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 72.53 on 212 degrees of freedom
Multiple R-squared: 0.07613, Adjusted R-squared: 0.04562
F-statistic: 2.496 on 7 and 212 DF, p-value: 0.01751
The \(\tau\) parameter
As with the \(\mu\) parameter, fitting a full model with all predictors and interactions reveals only a marginally significant effect of language (p = .095).
Call:
lm(formula = tau ~ language * condition * congruent, data = params)
Residuals:
Min 1Q Median 3Q Max
-170.61 -72.48 -20.88 52.21 468.38
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 147.355 19.775 7.451 2.31e-12 ***
languageXhosa 48.309 28.762 1.680 0.0945 .
conditionsize 8.576 27.966 0.307 0.7594
congruentTRUE 7.628 27.966 0.273 0.7853
languageXhosa:conditionsize 7.118 40.675 0.175 0.8613
languageXhosa:congruentTRUE -8.017 40.675 -0.197 0.8439
conditionsize:congruentTRUE -11.452 39.550 -0.290 0.7724
languageXhosa:conditionsize:congruentTRUE 15.138 57.524 0.263 0.7927
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 106.5 on 212 degrees of freedom
Multiple R-squared: 0.06082, Adjusted R-squared: 0.02981
F-statistic: 1.961 on 7 and 212 DF, p-value: 0.06172
In sum, we found that Xhosa speakers produced smaller \(\mu\). Otherwise, splitting the Gaussian and exponential components in the RTs did not allow us to identify effects that might have been hiding in the right tails.
Calabria et al. also ran correlation analyses on the \(\mu\) and \(\tau\) parameters of their groups. I’m not completely sure if this is truly interesting or relevant, but doing so yields a small, but significant negative correlation of \(\mu\) and \(\tau\) for Afrikaans speakers and no correlation for Xhosa speakers.
Conflict effect
We can also think of the dependent variable as the mean difference in RTs in response to congruent and incongruent trials, what Calabria et al. call the conflict effect
Below we’ll plot the conflict effect and run the analysis. Note that the congruence variable is contained in the dependent variable, so we fit a model with language and condition as the only factors.

Call:
lm(formula = conflict ~ language * condition, data = conflict_df)
Residuals:
Min 1Q Median 3Q Max
-108.086 -22.213 1.279 16.963 144.360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 13.913 7.616 1.827 0.0706 .
languageXhosa -10.580 11.078 -0.955 0.3417
conditionsize -16.332 10.771 -1.516 0.1324
languageXhosa:conditionsize 26.248 15.666 1.675 0.0968 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 41.02 on 106 degrees of freedom
Multiple R-squared: 0.02899, Adjusted R-squared: 0.001504
F-statistic: 1.055 on 3 and 106 DF, p-value: 0.3716
There seems to be too much variability to detect any significant effects, though the interaction lines suggest opposite trends, which is reflected in the marginally significant interaction effect between language and condition (p = .097)
Coefficient of variability
It might be interesting to explore whether there are patterns in the variability of response latencies. We can compute the coefficient of variability for each participant by diving the individual SDs by the means.
We plot the data and fit a model in the same way as with the conflict effect.

Call:
lm(formula = lm(coef_var ~ language * condition * congruent,
data = params))
Residuals:
Min 1Q Median 3Q Max
-0.18572 -0.06823 -0.00593 0.04981 0.35430
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.2329422 0.0178153 13.075 <2e-16 ***
languageXhosa 0.0418901 0.0259112 1.617 0.107
conditionsize 0.0014727 0.0251946 0.058 0.953
congruentTRUE 0.0020427 0.0251946 0.081 0.935
languageXhosa:conditionsize 0.0141439 0.0366440 0.386 0.700
languageXhosa:congruentTRUE 0.0037418 0.0366440 0.102 0.919
conditionsize:congruentTRUE -0.0067914 0.0356306 -0.191 0.849
languageXhosa:conditionsize:congruentTRUE 0.0007974 0.0518224 0.015 0.988
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.09594 on 212 degrees of freedom
Multiple R-squared: 0.07045, Adjusted R-squared: 0.03975
F-statistic: 2.295 on 7 and 212 DF, p-value: 0.02833
From the regression output, we see that no effects reached significance.
Mixed-effects regression
The downside of the previous analyses is that they require the data to be aggregated. For this analysis, we’ll follow Baayen & Milin’s suggestions and do the following:
- Compare distributions and transformations of the RTs
- Fit models with different random structures
- Apply model criticism and refit models
- Account for autocorrelation in RTs
Distributions
We’ll compare and determine whether to use untransformed RTs, a log or a inverse Gaussian transformation.
The values for skewness and kurtosis in the table below suggests that all options result in skewed and kurtotic distributions, but with considerable improvements with transformations.
However, these measures are known to be unreliable with larger samples (n > 200)
Instead, we’ll inspect quantile-quantile plots for the goodness of fit of theoretical distributions. Also shown, are the correlation coefficients of the observed and theoretical distributions.

Based on the output, we proceed with the inverse Gaussian transformation. Later model criticism will further improve the goodness of fit.
Below we see QQ-plots for individual participants. It is clear that both groups have a few participants deviating from the expected pattern causing later points to rise above the line in the center panel above.

Before fitting regression models, we’ll plot the data grouped by our independent variables.

Random intercept model
We first fit a random intercept model. From the output we can observe a significant three-way interaction between language, condition and congruence (p = .02)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: RTinv ~ language * condition * congruent + (1 | participant) + (1 | item)
Data: df
REML criterion at convergence: 10686.7
Scaled residuals:
Min 1Q Median 3Q Max
-3.2278 -0.5713 0.0162 0.5620 14.7918
Random effects:
Groups Name Variance Std.Dev.
participant (Intercept) 4.128e-02 0.203168
item (Intercept) 5.032e-05 0.007093
Residual 1.103e-01 0.332078
Number of obs: 16392, groups: participant, 55; item, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.457e+00 3.872e-02 5.675e+01 37.633 <2e-16 ***
languageXhosa 7.514e-03 5.587e-02 5.593e+01 0.134 0.8935
conditionsize 1.229e-02 1.227e-02 8.310e+00 1.001 0.3451
congruentTRUE 2.619e-02 1.233e-02 8.450e+00 2.125 0.0645 .
languageXhosa:conditionsize -1.786e-02 1.482e-02 1.633e+04 -1.205 0.2282
languageXhosa:congruentTRUE -1.666e-02 1.482e-02 1.633e+04 -1.125 0.2607
conditionsize:congruentTRUE -3.397e-02 1.739e-02 8.361e+00 -1.954 0.0849 .
languageXhosa:conditionsize:congruentTRUE 4.828e-02 2.085e-02 1.633e+04 2.316 0.0206 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) lnggXh cndtns cnTRUE lnggX: lX:TRU c:TRUE
languageXhs -0.681
conditionsz -0.159 0.074
congrntTRUE -0.159 0.074 0.502
lnggXhs:cnd 0.088 -0.133 -0.552 -0.278
lnggXh:TRUE 0.088 -0.133 -0.279 -0.556 0.502
cndtns:TRUE 0.113 -0.052 -0.707 -0.709 0.390 0.395
lnggX::TRUE -0.063 0.095 0.393 0.396 -0.712 -0.711 -0.556
We then calculate \(R^2\) indicating how well the model fits our data.
[1] 0.2670519
We then apply “model criticism” following Baayen & Milin’s recommendations. This means minimal trimming of standardized residuals above 2.5.
We then refit the model and see that the interaction is still significant.
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: RTinv ~ language * condition * congruent + (1 | participant) + (1 | item)
Data: df2
REML criterion at convergence: 6830.4
Scaled residuals:
Min 1Q Median 3Q Max
-2.87973 -0.61361 0.03524 0.63555 2.98165
Random effects:
Groups Name Variance Std.Dev.
participant (Intercept) 4.243e-02 0.20599
item (Intercept) 8.227e-05 0.00907
Residual 8.754e-02 0.29587
Number of obs: 16173, groups: participant, 55; item, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.448e+00 3.930e-02 5.688e+01 36.836 < 2e-16 ***
languageXhosa 9.256e-03 5.642e-02 5.529e+01 0.164 0.87030
conditionsize 1.180e-02 1.275e-02 6.666e+00 0.925 0.38712
congruentTRUE 2.993e-02 1.279e-02 6.741e+00 2.340 0.05319 .
languageXhosa:conditionsize -1.645e-02 1.329e-02 1.611e+04 -1.238 0.21588
languageXhosa:congruentTRUE -2.742e-02 1.328e-02 1.611e+04 -2.064 0.03906 *
conditionsize:congruentTRUE -3.451e-02 1.806e-02 6.695e+00 -1.911 0.09947 .
languageXhosa:conditionsize:congruentTRUE 5.753e-02 1.871e-02 1.611e+04 3.075 0.00211 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) lnggXh cndtns cnTRUE lnggX: lX:TRU c:TRUE
languageXhs -0.678
conditionsz -0.163 0.056
congrntTRUE -0.162 0.056 0.501
lnggXhs:cnd 0.077 -0.118 -0.474 -0.239
lnggXh:TRUE 0.078 -0.118 -0.240 -0.478 0.501
cndtns:TRUE 0.115 -0.040 -0.707 -0.709 0.336 0.339
lnggX::TRUE -0.055 0.084 0.338 0.340 -0.712 -0.710 -0.478
The \(R^2\) value now indicates a considerably better fit.
[1] 0.3208119
Random intercepts and slopes
We’ll also try fitting a model with “maximal random structure” including both random intercepts and slopes as is allowed by the design. We see that the interaction effect is retained in the model.
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: RTinv ~ language * condition * congruent + (1 + condition + congruent | participant) + (1 | item)
Data: df
REML criterion at convergence: 10603.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.2198 -0.5668 0.0171 0.5619 14.6442
Random effects:
Groups Name Variance Std.Dev. Corr
participant (Intercept) 4.357e-02 0.208723
conditionsize 3.872e-03 0.062228 -0.21
congruentTRUE 1.397e-03 0.037370 -0.10 -0.09
item (Intercept) 5.222e-05 0.007226
Residual 1.090e-01 0.330184
Number of obs: 16392, groups: participant, 55; item, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.456e+00 3.973e-02 5.474e+01 36.649 <2e-16 ***
languageXhosa 7.739e-03 5.733e-02 5.387e+01 0.135 0.8931
conditionsize 1.324e-02 1.689e-02 2.359e+01 0.784 0.4408
congruentTRUE 2.662e-02 1.419e-02 1.354e+01 1.876 0.0823 .
languageXhosa:conditionsize -1.772e-02 2.236e-02 8.650e+01 -0.792 0.4303
languageXhosa:congruentTRUE -1.545e-02 1.788e-02 1.211e+02 -0.864 0.3891
conditionsize:congruentTRUE -3.440e-02 1.744e-02 8.261e+00 -1.972 0.0829 .
languageXhosa:conditionsize:congruentTRUE 4.615e-02 2.076e-02 1.626e+04 2.223 0.0262 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) lnggXh cndtns cnTRUE lnggX: lX:TRU c:TRUE
languageXhs -0.682
conditionsz -0.253 0.148
congrntTRUE -0.184 0.095 0.289
lnggXhs:cnd 0.162 -0.239 -0.617 -0.136
lnggXh:TRUE 0.109 -0.163 -0.143 -0.588 0.235
cndtns:TRUE 0.110 -0.050 -0.516 -0.619 0.255 0.323
lnggX::TRUE -0.061 0.092 0.284 0.342 -0.470 -0.586 -0.552
\(R^2\):
[1] 0.2783126
We apply model criticism and see that the interaction is still significant.
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: RTinv ~ language * condition * congruent + (1 + condition + congruent | participant) + (1 | item)
Data: df3
REML criterion at convergence: 6807.6
Scaled residuals:
Min 1Q Median 3Q Max
-2.81499 -0.60413 0.03325 0.63064 2.97502
Random effects:
Groups Name Variance Std.Dev. Corr
participant (Intercept) 4.457e-02 0.211109
conditionsize 3.611e-03 0.060095 -0.23
congruentTRUE 8.312e-04 0.028831 -0.12 0.22
item (Intercept) 9.442e-05 0.009717
Residual 8.685e-02 0.294711
Number of obs: 16180, groups: participant, 55; item, 8
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 1.446e+00 4.030e-02 5.544e+01 35.868 < 2e-16 ***
languageXhosa 1.064e-02 5.778e-02 5.369e+01 0.184 0.854607
conditionsize 1.499e-02 1.729e-02 1.695e+01 0.867 0.398196
congruentTRUE 3.141e-02 1.428e-02 8.598e+00 2.199 0.056791 .
languageXhosa:conditionsize -2.008e-02 2.096e-02 8.247e+01 -0.958 0.340960
languageXhosa:congruentTRUE -2.633e-02 1.537e-02 1.354e+02 -1.713 0.089030 .
conditionsize:congruentTRUE -3.750e-02 1.869e-02 6.429e+00 -2.006 0.088469 .
languageXhosa:conditionsize:congruentTRUE 6.175e-02 1.866e-02 1.605e+04 3.309 0.000937 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) lnggXh cndtns cnTRUE lnggX: lX:TRU c:TRUE
languageXhs -0.677
conditionsz -0.268 0.140
congrntTRUE -0.197 0.080 0.408
lnggXhs:cnd 0.165 -0.246 -0.564 -0.179
lnggXh:TRUE 0.107 -0.161 -0.202 -0.499 0.360
cndtns:TRUE 0.116 -0.037 -0.540 -0.657 0.205 0.282
lnggX::TRUE -0.054 0.082 0.248 0.304 -0.450 -0.612 -0.460
The new \(R^2\) indicates a much better fit.
[1] 0.3297167
Autocorrelation of RTs
In the plots below, we see that there are no clear patterns in RTs over trials or over the duration of the experiments.
Still, in this very fast-paced experiment with many trials, it might be the case that response latencies are dependent on RTs in previous trials, particularly at lag\(_{t-1}\).

We therefore need to check for and possibly control for the RT variable being correlated with itself.
However, the following autocorrelation plots with a subsample of thirty participants suggest that there’s no evidence for significant autocorrelation between RTs for most of the participants. We can therefore leave RTs at t-1 out of the model.
Quantiles to be plotted:
0% 3.448276% 6.896552% 10.34483% 13.7931% 17.24138% 20.68966% 24.13793% 27.58621% 31.03448%
-0.586671730 -0.397174279 -0.321054119 -0.277102594 -0.237523948 -0.207245033 -0.182002424 -0.158051947 -0.130596415 -0.113514347
34.48276% 37.93103% 41.37931% 44.82759% 48.27586% 51.72414% 55.17241% 58.62069% 62.06897% 65.51724%
-0.097340640 -0.065118054 -0.041524692 -0.022997554 -0.002101595 0.023375358 0.045085477 0.062544878 0.083833946 0.105303793
68.96552% 72.41379% 75.86207% 79.31034% 82.75862% 86.2069% 89.65517% 93.10345% 96.55172% 100%
0.131932183 0.153693915 0.178259327 0.202648323 0.238750555 0.271158576 0.311439933 0.369299480 0.432781245 0.705182612

Summary
Methods:
- The ex-Gaussian approach failed to detect significant effects in the \(\mu\) and \(\tau\) parameters. However, more aggressive trimming might yield significant results in some of these analyses.
- This is an interesting approach, but also quite limited in that it requires aggregating the data.
- The same may be true for the conflict effect and the coefficient of variability.
- A three-way interaction was significant in all mixed-effects regression models varying in their random structures before and after applying model criticism.
The three-way interaction suggests that:
- Afrikaans speakers’ RTs were more affected by congruence in the ‘height’ condition
- Xhosa speakers were more affected by congruence in the ‘size’ condition
Xhosa speakers appear to have slightly slower responses overall, but this may be due to greater variability, since, as we saw in the ex-Gaussian analysis, they actually have smaller \(\mu\), but larger \(\tau\) values compared with the Afrikaans group.
Experiment 3: Two-alternative forced-choice task
Whereas experiment 2 was designed to test implicit associations between space and pitch, experiment 3 is aimed at explicit associations where participants choose between height and size mappings in a mixed condition, and whether to match high/low pitch with high vs low position and small vs. big.
Methods
We recruited 30 native speakers of Xhosa and 30 native speakers of Afrikaans for experiment 3.
Participants performed a two-alternative forced-choice task targeting explicit space-pitch associations by pairing pitch with circles differing in vertical position (high/low, height condition) or size (small, big). In a third condition (or trial type), participants had to choose between mapping pitch to either a high/low or a small/big circle.
There were 40 trials per participant. The order of trial types was randomised.
We also recorded RTs in each trial.
We again use mixed-effects regression to analyse the data.
Hypotheses
- Xhosa speakers’ will be more likely to pair pitch with ‘size’ in the mixed condition compared with Afrikaans speakers.
- No group differences in stimulus pairing in the ‘height’ and ‘size’ conditions. E.g. high pitch will consistently be paired with high and small circles.
- RTs will be slower in the mixed condition for both groups.
Mixed condition
The plot and output below indicate that Xhosa speakers, when given the possibility of pairing either visual height or size with pitch, are less likely to select size than Afrikaans speakers.

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: choiceDimension ~ voice * language + (1 | participant)
Data: df_m
AIC BIC logLik deviance df.resid
587.4 609.4 -288.7 577.4 592
Scaled residuals:
Min 1Q Median 3Q Max
-5.0581 -0.5260 0.2646 0.4567 3.4811
Random effects:
Groups Name Variance Std.Dev.
participant (Intercept) 2.32 1.523
Number of obs: 597, groups: participant, 61
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.3472 0.3720 3.622 0.000293 ***
voicelow 1.0147 0.3516 2.886 0.003899 **
languageXhosa -2.1798 0.5143 -4.238 2.25e-05 ***
voicelow:languageXhosa 1.4295 0.4846 2.950 0.003180 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) voiclw lnggXh
voicelow -0.358
languageXhs -0.731 0.255
vclw:lnggXh 0.294 -0.709 -0.418
We find an interaction effect between language and voice frequency.
Contrary to our expectations, Afrikaans speakers were more likely to pair small circles with high-pitched voices rather than circles with a high position.
The opposite patterns was found for Xhosa speakers, who preferred the ‘height’ mapping.
For low-pitched voices, both group consistently favoured the ‘size’ mapping.
Height condition

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: choiceName ~ voice * language + (1 + voice | participant)
Data: df_h
AIC BIC logLik deviance df.resid
735.6 766.4 -360.8 721.6 600
Scaled residuals:
Min 1Q Median 3Q Max
-2.2522 -0.6824 0.2859 0.5909 2.2376
Random effects:
Groups Name Variance Std.Dev. Corr
participant (Intercept) 4.828 2.197
voicelow 8.511 2.917 -0.94
Number of obs: 607, groups: participant, 61
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8408 0.4730 -1.778 0.07547 .
voicelow 1.6184 0.6233 2.596 0.00942 **
languageXhosa 0.7860 0.6648 1.182 0.23710
voicelow:languageXhosa -1.3244 0.8702 -1.522 0.12804
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) voiclw lnggXh
voicelow -0.892
languageXhs -0.714 0.637
vclw:lnggXh 0.641 -0.717 -0.897
In the ‘height’ condition, we find a significant effect for voice frequency. The trend further suggests an interaction such that Afrikaans speakers are more consistent in mapping high/low voices to high/low circles.
Size condition

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: binomial ( logit )
Formula: choiceName ~ voice * language + (1 | participant)
Data: df_s
AIC BIC logLik deviance df.resid
577.1 599.1 -283.5 567.1 603
Scaled residuals:
Min 1Q Median 3Q Max
-3.0099 -0.4562 -0.2429 0.3899 3.8936
Random effects:
Groups Name Variance Std.Dev.
participant (Intercept) 0.9321 0.9654
Number of obs: 608, groups: participant, 61
Fixed effects:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.8822 0.2976 6.324 2.55e-10 ***
voicelow -4.2257 0.4014 -10.527 < 2e-16 ***
languageXhosa -2.1984 0.3913 -5.619 1.92e-08 ***
voicelow:languageXhosa 2.5268 0.4982 5.072 3.93e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) voiclw lnggXh
voicelow -0.546
languageXhs -0.774 0.436
vclw:lnggXh 0.417 -0.769 -0.482
In the ‘size’ condition, we see a clear interaction between language and voice frequency.
The two language groups showed high agreement in pairing low pitch with big, rather than small circles.
Interestingly, Xhosa speakers also showed a slight preference for pairing high pitch with big circles, whereas Afrikaans speakers more consistently paired high pitch with small circles.
RT analyses
Finally, as large RTs might be indicative of uncertainty, we’ll examine whether there are significant differences in RTs related to the independent variables.
The data has been trimmed to only include responses faster than 20 seconds.
Not shown here are QQ-plots and correlation coefficients indicating that the log-normal transformation provides the best fit for our data. We will also apply model criticism and only examine the final model.
As the condition variable has more than two levels for comparison, we’ll base our analysis on an anova table summarizing the regression output.

Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
language 2.3358 2.3358 1 51.38 8.5757 0.005068 **
voice 1.9138 1.9138 1 2042.38 7.0264 0.008094 **
condition 20.0421 10.0211 2 2041.88 36.7923 < 2.2e-16 ***
language:voice 0.9849 0.9849 1 2042.38 3.6160 0.057366 .
language:condition 0.0970 0.0485 2 2041.88 0.1782 0.836826
voice:condition 0.6134 0.3067 2 2047.02 1.1261 0.324507
language:voice:condition 0.2553 0.1276 2 2047.02 0.4686 0.625950
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
The interaction plots and anova table reveal significant main effects of language, voice frequency and condition with no interaction effects. Xhosa speakers generally took longer in selecting visual stimuli, which was to be expected on the basis of their lower consistency in all conditions.
Interestingly, our expectation that the mixed condition would give rise to the longest RTs is not supported by the results. Instead the ‘height’ condition consistently gave rise to slower decisions despite being the more dominant mapping in speech.
Summary
This experiment yielded some very interesting and surprising results.
Overall, Xhosa speakers were less consistent in how they mapped space to pitch, both when choosing visual stimuli of opposite spatial polarities, and when choosing between height and size.
Surprisingly, Afrikaans speakers consistently preferred the size mapping for both high and low-pitched voices, whereas Xhosa speakers showed a preference for ‘height’ in cases with low pitch, and ‘size’ in cases with high pitch.
Another striking finding was that, in the size condition, Xhosa speakers showed a preference for mapping big circles to both low pitch and high pitch, though to a lesser extent.
There appears to be more individual variation for Xhosa speakers in this task, both in choosing between mappings, as well as choosing polarity correspondences within particular mappings.
In this experiment, linguistic metaphors proved to be poor predictors of non-linguistic choices with regards to spatial mappings.
Conclusion
In the series of experiments, we examined spatial metaphors for pitch from three angles: language production, performance in a nonverbal implicit association task and nonverbal judgements in an explicit association task.
Our language production findings show that ‘height’ is used in Adrikaans, whereas Xhosa speakers also used ‘size’. Interestingly, and in line with our findings from a previous study, vertical gestures frequently accompany ‘height’ but also ‘size’ metaphors. The speech material does not allow us determine whether gestural indications of height might refer to the physical size/height of a person likely to produce the heard sounds.
In the RT task, we found a three-way interaction that followed our predictions. However, methods requiring data aggregation failed to detect this effect. I would suggest reporting the statistically more sophisticated mixed-effect regfression model with maximal random structure, but perhaps also noting that an ex-Gaussian approach failed to detect any significant effects. As the last interaction plot indicates, the effect is far from spectacular (as opposed to previous findings in the literature).
The results from the hird experiment seem more puzzling. We did not expect Afrikaans speakers to consistently map pitch to ‘size’. Nor did we expect Xhosa speakers to choose ‘height’/‘size’ depending on the pitch of the voice.
In the height and size conditions, Afrikaans speakers paired pitch with high/low/big/small as expected, whereas Xhosa speakers were much less consistent, with the exception that they had a clear preference for pairing “low” pitch with bic circles as opposed to small circles. Further studies with other types of stimuli might shed light on the latter findings
As I see it, our findings point in different directions and only support the very general idea that the conceptualisation of pitch is, to some extent, spatial, but flexible. The factorial designs demonstrated different effects of manipulating visual size and height for the two groups, but only in the case of mixed-effects regression. The role of language in shaping the conceptualisation of pitch is unclear, but far from deterministic. This is particularly evident from the contradictory findings in experiment 3.
References and readings
Abutalebi, J., Guidi, L., Borsa, V., Canini, M., Della Rosa, P. A., Parris, B. A., & Weekes, B. S. (2015). Bilingualism provides a neural reserve for aging populations. Neuropsychologia, 69, 201–210. https://doi.org/10.1016/j.neuropsychologia.2015.01.040
Baayen, H. R., & Milin, P. (2010). Analyzing reaction times. International Journal of Psychological Research, 3(2), 12. https://doi.org/10.21500/20112084.807
Calabria, M., Hernandez, M., Martin, C. D., & Costa, A. (2011). When the Tail Counts: The Advantage of Bilingualism Through the Ex-Gaussian Distribution Analysis. Frontiers in Psychology, 2. https://doi.org/10.3389/fpsyg.2011.00250
Henriquez-Henriquez, M. P., Billeke, P., Henriquez, H., Zamorano, F. J., Rothhammer, F., & Aboitiz, F. (2015). Intra-Individual Response Variability Assessed by Ex-Gaussian Analysis may be a New Endophenotype for Attention-Deficit/Hyperactivity Disorder. Frontiers in Psychiatry, 5. https://doi.org/10.3389/fpsyt.2014.00197
Lachaud, C. M., & Renaud, O. (2011). A tutorial for analyzing human reaction times: How to filter data, manage missing values, and choose a statistical model. Applied Psycholinguistics, 32(02), 389–416. https://doi.org/10.1017/S0142716410000457
Marsden, E., Thompson, S., & Plonsky, L. (2018). A methodological synthesis of self-paced reading in second language research. Applied Psycholinguistics, 39(05), 861–904. https://doi.org/10.1017/S0142716418000036
Ratcliff, R. (u.å.). Methods for Dealing With Reaction Time Outliers, 23. Whelan, R. (2010). Effective analysis of reaction time data. The Psychological Record, 58(3). Hentet fra https://opensiuc.lib.siu.edu/tpr/vol58/iss3/9
