Replication of Study 1 by Luk & Surrain (2019, PsyArXiv Preprints)

Introduction

This project is a replication of Study 1 in Luk and Surrain (2019). All relevant documentation can be found in this GitHub repository. The authors developed a scale measuring the perception of bilingualism, the 13-item Perception of Bilingualism scale (POB), and conducted psychometric analyses using both a classical test theory (CTT) and an item response theory (IRT) approach. The authors established unidimensionality of the POB scale using exploratory factor analysis (EFA; 80 % of variance explained by a single factor), yet a single-factor confirmatory analysis (CFA) produced insufficient model fit results (\(\chi\)² (65) = 501.23) using multiple indices. In a graded response model, they flagged 3 out of 13 items as uninformative.

Further, using a multiple linear regression model, they showed that participants’ language background is consistently predictive of POB score (on the final 10-item version), regardless of age, education, or sex. The direction of this observation confirmed their hypothesis that bilingulism is perceived more positively by individuals speaking more than one language themselves.

This replication is important, (a) because it will solidify our knowledge about how bilingual inidividuals perceive their bilingual status and (b) because, to date, there is no reliable scale measuring how bilingualism is perceived by individuals. Obtaining this information can help inform policy and education and aids in assessing how successfuly the current state of research findings on multilingualism have been disseminated to the public.

Methods

Power Analysis

In the original study, multiple linear regression model R²s ranged from .15 in their simple-most model to .24 in a four-predictor model with interaction effect (Table 5 in the original report). Hence, for the replication of the regression analyses, to replicate a regression model with R² = .24, sample sizes of N = 47, 58, and 69 were needed to achieve statistical power of 80 %, 90 %, and 95 %, respectively.

Planned Sample

The authors of the original paper recruited US-American participants via Qualtrics Panel and Amazon MechanicalTurk, N = 422. For the replication, all participants were be recruited using Amazon MechanicalTurk. The use of the same sampling frame increases the likelihood of obtaining similar sample characteristics. Given the power calculations in the previous section, the desired sample size for the regression anlaysis, after exclusions due to non-attention/comphrehension, was N = 70. In order to replicate the IRT analysis, however, a much larger sample would be needed–this was unfeasible under the current circumstances.

Materials

The replication made use of the same materials as the original study. The POB is available as part of the original paper. Though not the original demographic questionnaire, the current version of the survey did, nonetheless, cover all the demographic variables of interest. Materials are:

Perception of Bilingualism scale; Luk and Surrain (2019, pp. 12-13) described its development as follows: “An initial set of 13 items was developed based on our review of the literature, cognitive interviews, and input from members of the research team in our lab who have worked with linguistically diverse populations across the lifespan. The initial set of items […] covered perceptions of whether speaking multiple languages in the U.S. should be acknowledged, accommodated, rewarded and supported; whether speaking multiple languages in the U.S. is needed and valued; and whether speaking multiple languages incurs personal benefits and costs. Several items were adapted from Baker’s Attitude to Bilingualism Scale (21) and Byrnes and Kiger’s Language Attitudes of Teachers Scale (LATS; 33,34). We chose to use a 6-point Likert scale from 1 (strongly disagree) to 6 (strongly agree) with no midpoint option elicit greater variability and discourage satisficing, or providing a response without expending the cognitive effort required to fully interpret and respond to each item”. As per the authors’ explicit concerns, the PoB will be available in both Spanish and English;
Demographic questionnaire;
Attention check item inserted inmidst the PoB.

Procedure

In one combined Qualtrics survey, participants will give informed consent to participation, complete the POB, interspersed with an attention check item, and respond to a basic set of questions about their demographics, educational attainment, and language background. In the original paper, median survey completion time was 13 minutes. Given that the replication focused on Part 1 of the original paper, which does neither require participants to complete the Knowledge of infant development inventory nor the PoB+, I expected the survey to be completed in under 10 minutes.

Analysis Plan

The analysis plan for the regression analysis mirrors that of the orginal paper: All data from participants who failed the attention was excluded. I provided a descriptive overview showing demographic characteristics of the sample. Given the small sample size, I did not conduct the EFA, CFA, and IRT analyses. Correlations between POB scores, age, sex, language background, and years of education were explored and all predictors were entered into a multiple linear regression model, exactly as in the original paper. The key analysis of interest is the regression modelling Luk and Surrain presented in Table 5; I aimed to replicate their final model (model 5), which predicts PoB score from English-language status, years of education, age, sex, and a language x sex interaction with \(R^2\) = .24, \(F\)(5, 416) = 26.23, \(p\) < .001 (\(p\) calculated myself).

Differences from Original Study

While the original study recruited participants using both Qualtrics Panel and Amazon MechanicalTurk, the replication recruited only from the latter. Further, it was not possible to systematically oversample parents to guarantee a sufficiently large representation of parents of children exposed to both Spanish and English—as was the case in the original study. Years of education (Education_y) is a self-reported item in the replication, wherease Luk and Surrain manually recoded self-reported highest level of education to the approximately equivalent number of years. In light of the fact that Luk and Surrain did not provide results split by whether or not participants were parents, the effect this sampling difference produced is impossible to predict. Overall, the replication remained very close to the original study; hence, it was reasonable to expect very similar results.

Methods Addendum (Post Data Collection)

Actual Sample

The final sample for this analysis comprised 66 individuals (24 women, 42 men), aged between 18 and 69 years (\(M\) = 37.00, \(SD\) = 12.30). The majority of participants was white (76 %), born in the US (87 %) , and had at least some tertiary education (83 %). A table with a detailed sample description can be found in the confirmatory analysis section.

Differences from pre-data collection methods plan

No changes were made to the data collection process or general analysis plan. However, I made some minor cosmetic changes to the presentation of regression models and added some lines of code to convert the demographic variables into appropriate formats (e.g. sex from numeric into factor). None of these changes influence the outcomes of the preregistered confirmatory analyses.

Results

Data preparation

Data preparation followed the analysis plan outlined above, using the following R packages.

library(tidyverse)
library(readr)
library(ltm)
library(psych)
library(qualtRics) # requires devtools install 
library(here) # requires devtools install 
library(dplyr)
library(table1)
library(corrr)
library(car)
library(kableExtra)
library(sjPlot)
library(htmltools)
library(stargazer)
library(ltm)

First, I downloaded the Qualtrics survey results as a csv file, stripped columns containing sensitive information, such as IP addresses and location data; then I imported it into R and deleted other irrelevant survey metadata variables.

# import data from csv (Qualtrics output), in R-friendly format, i.e. without unneccessary header rows
here()
d1 <- readSurvey("data/luk2019_replication_anonymous.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   StartDate = col_datetime(format = ""),
##   EndDate = col_datetime(format = ""),
##   Progress = col_double(),
##   `Duration (in seconds)` = col_double(),
##   Finished = col_logical(),
##   RecordedDate = col_datetime(format = ""),
##   RecipientLastName = col_logical(),
##   RecipientFirstName = col_logical(),
##   RecipientEmail = col_logical(),
##   ExternalReference = col_logical(),
##   PoB1_S = col_logical(),
##   PoB2_S = col_logical(),
##   PoB4_S = col_logical(),
##   PoB6_S = col_logical(),
##   PoB7_S = col_logical(),
##   AttentionCheck_S = col_logical(),
##   PoB8_S = col_logical(),
##   PoB9_S = col_logical(),
##   PoB10_S = col_logical(),
##   PoB12_S = col_logical()
##   # ... with 5 more columns
## )

## See spec(...) for full column specifications.

# drop irrelevant Qualtrics variables
d1 <- d1 %>%
  dplyr::select(-c(StartDate, EndDate, Status, RecordedDate, RecipientFirstName, RecipientLastName, RecipientEmail, ExternalReference, Finished, DistributionChannel))

Next, I evaluated the attention checks, recoded them as 1 when passed and 0 when failed. I then excluded all participants who did not pass the attention check.

d2 <- d1 %>%
  filter(AttentionCheck == "Strongly disagree" | AttentionCheck_S == "Totalmente en desacuerdo") %>%
  mutate(AttentionCheck = ifelse(AttentionCheck == "Strongly disagree", 1, ifelse(AttentionCheck_S == "Totalmente en desacuerdo", 1,0))) %>%
  dplyr::select(-c(AttentionCheck_S))

d2 <- d2 %>% # move attention check variable to end of dataset
  dplyr::select(-AttentionCheck, everything())

Then, I created a dummy variable for monolingual English status and renamed the PoB score variable to Total_PoB13.

# create variable "EO" (L1 English, no L2)
d2 <- d2 %>%
  mutate(EO = if_else(L1 == "English", if_else(L2 == "I do not speak a second language.", 1, 0), 0))

# rename SC0 to PoB13_Total
d2 <- d2 %>%
  mutate(Total_PoB13 = SC0) %>%
  dplyr::select(-SC0)

Then, I recoded PoB responses from string variables (labels on the Likert scale) to numerics (1-6). For this, I temporarily gathered the data, then spread it again, thereafter. Further, given that participants completed the PoB in either English or Spanish, I unified responses to the English and Spanish PoB scales into one variable. Lg_decision, the variable indicating the language in which the survey was taken was retained.

d2 <- d2 %>%
  pivot_longer(cols = starts_with("PoB"), names_to = "PoB_Item", values_to = "response")

# converting character responses (Likert categories) to numeric
d2$response <- recode(d2$response,"'Strongly agree' = 6; 'Totalmente de acuerdo' = 6; 'Agree' = 5; 'De acerdo' = 5; 'Somewhat agree' = 4; 'Parcialmente de acuerdo' = 4; 'Somewhat disagree' = 3; 'Parcialmente en desacuerdo' = 3; 'Disagree' = 2; 'En desacuerdo' = 2; 'Strongly disagree' = 1; 'Totalmente en desacuerdo' = 1")

d2 <- d2 %>%
  pivot_wider(names_from = "PoB_Item", values_from = "response")

# creating new unified variables for PoB responses (regardless of response language)
d2 <- d2 %>%
  mutate(PoB_1 = ifelse(Lg_decision == "Español", PoB1_S, PoB1_E),
         PoB_2 = ifelse(Lg_decision == "Español", PoB2_S, PoB2_E),
         PoB_4 = ifelse(Lg_decision == "Español", PoB4_S, PoB4_E),
         PoB_6 = ifelse(Lg_decision == "Español", PoB6_S, PoB6_E),
         PoB_7 = ifelse(Lg_decision == "Español", PoB7_S, PoB7_E),
         PoB_8 = ifelse(Lg_decision == "Español", PoB8_S, PoB8_E),
         PoB_9 = ifelse(Lg_decision == "Español", PoB9_S, PoB9_E),
         PoB_10 = ifelse(Lg_decision == "Español", PoB10_S, PoB10_E),
         PoB_12 = ifelse(Lg_decision == "Español", PoB12_S, PoB12_E),
         PoB_13 = ifelse(Lg_decision == "Español", PoB13_S, PoB13_E))

# removing old PoB response columns
d2 <- d2 %>%
  dplyr::select(-contains("_S")) %>%
  dplyr::select(-contains("_E"))

I computed a new PoB score variable (PoB10_Total), denoting the total score a shortened version of the PoB, which Luk and Surrain determined to have improved psychometric properties and which they used for subsequent analyses.

# compute new PoB10_Total
d2 <- d2 %>%
  mutate(PoB10_Total = PoB_1 + PoB_2 + PoB_4 + PoB_6 + PoB_7 + PoB_8 + PoB_9 + PoB_10 + PoB_12 + PoB_13)

After data cleaning, the resultant dataframe (d2) contains some session information, participants’ PoB scores, age, sex, language background, and years of education, as well as their responses to the linguistic profile questionnaire.

Confirmatory analysis

Sample descriptives and between-groups differences were displayed using the table1 package.

label(d2[["Age"]]) <- "Age in years"
label(d2[["Sex"]]) <- "Sex (Female = 1)"
label(d2[["Birthplace"]]) <- "Birthplace"
label(d2[["Ethnicity"]]) <- "Ethnicity"
label(d2[["Education"]]) <- "Highest educational level"

# Sample descriptives 
table1::table1(~Age + Sex + Birthplace + Education + Ethnicity, data = d2)

	Overall (n=66)
Age in years
Mean (SD)	37.0 (12.3)
Median [Min, Max]	34.5 [18.0, 69.0]
Sex (Female = 1)
Female	24 (36.4%)
Male	42 (63.6%)
Birthplace
In the US	58 (87.9%)
Outside the US, please specify:	7 (10.6%)
Missing	1 (1.5%)
Highest educational level
College graduate	25 (37.9%)
Graduate school degree	12 (18.2%)
High school graduate	11 (16.7%)
Some college or ass. degree	18 (27.3%)
Ethnicity
Asian or Pacific Islander	4 (6.1%)
Black	5 (7.6%)
Hispanic	4 (6.1%)
Mixed race or other	3 (4.5%)
White alone (not Hispanic)	50 (75.8%)

All confirmatory analyses are specified in the analysis plan (see section Analysis Plan). Following a correlations table, the main analysis is a multiple linear regression, carried out using the lm function. Luk and Surrain’s (2019) final model were replicated as per the below (building all models as in the original paper, with a focus on the last model, fit5). Prior to the model building, however, I converted the variable Sex into numeric format (female = 1) and centered Age at the mean (as the Luk & Surrain did).

# creating new variable, centering age at the mean
d2 <- d2 %>%
  mutate(age_centered = Age - mean(Age))

# changing Sex to numeric with Female = 1 
d2 <- d2 %>%
  mutate(Sex = if_else(Sex == "Female", 1, 0))

A quick look at the correlations table shows a pattern very similar to that in the original study (see Table 4 in Luk and Surrain, 2019). Being older and being a monolingual English-speaker are negatively related to PoB scores; higher levels of education and being female are positive related to PoB scores.

# prepration for correlation tables
# creating separate df for correlations_data
correlations_data <- d2 %>%
  dplyr::select(c(PoB10_Total, EO, Age, Sex, Education_y))

# creating & showing cor_df
correlations <- corrr::correlate(correlations_data)

## 
## Correlation method: 'pearson'
## Missing treated using: 'pairwise.complete.obs'

correlations

## # A tibble: 5 x 6
##   rowname     PoB10_Total     EO    Age      Sex Education_y
##   <chr>             <dbl>  <dbl>  <dbl>    <dbl>       <dbl>
## 1 PoB10_Total    NA       -0.125 -0.247  0.0997      0.00129
## 2 EO             -0.125   NA      0.138  0.113      -0.271  
## 3 Age            -0.247    0.138 NA      0.206       0.215  
## 4 Sex             0.0997   0.113  0.206 NA           0.00904
## 5 Education_y     0.00129 -0.271  0.215  0.00904    NA

The regression models, however, do not replicate Luk and Surrain’s findings in the least. The lack of any explanatory power in any of the five models and the lack of any resemblance at all to the original models seemed unlikely, prompting me to revisit Luk and Surrain’s codebook. Here, I realised that the dependent variable in the regression models was the PoB theta for each participant—as opposed to the total score on the 10-item PoB scale. Though they had mentioned it in their analytic plan, it was not self-evident in the results description or variable labeling.

# Multiple linear regression predicting PoB score from same demograhpic and linguistic variables as in original paper

# fit1 predictors: EO
# fit2 predictors: EO, Education_y
# fit3 predictors: EO, Education_y, age 
# fit4 predictors: EO, Education_y, age, sex,
# fit4 predictors: EO, Education_y, age, sex, parent
# fit5 predictors: EO, Education_y, age, sex, parent, age*sex

fit1 <- lm(PoB10_Total ~ EO, data=d2)
fit2 <- lm(PoB10_Total ~ EO + Education_y, data=d2)
fit3 <- lm(PoB10_Total ~ EO + Education_y + age_centered, data=d2)
fit4 <- lm(PoB10_Total ~ EO + Education_y + age_centered + Sex, data=d2)
fit5 <- lm(PoB10_Total ~ EO + Education_y + age_centered + Sex + EO*age_centered, data=d2)

stargazer::stargazer(fit1, fit2, fit3, fit4, fit5, type = "html")


	Dependent variable:

	PoB10_Total
	(1)	(2)	(3)	(4)	(5)

EO	-2.038	-1.776	-0.755	-1.015	-0.689
	(2.063)	(2.123)	(2.104)	(2.114)	(2.147)

Education_y		-0.070	0.111	0.099	0.153
		(0.284)	(0.287)	(0.287)	(0.293)

age_centered			-0.179^**	-0.191^**	-0.308^**
			(0.080)	(0.080)	(0.151)

Sex				2.177	2.200
				(1.998)	(2.001)

EO:age_centered					0.158
					(0.172)

Constant	44.818^***	46.041^***	42.481^***	42.073^***	40.857^***
	(1.664)	(5.248)	(5.321)	(5.326)	(5.496)


Observations	63	62	62	62	62
R²	0.016	0.012	0.090	0.109	0.122
Adjusted R²	-0.0004	-0.022	0.043	0.046	0.044
Residual Std. Error	7.805 (df = 61)	7.618 (df = 59)	7.372 (df = 58)	7.360 (df = 57)	7.370 (df = 56)
F Statistic	0.976 (df = 1; 61)	0.350 (df = 2; 59)	1.921 (df = 3; 58)	1.742 (df = 4; 57)	1.558 (df = 5; 56)

Note:	p<0.1; p<0.05; p<0.01

Therefore, in the following section, I adapted my replication plan and repeated the regression modelling using predicted PoB theta scores.

###Exploratory analyses

I attempted to compute PoB thetas in the same fashion as Luk and Surrain, using a graded response model.

# creating dataframe with PoB responses only
PoB <- d2 %>%
  dplyr::select(c(PoB_1, PoB_2, PoB_4, PoB_6, PoB_7, PoB_8, PoB_9, PoB_10, PoB_12, PoB_13))

# applying grm model to PoB results ('subscript out of bounds' error, hence not run)
# grm(PoB)

However, the data does not contain all possible response options for each item (i.e. some response options on some of the items were not selected by any of the participants). Therefore, I could not run the grm model, as it could not compute the threshold parameters for each response level. As a result of that, I could not compute theta values, either. In a last attempt, I adjusted the responses to a 1-5 scale, in order to obtain a rough estimate of grm theta estimates, in order to rerun the regression analyses.

# creating new set of adjusted PoB response, recoded original responses to 1-5 scale (collapsing original 1 and 2 responses into single category, 1, and subtracting one from all higher categories)
d2 <- d2 %>%
  mutate(PoB_1a = ifelse(PoB_1 <= 2, 1, PoB_1 - 1),
         PoB_2a = ifelse(PoB_2 <= 2, 1, PoB_2 - 1),
         PoB_4a = ifelse(PoB_4 <= 2, 1, PoB_4 - 1),
         PoB_6a = ifelse(PoB_6 <= 2, 1, PoB_6 - 1),
         PoB_7a = ifelse(PoB_7 <= 2, 1, PoB_7 - 1),
         PoB_8a = ifelse(PoB_8 <= 2, 1, PoB_8 - 1),
         PoB_9a = ifelse(PoB_9 <= 2, 1, PoB_9 - 1),
         PoB_10a = ifelse(PoB_10 <= 2, 1, PoB_10 - 1),
         PoB_12a = ifelse(PoB_12 <= 2, 1, PoB_12 - 1),
         PoB_13a = ifelse(PoB_13 <= 2, 1, PoB_13 - 1))

# creating new df with only recoded PoB responses
PoBa <- d2 %>%
  dplyr::select(c(PoB_1a, PoB_2a, PoB_4a, PoB_6a, PoB_7a, PoB_8a, PoB_9a, PoB_10a, PoB_12a, PoB_13a))

# running grm model and saving model fit
PoB_grm <- grm(PoBa)

# theta estimation: computing theta estimates for each participant 
PoB_grm_scores <- ltm::factor.scores(PoB_grm, resp.patterns = PoBa)

# saving thetas as vector
PoB_theta <- PoB_grm_scores$score.dat$z1

# adding thetas vector as column to d2
d2 <- d2 %>%
  add_column(PoB_theta)

The same regression models with PoB theta (as opposed to sumscores) are presented below Despite the use of theta scores as the DV, the models do not reproduce Luk and Surrain’s results.

# repeating regression analyses from confirmatory analysis with PoB thetas instead of sumscores
fit1a <- lm(PoB_theta ~ EO, data=d2)
fit2a <- lm(PoB_theta ~ EO + Education_y, data=d2)
fit3a <- lm(PoB_theta ~ EO + Education_y + age_centered, data=d2)
fit4a <- lm(PoB_theta ~ EO + Education_y + age_centered + Sex, data=d2)
fit5a <- lm(PoB_theta ~ EO + Education_y + age_centered + Sex + EO*age_centered, data=d2)

stargazer::stargazer(fit1a, fit2a, fit3a, fit4a, fit5a, type = "html")


	Dependent variable:

	thetas
	(1)	(2)	(3)	(4)	(5)

EO	-0.260	-0.230	-0.128	-0.141	-0.100
	(0.233)	(0.237)	(0.238)	(0.240)	(0.243)

Education_y		-0.009	0.008	0.008	0.015
		(0.032)	(0.033)	(0.033)	(0.034)

age_centered			-0.018^*	-0.019^**	-0.035^*
			(0.009)	(0.009)	(0.018)

Sex				0.145	0.149
				(0.230)	(0.230)

EO:age_centered					0.022
					(0.020)

Constant	0.156	0.315	-0.021	-0.067	-0.229
	(0.186)	(0.588)	(0.601)	(0.609)	(0.626)


Observations	66	65	65	65	65
R²	0.019	0.015	0.072	0.078	0.097
Adjusted R²	0.004	-0.017	0.027	0.017	0.020
Residual Std. Error	0.910 (df = 64)	0.888 (df = 62)	0.869 (df = 61)	0.873 (df = 60)	0.872 (df = 59)
F Statistic	1.244 (df = 1; 64)	0.469 (df = 2; 62)	1.581 (df = 3; 61)	1.272 (df = 4; 60)	1.260 (df = 5; 59)

Note:	p<0.1; p<0.05; p<0.01

Discussion

Summary of Replication Attempt

The key analysis of interest was the five-predictor regression model of PoB performance (theta, not sumscore). While Luk and Surrain computed a multiple linear model of English language status, years of education, age, sex, and English language status*age predicting PoB theta scores with an \(R^2\) of .24, the replication with the same model predictors only achieved an \(R^2\) of .09, with age being the only significant predictor. Therefore, the replication attempt must be considered failed. Explanations for this failure to replicate are discussed in the commentary section below.

Commentary

Key differences between the original study and the replication are:

The original study’s sample was bigger by factor 6.
The original study systematically oversampled parents of young children, the replication sample did not.
For the replication sample, unlike for the original one, there were no selection criteria for MTurk participants to intentionally recruit a sample representative of the US population. As a result, the replication sample was, on average, more educated, more white, and had a reverse sex ratio (2:1 men:women) compared to the original sample.
For the replicaiton, I did not have access to the demographic questionnaires and attention checks; the format of those elements in the replication might have altered the response patterns.
The replication only used the improved 10-item PoB scale, while participants in the original study completed the initial 13-item version.

In light of these differences, the failure to replicate the results is most likely due to the difference in the samples used and potentially due to the difference in the survey content. Further, having rerun Luk and Surrain’s analysis following their code on OSF produced the exact results in their paper, ruling out yet another potential explanation of diverging findings. Overall, the re-analysis of the OSF data using the code provided by the authors showed that the orginal paper’s methods are sound. This and general reservations about the timeline and circumstance of the replication effort raised by the original authors strongly suggest that the present replication effort must not be taken as a serious challenge to the original findings.