library(tidyverse)
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
-- Attaching packages ------------------------------------------------- tidyverse 1.3.2 --v ggplot2 3.3.5 v purrr 0.3.4
v tibble 3.1.8 v dplyr 1.0.9
v tidyr 1.1.4 v stringr 1.4.0
v readr 2.1.2 v forcats 0.5.1-- Conflicts ---------------------------------------------------- tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
library(lme4)
Loading required package: Matrix
Attaching package: ‘Matrix’
The following objects are masked from ‘package:tidyr’:
expand, pack, unpack
library(lmerTest)
Attaching package: ‘lmerTest’
The following object is masked from ‘package:lme4’:
lmer
The following object is masked from ‘package:stats’:
step
library(MuMIn)
d <- read_csv("isbell_crowther_genZ_inprep.csv")
Rows: 216 Columns: 271-- Column specification ------------------------------------------------------------------
Delimiter: ","
chr (20): ResponseId, wave, Generation, device, gender, ethnicity, ethnicity_8_TEXT, ...
dbl (250): USA_Correct, USA_Friendly, USA_Pleasant, USA_Familiar, China_Correct, China...
lgl (1): gender_4_TEXT
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
To run a mixed effects model, we’ll need to get the data in a suitable format.
You’ll notice that this dataset is not exactly tidy. There are, for example, columns that indicate multiple variables - country and one of four qualities (familiarity, correctness, friendliness, and pleasantness) related to the English spoken there.
It’ll take a couple of steps to get this into shape.
First, we’ll get all of the country ratings in a single column.
long <- d %>% pivot_longer(USA_Correct:Thailand_Familiar, names_to = "country_quality",
values_to = "rating")
Now we have over 50,000 observations. Wild!
But we need to break up the countries and qualities - those are
separate variables. We’ll use the separate() function and
some fancy regex (regular expressions) to split the column into two.
long <- long %>% separate(country_quality, into = c("Country", "Quality"), sep = "_(?=[[:alpha:]]*$)")
Now, since each quality is a different variable, we’ll go ahead and
put each in it’s own column using pivot_wider().
long <- long %>% pivot_wider(names_from = "Quality", values_from = "rating")
We should be set now - take a look at the dataset just to be sure.
Let’s have a look at some of the data. First, we’ll focus on countries.
long %>% ggplot(aes(y = Pleasant, x = Country))+
geom_jitter(height = .25, width = .1, alpha = .3)+
theme_bw()+
theme(axis.text.x = element_blank())
This isn’t the best way to represent the data, but you can clearly see that some countries have more observations than others, and at the same time some countries seem to have greater concentrations of pleasantness ratings at the higher or lower end.
We’ll look at listeners next - just a subset, though, to keep things manageable.
set.seed(15)
rand_listener <- sample(unique(long$ResponseId), 15)
long %>% filter(ResponseId %in% rand_listener) %>%
ggplot(aes(x = ResponseId, y = Pleasant)) +
geom_jitter(height = .05, width = .15)+
theme_bw()
Again, not the prettiest plot, but you can clearly see some differences in how listeners rate pleasantness - some assign generally lower ratings, and some assign generally higher ratings, while others use a broader range.
So it looks like there’d be a good case to include both listeners and countries as random effects.
Now we’ll try modeling how listeners judge the pleasantness of different global Englishes. We’ll start by taking a look at some random effects models.
ranef_mod <- lmer(Pleasant ~ 1 + (1|ResponseId) + (1|Country), data = long)
summary(ranef_mod)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Pleasant ~ 1 + (1 | ResponseId) + (1 | Country)
Data: long
REML criterion at convergence: 14298.6
Scaled residuals:
Min 1Q Median 3Q Max
-5.5604 -0.5954 0.0270 0.5732 3.7892
Random effects:
Groups Name Variance Std.Dev.
ResponseId (Intercept) 1.8914 1.3753
Country (Intercept) 0.6705 0.8188
Residual 2.0839 1.4436
Number of obs: 3787, groups: ResponseId, 216; Country, 58
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.3532 0.1454 146.8264 43.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Now we’ll add a fixed effect - let’s try Gender. Because there are only 4 non-binary folks, we’ll exclude them from analysis.
gender_mod <- lmer(Pleasant ~ gender + (1|ResponseId) + (1|Country), data = filter(long, gender != "Non-binary"))
summary(gender_mod)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Pleasant ~ gender + (1 | ResponseId) + (1 | Country)
Data: filter(long, gender != "Non-binary")
REML criterion at convergence: 13998.6
Scaled residuals:
Min 1Q Median 3Q Max
-3.8789 -0.5999 0.0343 0.5746 3.7864
Random effects:
Groups Name Variance Std.Dev.
ResponseId (Intercept) 1.8793 1.3709
Country (Intercept) 0.6838 0.8269
Residual 2.0659 1.4373
Number of obs: 3715, groups: ResponseId, 212; Country, 58
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.3782 0.1542 166.6562 41.368 <2e-16 ***
genderMale -0.2504 0.2463 209.6565 -1.016 0.311
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
genderMale -0.310
It doesn’t look like gender predicts average pleasantness ratings. But we might want to try treating Gender as a random slope. It can’t go with listener, because listeners don’t vary in their gender during the study. But country means could vary according to gender - maybe the gender slope is different from country to country.
gender_mod2 <- lmer(Pleasant ~ gender + (1|ResponseId) + (1+gender|Country), data = filter(long, gender != "Non-binary"))
boundary (singular) fit: see ?isSingular
summary(gender_mod2)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Pleasant ~ gender + (1 | ResponseId) + (1 + gender | Country)
Data: filter(long, gender != "Non-binary")
REML criterion at convergence: 13997.1
Scaled residuals:
Min 1Q Median 3Q Max
-3.8681 -0.6011 0.0328 0.5737 3.7561
Random effects:
Groups Name Variance Std.Dev. Corr
ResponseId (Intercept) 1.878815 1.37070
Country (Intercept) 0.661399 0.81326
genderMale 0.004256 0.06524 1.00
Residual 2.065054 1.43703
Number of obs: 3715, groups: ResponseId, 212; Country, 58
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.3812 0.1529 167.0384 41.74 <2e-16 ***
genderMale -0.2663 0.2465 210.2119 -1.08 0.281
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
genderMale -0.288
optimizer (nloptwrap) convergence code: 0 (OK)
boundary (singular) fit: see ?isSingular
We have a warning message here, which basically means that one of our random effects is nearly 0. In this case, it’s the random slope for gender - the variation is (relatively) microscopic. So we probably do not need that random slope.
Let’s play a bit with some other variables now. We’ll look at current
state of residence, state, and just include people located
in California and Hawaii.
state_mod <- lmer(Pleasant ~ state + (1|ResponseId) + (1+state|Country), data = filter(long, state %in% c("California", "Hawaii")))
summary(state_mod)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Pleasant ~ state + (1 | ResponseId) + (1 + state | Country)
Data: filter(long, state %in% c("California", "Hawaii"))
REML criterion at convergence: 13218.5
Scaled residuals:
Min 1Q Median 3Q Max
-5.6828 -0.5972 0.0076 0.5627 3.8316
Random effects:
Groups Name Variance Std.Dev. Corr
ResponseId (Intercept) 1.9516 1.3970
Country (Intercept) 0.6345 0.7965
stateHawaii 0.1225 0.3501 0.10
Residual 2.0182 1.4206
Number of obs: 3519, groups: ResponseId, 201; Country, 58
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.4365 0.1566 180.3935 41.10 <2e-16 ***
stateHawaii -0.2866 0.2493 209.6406 -1.15 0.252
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
stateHawaii -0.334
Here we can see that although the fixed effect of state was not significant, there was variation in the random slopes of state for each country.
Another model, this time related to race/ethnicity, which has been dummy-coded:
race_mod <- lmer(Pleasant ~ White + Asian + Hispanic_Latinx + (1|ResponseId) + (1|Country), data = long)
summary(race_mod)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Pleasant ~ White + Asian + Hispanic_Latinx + (1 | ResponseId) +
(1 | Country)
Data: long
REML criterion at convergence: 14299.5
Scaled residuals:
Min 1Q Median 3Q Max
-5.5557 -0.5950 0.0279 0.5725 3.8009
Random effects:
Groups Name Variance Std.Dev.
ResponseId (Intercept) 1.904 1.3800
Country (Intercept) 0.670 0.8185
Residual 2.084 1.4436
Number of obs: 3787, groups: ResponseId, 216; Country, 58
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.38743 0.27334 260.42896 23.368 <2e-16 ***
White 0.04782 0.25484 210.92265 0.188 0.851
Asian -0.13738 0.26512 210.94361 -0.518 0.605
Hispanic_Latinx -0.32399 0.31987 211.30121 -1.013 0.312
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) White Asian
White -0.812
Asian -0.602 0.527
Hispnc_Ltnx -0.274 0.184 0.116
Not much going on with race/ethnicity in terms of ratings assigned to countries for pleasantness.
Quick check: R2
r.squaredGLMM(ranef_mod)
Warning: 'r.squaredGLMM' now calculates a revised statistic. See the help page.
R2m R2c
[1,] 0 0.551444
r.squaredGLMM(gender_mod)
R2m R2c
[1,] 0.002114896 0.5546478
r.squaredGLMM(state_mod)
R2m R2c
[1,] 0.003092785 0.566779
Maybe someone’s status as an English speaker can predict how pleasant they tend to view global English varieties.
long <- long %>% mutate(ENG_status = case_when(grepl("^Bilingual", ENG_speaker_status) == T ~ "Multilingual",
grepl("^Dominant", ENG_speaker_status) == T ~ "EnglishPlus",
grepl("^Monolingual", ENG_speaker_status) == T ~ "Monolingual",
grepl("^Multilingual", ENG_speaker_status) == T ~ "Multilingual"))
and now a model…
eng_mod <- lmer(Pleasant ~ ENG_status + (1|ResponseId) + (1+ENG_status|Country), data = long)
summary(eng_mod)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Pleasant ~ ENG_status + (1 | ResponseId) + (1 + ENG_status | Country)
Data: long
REML criterion at convergence: 14189
Scaled residuals:
Min 1Q Median 3Q Max
-5.5938 -0.5924 0.0264 0.5730 3.8388
Random effects:
Groups Name Variance Std.Dev. Corr
ResponseId (Intercept) 1.87317 1.3686
Country (Intercept) 0.63955 0.7997
ENG_statusMonolingual 0.01209 0.1099 0.11
ENG_statusMultilingual 0.11198 0.3346 -0.02 0.99
Residual 2.05341 1.4330
Number of obs: 3769, groups: ResponseId, 215; Country, 58
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 6.3167 0.2097 240.7209 30.119 <2e-16 ***
ENG_statusMonolingual 0.1345 0.2198 210.3335 0.612 0.541
ENG_statusMultilingual -0.2940 0.3493 204.5057 -0.842 0.401
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) ENG_sttsMn
ENG_sttsMnl -0.707
ENG_sttsMlt -0.451 0.438
And nothing!
Let’s try one more…
fam_mod <- lmer(Pleasant ~ Familiar + (1|ResponseId) + (1|Country), data = long)
summary(fam_mod)
Linear mixed model fit by REML. t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: Pleasant ~ Familiar + (1 | ResponseId) + (1 | Country)
Data: long
REML criterion at convergence: 13511.8
Scaled residuals:
Min 1Q Median 3Q Max
-5.6538 -0.5764 0.0278 0.5575 4.3914
Random effects:
Groups Name Variance Std.Dev.
ResponseId (Intercept) 1.5313 1.2375
Country (Intercept) 0.2599 0.5098
Residual 1.7245 1.3132
Number of obs: 3778, groups: ResponseId, 215; Country, 58
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) 4.598e+00 1.259e-01 3.208e+02 36.52 <2e-16 ***
Familiar 3.509e-01 1.199e-02 3.488e+03 29.27 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr)
Familiar -0.475
So familiarity and pleasantness seem to go together…
r.squaredGLMM(fam_mod)
R2m R2c
[1,] 0.2092887 0.6121362
The fixed effect (familiarity) explains about 21% of the variation in pleasantness ratings.
The random effects add another 40% (!) of variance explained. So we can see that a larger portion of the variation is attributable to country and listener level differences compared to the amount that can be explained by a listener’s familiarity with English from a country.