Load the following packages and data (download the data from the OSF
link provided via email). If you need to install any packages, do so
with the install.packages() function or the
Packages tab in the lower-right window in RStudio.
We are setting the locale to “ko” for Korean, as some Korean transcriptions are included in the data file; this will only apply to the current R session.
library(tidyverse)
library(broom)
library(car)
library(sjPlot)
locale("ko")
<locale>
Numbers: 123,456.78
Formats: %AD / %AT
Timezone: UTC
Encoding: UTF-8
<date_names>
Days: 일요일 (일), 월요일 (월), 화요일 (화), 수요일 (수), 목요일 (목), 금요일 (금), 토요일 (토)
Months: 1월, 2월, 3월, 4월, 5월, 6월, 7월, 8월, 9월, 10월, 11월, 12월
AM/PM: 오전/오후
#read compiled data
d <- read_tsv("comp_acc_correlate_data.tsv")
Rows: 198 Columns: 63── Column specification ───────────────────────────────────────────────────────────────────────────────────────
Delimiter: "\t"
chr (5): Speaker, Gender, L1, transcript, trans_clean
dbl (58): EIT, Comprehensibility, Accentedness, SA_Comprehensibility, SA_Accentedness, Satisfaction, Value,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Names for variables we use for analyses don’t always look the prettiest when making plots or tables. This bit of code takes a manual approach to creating a set of ‘clean’ variable names that can be mapped on to our data. I recommend just copying and pasting this chunk of code.
# labels
speech_vars <- c("substitution_rate", "syllstr_error_rate", "pitch_CV",
"inton_error_rate",
"gram_error_rate", "eojeol_per_clause", "clauses_per_ASunit",
"lex_error_rate", "R", "speech_rate", "sp_rate",
"fp_rate", "repair_rate")
var_labels <- c("Substitution Rate", "Syllable Error Rate", "Pitch Variation (CV)",
"Intonation Error Rate",
"Morphosyntactic Error Rate", "Eojeol/clause", "Clauses/AS-unit",
"Lexical Error Rate", "Guiraud's Index", "Speech Rate", "Silent Pause Rate",
"Filled Pause Rate", "Repair Rate")
speech_var_tbl <- tibble(speech_vars, var_labels)
We’ll start with a very simple model - essentially a null, intercept only model.
cm0 <- lm(scale(Comprehensibility) ~ 1, data = d)
We can look at the results using the command
summary(cm0). The plot(cm0) command will allow
us to view diagnostic plots; it’s also a good idea to look at a
histogram of the model residuals.
hist(resid(cm0))
These residuals look mostly normal, but left-skewed. Probably not a big problem; this really just reflects the mean falling above the midpoint of the possible score range, to an extent.
For this project, my colleagues and I had selected a number of variables that could influence listener perceptions of comprehensibility and/or accentedness, according to the literature (mostly on other languages/L2s). After looking at bivariate correlations, we focus on a model that includes all of our speech stream variables.
cm1_r <- lm(Comprehensibility ~ substitution_rate + syllstr_error_rate +
pitch_CV + inton_error_rate +
gram_error_rate + eojeol_per_clause +
clauses_per_ASunit + lex_error_rate + R +
speech_rate + sp_rate + fp_rate + repair_rate, data = d)
In the code below, I’m using the scale() function to
standardize all variables. This produces standardized coefficients,
which are more directly comparable.
cm1 <- lm(scale(Comprehensibility) ~ scale(substitution_rate) + scale(syllstr_error_rate) +
scale(pitch_CV) + scale(inton_error_rate) +
scale(gram_error_rate) + scale(eojeol_per_clause) +
scale(clauses_per_ASunit) + scale(lex_error_rate) + scale(R) +
scale(speech_rate) + scale(sp_rate) + scale(fp_rate) + scale(repair_rate), data = d)
Again, we can use summary() to quickly review results
and plot() to check diagnostics for cm1. We
can also take a look at the residuals:
hist(resid(cm1))
These residuals actually look a bit nicer!
We should also check the collinearity of predictor variables. We can do this by examining Variance Inflation Factors (VIF).
vif(cm1)
scale(substitution_rate) scale(syllstr_error_rate) scale(pitch_CV) scale(inton_error_rate)
1.573087 1.218167 1.083517 1.436281
scale(gram_error_rate) scale(eojeol_per_clause) scale(clauses_per_ASunit) scale(lex_error_rate)
1.165567 1.374712 1.196704 1.245655
scale(R) scale(speech_rate) scale(sp_rate) scale(fp_rate)
2.120915 2.526126 1.278925 1.111976
scale(repair_rate)
1.269782
Values under 4 are good; so we’re in good shape.
The sjPlot package gives us some nice tools for
formatting tables:
tab_model(cm1)
Warning: Argument 'df_method' is deprecated. Please use 'ci_method' instead.
Let’s do some plotting of this model to allow some visual comparisons. First, getting nicer variable names:
cm1_coefs <- tidy(cm1, conf.int = T) %>%
filter(grepl("^scale*", term)) %>%
mutate(speech_var = str_sub(term, start = 7L, end = -2L)) %>%
left_join(speech_var_tbl, by = c("speech_var" = "speech_vars"))
And now a plot:
cm1_coefs %>%
ggplot(aes(x = reorder(var_labels, estimate), y = estimate))+
geom_point(size = 5)+
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .2)+
geom_hline(yintercept = 0, linetype = 2)+
scale_y_continuous(breaks = c(-.3, -.2, -.1, 0, .1, .2, .3, .4, .5))+
labs(x = NULL, y = 'Standardized Estimate')+
coord_flip()+
theme_bw()+
theme(axis.text = element_text(size = 18))