Loading some packages and data.
library(tidyverse)
library(broom)
library(emmeans)
icon <- read_csv('perry_winter_2017_iconicity.csv')
sim <- read_csv('winter_matlock_2013_similarity.csv')
lonely <- read_csv('sidhu&pexman_2017_iconicity.csv')
vinson_yelp <- read_csv('vinson_dale_2014_yelp.csv')
We’ll be looking at how sensory experience ratings (SER) predict iconicity for different classes of words.
NV <- filter(icon, POS %in% c('Noun', 'Verb'), !is.na(SER))
table(NV$POS)
Noun Verb
1106 373
As always, plot first, analyze later:
NV %>% ggplot(aes(x = SER, y = Iconicity, color = POS))+
geom_point()+
geom_smooth(method = 'lm')+
theme_bw()
or, a bit more clearly:
NV %>% ggplot(aes(x = SER, y = Iconicity, color = POS))+
geom_point()+
geom_smooth(method = 'lm')+
theme_bw()+
facet_wrap(~POS)
What we see is a rather substantial difference in slopes.
What happens if we don’t account for that difference?
First, a standard linear model without interaction:
NV_mdl <- lm(Iconicity ~ SER + POS, data = NV)
summary(NV_mdl)
Call:
lm(formula = Iconicity ~ SER + POS, data = NV)
Residuals:
Min 1Q Median 3Q Max
-3.0956 -0.7210 -0.1233 0.6146 3.6902
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.11935 0.10040 -1.189 0.235
SER 0.23319 0.02789 8.362 <2e-16 ***
POSVerb 0.60159 0.06388 9.418 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.045 on 1476 degrees of freedom
Multiple R-squared: 0.08226, Adjusted R-squared: 0.08102
F-statistic: 66.15 on 2 and 1476 DF, p-value: < 2.2e-16
We see that both variables are significant predictors, and the model explains a small amount of the total variation (though perhaps not bad for linguistic data like this).
Graphically, we can plot the predicted values over the empirical values.
NV_mdl_pred <- cbind(NV, predict(NV_mdl, interval = "confidence"))
NV_mdl_pred %>% ggplot(aes(x = SER, y = Iconicity, shape = POS, color = POS))+
geom_point(alpha = .4)+
geom_line(aes(x = SER, y = fit, color = POS)) +
geom_ribbon(aes(ymin=lwr,ymax=upr, fill = POS), alpha=0.3)+
theme_bw()
Now we’ll model the interaction:
NV_int_mdl <- lm(Iconicity ~ SER + POS + SER:POS, data = NV)
summary(NV_int_mdl)
Call:
lm(formula = Iconicity ~ SER + POS + SER:POS, data = NV)
Residuals:
Min 1Q Median 3Q Max
-2.8448 -0.7043 -0.1257 0.5864 3.5845
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.27394 0.11065 2.476 0.01341 *
SER 0.11817 0.03108 3.801 0.00015 ***
POSVerb -0.95542 0.20971 -4.556 5.65e-06 ***
SER:POSVerb 0.50838 0.06535 7.780 1.36e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.025 on 1475 degrees of freedom
Multiple R-squared: 0.1184, Adjusted R-squared: 0.1166
F-statistic: 66.05 on 3 and 1475 DF, p-value: < 2.2e-16
The interaction is significant, and overall the model explains a fair amount more variance - up to almost 12%, or around 140-150% of the model w/o interaction. What’s a bit counter-intuitive here is that now POSVerb has a lower intercept - but that doesn’t mean the overall average of POSVerb is lower like it did in the first model. It just means that the intercept, after accounting for the difference in slopes, is now lower than the intercept for POSNoun.
Let’s plot this one, too:
NV_int_mdl_pred <- cbind(NV, predict(NV_int_mdl, interval = "confidence"))
NV_int_mdl_pred %>% ggplot(aes(x = SER, y = Iconicity, shape = POS, color = POS))+
geom_point(alpha = .4)+
geom_line(aes(x = SER, y = fit, color = POS)) +
geom_ribbon(aes(ymin=lwr,ymax=upr, fill = POS), alpha=0.3)+
theme_bw()
Let’s center things to get a slightly different picture/interpretation, though the overall explanatory power of the model will not change.
NV <- mutate(NV, SER_c = SER - mean(SER, na.rm = T)) #center
NV_int_mdl_c <- lm(Iconicity ~ SER_c*POS, data = NV) #model
summary(NV_int_mdl_c)
Call:
lm(formula = Iconicity ~ SER_c * POS, data = NV)
Residuals:
Min 1Q Median 3Q Max
-2.8448 -0.7043 -0.1257 0.5864 3.5845
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.66423 0.03102 21.414 < 2e-16 ***
SER_c 0.11817 0.03108 3.801 0.00015 ***
POSVerb 0.72371 0.06456 11.209 < 2e-16 ***
SER_c:POSVerb 0.50838 0.06535 7.780 1.36e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.025 on 1475 degrees of freedom
Multiple R-squared: 0.1184, Adjusted R-squared: 0.1166
F-statistic: 66.05 on 3 and 1475 DF, p-value: < 2.2e-16
Now POSVerb has a positive value - with the centered predictor, this means that at the average level of SER, POSVerb tend to have higher Iconicity ratings, and this interpretation seems to better characterize a pattern in the data.
Let’s also plot this model now:
NV_int_mdl_c_pred <- cbind(NV, predict(NV_int_mdl_c, interval = "confidence"))
NV_int_mdl_c_pred %>% ggplot(aes(x = SER_c, y = Iconicity, shape = POS, color = POS))+
geom_point(alpha = .4)+
geom_line(aes(x = SER_c, y = fit, color = POS)) +
geom_ribbon(aes(ymin=lwr,ymax=upr, fill = POS), alpha=0.3)+
geom_vline(xintercept = 0, linetype = 2)+
theme_bw()
Examining the data
table(sim$Phon, sim$Sem)
Different Similar
Different 91 86
Similar 97 90
OR
sim %>% count(Phon, Sem)
Let’s do the basic model, no interactions:
sim_mdl <- lm(Distance ~ Phon + Sem, data = sim)
summary(sim_mdl)
Call:
lm(formula = Distance ~ Phon + Sem, data = sim)
Residuals:
Min 1Q Median 3Q Max
-79.350 -26.453 -4.167 19.536 138.833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 79.556 3.631 21.912 <2e-16 ***
PhonSimilar 5.795 4.181 1.386 0.1667
SemSimilar -10.184 4.181 -2.435 0.0154 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39.81 on 360 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.02148, Adjusted R-squared: 0.01604
F-statistic: 3.951 on 2 and 360 DF, p-value: 0.02008
Here, phonological similarity had no effect, but there was an effect for semantic similarity. But let’s try an interaction now:
sim_mdl_int <- lm(Distance ~ Phon*Sem, data = sim)
summary(sim_mdl_int)
Call:
lm(formula = Distance ~ Phon * Sem, data = sim)
Residuals:
Min 1Q Median 3Q Max
-80.42 -26.42 -4.40 19.60 139.99
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 78.400 4.201 18.663 <2e-16 ***
PhonSimilar 8.023 5.833 1.375 0.170
SemSimilar -7.819 6.010 -1.301 0.194
PhonSimilar:SemSimilar -4.593 8.375 -0.548 0.584
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39.85 on 359 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.0223, Adjusted R-squared: 0.01413
F-statistic: 2.729 on 3 and 359 DF, p-value: 0.04386
Notably here, now there are no significant predictors and the overall variance explained by the model is 1%. Not a great model! But what’s interesting is that there might be an additional reduction in distance when phonology and semantic properties of the word pairs are similar.
What’s useful here is generating predictions for the various combinations of categories. You can feed some ‘new data’ to a regression model and get predicted Y values. This can be handy for summarizing or illustrating the model predictions.
Phon <- rep(c('Different', 'Similar'), each = 2)
Sem <- rep(c('Different', 'Similar'), times = 2)
newdata <- tibble(Phon, Sem)
newdata
Now predict…
newdata$fit <- predict(sim_mdl_int, newdata)
newdata
Compute averages…
newdata %>% group_by(Sem) %>% summarise(distM = mean(fit))
Let’s try sum coding to get main effects:
sim <- sim %>% mutate(Phon_sum = factor(Phon),
Sem_sum = factor(Sem))
contrasts(sim$Phon_sum) <- contr.sum(2)
contrasts(sim$Sem_sum) <- contr.sum(2)
And now a new model…
summary(sum_mdl)
Call:
lm(formula = Distance ~ Phon_sum * Sem_sum, data = sim)
Residuals:
Min 1Q Median 3Q Max
-80.42 -26.42 -4.40 19.60 139.99
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 77.354 2.094 36.946 <2e-16 ***
Phon_sum1 -2.863 2.094 -1.367 0.1723
Sem_sum1 5.058 2.094 2.416 0.0162 *
Phon_sum1:Sem_sum1 -1.148 2.094 -0.548 0.5837
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 39.85 on 359 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.0223, Adjusted R-squared: 0.01413
F-statistic: 2.729 on 3 and 359 DF, p-value: 0.04386
So this model results in a significant main effect for semantic similarity - this is hard to reconcile, but as Winter noted the sum-coded main effect is essentially twice as large as the simple effect (i.e., going from -1 to 1 rather than 0 to 1), which impacts the calculation of “how likely is this difference if the true difference is 0”.
sim_anova <- aov(Distance ~ Phon*Sem, data = sim)
summary(sim_anova)
Df Sum Sq Mean Sq F value Pr(>F)
Phon 1 3124 3124 1.967 0.1617
Sem 1 9402 9402 5.920 0.0155 *
Phon:Sem 1 478 478 0.301 0.5837
Residuals 359 570193 1588
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1 observation deleted due to missingness
Notice that the statistical significance aligns (roughly) with the
centered linear model above. You can do follow up with
emmeans.
sim_anova_emm
Phon Sem emmean SE df lower.CL upper.CL
Different Different 78.4 4.20 359 70.1 86.7
Similar Different 86.4 4.05 359 78.5 94.4
Different Similar 70.6 4.30 359 62.1 79.0
Similar Similar 74.0 4.20 359 65.7 82.3
Confidence level used: 0.95
and then…
pairs(sim_anova_emm)
contrast estimate SE df t.ratio p.value
Different Different - Similar Different -8.02 5.83 359 -1.375 0.5156
Different Different - Different Similar 7.82 6.01 359 1.301 0.5629
Different Different - Similar Similar 4.39 5.94 359 0.739 0.8814
Similar Different - Different Similar 15.84 5.90 359 2.684 0.0380
Similar Different - Similar Similar 12.41 5.83 359 2.128 0.1462
Different Similar - Similar Similar -3.43 6.01 359 -0.571 0.9408
P value adjustment: tukey method for comparing a family of 4 estimates
lonely <- filter(lonely, Iconicity >= 0)
lonely_mdl <- lm(Iconicity ~ SER*ARC, data = lonely)
summary(lonely_mdl)
Call:
lm(formula = Iconicity ~ SER * ARC, data = lonely)
Residuals:
Min 1Q Median 3Q Max
-2.0178 -0.6258 -0.1407 0.4645 3.2562
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.3601 0.3822 3.559 0.000385 ***
SER 0.3612 0.1079 3.346 0.000841 ***
ARC -0.7929 0.6419 -1.235 0.216956
SER:ARC -0.5255 0.1848 -2.844 0.004519 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.842 on 1385 degrees of freedom
Multiple R-squared: 0.148, Adjusted R-squared: 0.1462
F-statistic: 80.22 on 3 and 1385 DF, p-value: < 2.2e-16
And now with standardized predictors (but not yet outcome variable…):
lonely <- mutate(lonely, SER_z = scale(SER), ARC_z = scale(ARC))
lonely_mdl_z <- lm(Iconicity ~ SER_z*ARC_z, data = lonely)
summary(lonely_mdl_z)
Call:
lm(formula = Iconicity ~ SER_z * ARC_z, data = lonely)
Residuals:
Min 1Q Median 3Q Max
-2.0178 -0.6258 -0.1407 0.4645 3.2562
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.15565 0.02308 50.069 < 2e-16 ***
SER_z 0.07115 0.02331 3.052 0.00231 **
ARC_z -0.32426 0.02308 -14.048 < 2e-16 ***
SER_z:ARC_z -0.06775 0.02382 -2.844 0.00452 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.842 on 1385 degrees of freedom
Multiple R-squared: 0.148, Adjusted R-squared: 0.1462
F-statistic: 80.22 on 3 and 1385 DF, p-value: < 2.2e-16
Again, we run into this somewhat strange phenomenon of a single-predictor effect changing/becoming significant after a transformation. But what’s most important here is the interaction.
Coming up with a really good visualization of continuous*continuous
interaction effects is hard. I don’t like the 3D plane plots that Winter
uses - they’re really cumbersome to try and wrap your brain around. A
more suitable approach is to pick a few values for one continuous
variable (e.g., the mean, and then 1 SD above and 1 SD below) and plot
the other continuous variable against the dependent variable to
illustrate how the slope changes. We’ll use the
interactions package to do this quickly.
interact_plot(lonely_mdl, pred = SER, modx = ARC,
modx.values = "mean-plus-minus",
plot.points = T, interval = T,
colors = "Blues")
In this example, there’s a pretty clear non-linear relationship between number of stars in the Yelp review and the information density in the text of the review.
vinson_yelp %>% ggplot(aes(x = stars, y = across_uni_info))+
geom_jitter(alpha = .3, height = 0, width = .25)+
stat_summary(fun.y = "mean", fun.ymin = "mean", fun.ymax= "mean", size= 0.3, geom = "crossbar",
color = "red")+
geom_smooth()+
theme_bw()
Or, as Winter did…
vinson_yelp %>% group_by(stars) %>%
summarise(AUI_mean = mean(across_uni_info)) %>%
ggplot(aes(x = stars, y = AUI_mean))+
geom_line(lintype = 2)+
geom_point(size = 3)+
theme_bw()
Centering or other transformations can be helpful when working with polynomials. For instance, 3 stars appears to be the lowest information, but 3^2 is greater than 2^2, so we’ll want 3 stars to be at the bottom of the valley. We can do that by centering.
vinson_yelp <- vinson_yelp %>% mutate(stars_c = stars - mean(stars),
stars_c2 = stars_c^2)
Now to model…
yelp_mdl <- lm(across_uni_info ~ stars_c + stars_c2, data = vinson_yelp)
summary(yelp_mdl)
Call:
lm(formula = across_uni_info ~ stars_c + stars_c2, data = vinson_yelp)
Residuals:
Min 1Q Median 3Q Max
-1.34147 -0.31699 -0.04251 0.27335 2.47805
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.691281 0.006463 1499.44 <2e-16 ***
stars_c 0.042908 0.004710 9.11 <2e-16 ***
stars_c2 0.037363 0.003014 12.39 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4624 on 9997 degrees of freedom
Multiple R-squared: 0.01558, Adjusted R-squared: 0.01538
F-statistic: 79.08 on 2 and 9997 DF, p-value: < 2.2e-16
And now for some predictions and plots…
yelp_preds <- tibble(stars_c = sort(unique(vinson_yelp$stars_c)))
yelp_preds <- mutate(yelp_preds, stars_c2 = stars_c^2)
yelp_preds$fit <- predict(yelp_mdl, yelp_preds)
yelp_preds
Plot:
yelp_preds %>% ggplot(aes(x = stars_c, y = fit))+
geom_point(size = 3)+
geom_line(linetype = 2)+
theme_bw()
While this illustrates the trend, we don’t have a great sense of how large the effect is or how strongly the ratings are characterized by this pattern. It’s generally a good idea for plots to include the whole range of possible values, or at least the whole range of observed values…
yelp_preds %>% ggplot(aes(x = stars_c, y = fit))+
geom_point(size = 3)+
geom_line(linetype = 2)+
scale_y_continuous(limits = c(min(vinson_yelp$across_uni_info), max(vinson_yelp$across_uni_info)))+
theme_bw()
So this looks a bit different! It also lines up a bit more with the very modest R2 value from the regression model.
Such a finding might be theoretically interesting or have some practical impact, but that requires some more thinking and understanding of the constructs and context.