#{r setup, include=FALSE} #knitr::opts_chunk$set(echo = TRUE) #
We load a couple of libraries: mgcv is the main library for fitting GAMs; itsadug is mainly for plotting (and has a number of other convenience functions). The tidyverse is for everything else!
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0 ✔ purrr 0.3.5
✔ tibble 3.1.8 ✔ dplyr 1.0.10
✔ tidyr 1.2.1 ✔ stringr 1.5.0
✔ readr 2.1.3 ✔ forcats 0.5.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
library(mgcv)
Loading required package: nlme
Attaching package: 'nlme'
The following object is masked from 'package:dplyr':
collapse
This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.
library(itsadug)
Loading required package: plotfunctions
Attaching package: 'plotfunctions'
The following object is masked from 'package:ggplot2':
alpha
Loaded package itsadug 2.4 (see 'help("itsadug")' ).
Data import.
price_bin <- readRDS("price_bin.rds")
price_bin_older <- price_bin %>%
filter(age=="older")
Here’s the GAM we looked at in the slides.
price_bin_older_gam <-
bam(f2 ~ s(measurement_no, bs="cr", k=11),
data=price_bin_older)
Here’s the summary that we went through:
summary(price_bin_older_gam)
Family: gaussian
Link function: identity
Formula:
f2 ~ s(measurement_no, bs = "cr", k = 11)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1538.289 1.338 1149 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(measurement_no) 8.053 9.22 588.4 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.161 Deviance explained = 16.1%
fREML = 1.9393e+05 Scale est. = 50787 n = 28365
And here’s a simple plot of the smooth.
plot_smooth(price_bin_older_gam, view="measurement_no")
Summary:
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* NOTE : No random effects in the model to cancel.
I’d like to invite you to explore this smooth: what happens to the smooth if you change the value of k – e.g. increase it or decrease it? Does the code still work? What does the resulting smooth look like when you plot it using plot_smooth()? An example is given below.
price_bin_older_gam_k6 <-
bam(f2 ~ s(measurement_no, bs="cr", k=6),
data=price_bin_older)
plot_smooth(price_bin_older_gam_k6, view="measurement_no")
Summary:
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* NOTE : No random effects in the model to cancel.
You can also try using different types of basis functions. One of them can be accessed using bs=“tp” (a thin-plate smooth); another one using bs=“ps”. Do these change the shape of the smooth? Again, an example is given below.
price_bin_older_gam_tp6 <-
bam(f2 ~ s(measurement_no, bs="tp", k=6),
data=price_bin_older)
plot_smooth(price_bin_older_gam_tp6, view="measurement_no")
Summary:
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* NOTE : No random effects in the model to cancel.
And now we look at the full data set, capturing the difference between the older vs. younger groups using a difference smooth. Here’s the model that we explored in the slides:
price_bin$age_o <- as.ordered(price_bin$age)
contrasts(price_bin$age_o) <- "contr.treatment"
price_bin_gam <-
bam(f2 ~ age_o +
s(measurement_no, bs="cr", k=11) +
s(measurement_no, bs="cr", k=11, by=age_o),
data=price_bin)
The model summary.
summary(price_bin_gam)
Family: gaussian
Link function: identity
Formula:
f2 ~ age_o + s(measurement_no, bs = "cr", k = 11) + s(measurement_no,
bs = "cr", k = 11, by = age_o)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1539.117 1.329 1158.04 <2e-16 ***
age_oyounger -119.471 2.096 -56.99 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(measurement_no) 8.518 9.457 584.32 <2e-16 ***
s(measurement_no):age_oyounger 5.895 7.130 73.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.253 Deviance explained = 25.3%
fREML = 3.2386e+05 Scale est. = 50069 n = 47419
And a plot!
plot_smooth(price_bin_gam, view="measurement_no", plot_all="age_o")
Summary:
* age_o : factor; set to the value(s): older, younger.
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* NOTE : No random effects in the model to cancel.
It’s also possible to plot the difference smooth alone.
plot_diff(price_bin_gam, view="measurement_no",
comp=list(age_o=c("older","younger"))
)
Summary:
* measurement_no : numeric predictor; with 100 values ranging from 0.000000 to 100.000000.
* NOTE : No random effects in the model to cancel.
measurement_no window(s) of significant difference(s):
0.000000 - 100.000000
Your turn: the data set also contains a variable called following_voiceless, which captures the voicing of the following segment. We expect that this vowel will be realised differently when followed by a voiceless segment; but is that the case? Set up following_voiceless as an ordered factor with treatment coding (like we did for age_o above), and then fit a model with a difference smooth (again, analogous to the one above). Plot the results!
price_bin$following_voiceless <- as.ordered(price_bin$following_voiceless)
contrasts(price_bin$following_voiceless) <- "contr.treatment"
price_bin_gam <-
bam(f2 ~ following_voiceless +
s(measurement_no, bs="cr", k=11) +
s(measurement_no, bs="cr", k=11, by=following_voiceless),
data=price_bin)
plot_smooth(price_bin_gam, view="measurement_no", plot_all="following_voiceless")
Summary:
* following_voiceless : factor; set to the value(s): FALSE, TRUE.
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* NOTE : No random effects in the model to cancel.
We can plot autocorrelation in the residuals as follows.
acf(resid_gam(price_bin_gam), lag.max=10)
Here’s how to include an AR1 error model in a GAM.
# first, we need to have an indicator variable
# that tells our gam where each trajectory
# starts; also, the data set has to be set up
# so that adjacent measurements are also
# adjacent within the data set (which is already
# the case here)
price_bin <- price_bin %>%
group_by(id) %>%
mutate(traj_start=measurement_no == min(measurement_no)) %>%
ungroup()
# we obtain the autocorrelation at lag 1 within
# our data set
rho_est <- start_value_rho(price_bin_gam)
# we run the same model, but with two extra parameters:
# - AR.start is the indicator variable that shows
# the start of each trajectory in the data set
# - rho is roughly the degree of autocorrelation we
# wish to remove
price_bin_gam_AR <-
bam(f2 ~ age_o +
s(measurement_no, bs="cr", k=11) +
s(measurement_no, bs="cr", k=11, by=age_o),
data=price_bin,
AR.start=price_bin$traj_start, rho=rho_est)
summary(price_bin_gam_AR)
Family: gaussian
Link function: identity
Formula:
f2 ~ age_o + s(measurement_no, bs = "cr", k = 11) + s(measurement_no,
bs = "cr", k = 11, by = age_o)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1540.080 2.451 628.41 <2e-16 ***
age_oyounger -120.532 3.868 -31.16 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(measurement_no) 9.536 9.912 338.42 <2e-16 ***
s(measurement_no):age_oyounger 7.483 8.882 61.82 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.253 Deviance explained = 25.3%
fREML = 3.0424e+05 Scale est. = 40453 n = 47419
Comparing the two models (without vs. with AR1) graphically.
plot_smooth(price_bin_gam, view="measurement_no", plot_all="age_o")
Warning in plot_smooth(price_bin_gam, view = "measurement_no", plot_all =
"age_o"): age_o (specified in plot_all) is not a model predictor. Will be
ignored.
Summary:
* following_voiceless : factor; set to the value(s): FALSE.
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* NOTE : No random effects in the model to cancel.
plot_smooth(price_bin_gam_AR, view="measurement_no", plot_all="age_o")
Summary:
* age_o : factor; set to the value(s): older, younger.
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* NOTE : No random effects in the model to cancel.
Now plotting the autocorrelation for this revised model. Note that resid_gam has to be used for models that include an AR1 error model.
acf(resid_gam(price_bin_gam_AR), lag.max=10)
Since we already know that random smooths are the best tool for the current case, we’ll focus on these here.
price_bin$speaker_f <- factor(price_bin$speaker)
price_bin_gam_rsmooth <- bam(f2 ~ age_o +
s(measurement_no, bs="cr", k=11) +
s(measurement_no, bs="cr", k=11, by=age_o) +
s(measurement_no, speaker_f, bs="fs", m=1, k=11),
data=price_bin,
AR.start=traj_start, rho=rho_est)
Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.
plot_smooth(price_bin_gam_rsmooth, view="measurement_no", plot_all="age_o", rm.ranef=T)
Summary:
* age_o : factor; set to the value(s): older, younger.
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* speaker_f : factor; set to the value(s): s-282. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(measurement_no,speaker_f)
This model takes a while to fit! There is a more efficient implementation of the bam() function that you can enable by setting the value of the “discrete” parameter to TRUE:
price_bin_gam_rsmooth <- bam(f2 ~ age_o +
s(measurement_no, bs="cr", k=11) +
s(measurement_no, bs="cr", k=11, by=age_o) +
s(measurement_no, speaker_f, bs="fs", m=1, k=11),
data=price_bin,
AR.start=traj_start, rho=rho_est,
discrete=T)
Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.
plot_smooth(price_bin_gam_rsmooth, view="measurement_no", plot_all="age_o", rm.ranef=T)
Summary:
* age_o : factor; set to the value(s): older, younger.
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* speaker_f : factor; set to the value(s): s-282. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(measurement_no,speaker_f)
A summary.
summary(price_bin_gam_rsmooth)
Family: gaussian
Link function: identity
Formula:
f2 ~ age_o + s(measurement_no, bs = "cr", k = 11) + s(measurement_no,
bs = "cr", k = 11, by = age_o) + s(measurement_no, speaker_f,
bs = "fs", m = 1, k = 11)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1584.29 25.41 62.356 < 2e-16 ***
age_oyounger -158.60 36.00 -4.406 1.06e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(measurement_no) 7.994 8.212 25.28 <2e-16 ***
s(measurement_no):age_oyounger 5.328 5.619 11.83 <2e-16 ***
s(measurement_no,speaker_f) 346.542 436.000 15.58 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.46 Deviance explained = 46.4%
fREML = 3.0194e+05 Scale est. = 36034 n = 47419
If you want to see the random smooths themselves… (select = 3 picks out the third smooth in the model – you can find the correct number by counting from the top of the smooth estimates part of the model summary; the random smooths are the third smooth for this specific model.)
inspect_random(price_bin_gam_rsmooth, select=3)
Summary:
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* speaker_f : factor with 40 values; set to the value(s): s-101, s-104, s-133, s-134, s-140, s-141, s-170, s-171, s-180, s-2, ...
There are two variables in the data set that code the previous and following phone. Here’s what they look like:
table(price_bin$previous)
_ b d f g h J k l m n none p P r
63 33 1056 630 2929 369 1035 968 776 8172 2870 4043 1639 708 10 7294
s S t T v w z
2068 71 5859 20 308 6458 40
table(price_bin$following)
_ { @ 5 b d D e f g h i
63 41 66 1932 31 145 4152 246 22 967 21 117 51
I k l m n none p Q r s t v w
500 5902 2756 4743 5342 3093 273 30 250 2115 10823 2458 10
z
1270
We typically use random effects for full words (or items / stimuli) rather than previous / following environment, but there are simply too many unique words in this data set, and specifying previous / following environments achieves essentially the same goal. So your task is to set up separate random smooths for both and add them to your model! Does this change your model output? (Remember: the grouping variable for random smooths must be a factor!)
price_bin$previous_f <- factor(price_bin$previous)
price_bin$following_f <- factor(price_bin$following)
price_bin_gam_rsmooth <- bam(f2 ~ age_o +
s(measurement_no, bs="cr", k=11) +
s(measurement_no, bs="cr", k=11, by=age_o) +
s(measurement_no, previous_f, bs="fs", m=1, k=11) +
s(measurement_no, following_f, bs="fs", m=1, k=11)+
s(measurement_no, speaker_f, bs="fs", m=1, k=11),
data=price_bin,
AR.start=traj_start, rho=rho_est,
discrete=T)
Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.
plot_smooth(price_bin_gam_rsmooth, view="measurement_no", plot_all="age_o", rm.ranef=T)
Summary:
* age_o : factor; set to the value(s): older, younger.
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* previous_f : factor; set to the value(s): l. (Might be canceled as random effect, check below.)
* following_f : factor; set to the value(s): t. (Might be canceled as random effect, check below.)
* speaker_f : factor; set to the value(s): s-282. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(measurement_no,previous_f),s(measurement_no,following_f),s(measurement_no,speaker_f)
Use inspect_smooth to look at the previous / following smooths. Do they look the way you’d expect them to?
#{r} #... #
If we want our p-values to be well-calibrated, we need a sort of a random slope equivalent for the within-speaker following voiceless effect, i.e. a random effect that allows the shape of the following voiceless effect to vary within speakers. Yikes!
In Sóskuthy (2021), I show that this should be done by simply mirroring the reference-difference smooth specification fixed effect smooths.
price_bin$foll_v_o <- as.ordered(price_bin$following_voiceless)
contrasts(price_bin$foll_v_o) <- "contr.treatment"
price_bin_gam_fv_rs <- bam(f2 ~ foll_v_o +
s(measurement_no, bs="cr", k=11) +
s(measurement_no, bs="cr", k=11, by=foll_v_o) +
s(measurement_no, speaker_f, bs="fs", m=1, k=11) +
s(measurement_no, speaker_f, by=foll_v_o, bs="fs", m=1, k=11),
data=price_bin,
AR.start=traj_start, rho=rho_est,
discrete=T)
Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.
Summary.
summary(price_bin_gam_fv_rs)
Family: gaussian
Link function: identity
Formula:
f2 ~ foll_v_o + s(measurement_no, bs = "cr", k = 11) + s(measurement_no,
bs = "cr", k = 11, by = foll_v_o) + s(measurement_no, speaker_f,
bs = "fs", m = 1, k = 11) + s(measurement_no, speaker_f,
by = foll_v_o, bs = "fs", m = 1, k = 11)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1492.882 22.188 67.284 < 2e-16 ***
foll_v_oTRUE 31.297 8.604 3.637 0.000276 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(measurement_no) 8.751 8.945 69.524 <2e-16 ***
s(measurement_no):foll_v_oTRUE 6.865 7.604 25.682 <2e-16 ***
s(measurement_no,speaker_f) 351.113 438.000 11.580 <2e-16 ***
s(measurement_no,speaker_f):foll_v_oTRUE 217.564 427.000 1.485 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.487 Deviance explained = 49.3%
fREML = 3.0151e+05 Scale est. = 35041 n = 47419
And a plot of the two smooths.
plot_smooth(price_bin_gam_fv_rs, view="measurement_no",
plot_all="foll_v_o", rm.ranef=T)
Summary:
* foll_v_o : factor; set to the value(s): FALSE, TRUE.
* measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000.
* speaker_f : factor; set to the value(s): s-282. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(measurement_no,speaker_f),s(measurement_no,speaker_f):foll_v_oTRUE
You may notice something slightly odd about the plotted smooths: the confidence interval around the second one (following voiceless = TRUE) is a bit wider than the confidence interval around the first one. You are absolutely right! This is a known issue with plots of models with reference-difference smooths, and we’re working on sorting it out. Based on the simulations in Sóskuthy (2021), the model estimates should be OK – so this is likely an issue with the prediction function for GAMMs.