#{r setup, include=FALSE} #knitr::opts_chunk$set(echo = TRUE) #

Prep

We load a couple of libraries: mgcv is the main library for fitting GAMs; itsadug is mainly for plotting (and has a number of other convenience functions). The tidyverse is for everything else!

library(tidyverse)

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.5.0 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(mgcv)

Loading required package: nlme

Attaching package: 'nlme'

The following object is masked from 'package:dplyr':

    collapse

This is mgcv 1.8-41. For overview type 'help("mgcv-package")'.

library(itsadug)

Loading required package: plotfunctions

Attaching package: 'plotfunctions'

The following object is masked from 'package:ggplot2':

    alpha

Loaded package itsadug 2.4 (see 'help("itsadug")' ).

Data import.

price_bin <- readRDS("price_bin.rds")
price_bin_older <- price_bin %>%
  filter(age=="older")

1. New Zealand diphthong data: a simple model

Here’s the GAM we looked at in the slides.

price_bin_older_gam <- 
  bam(f2 ~ s(measurement_no, bs="cr", k=11),
      data=price_bin_older)

Here’s the summary that we went through:

summary(price_bin_older_gam)


Family: gaussian 
Link function: identity 

Formula:
f2 ~ s(measurement_no, bs = "cr", k = 11)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 1538.289      1.338    1149   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                    edf Ref.df     F p-value    
s(measurement_no) 8.053   9.22 588.4  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.161   Deviance explained = 16.1%
fREML = 1.9393e+05  Scale est. = 50787     n = 28365

And here’s a simple plot of the smooth.

plot_smooth(price_bin_older_gam, view="measurement_no")

Summary:
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * NOTE : No random effects in the model to cancel.

I’d like to invite you to explore this smooth: what happens to the smooth if you change the value of k – e.g. increase it or decrease it? Does the code still work? What does the resulting smooth look like when you plot it using plot_smooth()? An example is given below.

price_bin_older_gam_k6 <- 
  bam(f2 ~ s(measurement_no, bs="cr", k=6),
      data=price_bin_older)
plot_smooth(price_bin_older_gam_k6, view="measurement_no")

Summary:
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * NOTE : No random effects in the model to cancel.

You can also try using different types of basis functions. One of them can be accessed using bs=“tp” (a thin-plate smooth); another one using bs=“ps”. Do these change the shape of the smooth? Again, an example is given below.

price_bin_older_gam_tp6 <- 
  bam(f2 ~ s(measurement_no, bs="tp", k=6),
      data=price_bin_older)
plot_smooth(price_bin_older_gam_tp6, view="measurement_no")

Summary:
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * NOTE : No random effects in the model to cancel.

2. Capturing the difference between two groups

And now we look at the full data set, capturing the difference between the older vs. younger groups using a difference smooth. Here’s the model that we explored in the slides:

price_bin$age_o <- as.ordered(price_bin$age)
contrasts(price_bin$age_o) <- "contr.treatment"

price_bin_gam <- 
  bam(f2 ~ age_o +
           s(measurement_no, bs="cr", k=11) +
           s(measurement_no, bs="cr", k=11, by=age_o),
      data=price_bin)

The model summary.

summary(price_bin_gam)


Family: gaussian 
Link function: identity 

Formula:
f2 ~ age_o + s(measurement_no, bs = "cr", k = 11) + s(measurement_no, 
    bs = "cr", k = 11, by = age_o)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1539.117      1.329 1158.04   <2e-16 ***
age_oyounger -119.471      2.096  -56.99   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                 edf Ref.df      F p-value    
s(measurement_no)              8.518  9.457 584.32  <2e-16 ***
s(measurement_no):age_oyounger 5.895  7.130  73.17  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.253   Deviance explained = 25.3%
fREML = 3.2386e+05  Scale est. = 50069     n = 47419

And a plot!

plot_smooth(price_bin_gam, view="measurement_no", plot_all="age_o")

Summary:
    * age_o : factor; set to the value(s): older, younger. 
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * NOTE : No random effects in the model to cancel.

It’s also possible to plot the difference smooth alone.

plot_diff(price_bin_gam, view="measurement_no",
          comp=list(age_o=c("older","younger"))
         )

Summary:
    * measurement_no : numeric predictor; with 100 values ranging from 0.000000 to 100.000000. 
    * NOTE : No random effects in the model to cancel.


measurement_no window(s) of significant difference(s):
    0.000000 - 100.000000

Your turn: the data set also contains a variable called following_voiceless, which captures the voicing of the following segment. We expect that this vowel will be realised differently when followed by a voiceless segment; but is that the case? Set up following_voiceless as an ordered factor with treatment coding (like we did for age_o above), and then fit a model with a difference smooth (again, analogous to the one above). Plot the results!

price_bin$following_voiceless <- as.ordered(price_bin$following_voiceless)
contrasts(price_bin$following_voiceless) <- "contr.treatment"

price_bin_gam <- 
  bam(f2 ~ following_voiceless +
           s(measurement_no, bs="cr", k=11) +
           s(measurement_no, bs="cr", k=11, by=following_voiceless),
      data=price_bin)
plot_smooth(price_bin_gam, view="measurement_no", plot_all="following_voiceless")

Summary:
    * following_voiceless : factor; set to the value(s): FALSE, TRUE. 
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * NOTE : No random effects in the model to cancel.

3. Autocorrelation in the New Zealand diphthong data

We can plot autocorrelation in the residuals as follows.

acf(resid_gam(price_bin_gam), lag.max=10)

Here’s how to include an AR1 error model in a GAM.

# first, we need to have an indicator variable
# that tells our gam where each trajectory
# starts; also, the data set has to be set up
# so that adjacent measurements are also
# adjacent within the data set (which is already
# the case here)
price_bin <- price_bin %>%
  group_by(id) %>%
  mutate(traj_start=measurement_no == min(measurement_no)) %>%
  ungroup()

# we obtain the autocorrelation at lag 1 within
# our data set
rho_est <- start_value_rho(price_bin_gam)

# we run the same model, but with two extra parameters:
# - AR.start is the indicator variable that shows
#   the start of each trajectory in the data set
# - rho is roughly the degree of autocorrelation we
#   wish to remove
price_bin_gam_AR <- 
  bam(f2 ~ age_o +
        s(measurement_no, bs="cr", k=11) +
        s(measurement_no, bs="cr", k=11, by=age_o),
      data=price_bin,
      AR.start=price_bin$traj_start, rho=rho_est)
summary(price_bin_gam_AR)


Family: gaussian 
Link function: identity 

Formula:
f2 ~ age_o + s(measurement_no, bs = "cr", k = 11) + s(measurement_no, 
    bs = "cr", k = 11, by = age_o)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1540.080      2.451  628.41   <2e-16 ***
age_oyounger -120.532      3.868  -31.16   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                 edf Ref.df      F p-value    
s(measurement_no)              9.536  9.912 338.42  <2e-16 ***
s(measurement_no):age_oyounger 7.483  8.882  61.82  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.253   Deviance explained = 25.3%
fREML = 3.0424e+05  Scale est. = 40453     n = 47419

Comparing the two models (without vs. with AR1) graphically.

plot_smooth(price_bin_gam, view="measurement_no", plot_all="age_o")

Warning in plot_smooth(price_bin_gam, view = "measurement_no", plot_all =
"age_o"): age_o (specified in plot_all) is not a model predictor. Will be
ignored.

Summary:
    * following_voiceless : factor; set to the value(s): FALSE. 
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * NOTE : No random effects in the model to cancel.

plot_smooth(price_bin_gam_AR, view="measurement_no", plot_all="age_o")

Summary:
    * age_o : factor; set to the value(s): older, younger. 
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * NOTE : No random effects in the model to cancel.

Now plotting the autocorrelation for this revised model. Note that resid_gam has to be used for models that include an AR1 error model.

acf(resid_gam(price_bin_gam_AR), lag.max=10)

4. Random effects for the New Zealand diphthong data

Since we already know that random smooths are the best tool for the current case, we’ll focus on these here.

price_bin$speaker_f <- factor(price_bin$speaker)

price_bin_gam_rsmooth <- bam(f2 ~ age_o +
                       s(measurement_no, bs="cr", k=11) +
                       s(measurement_no, bs="cr", k=11, by=age_o) +
                       s(measurement_no, speaker_f, bs="fs", m=1, k=11),
                     data=price_bin,
                     AR.start=traj_start, rho=rho_est)

Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.

plot_smooth(price_bin_gam_rsmooth, view="measurement_no", plot_all="age_o", rm.ranef=T)

Summary:
    * age_o : factor; set to the value(s): older, younger. 
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * speaker_f : factor; set to the value(s): s-282. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(measurement_no,speaker_f)

This model takes a while to fit! There is a more efficient implementation of the bam() function that you can enable by setting the value of the “discrete” parameter to TRUE:

price_bin_gam_rsmooth <- bam(f2 ~ age_o +
                       s(measurement_no, bs="cr", k=11) +
                       s(measurement_no, bs="cr", k=11, by=age_o) +
                       s(measurement_no, speaker_f, bs="fs", m=1, k=11),
                     data=price_bin,
                     AR.start=traj_start, rho=rho_est,
                     discrete=T)

Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.

plot_smooth(price_bin_gam_rsmooth, view="measurement_no", plot_all="age_o", rm.ranef=T)

Summary:
    * age_o : factor; set to the value(s): older, younger. 
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * speaker_f : factor; set to the value(s): s-282. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(measurement_no,speaker_f)

A summary.

summary(price_bin_gam_rsmooth)


Family: gaussian 
Link function: identity 

Formula:
f2 ~ age_o + s(measurement_no, bs = "cr", k = 11) + s(measurement_no, 
    bs = "cr", k = 11, by = age_o) + s(measurement_no, speaker_f, 
    bs = "fs", m = 1, k = 11)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   1584.29      25.41  62.356  < 2e-16 ***
age_oyounger  -158.60      36.00  -4.406 1.06e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                   edf  Ref.df     F p-value    
s(measurement_no)                7.994   8.212 25.28  <2e-16 ***
s(measurement_no):age_oyounger   5.328   5.619 11.83  <2e-16 ***
s(measurement_no,speaker_f)    346.542 436.000 15.58  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =   0.46   Deviance explained = 46.4%
fREML = 3.0194e+05  Scale est. = 36034     n = 47419

If you want to see the random smooths themselves… (select = 3 picks out the third smooth in the model – you can find the correct number by counting from the top of the smooth estimates part of the model summary; the random smooths are the third smooth for this specific model.)

inspect_random(price_bin_gam_rsmooth, select=3)

Summary:
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * speaker_f : factor with 40 values; set to the value(s): s-101, s-104, s-133, s-134, s-140, s-141, s-170, s-171, s-180, s-2, ...

5. Exercise: adding random smooths for previous / following environment

There are two variables in the data set that code the previous and following phone. Here’s what they look like:

table(price_bin$previous)


        _    b    d    f    g    h    J    k    l    m    n none    p    P    r 
  63   33 1056  630 2929  369 1035  968  776 8172 2870 4043 1639  708   10 7294 
   s    S    t    T    v    w    z 
2068   71 5859   20  308 6458   40

table(price_bin$following)


          _     {     @     5     b     d     D     e     f     g     h     i 
   63    41    66  1932    31   145  4152   246    22   967    21   117    51 
    I     k     l     m     n  none     p     Q     r     s     t     v     w 
  500  5902  2756  4743  5342  3093   273    30   250  2115 10823  2458    10 
    z 
 1270

We typically use random effects for full words (or items / stimuli) rather than previous / following environment, but there are simply too many unique words in this data set, and specifying previous / following environments achieves essentially the same goal. So your task is to set up separate random smooths for both and add them to your model! Does this change your model output? (Remember: the grouping variable for random smooths must be a factor!)

price_bin$previous_f <- factor(price_bin$previous)
price_bin$following_f <- factor(price_bin$following)

price_bin_gam_rsmooth <- bam(f2 ~ age_o +
                      s(measurement_no, bs="cr", k=11) +
                      s(measurement_no, bs="cr", k=11, by=age_o) +
                      s(measurement_no, previous_f, bs="fs", m=1, k=11) +
                      s(measurement_no, following_f, bs="fs", m=1, k=11)+
                       s(measurement_no, speaker_f, bs="fs", m=1, k=11),
                     data=price_bin,
                     AR.start=traj_start, rho=rho_est,
                     discrete=T)

Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.

plot_smooth(price_bin_gam_rsmooth, view="measurement_no", plot_all="age_o", rm.ranef=T)

Summary:
    * age_o : factor; set to the value(s): older, younger. 
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * previous_f : factor; set to the value(s): l. (Might be canceled as random effect, check below.) 
    * following_f : factor; set to the value(s): t. (Might be canceled as random effect, check below.) 
    * speaker_f : factor; set to the value(s): s-282. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(measurement_no,previous_f),s(measurement_no,following_f),s(measurement_no,speaker_f)

Use inspect_smooth to look at the previous / following smooths. Do they look the way you’d expect them to?

#{r} #... #

6. Random reference-difference smooths

If we want our p-values to be well-calibrated, we need a sort of a random slope equivalent for the within-speaker following voiceless effect, i.e. a random effect that allows the shape of the following voiceless effect to vary within speakers. Yikes!

In Sóskuthy (2021), I show that this should be done by simply mirroring the reference-difference smooth specification fixed effect smooths.

price_bin$foll_v_o <- as.ordered(price_bin$following_voiceless)
contrasts(price_bin$foll_v_o) <- "contr.treatment"

price_bin_gam_fv_rs <- bam(f2 ~ foll_v_o +
                       s(measurement_no, bs="cr", k=11) +
                       s(measurement_no, bs="cr", k=11, by=foll_v_o) +
                       s(measurement_no, speaker_f, bs="fs", m=1, k=11) +
                       s(measurement_no, speaker_f, by=foll_v_o, bs="fs", m=1, k=11),
                     data=price_bin,
                     AR.start=traj_start, rho=rho_est,
                     discrete=T)

Warning in gam.side(sm, X, tol = .Machine$double.eps^0.5): model has repeated
1-d smooths of same variable.

Summary.

summary(price_bin_gam_fv_rs)


Family: gaussian 
Link function: identity 

Formula:
f2 ~ foll_v_o + s(measurement_no, bs = "cr", k = 11) + s(measurement_no, 
    bs = "cr", k = 11, by = foll_v_o) + s(measurement_no, speaker_f, 
    bs = "fs", m = 1, k = 11) + s(measurement_no, speaker_f, 
    by = foll_v_o, bs = "fs", m = 1, k = 11)

Parametric coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1492.882     22.188  67.284  < 2e-16 ***
foll_v_oTRUE   31.297      8.604   3.637 0.000276 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Approximate significance of smooth terms:
                                             edf  Ref.df      F p-value    
s(measurement_no)                          8.751   8.945 69.524  <2e-16 ***
s(measurement_no):foll_v_oTRUE             6.865   7.604 25.682  <2e-16 ***
s(measurement_no,speaker_f)              351.113 438.000 11.580  <2e-16 ***
s(measurement_no,speaker_f):foll_v_oTRUE 217.564 427.000  1.485  <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

R-sq.(adj) =  0.487   Deviance explained = 49.3%
fREML = 3.0151e+05  Scale est. = 35041     n = 47419

And a plot of the two smooths.

plot_smooth(price_bin_gam_fv_rs, view="measurement_no",
            plot_all="foll_v_o", rm.ranef=T)

Summary:
    * foll_v_o : factor; set to the value(s): FALSE, TRUE. 
    * measurement_no : numeric predictor; with 30 values ranging from 0.000000 to 100.000000. 
    * speaker_f : factor; set to the value(s): s-282. (Might be canceled as random effect, check below.) 
    * NOTE : The following random effects columns are canceled: s(measurement_no,speaker_f),s(measurement_no,speaker_f):foll_v_oTRUE

You may notice something slightly odd about the plotted smooths: the confidence interval around the second one (following voiceless = TRUE) is a bit wider than the confidence interval around the first one. You are absolutely right! This is a known issue with plots of models with reference-difference smooths, and we’re working on sorting it out. Based on the simulations in Sóskuthy (2021), the model estimates should be OK – so this is likely an issue with the prediction function for GAMMs.

SST GAM workshop exercises

Márton Sóskuthy