The Atlas of North American English was a telephone survey of large urban centers in the United States and Canada conducted in the mid 1990s. This analysis of /ay/ allophony involves
library(lsa2017)
library(ggmap)
Loading required package: ggplot2
Google Maps API Terms of Service: http://developers.google.com/maps/terms.
Please cite ggmap if you use it: see citation('ggmap') for details.
library(tidyverse)
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
package ‘tibble’ was built under R version 3.4.1package ‘purrr’ was built under R version 3.4.2package ‘dplyr’ was built under R version 3.4.2Conflicts with tidy packages -----------------------------------------------------------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
library(stringr)
head(anae)
I geocoded the data using the Data Science Toolkit API.
citstate_df <- anae %>%
mutate(country = recode(dialect, Canada = "Canada", .default = "USA"))%>%
group_by(city, state, country)%>%
tally()%>%
mutate(citstate = str_c(city, state, country, sep = ", "))
geos_dsk <- geocode(citstate_df$citstate, source = "dsk")
geocoded <- bind_cols(citstate_df, geos_dsk)
ggplot(geocoded, aes(lon, lat))+
geom_point()
summary(geocoded)
city state country n citstate lon lat
Springfield: 4 IL : 15 USA :240 Min. : 141.0 Length:254 Min. :-149.86 Min. :25.89
Charleston : 2 OH : 14 Canada: 14 1st Qu.: 283.5 Class :character 1st Qu.: -96.81 1st Qu.:37.85
Columbia : 2 WI : 12 Median : 411.5 Mode :character Median : -87.85 Median :40.79
Greenville : 2 IN : 10 Mean : 509.0 Mean : -89.98 Mean :40.30
KansasCity : 2 MN : 10 3rd Qu.: 636.8 3rd Qu.: -80.86 3rd Qu.:43.01
Norfolk : 2 NJ : 10 Max. :2429.0 Max. : -68.85 Max. :61.22
(Other) :240 (Other):183
Normalization is easy with the split-apply-combine workflow.
anae <- anae %>%
group_by(speakerID)%>%
mutate(F1_n = (F1 - mean(F1))/sd(F1),
F2_n = (F2 - mean(F2))/sd(F2)) %>%
left_join(geocoded)
head(anae)
I’m going to subset the data and recode the voicing contrast.
ays <- anae %>%
ungroup() %>%
filter(vclass %in% c("ay", "ay0"),
fol_seg %in% c("B", "D", "DH", "F", "G", "K" , "M", "N", "P", "S", "T", "V", "Z"),
lon > -140)%>%
mutate(voicing = recode(as.character(fol_seg),
B = "voiced",
D = "voiced",
DH = "voiced",
`F` = "voiceless",
G = "voiced",
K = "voiceless",
M = "voiced",
N = "voiced",
P = "voiceless",
S = "voiceless",
`T` = "voiceless",
V = "voiced",
Z = "voiced"),
voicing = factor(voicing),
citstate = factor(citstate),
word = factor(word),
speakerID = factor(speakerID))
I fit a tensor-product smooth over latitude and longitude, including voicing as an effect. This’ll let me get difference curves, which is what’s really of interest. I also included random effects of word, speaker, and city. I’ll come back to the city effects in a moment.
library(mgcv)
Loading required package: nlme
Attaching package: ‘nlme’
The following object is masked from ‘package:dplyr’:
collapse
This is mgcv 1.8-17. For overview type 'help("mgcv-package")'.
ays_model <- gam(F1_n ~ voicing + te(lon, lat, by = voicing, k = 10)+
s(speakerID, bs = 're', by = voicing)+
s(citstate, bs = 're', by = voicing)+
s(word, bs = 're'), data = ays)
It took a long time to fit, so saving for safety.
saveRDS(ays_model, "ays_model.RDS")
According to gam.check(), the parameterization of the smoothers was pretty ok.
gam.check(ays_model)
Method: GCV Optimizer: magic
Smoothing parameter selection converged after 96 iterations.
The RMS GCV score gradient at convergence was 2.124059e-05 .
The Hessian was not positive definite.
Model rank = 2034 / 2034
Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.
k' edf k-index p-value
te(lon,lat):voicingvoiced 99.0 8.2 1.01 0.78
te(lon,lat):voicingvoiceless 99.0 15.1 1.01 0.78
s(speakerID):voicingvoiced 426.0 186.0 NA NA
s(speakerID):voicingvoiceless 426.0 167.3 NA NA
s(citstate):voicingvoiced 253.0 20.4 NA NA
s(citstate):voicingvoiceless 253.0 13.6 NA NA
s(word) 476.0 134.5 NA NA
The summary is kind of useless.
summary(ays_model)
Family: gaussian
Link function: identity
Formula:
F1_n ~ voicing + te(lon, lat, by = voicing, k = 10) + s(speakerID,
bs = "re", by = voicing) + s(citstate, bs = "re", by = voicing) +
s(word, bs = "re")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.05421 0.02207 47.770 < 2e-16 ***
voicingvoiceless -0.25415 0.03361 -7.562 4.71e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
te(lon,lat):voicingvoiced 8.205 9.098 5.710 5.87e-08 ***
te(lon,lat):voicingvoiceless 15.061 17.293 15.522 < 2e-16 ***
s(speakerID):voicingvoiced 185.972 421.000 1.185 < 2e-16 ***
s(speakerID):voicingvoiceless 167.318 399.000 1.185 < 2e-16 ***
s(citstate):voicingvoiced 20.421 249.000 0.150 0.01977 *
s(citstate):voicingvoiceless 13.645 240.000 0.103 0.00593 **
s(word) 134.533 474.000 1.140 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.335 Deviance explained = 40.1%
GCV = 0.24619 Scale est. = 0.22184 n = 5532
library(itsadug)
First, get the predicted values across a grid of longitudes and latitudes, excluding the random effects.
bbox <- c(min(ays$lon)-5, min(ays$lat)-2, max(ays$lon)+5, max(ays$lat)+2)
ays_diff <- get_difference(ays_model,
comp = list(voicing = c("voiced", "voiceless")),
cond = list(lon = seq(bbox[1], bbox[3], length = 100),
lat = seq(bbox[2], bbox[4], length = 100)),
rm.ranef = T)
Summary:
* lon : numeric predictor; with 100 values ranging from -128.119340 to -63.850405.
* lat : numeric predictor; with 100 values ranging from 23.893664 to 55.550140.
* speakerID : factor; set to the value(s): 345. (Might be canceled as random effect, check below.)
* citstate : factor; set to the value(s): Atlanta, GA, USA. (Might be canceled as random effect, check below.)
* word : factor; set to the value(s): five. (Might be canceled as random effect, check below.)
* NOTE : The following random effects columns are canceled: s(speakerID):voicingvoiced,s(speakerID):voicingvoiceless,s(citstate):voicingvoiced,s(citstate):voicingvoiceless,s(word)
A practice plot, using geom_contour()
ays_diff %>%
ggplot(aes(lon, lat))+
geom_contour(aes(z = difference, color = ..level..))
Next, get a basemap. I went with a black and white basemap so that it won’t compete with the colors of the plot, but that could be revisited.
bbox <- c(min(ays$lon)-5, min(ays$lat)-2, max(ays$lon)+5, max(ays$lat)+2)
na_map <- get_map(location = bbox, maptype = "toner")
ggmap(na_map)
Overlaying the contours on the map.
ggmap(na_map)+
geom_contour(data = ays_diff, aes(z = difference, color = ..level..), size= 1)+
scale_color_gradient2()
The one major downside of this is that the difference between regions might be actually quite a bit sharper than the gamms show. I want to include cities that are very different from their predicted geographic values. This involves:
This involves messing around with matrix multiplication, sorry.
pre_pred <- geocoded %>%
ungroup()%>%
select(citstate, lon, lat)%>%
mutate(speakerID = 1,
word = "five") %>%
filter(lon > -140)
pred_df <- bind_rows(pre_pred %>% mutate(voicing = "voiced"),
pre_pred %>% mutate(voicing = "voiceless")) %>%
mutate(voicing = factor(voicing))
pred_df_nocit <- pred_df
predict(..., type = 'lpmatrix') method.pred_matrix <- predict(ays_model, newdata = pred_df, type = "lpmatrix")
pred_matrix[,grepl("speakerID", colnames(pred_matrix))] <- 0
pred_matrix[,grepl("word", colnames(pred_matrix))] <- 0
pred_matrix_nocit <- pred_matrix
pred_matrix_nocit[,grepl("citstate", colnames(pred_matrix))] <- 0
pred_df$fit <- (pred_matrix %*% coef(ays_model))[,1]
pred_df_nocit$fit <- (pred_matrix_nocit %*% coef(ays_model))[,1]
pred_df <- pred_df %>%
spread(voicing, fit) %>%
mutate(citdiff = voiced - voiceless)%>%
select(citstate, lon, lat, citdiff)
pred_df_nocit <- pred_df_nocit %>%
spread(voicing, fit) %>%
mutate(no_citdiff = voiced - voiceless)%>%
select(citstate, lon, lat, no_citdiff)
all_pred <- left_join(pred_df, pred_df_nocit)
Joining, by = c("citstate", "lon", "lat")
⬆️ all_pred now contains a column of the predicted difference between pre-voiced and pre-voiceless /ay/ based on its location + random effect (citdiff) and just its location (no_citdiff) and the difference between them (extradiff). I’ll plot just the cities that have the biggest absolute extradiff (they’re maximally different from their predicted values based on location alone).
big_ranef_cities <- all_pred %>%
mutate(extradiff = citdiff - no_citdiff)%>%
filter(abs(extradiff) > 0.02)
library(ggthemes)
pal <- ptol_pal()(3)
ggmap(na_map)+
geom_contour(data = ays_diff, aes(z = difference, color = ..level..), size= 1)+
geom_point(data = big_ranef_cities, aes(color = citdiff, size = abs(extradiff)))+
scale_color_gradient2(low = pal[1], mid = pal[2], high = pal[3],
limits = c(min(all_pred$citdiff), max(all_pred$citdiff)))
I waffled on what the right thing to do for the map was. I could color each point according to how different it was from its surrounding area, but I think looking at a map like this you’d expect each point to be colored according to its predicted value, so that’s what I did. What gets lost is, for example, the fact that Philadelphia has a larger difference than predicted for its location while NYC has a much smaller difference. The barplot below is an attempt to show this, bit it loses the geographic information.
big_ranef_cities%>%
mutate(citstate = factor(citstate),
citstate = reorder(citstate, extradiff, mean))%>%
ggplot(aes(citstate, extradiff))+
geom_bar(stat = 'identity', aes(fill = no_citdiff))+
scale_fill_gradient2(low = pal[1], mid = pal[2], high = pal[3])+
coord_flip()
library(scales)
ggmap(na_map)+
geom_point(data = all_pred, aes(color = citdiff - no_citdiff, size=abs(citdiff - no_citdiff)))+
scale_color_gradient2(low = muted("blue"), high = muted("red"), mid = "grey")