1 Summary

This is an R Markdown document presenting the steps and the results of the ordinary kriging analysis conducted for the New England vowel formant crowdsourcing paper by Kim et al.

2 Packages

First, we load up all the R packages we’re using, which are mostly for geostatiscal analysis, cartography, and visualisation.

library(classInt)
library(rgeos)

## rgeos version: 0.3-23, (SVN revision 546)
##  GEOS runtime version: 3.6.1-CAPI-1.10.1 r0 
##  Linking to sp version: 1.2-4 
##  Polygon checking: TRUE

library(maps)
library(sp)
library(maptools)

## Checking rgeos availability: TRUE

library(geoR)

## --------------------------------------------------------------
##  Analysis of Geostatistical Data
##  For an Introduction to geoR go to http://www.leg.ufpr.br/geoR
##  geoR version 1.7-5.2 (built on 2016-05-02) is now loaded
## --------------------------------------------------------------

library(plotrix)
library(geosphere)

3 Mapping Preliminaries

Before mapping each of the vowel formant measures individually, we get some preliminary work out of the way.

3.1 Defining New England

First, we specify a projection and create a map object consisting the county-level information for the 6 New England state. We then convert that map object into sp object, which is the basis for the rest of the analysis.

projection <- "+proj=longlat +datum=WGS84"
neweng <- c("new hampshire", "maine", "massachusetts", "vermont", "rhode island", 
    "connecticut")
ne.map <- map(database = "county", region = neweng, plot = FALSE, fill = TRUE)
ne.sp <- map2SpatialPolygons(ne.map, IDs = ne.map$names, proj4string = CRS(projection))

Here is the region mapped.

par(mar = c(0, 0, 0, 0))
map(ne.map)

3.2 Defining Map Grids

Next, we create two grids of location that we’ll use throughout the analysis.

The first grid consists of approximately 1,000 evenly spaced locations across the six states. This will be used to bin the individual informant vowel formant observations as a precursor to interpolation. We do this because there are so many informants, the number of informants for the different variables is inconsistent, the values of the variables for nearby informants are often inconsistent, and the distribution of informants across the maps is very uneven (roughly following population density).

grid <- as.data.frame(spsample(ne.sp, 1000, type = "regular", offset = c(0.5, 
    0.5)))
grid$code <- row.names(grid)
nrow(grid)

## [1] 1006

And here it is mapped.

par(mar = c(0, 0, 0, 0))
map(database = "state", region = neweng, col = rgb(0.1, 0.1, 0.1), lwd = 1)
map(database = "county", region = neweng, col = rgb(0.3, 0.3, 0.3), lwd = 0.4, 
    add = TRUE)
points(grid$x1, grid$x2, pch = 23, cex = 1, bg = "gray95", lwd = 0.5)

The second grid consists of approximately 300,000 evenly spaced locations. This is the grid I’ll use to interpolate over so that we can get what looks like a smooth surface. I chose 300,000 locations because this is around the lowest number of locations needed to get a fully kriged map at a good resolution.

kgrid <- as.data.frame(spsample(ne.sp, 3e+05, type = "regular", offset = c(0.5, 
    0.5)))
nrow(kgrid)

## [1] 299980

3.3 Map Functions

Finally, we set up three New England map functions. Note that we’re taking the shortcut here of using object names directly in the functions and then consistently re-using the object names for each sub-analysis. We’re also defining the colour pallete when we call the fucntion. Those should be fixed so it’s in the function call. Right now the function calls just take label titles and then whether or not we want the scale to be set up so that higher values are associated with red or blue.

The first function is for maps by actual location.

map_loc <- function(maint, subt, legt, legb, lab, hicol) {
    
    par(mar = c(0, 0, 0, 0))
    
    map(database = "state", region = neweng, col = rgb(0.1, 0.1, 0.1), lwd = 0.8)
    map(database = "county", region = neweng, col = rgb(0.3, 0.3, 0.3), lwd = 0.4, 
        add = TRUE)
    
    points(loc$Group.1, loc$Group.2, pch = 21, cex = 0.8, bg = color, lwd = 0.5)
    
    text(-72.15, 47.2, maint, pos = 1, col = "black", cex = 1.3)
    text(-72.15, 46.3, subt, pos = 1, col = "black", cex = 1.1)
    
    if (hicol == "red") {
        rect(-68.7, 42.9, -68.4, 43.3, col = basepal[4])
        rect(-68.7, 42.5, -68.4, 42.9, col = basepal[3])
        rect(-68.7, 42.1, -68.4, 42.5, col = basepal[2])
        rect(-68.7, 41.7, -68.4, 42.1, col = basepal[1])
        text(-68.35, 42.9, paste(round(class$brks[4], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
        text(-68.35, 42.5, paste(round(class$brks[3], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
        text(-68.35, 42.1, paste(round(class$brks[2], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
    } else if (hicol == "blue") {
        rect(-68.7, 42.9, -68.4, 43.3, col = basepal[1])
        rect(-68.7, 42.5, -68.4, 42.9, col = basepal[2])
        rect(-68.7, 42.1, -68.4, 42.5, col = basepal[3])
        rect(-68.7, 41.7, -68.4, 42.1, col = basepal[4])
        text(-68.35, 42.9, paste(round(class$brks[2], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
        text(-68.35, 42.5, paste(round(class$brks[3], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
        text(-68.35, 42.1, paste(round(class$brks[4], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
    }
    
    text(-68.55, 43.62, legt, pos = 1, col = "black", cex = 0.9)
    text(-68.55, 41.62, legb, pos = 1, col = "black", cex = 0.9)
}

The second function is for maps using the 1,000 location average grid (note that a substantial number go unused since there is no nearby informant).

map_grid <- function(maint, subt, legt, legb, lab, hicol) {
    
    par(mar = c(0, 0, 0, 0))
    map(database = "state", region = neweng, col = rgb(0.1, 0.1, 0.1), lwd = 0.8)
    map(database = "county", region = neweng, col = rgb(0.3, 0.3, 0.3), lwd = 0.4, 
        add = TRUE)
    
    points(grid$x1[as.numeric(avg$Group.1)], grid$x2[as.numeric(avg$Group.1)], 
        pch = 23, cex = 1, bg = color, lwd = 0.5)
    text(-72.15, 47.2, maint, pos = 1, col = "black", cex = 1.3)
    text(-72.15, 46.3, subt, pos = 1, col = "black", cex = 1.1)
    
    if (hicol == "red") {
        rect(-68.7, 42.9, -68.4, 43.3, col = basepal[4])
        rect(-68.7, 42.5, -68.4, 42.9, col = basepal[3])
        rect(-68.7, 42.1, -68.4, 42.5, col = basepal[2])
        rect(-68.7, 41.7, -68.4, 42.1, col = basepal[1])
        text(-68.35, 42.9, paste(round(class$brks[4], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
        text(-68.35, 42.5, paste(round(class$brks[3], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
        text(-68.35, 42.1, paste(round(class$brks[2], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
    } else if (hicol == "blue") {
        rect(-68.7, 42.9, -68.4, 43.3, col = basepal[1])
        rect(-68.7, 42.5, -68.4, 42.9, col = basepal[2])
        rect(-68.7, 42.1, -68.4, 42.5, col = basepal[3])
        rect(-68.7, 41.7, -68.4, 42.1, col = basepal[4])
        text(-68.35, 42.9, paste(round(class$brks[2], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
        text(-68.35, 42.5, paste(round(class$brks[3], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
        text(-68.35, 42.1, paste(round(class$brks[4], 0), lab, sep = " "), cex = 0.9, 
            pos = 4)
    }
    
    text(-68.55, 43.62, legt, pos = 1, col = "black", cex = 0.9)
    text(-68.55, 41.62, legb, pos = 1, col = "black", cex = 0.9)
}

The third function is for maps using the high density interpolation grid.

map_krig <- function(maint, subt, legt, legb) {
    
    pal <- colorRampPalette(c(rgb(0, 0, 1), "white", rgb(1, 0, 0)), bias = 1)
    basepal <- pal(100)
    class <- classIntervals(final$KDATA, 100, style = "equal")
    color <- findColours(class, basepal)
    
    par(mar = c(0, 0, 0, 0))
    
    map(database = "state", region = neweng, col = rgb(0.1, 0.1, 0.1), lwd = 1)
    
    points(kgrid$x1, kgrid$x2, pch = 16, col = color, cex = 0.2, lwd = 1)
    
    map(database = "county", region = neweng, col = rgb(0.3, 0.3, 0.3), lwd = 0.4, 
        add = TRUE)
    map(database = "state", region = neweng, col = rgb(0.1, 0.1, 0.1), lwd = 0.8, 
        add = TRUE)
    
    text(-72.15, 47.2, maint, pos = 1, col = "black", cex = 1.3)
    text(-72.15, 46.3, subt, pos = 1, col = "black", cex = 1.1)
    gradient.rect(-68.7, 41.7, -68.4, 43.3, col = basepal, nslices = 100, gradient = "y")
    text(-68.55, 43.62, legt, pos = 1, col = "black", cex = 0.9)
    text(-68.55, 41.62, legb, pos = 1, col = "black", cex = 0.9)
}

4 Main Interpolation Analysis

We now go through the interpolation analysis of the 5 acoustic measures. We’ll describe the first analysis of LOT/THOUGHT diustance in more detail; they all follow the same format, so we’ll only note differences after the first example.

4.1 LOT/THOUGHT Distance

The first analysis is for the distance between the LOT and THOUGHT vowels.

4.1.1 Data

First, we read in the dataset. We have each acoustic feature in its own .csv, since they have are measured across slightly different sets of informants. In each file, the data consists of an informant speaker code and then the Longitude and Latitude of the place where the informant grew up, as well as the acoustic feature, which in this case consists of the Euclidean distance betweeh the LOT and THOUGHT vowels measured in Hz.

lvar <- read.table("thought.lot.Grieve.csv", header = TRUE, sep = ",", quote = "\"")
names(lvar)[2] <- "var"
summary(lvar)

##     speaker           var              childLat       childLong     
##  Min.   :  2.0   Min.   :  0.6203   Min.   :41.02   Min.   :-73.57  
##  1st Qu.:167.5   1st Qu.: 16.6680   1st Qu.:41.78   1st Qu.:-72.50  
##  Median :353.5   Median : 32.5397   Median :42.28   Median :-71.44  
##  Mean   :370.2   Mean   : 45.5346   Mean   :42.55   Mean   :-71.54  
##  3rd Qu.:542.8   3rd Qu.: 65.0148   3rd Qu.:42.97   3rd Qu.:-71.00  
##  Max.   :848.0   Max.   :313.5288   Max.   :47.32   Max.   :-67.86

nrow(lvar)

## [1] 622

4.1.2 Outlier Removal

We check for massive outliers, since they can really affect the rest of the analysis. We’re doing this pretty informally and just eyeballing it to try to get rid of really bad outliers. In this case, we see trimming at around 180 Hz gets rid of some pretty major outliers.

plot(lvar$var, pch = "*", col = "darkblue", ylab = "LOT/THOUGHT Distance")
lines(x = c(0, 620), y = c(180, 180), col = "red")

lvar <- lvar[lvar$var < 180, ]
summary(lvar)

##     speaker           var              childLat       childLong     
##  Min.   :  2.0   Min.   :  0.6203   Min.   :41.02   Min.   :-73.57  
##  1st Qu.:169.0   1st Qu.: 16.5318   1st Qu.:41.79   1st Qu.:-72.50  
##  Median :355.0   Median : 32.1944   Median :42.29   Median :-71.44  
##  Mean   :370.3   Mean   : 43.8363   Mean   :42.56   Mean   :-71.53  
##  3rd Qu.:543.0   3rd Qu.: 63.8567   3rd Qu.:42.97   3rd Qu.:-71.00  
##  Max.   :848.0   Max.   :161.8439   Max.   :47.32   Max.   :-67.86

nrow(lvar)

## [1] 617

4.1.3 Location Averages

We first map by location just to get a look at the underlying data. We’re not jiggering here. When there are more than one informant associated with the same long/lat, we just take the average distance value. Note we’re just doing this here for mapping We aren’t working with these values below.

loc <- aggregate(lvar$var, by = list(lvar$childLong, lvar$childLat), FUN = mean)
summary(loc)

##     Group.1          Group.2            x          
##  Min.   :-73.57   Min.   :41.02   Min.   :  1.059  
##  1st Qu.:-72.52   1st Qu.:41.79   1st Qu.: 17.639  
##  Median :-71.44   Median :42.31   Median : 32.653  
##  Mean   :-71.52   Mean   :42.60   Mean   : 42.568  
##  3rd Qu.:-70.98   3rd Qu.:43.08   3rd Qu.: 61.008  
##  Max.   :-67.86   Max.   :47.32   Max.   :161.844

nrow(loc)

## [1] 435

We then map the data by locations, after first setting the color palette. We do this each time so that we can set different orders (by changing the order of the palette) and styles of intervals.

In this case, we’re dividing into four colour bands using quantiles (i.e. into quartiles), which means roughly the number of locations are associated with each colour band. So, for example, the split between blue and red is the median value of the variable across all locations.

pal <- colorRampPalette(c(rgb(1, 0, 0), "white", rgb(0, 0, 1)), bias = 1)
basepal <- pal(4)
class <- classIntervals(loc$x, 4, style = "quantile")
color <- findColours(class, basepal)


map_loc(maint = "LOT/THOUGHT\nDistance", subt = "Average\nF1/F2 Euclidean Distance\nby Location", 
    legt = "Closer", legb = "Farther", lab = "Hz", hicol = "blue")

4.1.4 On Grid Averages

Because the distribution of informants/locations is so uneven, with an especially large number of informants from around Boston, before kriging, we take the average measurement for all informants who closest to a point on the regular grid of around 1,000 locations, which we generatde earlier. Many grid points are not associated with any informants, but this step still evens out the distribution of points and helps show the underlying pattern.

We do this by first finding the distance between each informant and the grid centroid. We then find record the closest grid point for each informant and then compute an average measurement for all these informants.

dist <- distm(lvar[, c("childLong", "childLat")], grid[, c("x1", "x2")], fun = distVincentyEllipsoid)
lvar$grid <- grid$code[max.col(-dist)]
avg <- aggregate(lvar$var, by = list(lvar$grid), FUN = mean)

We can also map the result, again using the quantile approach to define colour bands, which really makes the pattern clear.

class <- classIntervals(avg$x, 4, style = "quantile")
color <- findColours(class, basepal)

map_grid(maint = "LOT/THOUGHT\nDistance", subt = "Average\nF1/F2 Euclidean Distance\non Grid", 
    legt = "Closer", legb = "Farther", lab = "Hz", hicol = "blue")

4.1.5 Variogram Analysis

Finally, we use these grid averages to run an ordinary kriging.

First, we need to conduct a variogram analysis…

xy <- as.matrix(data.frame(grid$x1[as.numeric(avg$Group.1)], grid$x2[as.numeric(avg$Group.1)]))
vg <- variog(coords = xy, data = avg$x, option = "bin", max.dist = 3)

## variog: computing omnidirectional variogram

vgf <- variofit(vg, cov.model = "spherical")

## variofit: covariance model used is spherical 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim

## Warning in variofit(vg, cov.model = "spherical"): initial values not
## provided - running the default search

## variofit: searching for best initial value ... selected values:
##               sigmasq  phi   tausq    kappa
## initial.value "717.19" "2.3" "239.06" "0.5"
## status        "est"    "est" "est"    "fix"
## loss value: 109745641.066694

plot(vg, ylab = "Semivariance", xlab = "Physical Distance", pch = 21, cex = 1.3, 
    col = "darkblue", main = "THOUGHT/LOT Distance")
lines(vgf, col = "red", lwd = 1.1)

4.1.6 Ordinary Kriging

… We multiply the z-score by negative -1 because we want higher values in red.

krigdata <- krige.conv(coords = xy, data = avg$x, locations = kgrid, krige = krige.control(obj.m = vgf))

## krige.conv: model with constant mean
## krige.conv: Kriging performed using global neighbourhood

final <- as.data.frame(krigdata$predict) * -1
colnames(final) <- c("KDATA")

summary(final)

##      KDATA       
##  Min.   :-84.55  
##  1st Qu.:-33.80  
##  Median :-26.19  
##  Mean   :-30.79  
##  3rd Qu.:-21.75  
##  Max.   :-15.28

Note that in this case we’re using equal intervals, rather than some type of quantile like above. The effect is that each colour band can be associated with very different number of locations, allowing us to generate interpolated maps that do not always have approximately 50% of the map coloured in blue and 50% in red, which would mispreresent a number of the variables and produce interpolations that do not appear to correspond very well to the raw maps upon which they are based. The reason we use a quantile approach for those is that every location is associated with one or more informants, so it makes sense to set the colour bands to include roughly equal number of locations, but when we are interpolating over the 300,000 location grid, most location aren’t near to any informant.

map_krig(maint = "LOT/THOUGHT\nDistance", subt = "Interpolated\nF1/F2 Euclidean Distance\n(Ordinary Kriging)", 
    legt = "Closer", legb = "Farther")

4.1.7 Save the data

k_thought <- scale(final)

4.2 Rhoticity

4.2.1 Data

Flip it here so we get percentage of r-lessness

lvar <- read.table("rhoticity_Grieve.csv", header = TRUE, sep = ",", quote = "\"")
names(lvar)[2] <- "var"
lvar$var <- (1 - lvar$var) * 100
summary(lvar)

##     speaker           var            childLat       childLong     
##  Min.   :  2.0   Min.   :  0.00   Min.   :41.02   Min.   :-73.57  
##  1st Qu.:167.0   1st Qu.:  0.00   1st Qu.:41.77   1st Qu.:-72.50  
##  Median :352.0   Median :  0.00   Median :42.28   Median :-71.42  
##  Mean   :369.6   Mean   : 21.12   Mean   :42.55   Mean   :-71.53  
##  3rd Qu.:542.0   3rd Qu.:  0.00   3rd Qu.:42.97   3rd Qu.:-71.00  
##  Max.   :848.0   Max.   :100.00   Max.   :47.32   Max.   :-67.86

nrow(lvar)

## [1] 625

4.2.2 Outlier Removal

Doesn’t make sense in this context.

plot(lvar$var, pch = "*", col = "darkblue", ylab = "Rhoticity Percent")

4.2.3 Location Averages

loc <- aggregate(lvar$var, by = list(lvar$childLong, lvar$childLat), FUN = mean)
summary(loc)

##     Group.1          Group.2            x         
##  Min.   :-73.57   Min.   :41.02   Min.   :  0.00  
##  1st Qu.:-72.52   1st Qu.:41.79   1st Qu.:  0.00  
##  Median :-71.44   Median :42.31   Median :  0.00  
##  Mean   :-71.52   Mean   :42.60   Mean   : 19.85  
##  3rd Qu.:-70.98   3rd Qu.:43.08   3rd Qu.: 33.33  
##  Max.   :-67.86   Max.   :47.32   Max.   :100.00

nrow(loc)

## [1] 437

pal <- colorRampPalette(c(rgb(0, 0, 1), "white", rgb(1, 0, 0)), bias = 1)
basepal <- pal(4)
class <- classIntervals(loc$x, 4, style = "fixed", fixedBreaks = c(0, 1e-06, 
    33, 100, 101))
color <- findColours(class, basepal)


map_loc(maint = "Rhoticity", subt = "Average\nPercentage of r-lessness\nby Location", 
    legt = "r-less", legb = "", lab = "%", hicol = "red")

4.2.4 On Grid Averages

dist <- distm(lvar[, c("childLong", "childLat")], grid[, c("x1", "x2")], fun = distVincentyEllipsoid)
lvar$grid <- grid$code[max.col(-dist)]
avg <- aggregate(lvar$var, by = list(lvar$grid), FUN = mean)

And here it is mapped.

class <- classIntervals(avg$x, 4, style = "fixed", fixedBreaks = c(0, 1e-06, 
    33, 100, 101))
color <- findColours(class, basepal)

map_grid(maint = "Rhoticity", subt = "Average\nPercentage of r-lessness\non Grid", 
    legt = "r-less", legb = "", lab = "%", hicol = "red")

4.2.5 Variogram Analysis

xy <- as.matrix(data.frame(grid$x1[as.numeric(avg$Group.1)], grid$x2[as.numeric(avg$Group.1)]))
vg <- variog(coords = xy, data = avg$x, option = "bin", max.dist = 3)

## variog: computing omnidirectional variogram

vgf <- variofit(vg, cov.model = "spherical")

## variofit: covariance model used is spherical 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim

## Warning in variofit(vg, cov.model = "spherical"): initial values not
## provided - running the default search

## variofit: searching for best initial value ... selected values:
##               sigmasq  phi    tausq    kappa
## initial.value "455.58" "1.38" "455.58" "0.5"
## status        "est"    "est"  "est"    "fix"
## loss value: 112524306.429328

plot(vg, ylab = "Semivariance", xlab = "Physical Distance", pch = 21, cex = 1.3, 
    col = "darkblue", main = "Rhoticity Percentage")
lines(vgf, col = "red", lwd = 1.1)

4.2.6 Ordinary Kriging

krigdata <- krige.conv(coords = xy, data = avg$x, locations = kgrid, krige = krige.control(obj.m = vgf))

## krige.conv: model with constant mean
## krige.conv: Kriging performed using global neighbourhood

final <- as.data.frame(krigdata$predict)
colnames(final) <- c("KDATA")
summary(final)

##      KDATA       
##  Min.   : 3.298  
##  1st Qu.:12.700  
##  Median :15.995  
##  Mean   :16.454  
##  3rd Qu.:19.754  
##  Max.   :36.487

map_krig(maint = "Rhoticity", subt = "Interpolated\nPercentage of r-lessness\n(Ordinary Kriging)", 
    legt = "r-less", legb = "r-ful")

4.2.7 Save the data

k_rhot <- scale(final)

4.3 START Fronting

4.3.1 Data

lvar <- read.table("startminuslot_Grieve.csv", header = TRUE, sep = ",", quote = "\"")
names(lvar)[2] <- "var"
summary(lvar)

##     speaker           var              childLat       childLong     
##  Min.   :  2.0   Min.   :-288.945   Min.   :41.02   Min.   :-73.57  
##  1st Qu.:170.0   1st Qu.: -31.293   1st Qu.:41.78   1st Qu.:-72.50  
##  Median :355.0   Median :  -5.968   Median :42.29   Median :-71.42  
##  Mean   :371.1   Mean   :  -4.790   Mean   :42.56   Mean   :-71.53  
##  3rd Qu.:543.5   3rd Qu.:  16.837   3rd Qu.:42.97   3rd Qu.:-71.00  
##  Max.   :848.0   Max.   : 219.940   Max.   :47.32   Max.   :-67.86

nrow(lvar)

## [1] 619

4.3.2 Outlier Removal

High and low outliers removed

plot(lvar$var, pch = "*", col = "darkblue", ylab = "START Fronting")
lines(x = c(0, 620), y = c(175, 175), col = "red")
lines(x = c(0, 620), y = c(-175, -175), col = "orange")

lvar <- lvar[lvar$var < 175, ]
lvar <- lvar[lvar$var > -175, ]
summary(lvar)

##     speaker           var              childLat       childLong     
##  Min.   :  2.0   Min.   :-153.224   Min.   :41.02   Min.   :-73.57  
##  1st Qu.:174.5   1st Qu.: -31.293   1st Qu.:41.78   1st Qu.:-72.50  
##  Median :358.0   Median :  -5.984   Median :42.29   Median :-71.42  
##  Mean   :372.8   Mean   :  -5.328   Mean   :42.56   Mean   :-71.53  
##  3rd Qu.:544.5   3rd Qu.:  16.686   3rd Qu.:42.97   3rd Qu.:-71.00  
##  Max.   :848.0   Max.   : 156.613   Max.   :47.32   Max.   :-67.86

nrow(lvar)

## [1] 615

4.3.3 Location Averages

loc <- aggregate(lvar$var, by = list(lvar$childLong, lvar$childLat), FUN = mean)
summary(loc)

##     Group.1          Group.2            x           
##  Min.   :-73.57   Min.   :41.02   Min.   :-118.535  
##  1st Qu.:-72.52   1st Qu.:41.79   1st Qu.: -29.769  
##  Median :-71.44   Median :42.31   Median :  -5.624  
##  Mean   :-71.52   Mean   :42.60   Mean   :  -4.026  
##  3rd Qu.:-70.98   3rd Qu.:43.08   3rd Qu.:  16.167  
##  Max.   :-67.86   Max.   :47.32   Max.   : 156.613

nrow(loc)

## [1] 436

pal <- colorRampPalette(c(rgb(0, 0, 1), "white", rgb(1, 0, 0)), bias = 1)
basepal <- pal(4)
class <- classIntervals(loc$x, 4, style = "quantile")
color <- findColours(class, basepal)


map_loc(maint = "START\nFronting", subt = "Average\nSTART F2 - LOT F2\nby Location", 
    legt = "More Fronted", legb = "", lab = "Hz", hicol = "red")

4.3.4 On Grid Averages

dist <- distm(lvar[, c("childLong", "childLat")], grid[, c("x1", "x2")], fun = distVincentyEllipsoid)
lvar$grid <- grid$code[max.col(-dist)]
avg <- aggregate(lvar$var, by = list(lvar$grid), FUN = mean)

And here it is mapped.

class <- classIntervals(avg$x, 4, style = "quantile")
color <- findColours(class, basepal)

map_grid(maint = "START\nFronting", subt = "Average\nSTART F2 - LOT F2\non Grid", 
    legt = "More Fronted", legb = "", lab = "Hz", hicol = "red")

4.3.5 Variogram Analysis

xy <- as.matrix(data.frame(grid$x1[as.numeric(avg$Group.1)], grid$x2[as.numeric(avg$Group.1)]))
vg <- variog(coords = xy, data = avg$x, option = "bin", max.dist = 3)

## variog: computing omnidirectional variogram

vgf <- variofit(vg, cov.model = "spherical")

## variofit: covariance model used is spherical 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim

## Warning in variofit(vg, cov.model = "spherical"): initial values not
## provided - running the default search

## variofit: searching for best initial value ... selected values:
##               sigmasq   phi   tausq    kappa
## initial.value "1217.53" "2.3" "405.84" "0.5"
## status        "est"     "est" "est"    "fix"
## loss value: 353901028.505612

plot(vg, ylab = "Semivariance", xlab = "Physical Distance", pch = 21, cex = 1.3, 
    col = "darkblue", main = "START Fronting")
lines(vgf, col = "red", lwd = 1.1)

4.3.6 Ordinary Kriging

krigdata <- krige.conv(coords = xy, data = avg$x, locations = kgrid, krige = krige.control(obj.m = vgf))

## krige.conv: model with constant mean
## krige.conv: Kriging performed using global neighbourhood

final <- as.data.frame(krigdata$predict)
colnames(final) <- c("KDATA")
summary(final)

##      KDATA        
##  Min.   :-39.710  
##  1st Qu.:  2.607  
##  Median : 14.638  
##  Mean   : 12.487  
##  3rd Qu.: 25.680  
##  Max.   : 51.288

map_krig(maint = "START\nFronting", subt = "Interpolated\nSTART F2 - LOT F2\n(Ordinary Kriging)", 
    legt = "More Fronted", legb = "Less Fronted")

4.3.7 Save the data

k_start <- scale(final)

4.4 PALM Fronting

4.4.1 Data

lvar <- read.table("palmminuslot_Grieve.csv", header = TRUE, sep = ",", quote = "\"")
names(lvar)[2] <- "var"
summary(lvar)

##     speaker           var              childLat       childLong     
##  Min.   :  2.0   Min.   :-159.515   Min.   :41.02   Min.   :-73.57  
##  1st Qu.:171.8   1st Qu.: -44.265   1st Qu.:41.78   1st Qu.:-72.50  
##  Median :359.5   Median : -27.253   Median :42.29   Median :-71.42  
##  Mean   :373.0   Mean   : -24.593   Mean   :42.56   Mean   :-71.53  
##  3rd Qu.:544.2   3rd Qu.:  -4.521   3rd Qu.:42.97   3rd Qu.:-71.00  
##  Max.   :848.0   Max.   : 139.003   Max.   :47.32   Max.   :-67.86

nrow(lvar)

## [1] 616

4.4.2 Outlier Removal

No major outliers here, so we leave as is.

plot(lvar$var, pch = "*", col = "darkblue", ylab = "Palm")

4.4.3 Location Averages

loc <- aggregate(lvar$var, by = list(lvar$childLong, lvar$childLat), FUN = mean)
summary(loc)

##     Group.1          Group.2            x           
##  Min.   :-73.57   Min.   :41.02   Min.   :-159.515  
##  1st Qu.:-72.51   1st Qu.:41.79   1st Qu.: -43.505  
##  Median :-71.44   Median :42.32   Median : -26.743  
##  Mean   :-71.52   Mean   :42.61   Mean   : -24.769  
##  3rd Qu.:-70.98   3rd Qu.:43.10   3rd Qu.:  -6.817  
##  Max.   :-67.86   Max.   :47.32   Max.   : 139.003

nrow(loc)

## [1] 436

pal <- colorRampPalette(c(rgb(0, 0, 1), "white", rgb(1, 0, 0)), bias = 1)
basepal <- pal(4)
class <- classIntervals(loc$x, 4, style = "quantile")
color <- findColours(class, basepal)


map_loc(maint = "PALM\nFronting", subt = "Average\nPALM F2 - LOT F2\nby Location", 
    legt = "More Fronted", legb = "", lab = "Hz", hicol = "red")

4.4.4 On Grid Averages

dist <- distm(lvar[, c("childLong", "childLat")], grid[, c("x1", "x2")], fun = distVincentyEllipsoid)
lvar$grid <- grid$code[max.col(-dist)]
avg <- aggregate(lvar$var, by = list(lvar$grid), FUN = mean)

And here it is mapped.

class <- classIntervals(avg$x, 4, style = "quantile")
color <- findColours(class, basepal)

map_grid(maint = "PALM\nFronting", subt = "Average\nPALM F2 - LOT F2\non Grid", 
    legt = "More Fronted", legb = "", lab = "Hz", hicol = "red")

4.4.5 Variogram Analysis

xy <- as.matrix(data.frame(grid$x1[as.numeric(avg$Group.1)], grid$x2[as.numeric(avg$Group.1)]))
vg <- variog(coords = xy, data = avg$x, option = "bin", max.dist = 3)

## variog: computing omnidirectional variogram

vgf <- variofit(vg, cov.model = "spherical")

## variofit: covariance model used is spherical 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim

## Warning in variofit(vg, cov.model = "spherical"): initial values not
## provided - running the default search

## variofit: searching for best initial value ... selected values:
##               sigmasq  phi    tausq    kappa
## initial.value "502.28" "1.38" "502.28" "0.5"
## status        "est"    "est"  "est"    "fix"
## loss value: 117120372.086316

plot(vg, ylab = "Semivariance", xlab = "Physical Distance", pch = 21, cex = 1.3, 
    col = "darkblue", main = "PALM Fronting")
lines(vgf, col = "red", lwd = 1.1)

4.4.6 Ordinary Kriging

krigdata <- krige.conv(coords = xy, data = avg$x, locations = kgrid, krige = krige.control(obj.m = vgf))

## krige.conv: model with constant mean
## krige.conv: Kriging performed using global neighbourhood

final <- as.data.frame(krigdata$predict)
colnames(final) <- c("KDATA")
summary(final)

##      KDATA          
##  Min.   :-60.77603  
##  1st Qu.:-29.47791  
##  Median :-24.19291  
##  Mean   :-24.27324  
##  3rd Qu.:-17.95466  
##  Max.   :  0.07402

map_krig(maint = "PALM\nFronting", subt = "Interpolated\nPALM F2 - LOT F2\n(Ordinary Kriging)", 
    legt = "More Fronted", legb = "Less Fronted")

4.4.7 Save the data

k_palm <- scale(final)

4.5 MARY/MARRY/MERRY Distance

4.5.1 Data

lvar <- read.table("mmm_Grieve.csv", header = TRUE, sep = ",", quote = "\"")
names(lvar)[2] <- "var"
summary(lvar)

##     speaker           var             childLat       childLong     
##  Min.   :  2.0   Min.   :  44.12   Min.   :41.02   Min.   :-73.57  
##  1st Qu.:171.2   1st Qu.: 251.91   1st Qu.:41.78   1st Qu.:-72.52  
##  Median :356.0   Median : 382.86   Median :42.28   Median :-71.44  
##  Mean   :371.9   Mean   : 470.80   Mean   :42.55   Mean   :-71.55  
##  3rd Qu.:544.8   3rd Qu.: 583.85   3rd Qu.:42.97   3rd Qu.:-71.00  
##  Max.   :848.0   Max.   :2594.87   Max.   :47.32   Max.   :-67.86

nrow(lvar)

## [1] 602

4.5.2 Outlier Removal

plot(lvar$var, pch = "*", col = "darkblue", ylab = "MARY/MARRY/MERRY Distance")
lines(x = c(0, 620), y = c(1600, 1600), col = "red")

lvar <- lvar[lvar$var < 1600, ]
summary(lvar)

##     speaker           var             childLat       childLong     
##  Min.   :  2.0   Min.   :  44.12   Min.   :41.02   Min.   :-73.57  
##  1st Qu.:174.5   1st Qu.: 249.54   1st Qu.:41.78   1st Qu.:-72.53  
##  Median :359.0   Median : 380.40   Median :42.28   Median :-71.44  
##  Mean   :373.7   Mean   : 450.48   Mean   :42.55   Mean   :-71.55  
##  3rd Qu.:545.5   3rd Qu.: 569.29   3rd Qu.:42.97   3rd Qu.:-71.00  
##  Max.   :848.0   Max.   :1560.62   Max.   :47.32   Max.   :-67.86

nrow(lvar)

## [1] 595

4.5.3 Location Averages

loc <- aggregate(lvar$var, by = list(lvar$childLong, lvar$childLat), FUN = mean)
summary(loc)

##     Group.1          Group.2            x          
##  Min.   :-73.57   Min.   :41.02   Min.   :  44.12  
##  1st Qu.:-72.55   1st Qu.:41.79   1st Qu.: 262.27  
##  Median :-71.45   Median :42.32   Median : 380.70  
##  Mean   :-71.54   Mean   :42.61   Mean   : 440.40  
##  3rd Qu.:-71.00   3rd Qu.:43.16   3rd Qu.: 563.44  
##  Max.   :-67.86   Max.   :47.32   Max.   :1440.68

nrow(loc)

## [1] 418

pal <- colorRampPalette(c(rgb(0, 0, 1), "white", rgb(1, 0, 0)), bias = 1)
basepal <- pal(4)
class <- classIntervals(loc$x, 4, style = "quantile")
color <- findColours(class, basepal)


map_loc(maint = "MARY/MARRY/MERRY\nDistance", subt = "Average\nF1/F2 Euclidean Distance\nby Location", 
    legt = "Farther", legb = "Closer", lab = "Hz", hicol = "red")

4.5.4 On Grid Averages

dist <- distm(lvar[, c("childLong", "childLat")], grid[, c("x1", "x2")], fun = distVincentyEllipsoid)
lvar$grid <- grid$code[max.col(-dist)]
avg <- aggregate(lvar$var, by = list(lvar$grid), FUN = mean)

And here it is mapped.

class <- classIntervals(avg$x, 4, style = "quantile")
color <- findColours(class, basepal)


map_grid(maint = "MARY/MARRY/MERRY\nDistance", subt = "Average\nF1/F2 Euclidean Distance\non Grid", 
    legt = "Farther", legb = "Closer", lab = "Hz", hicol = "red")

4.5.5 Variogram Analysis

xy <- as.matrix(data.frame(grid$x1[as.numeric(avg$Group.1)], grid$x2[as.numeric(avg$Group.1)]))
vg <- variog(coords = xy, data = avg$x, option = "bin", max.dist = 3)

## variog: computing omnidirectional variogram

vgf <- variofit(vg, cov.model = "spherical")

## variofit: covariance model used is spherical 
## variofit: weights used: npairs 
## variofit: minimisation function used: optim

## Warning in variofit(vg, cov.model = "spherical"): initial values not
## provided - running the default search

## variofit: searching for best initial value ... selected values:
##               sigmasq    phi   tausq      kappa
## initial.value "30604.65" "2.3" "30604.65" "0.5"
## status        "est"      "est" "est"      "fix"
## loss value: 597395704441.898

plot(vg, ylab = "Semivariance", xlab = "Physical Distance", pch = 21, cex = 1.3, 
    col = "darkblue", main = "MARY/MARRY/MERRY Distance")
lines(vgf, col = "red", lwd = 1.1)

4.5.6 Ordinary Kriging

krigdata <- krige.conv(coords = xy, data = avg$x, locations = kgrid, krige = krige.control(obj.m = vgf))

## krige.conv: model with constant mean
## krige.conv: Kriging performed using global neighbourhood

final <- as.data.frame(krigdata$predict)
colnames(final) <- c("KDATA")
summary(final)

##      KDATA      
##  Min.   :233.4  
##  1st Qu.:335.5  
##  Median :374.2  
##  Mean   :382.9  
##  3rd Qu.:418.3  
##  Max.   :596.8

map_krig(maint = "MARY/MARRY/MERRY\nDistance", subt = "Interpolated\nF1/F2 Euclidean Distance\n(Ordinary Kriging)", 
    legt = "Farther", legb = "Closer")

4.5.7 Save the data

k_mary <- scale(final)

5 Dialect Regions

In the final stage of the analysis we produce a single overall map of New England dialect regions simply by taking a scaled average of the values of the 5 kriged measures across the ~ 300,000 locations. The process is similar to looking for a bundle of isoglosses: each of the individual interpolated maps is an isglosses, and by taking their average we can map how they bundle.

final <- as.data.frame((k_rhot + k_thought + k_mary + k_start + k_palm)/5)

pal <- colorRampPalette(c(rgb(0, 0, 1), "white", rgb(1, 0, 0)), bias = 1)
basepal <- pal(100)
class <- classIntervals(final$KDATA, 100, style = "equal")
color <- findColours(class, basepal)



map_krig(maint = "New England\nDialect Regions", subt = "Average of 5\nSociophonetic Feature\n Maps", 
    legt = "Eastern", legb = "Non-Eastern")

The is a relatively basic approach to aggregation. In dialectometry, methods like MDS and FA are commonly used. We have taken this approach because we have a small number of variables, all of which clearly show a similar east-west divide, especially in the lower half of the maps, and because we have made our measurements in such a way that in all cases the eastern portions of the map, which always includes Boston, are assigned positive values.

Ordinary Kriging Analysis of New England Vowel Formant Data

for Kim, Reddy, Stanford, Wyschogrod & Grieve (2018). Bring on the crowd! Using online audio crowdsourcing for large-scale New England dialectology and acoustic sociophonetics (in revision for American Speech).

Jack Grieve

2018/02/25

1 Summary

2 Packages

3 Mapping Preliminaries

3.1 Defining New England

3.2 Defining Map Grids

3.3 Map Functions

4 Main Interpolation Analysis

4.1 LOT/THOUGHT Distance

4.1.1 Data

4.1.2 Outlier Removal

4.1.3 Location Averages

4.1.4 On Grid Averages

4.1.5 Variogram Analysis

4.1.6 Ordinary Kriging

4.1.7 Save the data

4.2 Rhoticity

4.2.1 Data

4.2.2 Outlier Removal

4.2.3 Location Averages

4.2.4 On Grid Averages

4.2.5 Variogram Analysis

4.2.6 Ordinary Kriging

4.2.7 Save the data

4.3 START Fronting

4.3.1 Data

4.3.2 Outlier Removal

4.3.3 Location Averages

4.3.4 On Grid Averages

4.3.5 Variogram Analysis

4.3.6 Ordinary Kriging

4.3.7 Save the data

4.4 PALM Fronting

4.4.1 Data

4.4.2 Outlier Removal

4.4.3 Location Averages

4.4.4 On Grid Averages

4.4.5 Variogram Analysis

4.4.6 Ordinary Kriging

4.4.7 Save the data

4.5 MARY/MARRY/MERRY Distance

4.5.1 Data

4.5.2 Outlier Removal

4.5.3 Location Averages

4.5.4 On Grid Averages

4.5.5 Variogram Analysis

4.5.6 Ordinary Kriging

4.5.7 Save the data

5 Dialect Regions