1 Summary

This document presents the preliminary R code for a comparison of the maps from the BBC Voices project with corresponding Twitter maps, based on a 2 billion word/180 million post corpus of UK geocoded Twitter from 2014. Overall, we find substantial similiarity, indicating the two approaches generally allow the same underlying patterns of regional variation to be observed.

See also Jack Grieve, Chris Montgomery, Andrea Nini & Diasheng Guo. 2017. Assessing the use of social media for mapping lexical variation in British English. Presented at ICLAVE 9 Malaga, Spain, 2017/6/6. (https://www.dropbox.com/s/97x5hdt4h3hjm4l/MALAGA_UKTWITTER.pdf?dl=0)

2 Packages

library(Matrix)
library(sp)
library(spdep)
library(maps)
library(maptools)
library(mapproj)
library(ggplot2)
library(plotly)
library(plotrix)
library(rgdal)
library(rgeos)
library(classInt)
library(vioplot)
library(grDevices)
library(classInt)

3 Data

3.1 Reading in UK Shape File

First, we read in the UK postal code ESRI shapefile into a spatial vector object. The shapefile consists of six files (Areas.dbf, Areas.fix, Areas.prj, Areas.shp, Areas.shx). (Chris, Source of shapefile?). We use postal code areas because that is how the BBC Voices dataset is organised.

ukmap <- readOGR(".", "Areas")
## OGR data source with driver: ESRI Shapefile 
## Source: ".", layer: "Areas"
## with 124 features
## It has 1 fields

We then specify the projection for this object and order it by postal code name.

projection <- "+proj=longlat +datum=WGS84"
ukmap <- spTransform(ukmap, CRS(projection))
ukmap <- ukmap[order(ukmap$name), ]

It contains 124 postal code areas.

nrow(ukmap)
## [1] 124

Here they are mapped.

par(mar = c(0, 0, 0, 0))
plot(ukmap, xlim = c(-6, -1), ylim = c(50, 59.4), col = "gray99", lwd = 0.4)

Note that postal codes are much denser in London.

par(mar = c(4, 4, 4, 4))
plot(ukmap, xlim = c(-0.5, 0.242), ylim = c(51.28, 51.7), col = "gray99", lwd = 0.4)

3.2 Reading in Twitter Data

We read in the Twitter dataset and order it. This data contains a series of lexical alternation variables selected from the BBC Voices data. Each alternation consists of two or more variants (e.g. sofa/couch/settee). In each case, we measured the frequency of each variant in each postal code area in the Twitter corpus and calculated its percentage relative to the frequency of the other variants.

twitdata <- read.table("TWE_NORM_ALT2.txt", header = TRUE, sep = ",")
twitdata <- twitdata[order(twitdata$REGION), ]
row.names(twitdata) = row.names(ukmap)

twitdata

3.3 BBC Voices Data

We read in the BBC Voices data, same format as the Twitter data, but where the values are the percentage of informants who gave that variant as a response in each postal code area. We then order the postal code areas based on the UK map, filter the variants based on the Twitter dataset (since there can be fewer forms in the Twitter dataset), and convert to percentages.

bbcdata <- read.table("BBCVoices.csv", header = TRUE, sep = ",")
bbcdata <- bbcdata[order(bbcdata$REGION), ]
row.names(bbcdata) = row.names(ukmap)
bbcdata <- bbcdata[, names(twitdata)]
bbcdata <- cbind(bbcdata[1], round(100 * bbcdata[, 2:ncol(bbcdata)], 2))

bbcdata

Unfortunately, there are some NAs here, but just ever 1 per variant and just a few, so we assign the variant mean percentage to those locations.

sum(is.na(bbcdata))
## [1] 23
for (i in 2:ncol(bbcdata)) {
    bbcdata[is.na(bbcdata[, i]), i] <- mean(bbcdata[, i], na.rm = TRUE)
}

sum(is.na(bbcdata))
## [1] 0

3.4 Spatializing the Datasets

Finally, we turn both the Twitter and BBC Voices datasets into spatial objects based on the UK Map data.

twit <- spCbind(ukmap, twitdata[, c(2:ncol(twitdata))])
bbc <- spCbind(ukmap, bbcdata[, c(2:ncol(bbcdata))])

4 Mapping the Alternations

We can produce choropleth maps for each of the variant forms for each alternation in the dataset to visualize the regional patterns.

First, we create a mapping function, since we will be making a bunch of similar maps.

mapfunc <- function(varlab, varname, colend, sourcelab) {
    breaks <- 10
    colfunc <- colorRampPalette(c("white", colend), bias = 1)
    colpal <- colfunc(breaks)
    class <- classIntervals(varlab, breaks, style = "quantile")
    colors <- findColours(class, colpal)
    par(mar = c(0, 0, 0, 0))
    plot(ukmap, xlim = c(-6, -1), ylim = c(50, 59.4), col = colors, lwd = 0.4)
    text(0.5, 58.7, varname, cex = 1.1)
    text(0.5, 58.2, "percent", cex = 0.8)
    gradient.rect(0.2, 55.8, 0.8, 57.8, col = colpal, gradient = "y", border = "black")
    text(0.87, 55.84, round(min(class$brks), 0), cex = 0.8, pos = 4)
    text(0.87, 56.8, round(class$brks[breaks/2], 0), cex = 0.8, pos = 4)
    text(0.87, 57.72, round(max(class$brks), 0), cex = 0.8, pos = 4)
    text(0.5, 50.2, sourcelab, cex = 0.8, col = "gray75")
}

Then we can map an alternation. For example, here are the maps for sofa/couch/settee in the Twitter data.

mapfunc(twit$sofa, "Sofa", "firebrick3", "Twitter 2014")

mapfunc(twit$couch, "Couch", "dodgerblue3", "Twitter 2014")

mapfunc(twit$settee, "Settee", "limegreen", "Twitter 2014")

And here are the maps for the same forms based on the BBC Voices Twitter data, which we can see are pretty similar.

mapfunc(bbc$sofa, "Sofa", "firebrick3", "BBC Voices")

mapfunc(bbc$couch, "Couch", "dodgerblue3", "BBC Voices")

mapfunc(bbc$settee, "Settee", "limegreen", "BBC Voices")

5 Spatial Analysis

5.1 Spatial Weights Matrix

To run spatial analyses we need to define a spatial weights matrix, which specifies the spatial relationship between every pair of locations. In this case we restrict the comparision to the location and its the 4 nearest neighbors.

First, we find the centroids for each of postal code areas.

centroids <- gCentroid(ukmap, byid = TRUE)
xy <- as.matrix(data.frame(centroids$x, centroids$y))

And we can map those.

par(mar = c(0, 0, 0, 0))
plot(ukmap, xlim = c(-6, -1), ylim = c(50, 59.4), col = "gray99", lwd = 0.4)
points(xy, pch = "*", col = "darkorange1")

For each location, we find its 2, 5, 10, 20 nearest neighbors (including the location itself) and then build 4 corresponding spatial weights matrices.

neighbors <- knn2nb(knearneigh(xy, k = 1, longlat = TRUE))
neighbors <- include.self(neighbors)
neighbors
## Neighbour list object:
## Number of regions: 124 
## Number of nonzero links: 248 
## Percentage nonzero weights: 1.612903 
## Average number of links: 2 
## Non-symmetric neighbours list
listws2 <- nb2listw(neighbors)
listws2
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 124 
## Number of nonzero links: 248 
## Percentage nonzero weights: 1.612903 
## Average number of links: 2 
## Non-symmetric neighbours list
## 
## Weights style: W 
## Weights constants summary:
##     n    nn  S0    S1  S2
## W 124 15376 124 105.5 517
neighbors <- knn2nb(knearneigh(xy, k = 4, longlat = TRUE))
neighbors <- include.self(neighbors)
listws5 <- nb2listw(neighbors)

neighbors <- knn2nb(knearneigh(xy, k = 9, longlat = TRUE))
neighbors <- include.self(neighbors)
listws10 <- nb2listw(neighbors)

neighbors <- knn2nb(knearneigh(xy, k = 19, longlat = TRUE))
neighbors <- include.self(neighbors)
listws20 <- nb2listw(neighbors)

5.2 Global Spatial Autocorrelation Analysis

We run a Moran’s I global spatial autocorrelation analysis, which gives us a measure of the overall amount of spatial clustering in each of our maps.

First, we do this for the Twitter dataset for each spatial weights matrix.

morantwit <- data.frame()
for (j in c(2:ncol(twit))) {
    
    morantwit[j - 1, "NAME"] <- names(twit[j])
    
    mtemp <- moran.test(twit[[j]], listws2, rand = TRUE)
    morantwit[j - 1, "I_2nn"] <- round(mtemp$estimate[1], 3)
    morantwit[j - 1, "p_2nn"] <- mtemp$p.value
    
    mtemp <- moran.test(twit[[j]], listws5, rand = TRUE)
    morantwit[j - 1, "I_5nn"] <- round(mtemp$estimate[1], 3)
    morantwit[j - 1, "p_5nn"] <- mtemp$p.value
    
    mtemp <- moran.test(twit[[j]], listws10, rand = TRUE)
    morantwit[j - 1, "I_10nn"] <- round(mtemp$estimate[1], 3)
    morantwit[j - 1, "p_10nn"] <- mtemp$p.value
    
    mtemp <- moran.test(twit[[j]], listws20, rand = TRUE)
    morantwit[j - 1, "I_20nn"] <- round(mtemp$estimate[1], 3)
    morantwit[j - 1, "p_20nn"] <- mtemp$p.value
}

morantwit

The we do the same for the BBC Voices dataset for each spatial weights matrix.

moranbbc <- data.frame()
for (j in c(2:ncol(bbc))) {
    
    moranbbc[j - 1, "NAME"] <- names(twit[j])
    
    mtemp <- moran.test(bbc[[j]], listws2, rand = TRUE)
    moranbbc[j - 1, "I_2nn"] <- round(mtemp$estimate[1], 3)
    moranbbc[j - 1, "p_2nn"] <- mtemp$p.value
    
    mtemp <- moran.test(bbc[[j]], listws5, rand = TRUE)
    moranbbc[j - 1, "I_5nn"] <- round(mtemp$estimate[1], 3)
    moranbbc[j - 1, "p_5nn"] <- mtemp$p.value
    
    mtemp <- moran.test(bbc[[j]], listws10, rand = TRUE)
    moranbbc[j - 1, "I_10nn"] <- round(mtemp$estimate[1], 3)
    moranbbc[j - 1, "p_10nn"] <- mtemp$p.value
    
    mtemp <- moran.test(bbc[[j]], listws20, rand = TRUE)
    moranbbc[j - 1, "I_20nn"] <- round(mtemp$estimate[1], 3)
    moranbbc[j - 1, "p_20nn"] <- mtemp$p.value
}

moranbbc

It is also interesting to correlate these two sets of Moran’s I scores to see if variants show similar levels of spatial clustering across the two dataset, regardless of the actual patterns of spatial clustering. Somewhat surprisingly, the answer is not much, even though the Moran’s I scores tend be strong and significant across both datasets. That does not seem to really tell us much about alignment though.

cor(morantwit$I_2nn, moranbbc$I_2nn)
## [1] 0.1409185
plot(morantwit$I_2nn, moranbbc$I_2nn, pch = "*", col = "dodgerblue3", cex = 1.6)

cor(morantwit$I_5nn, moranbbc$I_5nn)
## [1] 0.1615067
plot(morantwit$I_5nn, moranbbc$I_5nn, pch = "*", col = "dodgerblue3", cex = 1.6)

cor(morantwit$I_10n, moranbbc$I_10nn)
## [1] 0.1555817
plot(morantwit$I_10nn, moranbbc$I_10nn, pch = "*", col = "dodgerblue3", cex = 1.6)

cor(morantwit$I_20nn, moranbbc$I_20nn)
## [1] 0.1248003
plot(morantwit$I_20nn, moranbbc$I_20nn, pch = "*", col = "dodgerblue3", cex = 1.6)

5.3 Local Spatial Autocorrelation Analysis

We also conducted a Getis-Ord Gi* local spatial autocorrelation analysis to detect hotspots in the map for each variant–clusters of locations where the form is particularly common. Again we can run these for each variant for both datasets using all of the spatial weights matrices.

First, we generate Getis-Ord Gi* z-scores for each location, using 5 nearest neighbor spatial weights matrix for both datasets.

gitwit <- twitdata
for (i in c(2:ncol(twitdata))) {
    gitwit[[i]] <- localG(twitdata[[i]], listws5)
}
gitwit
gibbc <- bbcdata
for (j in c(2:ncol(bbcdata))) {
    gibbc[, j] <- localG(bbcdata[[j]], listws5)
}
gibbc

Then we can map the results, after setting up another maping function, since we will use different cutoffs here. Basically, the maps identify the underlying regional patterns in the data, effectively smoothing our raw percentage maps.

mapgifunc <- function(varlab, varname, colend, sourcelab) {
    breaks <- 7
    colfunc <- colorRampPalette(c("white", colend), bias = 1)
    colpal <- colfunc(breaks)
    class <- classIntervals(varlab, breaks, style = "fixed", fixedBreaks = c(min(varlab) - 
        1, 0, 1.28, 1.645, 1.96, 2.33, 2.575))
    colors <- findColours(class, colpal)
    par(mar = c(0, 0, 0, 0))
    plot(ukmap, xlim = c(-6, -1), ylim = c(50, 59.4), col = colors, lwd = 0.4)
    text(0.5, 58.9, varname, cex = 1.1)
    git <- expression(paste("Getis-Ord ", G[i], "* z-score", sep = ""))
    text(0.5, 58.4, git, cex = 0.8)
    gradient.rect(0.2, 55.8, 0.8, 57.8, col = colpal, gradient = "y", border = "black")
    text(0.87, 57.52, 2.575, cex = 0.5, pos = 4)
    text(0.87, 57.23, 2.33, cex = 0.5, pos = 4)
    text(0.87, 56.94, 1.96, cex = 0.5, pos = 4)
    text(0.87, 56.65, 1.645, cex = 0.5, pos = 4)
    text(0.87, 56.36, 1.28, cex = 0.5, pos = 4)
    text(0.87, 56.07, 0, cex = 0.5, pos = 4)
    text(0.5, 50.2, sourcelab, cex = 0.8, col = "gray75")
}

mapgifunc(gitwit$sofa, "Sofa", "firebrick3", "Twitter 2014")

mapgifunc(gitwit$couch, "Couch", "dodgerblue3", "Twitter 2014")
## Warning in classIntervals(varlab, breaks, style = "fixed", fixedBreaks =
## c(min(varlab) - : variable range greater than fixedBreaks

mapgifunc(gitwit$settee, "Settee", "limegreen", "Twitter 2014")
## Warning in classIntervals(varlab, breaks, style = "fixed", fixedBreaks =
## c(min(varlab) - : variable range greater than fixedBreaks

mapgifunc(gibbc$sofa, "Sofa", "firebrick3", "BBC Voices")
## Warning in classIntervals(varlab, breaks, style = "fixed", fixedBreaks =
## c(min(varlab) - : variable range greater than fixedBreaks

mapgifunc(gibbc$couch, "Couch", "dodgerblue3", "BBC Voices")
## Warning in classIntervals(varlab, breaks, style = "fixed", fixedBreaks =
## c(min(varlab) - : variable range greater than fixedBreaks

mapgifunc(gibbc$settee, "Settee", "limegreen", "BBC Voices")
## Warning in classIntervals(varlab, breaks, style = "fixed", fixedBreaks =
## c(min(varlab) - : variable range greater than fixedBreaks

5.4 Map Comparisons

We compare the maps for each pair of variants using Lee’s L statistic (https://link.springer.com/article/10.1007/s101090100064), a bivariate spatial association measure for correlating choropleth maps. It is based on Pearson’s r, but rather than only comparing the value of a variable across pairs of matched locations, it also takes into consideration its values at nearby locations, as defined by a spatial weights matrix, similar to a Moran’s I. This is the first time the measure has been used in linguistics.

We compute Lee’s statistic for each form for all 4 spatial weights matrix. We also run Pearson’s and Spearman’s correlation coefficients for comparison on the raw scores as well as on the smoothed maps.

comp <- data.frame()

for (j in c(2:ncol(twit))) {
    
    comp[j - 1, "NAME"] <- names(twit[j])
    
    measure <- lee(bbc[[j]], twit[[j]], listws2, length(bbc))
    comp[j - 1, "L02"] <- round(measure$L, 3)
    
    measure <- lee(bbc[[j]], twit[[j]], listws5, length(bbc))
    comp[j - 1, "L05"] <- round(measure$L, 3)
    
    measure <- lee(bbc[[j]], twit[[j]], listws10, length(bbc))
    comp[j - 1, "L10"] <- round(measure$L, 3)
    
    measure <- lee(bbc[[j]], twit[[j]], listws20, length(bbc))
    comp[j - 1, "L20"] <- round(measure$L, 3)
    
    measure <- cor(bbc[[j]], twit[[j]], use = "complete.obs", method = "pearson")
    comp[j - 1, "r"] <- round(measure, 3)
    
    measure <- cor(bbc[[j]], twit[[j]], use = "complete.obs", method = "spearman")
    comp[j - 1, "rho"] <- round(measure, 3)
    
    measure <- cor(gibbc[[j]], gitwit[[j]], use = "complete.obs", method = "pearson")
    comp[j - 1, "rgi"] <- round(measure, 3)
    
    measure <- cor(gibbc[[j]], gitwit[[j]], use = "complete.obs", method = "spearman")
    comp[j - 1, "rhogi"] <- round(measure, 3)
    
}

comp

Overall, the correlation coefficients are pretty similar across methods, with the strength decreasing as the spatial weights matrix is based on more nearest neighbors, and with correlating the smoothed maps producing the strongest correspondence.

par(mar = c(3, 3, 3, 3))
vioplot(comp$L02, comp$L05, comp$L10, comp$L20, comp$r, comp$rho, comp$rgi, 
    comp$rhogi, col = "slategray1", names = c("L 2nn", "L 5nn", "L 10nn", "L 20nn", 
        "r", "rho", "r Gi*", "rho Gi*"))

5.5 Comparing Similarity and Strength of Maps

It is also interesting to compare the strength of the correlation between BBC Voices and Twitter variants against strength of the patterns, as measured by Moran’s I. For example, when we compare the values of the two measures calculated using 5 nearest neighbor spatial weights matrices, we find substantial correlations, showing that the maps tend to match better when the variants show clear patterns in the underlying datasets.

cor(morantwit$I_5nn, comp$L05)
## [1] 0.5274918
plot(morantwit$I_5nn, comp$L05, pch = "*", col = "dodgerblue3", cex = 1.6)

cor(moranbbc$I_5nn, comp$L05)
## [1] 0.588585
plot(moranbbc$I_5nn, comp$L05, pch = "*", col = "dodgerblue3", cex = 1.6)