This is document contains the main R code for the final multivariate spatial analysis reported in the revised version (2017) of the paper “Mapping lexical innovation on American social media” by Jack Grieve, Andrea Nini and Diansheng Guo.
The papers identifies common regional patterns of lexical innovation based on maps plotting the change over time in the usage of 54 newly emerging words on American Twitter in 2013-2014.
library(psych)
library(Matrix)
library(sp)
library(spdep)
library(GPArotation)
library(maps)
library(mapproj)
The dataset for this study consists of the number of days since the end of the corpus (22/11/2014) that the relative frequency of each of the 54 emerging reached its overall third quartile across +3,000 counties in the contiguous United States. For each county, the dataset contains a FIPS code, a state and name, and a centroid longitude and latitude, as well as the number of days since each of the words reached its third quartile threshold in that county.
table <- read.table("LI_DAYSINCE_3Q.txt", header = TRUE, sep = ",")
summary(table[, c(1:10)])
## FIPS STATE NAME LONG
## Min. : 1001 texas : 254 washington: 32 Min. :-124.14
## 1st Qu.:19030 georgia : 159 jefferson : 26 1st Qu.: -98.09
## Median :29179 kentucky: 120 franklin : 25 Median : -90.40
## Mean :30461 missouri: 115 jackson : 24 Mean : -91.80
## 3rd Qu.:45050 kansas : 105 lincoln : 24 3rd Qu.: -83.57
## Max. :56045 illinois: 102 madison : 20 Max. : -67.65
## (Other) :2220 (Other) :2924
## LAT amirite baeless baeritto
## Min. :25.53 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.:34.63 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000
## Median :38.37 Median : 0.00 Median : 0.00 Median : 0.000
## Mean :38.29 Mean : 48.65 Mean : 17.95 Mean : 4.697
## 3rd Qu.:41.73 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :48.82 Max. :399.00 Max. :399.00 Max. :178.000
##
## balayage boolin
## Min. : 0.000 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.: 0.00
## Median : 0.000 Median : 0.00
## Mean : 7.756 Mean : 29.82
## 3rd Qu.: 0.000 3rd Qu.: 0.00
## Max. :399.000 Max. :399.00
##
nrow(table)
## [1] 3075
The data can be mapped for each individual words, showing the counties where that word was used relatively frequently at a relatively early point in time. For example, below is the lexical emergence map for “baeless”, showing the number of days since the relative frequency of this word reached the third quartile across counties.
To extract the underlying regional pattern from each of these 54 emerging word maps, each map was subjected to Getis-Ord Gi* local spatial autocorrelation analysis, using a 25 nearest neighbor spatial weights matrix.
First, we build the spatial weights matrix, based on the longitude and latitude centroid for each county.
xy <- as.matrix(data.frame(table$LONG, table$LAT))
neighbors <- knn2nb(knearneigh(xy, k = 24, longlat = TRUE))
neighbors <- include.self(neighbors)
listws <- nb2listw(neighbors)
Second, we conduct the local spatial autocorrelation analysis, which generates a z-score for each word-county pair, indicating whether that county is part of a cluster of counties where that word was used relatively frequently at a relatively early point in time.
gitable <- table
for (i in 5:ncol(table)) {
gitable[, i] <- round(localG(table[, i], listws), 3)
}
summary(gitable[, c(1:10)])
## FIPS STATE NAME LONG
## Min. : 1001 texas : 254 washington: 32 Min. :-124.14
## 1st Qu.:19030 georgia : 159 jefferson : 26 1st Qu.: -98.09
## Median :29179 kentucky: 120 franklin : 25 Median : -90.40
## Mean :30461 missouri: 115 jackson : 24 Mean : -91.80
## 3rd Qu.:45050 kansas : 105 lincoln : 24 3rd Qu.: -83.57
## Max. :56045 illinois: 102 madison : 20 Max. : -67.65
## (Other) :2220 (Other) :2924
## LAT amirite baeless
## Min. :-11.448000 Min. :-2.10500 Min. :-1.494000
## 1st Qu.: -3.687000 1st Qu.:-1.20300 1st Qu.:-1.247500
## Median : 0.111000 Median :-0.18700 Median :-0.792000
## Mean : 0.006457 Mean : 0.02821 Mean : 0.007105
## 3rd Qu.: 3.556000 3rd Qu.: 0.96700 3rd Qu.: 0.791000
## Max. : 10.163000 Max. : 5.50700 Max. : 9.223000
##
## baeritto balayage boolin
## Min. :-1.0120 Min. :-0.86700 Min. :-1.58400
## 1st Qu.:-1.0120 1st Qu.:-0.86700 1st Qu.:-1.58400
## Median :-0.0640 Median :-0.59800 Median :-0.74100
## Mean : 0.0117 Mean :-0.02538 Mean : 0.04184
## 3rd Qu.: 0.7210 3rd Qu.: 0.38000 3rd Qu.: 0.46000
## Max. : 4.1520 Max. : 7.10600 Max. :14.43100
##
We can also map these Getis-Ord Gi* z-scores in order to visualize these underlying patterns of regional variation. For example, the map for “baeless” is presented below.
Finally, we conduct an exploratory factor analysis of the 54 emerging word local spatial autocorrelation maps to extract the 5 most important common regional patterns of lexical innovation in our datatset.
fa <- fa(gitable[, 6:ncol(gitable)], nfactors = 5, rotate = "oblimin", scores = "regression",
fm = "minres")
fascores <- fa$scores
fas <- as.data.frame(fa$scores)
fa
## Factor Analysis using method = minres
## Call: fa(r = gitable[, 6:ncol(gitable)], nfactors = 5, rotate = "oblimin",
## scores = "regression", fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
## MR1 MR2 MR4 MR5 MR3 h2 u2 com
## amirite 0.61 -0.14 0.08 0.01 -0.05 0.433 0.57 1.2
## baeless -0.05 0.65 -0.12 0.11 0.24 0.637 0.36 1.4
## baeritto 0.35 0.00 -0.16 -0.12 0.13 0.123 0.88 2.0
## balayage 0.37 0.12 0.44 -0.12 -0.03 0.443 0.56 2.3
## boolin -0.02 0.88 -0.01 -0.14 0.02 0.707 0.29 1.1
## brazy 0.03 0.83 0.17 -0.02 -0.14 0.686 0.31 1.1
## bruuh 0.25 0.44 -0.09 0.07 0.40 0.590 0.41 2.7
## candids 0.53 0.28 0.25 0.04 -0.03 0.594 0.41 2.0
## celfie 0.10 0.31 -0.05 -0.09 0.44 0.365 0.63 2.0
## cosplay 0.78 -0.05 0.20 0.04 -0.01 0.779 0.22 1.1
## dwk 0.03 0.00 -0.14 -0.08 -0.09 0.039 0.96 2.6
## fallback -0.08 0.59 0.14 0.12 0.31 0.674 0.33 1.8
## famo -0.02 0.77 -0.07 0.08 -0.19 0.574 0.43 1.2
## faved 0.59 0.06 -0.10 0.04 0.13 0.364 0.64 1.2
## fhritp 0.29 0.14 -0.13 -0.06 -0.02 0.082 0.92 2.0
## figgity 0.39 0.02 0.14 -0.10 0.24 0.272 0.73 2.1
## fleek -0.21 0.60 0.02 0.37 0.09 0.723 0.28 2.0
## fuckboys 0.27 -0.02 0.43 0.41 0.14 0.704 0.30 2.9
## gainz 0.66 -0.08 -0.07 -0.04 0.03 0.402 0.60 1.1
## gmfu 0.26 -0.09 -0.09 -0.14 0.76 0.627 0.37 1.4
## goalz 0.29 0.23 0.22 -0.12 -0.12 0.218 0.78 3.6
## idgt -0.20 -0.03 0.10 -0.01 0.83 0.671 0.33 1.2
## lfie 0.38 0.29 0.20 -0.10 0.14 0.370 0.63 3.0
## lifestyleeee 0.06 0.62 -0.09 -0.07 0.06 0.388 0.61 1.1
## litt 0.14 -0.04 0.74 0.22 0.05 0.797 0.20 1.3
## litty 0.02 0.03 0.85 -0.04 -0.02 0.728 0.27 1.0
## lituation -0.04 -0.01 0.95 -0.09 -0.02 0.846 0.15 1.0
## lordt -0.10 0.38 -0.04 0.12 0.52 0.590 0.41 2.1
## lw 0.67 -0.01 0.01 0.02 0.14 0.500 0.50 1.1
## mce -0.06 0.38 -0.02 0.43 0.24 0.585 0.41 2.6
## mmmmmmmuah 0.46 0.05 -0.19 -0.01 0.04 0.181 0.82 1.4
## mutuals 0.62 0.00 0.07 0.14 -0.14 0.485 0.51 1.2
## nahfr 0.01 0.69 -0.02 -0.12 -0.14 0.382 0.62 1.1
## notifs 0.29 -0.09 0.29 0.23 0.19 0.394 0.61 3.8
## pcd 0.64 0.06 0.01 0.01 -0.11 0.427 0.57 1.1
## pullout 0.26 0.18 0.11 0.19 0.16 0.310 0.69 3.8
## rekt 0.54 0.10 -0.03 -0.20 0.08 0.288 0.71 1.4
## rq 0.58 0.10 0.04 0.10 -0.03 0.428 0.57 1.1
## scute 0.51 -0.17 0.09 0.29 -0.02 0.443 0.56 1.9
## senpai 0.82 -0.04 -0.08 0.00 0.01 0.623 0.38 1.0
## shordy 0.01 -0.09 0.01 0.83 -0.08 0.641 0.36 1.0
## slayin 0.46 0.10 0.00 -0.01 -0.29 0.264 0.74 1.8
## sqaud 0.28 0.16 -0.09 0.69 -0.11 0.702 0.30 1.5
## tbfh 0.68 0.01 0.06 0.04 0.00 0.515 0.48 1.0
## tfw 0.56 -0.06 0.06 0.16 -0.06 0.410 0.59 1.2
## thotful 0.00 0.25 0.06 0.12 0.20 0.188 0.81 2.5
## thottin 0.01 0.22 0.15 0.57 -0.04 0.537 0.46 1.5
## tookah -0.03 0.57 0.21 0.23 0.02 0.568 0.43 1.6
## traphouse 0.02 0.75 -0.04 0.20 0.10 0.781 0.22 1.2
## unbae 0.13 0.04 -0.19 0.08 0.13 0.067 0.93 3.3
## waifu 0.60 -0.07 0.25 0.07 0.03 0.575 0.43 1.4
## wce -0.02 0.38 0.01 0.41 0.20 0.558 0.44 2.5
## xans 0.25 0.52 -0.02 0.03 0.22 0.503 0.50 1.8
## yaas 0.21 0.35 0.18 0.41 0.18 0.706 0.29 3.3
##
## MR1 MR2 MR4 MR5 MR3
## SS loadings 8.42 7.40 3.86 3.67 3.13
## Proportion Var 0.16 0.14 0.07 0.07 0.06
## Cumulative Var 0.16 0.29 0.36 0.43 0.49
## Proportion Explained 0.32 0.28 0.15 0.14 0.12
## Cumulative Proportion 0.32 0.60 0.74 0.88 1.00
##
## With factor correlations of
## MR1 MR2 MR4 MR5 MR3
## MR1 1.00 0.08 0.41 0.25 0.10
## MR2 0.08 1.00 0.11 0.40 0.32
## MR4 0.41 0.11 1.00 0.26 0.02
## MR5 0.25 0.40 0.26 1.00 0.13
## MR3 0.10 0.32 0.02 0.13 1.00
##
## Mean item complexity = 1.8
## Test of the hypothesis that 5 factors are sufficient.
##
## The degrees of freedom for the null model are 1431 and the objective function was 39.52 with Chi Square of 120738.2
## The degrees of freedom for the model are 1171 and the objective function was 10.73
##
## The root mean square of the residuals (RMSR) is 0.05
## The df corrected root mean square of the residuals is 0.05
##
## The harmonic number of observations is 3075 with the empirical chi square 20118.67 with prob < 0
## The total number of observations was 3075 with Likelihood Chi Square = 32748.15 with prob < 0
##
## Tucker Lewis Index of factoring reliability = 0.676
## RMSEA index = 0.094 and the 90 % confidence intervals are 0.093 0.095
## BIC = 23343.78
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy
## MR1 MR2 MR4 MR5 MR3
## Correlation of scores with factors 0.97 0.97 0.97 0.94 0.93
## Multiple R square of scores with factors 0.94 0.95 0.94 0.89 0.87
## Minimum correlation of possible factor scores 0.89 0.89 0.87 0.78 0.74
We then map each set of factor scores to visualize these 5 common patterns of American lexical innovation, where darker shades indicate regions where the words associated with that factor tended to occur relatively frequently at a relatively early date (hubs of lexical innovation) and where lighter shades indcaten regions where those words tended to spread (areas of diffusion).
We can also plot all five factors together by coloring each county according to the factor with which it is most strongly associated to create a single map showing the main regions of lexical innovation in this variety of language.