Mapping Lexical Innovation on American Social Media

0.1 Summary

This is document contains the main R code for the final multivariate spatial analysis reported in the revised version (2017) of the paper “Mapping lexical innovation on American social media” by Jack Grieve, Andrea Nini and Diansheng Guo.

The papers identifies common regional patterns of lexical innovation based on maps plotting the change over time in the usage of 54 newly emerging words on American Twitter in 2013-2014.

0.2 Libraries

library(psych)
library(Matrix)
library(sp)
library(spdep)
library(GPArotation)
library(maps)
library(mapproj)

0.3 Data

The dataset for this study consists of the number of days since the end of the corpus (22/11/2014) that the relative frequency of each of the 54 emerging reached its overall third quartile across +3,000 counties in the contiguous United States. For each county, the dataset contains a FIPS code, a state and name, and a centroid longitude and latitude, as well as the number of days since each of the words reached its third quartile threshold in that county.

table <- read.table("LI_DAYSINCE_3Q.txt", header = TRUE, sep = ",")

summary(table[, c(1:10)])

##       FIPS            STATE              NAME           LONG        
##  Min.   : 1001   texas   : 254   washington:  32   Min.   :-124.14  
##  1st Qu.:19030   georgia : 159   jefferson :  26   1st Qu.: -98.09  
##  Median :29179   kentucky: 120   franklin  :  25   Median : -90.40  
##  Mean   :30461   missouri: 115   jackson   :  24   Mean   : -91.80  
##  3rd Qu.:45050   kansas  : 105   lincoln   :  24   3rd Qu.: -83.57  
##  Max.   :56045   illinois: 102   madison   :  20   Max.   : -67.65  
##                  (Other) :2220   (Other)   :2924                    
##       LAT           amirite          baeless          baeritto      
##  Min.   :25.53   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:34.63   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :38.37   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   :38.29   Mean   : 48.65   Mean   : 17.95   Mean   :  4.697  
##  3rd Qu.:41.73   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :48.82   Max.   :399.00   Max.   :399.00   Max.   :178.000  
##                                                                     
##     balayage           boolin      
##  Min.   :  0.000   Min.   :  0.00  
##  1st Qu.:  0.000   1st Qu.:  0.00  
##  Median :  0.000   Median :  0.00  
##  Mean   :  7.756   Mean   : 29.82  
##  3rd Qu.:  0.000   3rd Qu.:  0.00  
##  Max.   :399.000   Max.   :399.00  
##

nrow(table)

## [1] 3075

0.4 Mapping the Raw Data

The data can be mapped for each individual words, showing the counties where that word was used relatively frequently at a relatively early point in time. For example, below is the lexical emergence map for “baeless”, showing the number of days since the relative frequency of this word reached the third quartile across counties.

0.5 Getis-Ord Gi* Analysis

To extract the underlying regional pattern from each of these 54 emerging word maps, each map was subjected to Getis-Ord Gi* local spatial autocorrelation analysis, using a 25 nearest neighbor spatial weights matrix.

First, we build the spatial weights matrix, based on the longitude and latitude centroid for each county.

xy <- as.matrix(data.frame(table$LONG, table$LAT))
neighbors <- knn2nb(knearneigh(xy, k = 24, longlat = TRUE))
neighbors <- include.self(neighbors)
listws <- nb2listw(neighbors)

Second, we conduct the local spatial autocorrelation analysis, which generates a z-score for each word-county pair, indicating whether that county is part of a cluster of counties where that word was used relatively frequently at a relatively early point in time.

gitable <- table
for (i in 5:ncol(table)) {
    gitable[, i] <- round(localG(table[, i], listws), 3)
}
summary(gitable[, c(1:10)])

##       FIPS            STATE              NAME           LONG        
##  Min.   : 1001   texas   : 254   washington:  32   Min.   :-124.14  
##  1st Qu.:19030   georgia : 159   jefferson :  26   1st Qu.: -98.09  
##  Median :29179   kentucky: 120   franklin  :  25   Median : -90.40  
##  Mean   :30461   missouri: 115   jackson   :  24   Mean   : -91.80  
##  3rd Qu.:45050   kansas  : 105   lincoln   :  24   3rd Qu.: -83.57  
##  Max.   :56045   illinois: 102   madison   :  20   Max.   : -67.65  
##                  (Other) :2220   (Other)   :2924                    
##       LAT                amirite            baeless         
##  Min.   :-11.448000   Min.   :-2.10500   Min.   :-1.494000  
##  1st Qu.: -3.687000   1st Qu.:-1.20300   1st Qu.:-1.247500  
##  Median :  0.111000   Median :-0.18700   Median :-0.792000  
##  Mean   :  0.006457   Mean   : 0.02821   Mean   : 0.007105  
##  3rd Qu.:  3.556000   3rd Qu.: 0.96700   3rd Qu.: 0.791000  
##  Max.   : 10.163000   Max.   : 5.50700   Max.   : 9.223000  
##                                                             
##     baeritto          balayage            boolin        
##  Min.   :-1.0120   Min.   :-0.86700   Min.   :-1.58400  
##  1st Qu.:-1.0120   1st Qu.:-0.86700   1st Qu.:-1.58400  
##  Median :-0.0640   Median :-0.59800   Median :-0.74100  
##  Mean   : 0.0117   Mean   :-0.02538   Mean   : 0.04184  
##  3rd Qu.: 0.7210   3rd Qu.: 0.38000   3rd Qu.: 0.46000  
##  Max.   : 4.1520   Max.   : 7.10600   Max.   :14.43100  
##

We can also map these Getis-Ord Gi* z-scores in order to visualize these underlying patterns of regional variation. For example, the map for “baeless” is presented below.

0.6 Factor Analysis

Finally, we conduct an exploratory factor analysis of the 54 emerging word local spatial autocorrelation maps to extract the 5 most important common regional patterns of lexical innovation in our datatset.

fa <- fa(gitable[, 6:ncol(gitable)], nfactors = 5, rotate = "oblimin", scores = "regression", 
    fm = "minres")
fascores <- fa$scores
fas <- as.data.frame(fa$scores)

fa

## Factor Analysis using method =  minres
## Call: fa(r = gitable[, 6:ncol(gitable)], nfactors = 5, rotate = "oblimin", 
##     scores = "regression", fm = "minres")
## Standardized loadings (pattern matrix) based upon correlation matrix
##                MR1   MR2   MR4   MR5   MR3    h2   u2 com
## amirite       0.61 -0.14  0.08  0.01 -0.05 0.433 0.57 1.2
## baeless      -0.05  0.65 -0.12  0.11  0.24 0.637 0.36 1.4
## baeritto      0.35  0.00 -0.16 -0.12  0.13 0.123 0.88 2.0
## balayage      0.37  0.12  0.44 -0.12 -0.03 0.443 0.56 2.3
## boolin       -0.02  0.88 -0.01 -0.14  0.02 0.707 0.29 1.1
## brazy         0.03  0.83  0.17 -0.02 -0.14 0.686 0.31 1.1
## bruuh         0.25  0.44 -0.09  0.07  0.40 0.590 0.41 2.7
## candids       0.53  0.28  0.25  0.04 -0.03 0.594 0.41 2.0
## celfie        0.10  0.31 -0.05 -0.09  0.44 0.365 0.63 2.0
## cosplay       0.78 -0.05  0.20  0.04 -0.01 0.779 0.22 1.1
## dwk           0.03  0.00 -0.14 -0.08 -0.09 0.039 0.96 2.6
## fallback     -0.08  0.59  0.14  0.12  0.31 0.674 0.33 1.8
## famo         -0.02  0.77 -0.07  0.08 -0.19 0.574 0.43 1.2
## faved         0.59  0.06 -0.10  0.04  0.13 0.364 0.64 1.2
## fhritp        0.29  0.14 -0.13 -0.06 -0.02 0.082 0.92 2.0
## figgity       0.39  0.02  0.14 -0.10  0.24 0.272 0.73 2.1
## fleek        -0.21  0.60  0.02  0.37  0.09 0.723 0.28 2.0
## fuckboys      0.27 -0.02  0.43  0.41  0.14 0.704 0.30 2.9
## gainz         0.66 -0.08 -0.07 -0.04  0.03 0.402 0.60 1.1
## gmfu          0.26 -0.09 -0.09 -0.14  0.76 0.627 0.37 1.4
## goalz         0.29  0.23  0.22 -0.12 -0.12 0.218 0.78 3.6
## idgt         -0.20 -0.03  0.10 -0.01  0.83 0.671 0.33 1.2
## lfie          0.38  0.29  0.20 -0.10  0.14 0.370 0.63 3.0
## lifestyleeee  0.06  0.62 -0.09 -0.07  0.06 0.388 0.61 1.1
## litt          0.14 -0.04  0.74  0.22  0.05 0.797 0.20 1.3
## litty         0.02  0.03  0.85 -0.04 -0.02 0.728 0.27 1.0
## lituation    -0.04 -0.01  0.95 -0.09 -0.02 0.846 0.15 1.0
## lordt        -0.10  0.38 -0.04  0.12  0.52 0.590 0.41 2.1
## lw            0.67 -0.01  0.01  0.02  0.14 0.500 0.50 1.1
## mce          -0.06  0.38 -0.02  0.43  0.24 0.585 0.41 2.6
## mmmmmmmuah    0.46  0.05 -0.19 -0.01  0.04 0.181 0.82 1.4
## mutuals       0.62  0.00  0.07  0.14 -0.14 0.485 0.51 1.2
## nahfr         0.01  0.69 -0.02 -0.12 -0.14 0.382 0.62 1.1
## notifs        0.29 -0.09  0.29  0.23  0.19 0.394 0.61 3.8
## pcd           0.64  0.06  0.01  0.01 -0.11 0.427 0.57 1.1
## pullout       0.26  0.18  0.11  0.19  0.16 0.310 0.69 3.8
## rekt          0.54  0.10 -0.03 -0.20  0.08 0.288 0.71 1.4
## rq            0.58  0.10  0.04  0.10 -0.03 0.428 0.57 1.1
## scute         0.51 -0.17  0.09  0.29 -0.02 0.443 0.56 1.9
## senpai        0.82 -0.04 -0.08  0.00  0.01 0.623 0.38 1.0
## shordy        0.01 -0.09  0.01  0.83 -0.08 0.641 0.36 1.0
## slayin        0.46  0.10  0.00 -0.01 -0.29 0.264 0.74 1.8
## sqaud         0.28  0.16 -0.09  0.69 -0.11 0.702 0.30 1.5
## tbfh          0.68  0.01  0.06  0.04  0.00 0.515 0.48 1.0
## tfw           0.56 -0.06  0.06  0.16 -0.06 0.410 0.59 1.2
## thotful       0.00  0.25  0.06  0.12  0.20 0.188 0.81 2.5
## thottin       0.01  0.22  0.15  0.57 -0.04 0.537 0.46 1.5
## tookah       -0.03  0.57  0.21  0.23  0.02 0.568 0.43 1.6
## traphouse     0.02  0.75 -0.04  0.20  0.10 0.781 0.22 1.2
## unbae         0.13  0.04 -0.19  0.08  0.13 0.067 0.93 3.3
## waifu         0.60 -0.07  0.25  0.07  0.03 0.575 0.43 1.4
## wce          -0.02  0.38  0.01  0.41  0.20 0.558 0.44 2.5
## xans          0.25  0.52 -0.02  0.03  0.22 0.503 0.50 1.8
## yaas          0.21  0.35  0.18  0.41  0.18 0.706 0.29 3.3
## 
##                        MR1  MR2  MR4  MR5  MR3
## SS loadings           8.42 7.40 3.86 3.67 3.13
## Proportion Var        0.16 0.14 0.07 0.07 0.06
## Cumulative Var        0.16 0.29 0.36 0.43 0.49
## Proportion Explained  0.32 0.28 0.15 0.14 0.12
## Cumulative Proportion 0.32 0.60 0.74 0.88 1.00
## 
##  With factor correlations of 
##      MR1  MR2  MR4  MR5  MR3
## MR1 1.00 0.08 0.41 0.25 0.10
## MR2 0.08 1.00 0.11 0.40 0.32
## MR4 0.41 0.11 1.00 0.26 0.02
## MR5 0.25 0.40 0.26 1.00 0.13
## MR3 0.10 0.32 0.02 0.13 1.00
## 
## Mean item complexity =  1.8
## Test of the hypothesis that 5 factors are sufficient.
## 
## The degrees of freedom for the null model are  1431  and the objective function was  39.52 with Chi Square of  120738.2
## The degrees of freedom for the model are 1171  and the objective function was  10.73 
## 
## The root mean square of the residuals (RMSR) is  0.05 
## The df corrected root mean square of the residuals is  0.05 
## 
## The harmonic number of observations is  3075 with the empirical chi square  20118.67  with prob <  0 
## The total number of observations was  3075  with Likelihood Chi Square =  32748.15  with prob <  0 
## 
## Tucker Lewis Index of factoring reliability =  0.676
## RMSEA index =  0.094  and the 90 % confidence intervals are  0.093 0.095
## BIC =  23343.78
## Fit based upon off diagonal values = 0.97
## Measures of factor score adequacy             
##                                                 MR1  MR2  MR4  MR5  MR3
## Correlation of scores with factors             0.97 0.97 0.97 0.94 0.93
## Multiple R square of scores with factors       0.94 0.95 0.94 0.89 0.87
## Minimum correlation of possible factor scores  0.89 0.89 0.87 0.78 0.74

We then map each set of factor scores to visualize these 5 common patterns of American lexical innovation, where darker shades indicate regions where the words associated with that factor tended to occur relatively frequently at a relatively early date (hubs of lexical innovation) and where lighter shades indcaten regions where those words tended to spread (areas of diffusion).

We can also plot all five factors together by coloring each county according to the factor with which it is most strongly associated to create a single map showing the main regions of lexical innovation in this variety of language.