1 Summary

This document describes an analysis of how various factors predict the success of a set of emerging words on American Twitter. This is a preliminary analysis of 54 emerging words for the Evolang XII conference proceedings. I’m currently expanding the set of emeging words and refining the analysis in other ways for a longer paper (see the notes in the text). The complete dataset used for this analysis is available here.

2 Emerging Words

This analysis looks at 54 emerging words that we found in a previous studies, based 9 billion word corpus of geo-coded American Tweets posted between Oct 2013 - Nov 2014 (see Grieve et al. 2017, 2018).

table <- read.table("EVOL_DATA_FINAL.csv", header = TRUE, sep = ",")
table$WORD
##  [1] amirite      baeless      baeritto     balayage     boolin      
##  [6] brazy        bruuh        candids      celfie       cosplay     
## [11] dwk          fallback     famo         faved        fhritp      
## [16] figgity      fleek        fuckboys     gainz        gmfu        
## [21] goalz        idgt         lfie         lifestyleeee litt        
## [26] litty        lituation    lordt        lw           mce         
## [31] mmmmmmmuah   mutuals      nahfr        notifs       pcd         
## [36] pullout      rekt         rq           scute        senpai      
## [41] shordy       slayin       sqaud        tbfh         tfw         
## [46] thotful      thottin      tookah       traphouse    unbae       
## [51] waifu        wce          xans         yaas        
## 54 Levels: amirite baeless baeritto balayage boolin brazy ... yaas

This is the complete set of word forms in the corpus that meets the following criteria:

    1. Not listed in a standard dictionary (Merriam-Webster, 2015)
    1. Not a proper-noun
    1. Occur at least 500 times in the complete corpus
    1. Occur at a rate of less than once per million words at the end of 2013
    1. Rate of use increased monotonically over the course of 2014 (rho > .7)

In addition, only the most common word form was retained for each lemma (set of inflected forms and spelling variants).

Right now, I’m extracting more word forms for the final analysis by relaxing the last two criteria, likely looking at words that occur at a rate less than 10 occurences per million words and that show a monotonic increase in use of at least rho > .6.

3 Outcome Variable

The goal of the analysis is to investigate the degree to which certain variables predict the success of emerging words, but it’s not entirely clear how to measure word success.

For now I’m using the log of the factor by which the relative frequency of the word changed from 2014 to 2016 (i.e. log(2016_RF/2014_RF)).

Specifically, I’m using the relative frequencies in November 2014, the final month in the original Twitter corpus, and a comparable November sample from 2016.

Most of the words fell in usage, which isn’t all that surprising, but some a lot more than others and there are also words that rise in usage. We can see that if we look at the histogram below, where the split between falling and rising words is marked with a red line and the median is marked with an orange line.

hist(table$FACTOR, col = "dodgerblue", main = "Factor Change")
abline(v = 1, col = "red", lwd = 3)
abline(v = median(table$FACTOR), col = "orange", lwd = 2)

I took the log because the factor can range from 0 to 1 for words whose usage falls from 1 to a number potentially far larger than 2 for words whose usage rises. For example, a word that doubles its usage rises by a factor +2.0 (but rises by a logged factor of +0.301), while a word that halves its usage only falls by a factor of -0.5 (but falls by a logged factor of -0.301).

summary(table$LOG_FACTOR)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.0239 -0.6056 -0.1369 -0.1991  0.2125  1.4576
table[, c(1:5)]

We can see the effect of logging these factors by comparing the histogram for the logged factors below to the histogram for the un-logged factors above. It really decompresses the negative end of the metric.

hist(table$LOG_FACTOR, col = "dodgerblue", main = "Log of Factor Change")
abline(v = 0, col = "red", lwd = 3)
abline(v = median(table$LOG_FACTOR), col = "orange", lwd = 2)

Right now, in addition to extracting additional words, I am also computing 2016 relative frequencies for just November 2016 for the sake of consistency.

There are lots of other possibilities here too. I think one approach that is fairly different and that would therefore be interesting would be to ignore the 2014 frequencies and just code a binary response about whether or not the word is still being used at some minimum relative frequency threshold in 2016, like once per million words.

4 Predictors

I’ve considered a range of explanatory variables, which can be divided into three basic types: form and function. It would definitely be nice to have more cases here to analyse.

4.1 Form

I looked at three explanatory variables related to the structure/grammar of the words: length of the word (in characters), part-of-speech (Nouns, Verbs, Adjectives, Other), and word formation process (Acronyms, Creative Spellings, Other). I’ve also broken down this “Other” category into various standard word formation processes (e.g. Compounding, Truncation, Derivation), but that’s a bit fine for the final analysis.

hist(table$LENGTH, col = "dodgerblue")
abline(v = median(table$LENGTH), col = "orange", lwd = 2)

counts <- table(table$POS)
barplot(counts, col = c("dodgerblue", "darkviolet", "forestgreen", "darkorange"))

counts <- table(table$FORM)
barplot(counts, col = c("dodgerblue", "darkviolet", "forestgreen"))

counts <- table(table$FORM_LONG)
barplot(counts, col = c("dodgerblue", "darkviolet", "forestgreen", "darkorange", 
    "gold", "brown1", "cyan2", "maroon", "palegreen"), cex.names = 0.75)

None of these predictors appear to have a major oeffect on the success of an emerging word, as illustrated in the scatterplots and boxplots below. Longer words, verbs and acronyms alll seem to be a bit less successful.

plot(table$LOG_FACTOR, table$LENGTH, pch = "*", col = "dodgerblue", cex = 2)

boxplot(table$LOG_FACTOR ~ table$POS, col = c("dodgerblue", "darkviolet", "forestgreen", 
    "darkorange"))

boxplot(table$LOG_FACTOR ~ table$FORM, col = c("dodgerblue", "darkviolet", "forestgreen"))

When defined more specifically, the word formation processes appear to be a bit more interesting, with blends and borrowings doing especially well, but there isn’t a lot of data here to draw generalisations.

boxplot(table$LOG_FACTOR ~ table$FORM_LONG, col = c("dodgerblue", "darkviolet", 
    "forestgreen", "darkorange", "gold", "brown1", "cyan2", "maroon", "palegreen"), 
    cex.axis = 0.5)

4.2 Function

The meaning of a word seems like it shoild be an especially important predictor of it success. It is, however, tricky to code for meaning in a rigorous and meaningful way.

I came up with a system consisting of a basic two-way distinction between words that mark new meanings and words that are synonymous with other words in a standard dictionary. I don’t worry about whether other slang terms exists because that’s really impossible to judge.

So you have a word like “balayage”, which appears to marks a new meaning as it refers to a specific hairstyle that is not defined in Merriam-Webseter, the Amercian Heritage Dictionary, Wordnet, or the Microsoft Word Dictionary, for example. Alternatively, you have a word like “baeless”, which means the same thing as being “single”, and so therefore is just a synonym for an existing standard word and therefore does not mark a new meaning.

Creative spelling are somewhat difficult to classify in this system, since these are by definition variant spellings of existing words, but there does seem to be a pretty clear difference between spellings that mark new meanings of existing words, and spellings that are just for emphasis or to represent a pronuncition, for isntance.

So you have a word like “gainz”, which appears marks a new meaning of the word “gains”, specifically “weight gains from working out”. Alternatively, you have a word like “yaas”, which appears to just represent a specific pronunciation of “yes”.

These two types of words are fairly evenly distributed in the dataset.

counts <- table(table$MEAN)
barplot(counts, col = c("dodgerblue", "darkviolet"))

This variable appears to do a good job of predicting the success of these emerging words, with emerging words marking new meanings tending to do much better over time, which I think makes a lot of sense.

boxplot(table$LOG_FACTOR ~ table$MEAN, col = c("dodgerblue", "darkviolet"))

Given that spelling and to a lesser acronyms are somewhat different than the other new words it also interesting the look at success across both form and meaning.

boxplot(table$LOG_FACTOR ~ table$MEAN + table$FORM, col = c("dodgerblue", "darkviolet"), 
    cex.axis = 0.5)

5 Linear Model

So here’s the complete dataset.

table

And here is the linear model using the four predictors.

lm.1 <- lm(LOG_FACTOR ~ LENGTH + POS + FORM + MEAN, data = table)
summary(lm.1)
## 
## Call:
## lm(formula = LOG_FACTOR ~ LENGTH + POS + FORM + MEAN, data = table)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17243 -0.25901 -0.03617  0.33718  0.98960 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.31330    0.26103   1.200  0.23619    
## LENGTH      -0.09647    0.04528  -2.130  0.03851 *  
## POSN        -0.18298    0.19333  -0.946  0.34885    
## POSO         0.18655    0.26314   0.709  0.48194    
## POSV        -0.19468    0.23854  -0.816  0.41861    
## FORMOTHER    0.68889    0.23919   2.880  0.00602 ** 
## FORMSPELL    0.33034    0.23927   1.381  0.17406    
## MEANOLD     -0.72091    0.15953  -4.519 4.33e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5177 on 46 degrees of freedom
## Multiple R-squared:  0.4271, Adjusted R-squared:  0.3399 
## F-statistic: 4.898 on 7 and 46 DF,  p-value: 0.0003435
library(car)
## Warning: package 'car' was built under R version 3.4.3
vif(lm.1)
##            GVIF Df GVIF^(1/(2*Df))
## LENGTH 1.628050  1        1.275950
## POS    1.448690  3        1.063725
## FORM   1.840847  2        1.164808
## MEAN   1.274626  1        1.128993

Overall, the results are very interesting and the overall model does a fairly good job predicting the success of new words.

The overall model significant at the p < .001 level and with an adjusted r-squared of .34.

Specifically, whether the emerging word marks a new meaning was found to be an especially important predictor of the success of an emerging word, at least for this dataset, with words that mark a new meaning being more likely to rise in usage over time. This makes sense, given that these words are essentially filling semantic gaps in the lexicon, whereas the other words are meerely referential synonyms for existing words.

In addition, the length of the word was also found to be significant, with shorter words tending to be more successful over time, perhaps reflecting the specific communicative constraints of the Twitter register.

The word formation process responsible for the word also has some effect, with words created through traditional word formation process tending to be more successful than creative spellings and acronyms, both of which are notably generally more restricted this particular register, or at least CMC/Written Language.

Finally, part-of-speech was found to have very little effect on the success of emerging words over time.

In terms of residuals, everything looks okay, although it is also interesting to see the success of which words was especially badly predicted by this model.

plot(residuals(lm.1), pch = "*", col = "dodgerblue", cex = 2)

hist(residuals(lm.1), col = "dodgerblue")

qqnorm(residuals(lm.1), pch = "*", col = "dodgerblue", cex = 2)

table$residuals <- residuals(lm.1)

table[, c(1, 11)]
##            WORD    residuals
## 1       amirite  0.081587443
## 2       baeless -0.591246692
## 3      baeritto -0.061199745
## 4      balayage  0.485478871
## 5        boolin  0.688364148
## 6         brazy  0.989603563
## 7         bruuh -0.270501483
## 8       candids -0.114582110
## 9        celfie -0.069772101
## 10      cosplay  0.145387212
## 11          dwk -1.033316124
## 12     fallback -0.728829190
## 13         famo -0.464170784
## 14        faved -0.036265049
## 15       fhritp  0.173309554
## 16      figgity  0.106121733
## 17        fleek -0.284242497
## 18     fuckboys  0.474141908
## 19        gainz  0.089405698
## 20         gmfu  0.465609813
## 21        goalz  0.220289463
## 22         idgt -0.036069652
## 23         lfie  0.421593720
## 24 lifestyleeee  0.291644022
## 25         litt  0.002036307
## 26        litty  0.937797559
## 27    lituation  0.948523827
## 28        lordt  0.730696733
## 29           lw -0.080371427
## 30          mce -0.286031093
## 31   mmmmmmmuah -1.168484137
## 32      mutuals  0.172676588
## 33        nahfr -0.043623969
## 34       notifs  0.063987261
## 35          pcd -0.190257501
## 36      pullout -0.127686813
## 37         rekt -0.051494079
## 38           rq  0.352360858
## 39        scute  0.104942019
## 40       senpai -0.197479777
## 41       shordy  0.126068946
## 42       slayin  0.422012158
## 43        sqaud -0.611859447
## 44         tbfh  0.532315329
## 45          tfw  0.387302979
## 46      thotful -0.560872982
## 47      thottin  0.411133011
## 48       tookah -0.477441097
## 49    traphouse -0.440187482
## 50        unbae -1.172434954
## 51        waifu -0.084930061
## 52          wce -0.284852734
## 53         xans -0.224551942
## 54         yaas -0.131635799

On the negative side, most of the words that fell a lot more than expected, were still predicted to fall. There appears to be several different explanations for these near-extinctions. For example, the extreme fall of “unbae” is probably due in part a general fall in the usage of “bae”-related form. Alternatively, the extreme fall of “dwk” and “thotful” is probably because both originated as proper nouns (in the title of popular songs). They were only included in the analysis because they were being used in the 2014 corpus in more generic ways, but it appears that their proper-noun origins have ultimately caught up to them. Some creative spellings also did especially bad (“mmmmmmmuah”, “sqaud”), which I guess isn’t especially surprising, given their a bit random and pointless.

On the positive side, there is less to talk about. The massive success of “lit”-related forms, especially “litty” and “lituation”, which were still relatively uncommon in 2014, is notable. I’m not exactly sure what is happening here, yet these words were expected to rise, just not to this degree. The bigger suprise is “brazy”, which was expected to fall, since it doesn’t strictly speaking mark a new meaning: it just basically means “crazy”; the “c” is replaced with “b” to mark the word as being associated with the Bloods as opposed to the Crips street gang. I guess arguably it does mark a new social meaning/connotation then, but the same could be said for creative spellings that reflect accent, so I think I coded it correctly (and conservatively) as is. It is also notable that the related form “boolin” (Bloods + coolin) fell. I’m not entirely sure what’s happening here either.

6 Conclusions

Overall the analysis was fairly successful, finding some clear and intuitive predictors for the success and failure of emerging words on Twitter, most notably whether or not the word marks a new meaning.

In terms of natural selection in the modern English lexicon, these results show that, at least in this variety of language, the communicative utility of a new word is a strong predictor of it’s success.

Whether or not these results hold for Twitter more generally, other varieties of English, English language in general, and other language is an open question. Furthermore, the number of emerging words analysed is relatively small and there are certainly other predictors that could be included in the analysis, including, given the analysis of residuals, whether or not the word originated as a proper noun, for example. Also looking at the social/regional origin of the words and the characteristics of its relative frequency series would be neat. There are also other ways to conceive of the outcome variable, as noted above.