ANLY540 - Analysis of Human Language - Assignment 4: Distinctive Collexeme Analysis

Create the Data

Is there a difference in twitter word usage of Donald Trump during his run for president versus while he is president?

Data from: http://trumptwitterarchive.com/
Not Prez data is Jan 2015 to Jan 2017
Prez data is from Jan 2017 to Jan 2019
Install tokenizers package, if you do not have it.

master = read.csv("trump_tweets.csv", stringsAsFactors = F)
master = na.omit(master)

colnames(master)[1] = "time"

prez = subset(master, time == "prez")
notprez = subset(master, time == "not prez")

ptemp = tokenizers::tokenize_ngrams(as.character(prez$text),
                            lowercase = T,
                            n = 2)
pdata = as.data.frame(table(unlist(ptemp)))

nptemp = tokenizers::tokenize_ngrams(as.character(notprez$text),
                            lowercase = T,
                            n = 2)
npdata = as.data.frame(table(unlist(nptemp)))

final_data = merge(pdata, npdata, by = "Var1", all = TRUE)

colnames(final_data) = c("Bigram", "Prez", "NotPrez")

Load the Libraries + Functions

Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.

library(Rling)

Deal with NA values

Remember that the NA values just represent options that didn’t occur in one time point versus the other. Fill those NA values in with zero.

final_data[is.na(final_data)] = 0

Pick your construction

Pick a construction to consider, remembering that the data has been put into bigrams (although you can change this to larger numbers above if you want).
You can look for a particular word in a construction using regular expressions and the grep function.
- the ^ indicates that it should be the start of the bigram.
- the $ indicates that it should be the last word of the bigram.
- I recommend putting a space so you only get that particular word.
- You might consider using the word make - so the firstword would be “^make” or the lastword would be " make$".

Here we are considering all the bigrams with the first word as ‘make’.

firstword = "^make "
#lastword = " fake$"

#change to either first or last word here 
reduced_data = final_data[grep(firstword, final_data$Bigram), ]

Summarize the data

Using Not Prez as a and Prez as b, create the vectors for a, b, c, and d.

Below we look at the a, b, c, and d values for some of the bigrams with ‘make’ as the first word:

a = reduced_data$Prez
b = reduced_data$NotPrez
c = sum(reduced_data$Prez) - reduced_data$Prez
d = sum(reduced_data$NotPrez) - reduced_data$NotPrez
head(cbind(as.character(reduced_data$Bigram), a, b, c, d))

##                     a     b     c     d    
## [1,] "make 3"       "1"   "0"   "259" "604"
## [2,] "make a"       "44"  "32"  "216" "572"
## [3,] "make almost"  "1"   "0"   "259" "604"
## [4,] "make america" "103" "378" "157" "226"
## [5,] "make an"      "2"   "2"   "258" "602"
## [6,] "make and"     "1"   "1"   "259" "603"

Calculate aExp

Calculate the expected value of a for each bigram.

Below we look at the a, b, c, and d values for some of the bigrams with ‘make’ as the first word, along with their aExp values. aExp represents the expected value of the bigram when president given the overall sum total of constructions when president and when not president.

aExp = (a + b) * (a + c) / (a + b + c + d)
head(cbind(as.character(reduced_data$Bigram), a, aExp, b, c, d))

##                     a     aExp                b     c     d    
## [1,] "make 3"       "1"   "0.300925925925926" "0"   "259" "604"
## [2,] "make a"       "44"  "22.8703703703704"  "32"  "216" "572"
## [3,] "make almost"  "1"   "0.300925925925926" "0"   "259" "604"
## [4,] "make america" "103" "144.74537037037"   "378" "157" "226"
## [5,] "make an"      "2"   "1.2037037037037"   "2"   "258" "602"
## [6,] "make and"     "1"   "0.601851851851852" "1"   "259" "603"

make 3 is over-represented when president than expected
make a is over-represented when president than expected
make almost is over-represented when president than expected
make america is under-represented when president than expected
make an is over-represented when president than expected
make and is over-represented when president than expected

logPF

Calculate the log p-values from the Fisher test.

Below we look at the a, b, c, and d values for some of the bigrams with ‘make’ as the first word, along with their aExp and logpvF values. logpvF is positive when the a is more than expected and negative when a is less than expected as in the case of the usage of ‘make america’ when president.

pvF = pv.Fisher.collostr(a, b, c, d)
# Convert to effect size measure
logpvF = ifelse(a < aExp, log10(pvF), -log10(pvF))
head(cbind(as.character(reduced_data$Bigram), a, aExp, b, c, d, logpvF))

##                     a     aExp                b     c     d    
## [1,] "make 3"       "1"   "0.300925925925926" "0"   "259" "604"
## [2,] "make a"       "44"  "22.8703703703704"  "32"  "216" "572"
## [3,] "make almost"  "1"   "0.300925925925926" "0"   "259" "604"
## [4,] "make america" "103" "144.74537037037"   "378" "157" "226"
## [5,] "make an"      "2"   "1.2037037037037"   "2"   "258" "602"
## [6,] "make and"     "1"   "0.601851851851852" "1"   "259" "603"
##      logpvF             
## [1,] "0.521540394508075"
## [2,] "6.83505620083176" 
## [3,] "0.521540394508075"
## [4,] "-9.14445123335334"
## [5,] "0.230659049802421"
## [6,] "0.291121076380002"

Calculate the top scores

Create the top 10 bigrams for Not Prez (positive scores) and for Prez (negative scores).

The top bigrams when president are below:

reduced_data$logp = logpvF
reduced_data = reduced_data[order(-reduced_data$logp),]
top_prez = reduced_data$Bigram[1:10]
head(reduced_data,10)

as.character(top_prez)

##  [1] "make a"          "make up"         "make california"
##  [4] "make case"       "make freedom"    "make much"      
##  [7] "make or"         "make safety"     "make us"        
## [10] "make his"

The top bigrams when not president are below:

reduced_data = reduced_data[order(reduced_data$logp),]
top_notprez = reduced_data$Bigram[1:10]
head(reduced_data,10)

as.character(top_notprez)

##  [1] "make america"  "make this"     "make donald"   "make them"    
##  [5] "make it"       "make americaâ" "make anything" "make your"    
##  [9] "make our"      "make any"

Interpreting the numbers

What can you gather from the different top scores for Not Prez and Prez? Did he appear to change his style once inaugurated?

Based on the above results and discussion, it is evident that Trump’s style of tweets with respect to the usage of ‘make’ has changed quite a bit since becoming the president. Prior to being the president, Trump’s tweets had a lot of of make bigrams focussed on promises such as “make america”, “make this”, “make donald”, “make them”, “make it”, “make anything”, “make your”, and “make our” which occured more than expected given the overall sum total of constructions when president and when not president. However, after being elected the president his tweets have make bigrams which don’t signify anything in particular (he is using make in a usual context), with bigrams such as “make a”, “make up”, and “make California” occuring more than expected given the overall sum total of constructions when president and when not president.