Is there a difference in twitter word usage of Donald Trump during his run for president versus while he is president?
tokenizers package, if you do not have it.master = read.csv("trump_tweets.csv", stringsAsFactors = F)
master = na.omit(master)
colnames(master)[1] = "time"
prez = subset(master, time == "prez")
notprez = subset(master, time == "not prez")
ptemp = tokenizers::tokenize_ngrams(as.character(prez$text),
lowercase = T,
n = 2)
pdata = as.data.frame(table(unlist(ptemp)))
nptemp = tokenizers::tokenize_ngrams(as.character(notprez$text),
lowercase = T,
n = 2)
npdata = as.data.frame(table(unlist(nptemp)))
final_data = merge(pdata, npdata, by = "Var1", all = TRUE)
colnames(final_data) = c("Bigram", "Prez", "NotPrez")Load all the libraries or functions that you will use to for the rest of the assignment. It is helpful to define your libraries and functions at the top of a report, so that others can know what they need for the report to compile correctly.
Remember that the NA values just represent options that didn’t occur in one time point versus the other. Fill those NA values in with zero.
grep function.
^ indicates that it should be the start of the bigram.$ indicates that it should be the last word of the bigram.make - so the firstword would be “^make” or the lastword would be " make$".Here we are considering all the bigrams with the first word as ‘make’.
Using Not Prez as a and Prez as b, create the vectors for a, b, c, and d.
Below we look at the a, b, c, and d values for some of the bigrams with ‘make’ as the first word:
a = reduced_data$Prez
b = reduced_data$NotPrez
c = sum(reduced_data$Prez) - reduced_data$Prez
d = sum(reduced_data$NotPrez) - reduced_data$NotPrez
head(cbind(as.character(reduced_data$Bigram), a, b, c, d))## a b c d
## [1,] "make 3" "1" "0" "259" "604"
## [2,] "make a" "44" "32" "216" "572"
## [3,] "make almost" "1" "0" "259" "604"
## [4,] "make america" "103" "378" "157" "226"
## [5,] "make an" "2" "2" "258" "602"
## [6,] "make and" "1" "1" "259" "603"
Calculate the expected value of a for each bigram.
Below we look at the a, b, c, and d values for some of the bigrams with ‘make’ as the first word, along with their aExp values. aExp represents the expected value of the bigram when president given the overall sum total of constructions when president and when not president.
aExp = (a + b) * (a + c) / (a + b + c + d)
head(cbind(as.character(reduced_data$Bigram), a, aExp, b, c, d))## a aExp b c d
## [1,] "make 3" "1" "0.300925925925926" "0" "259" "604"
## [2,] "make a" "44" "22.8703703703704" "32" "216" "572"
## [3,] "make almost" "1" "0.300925925925926" "0" "259" "604"
## [4,] "make america" "103" "144.74537037037" "378" "157" "226"
## [5,] "make an" "2" "1.2037037037037" "2" "258" "602"
## [6,] "make and" "1" "0.601851851851852" "1" "259" "603"
Calculate the log p-values from the Fisher test.
Below we look at the a, b, c, and d values for some of the bigrams with ‘make’ as the first word, along with their aExp and logpvF values. logpvF is positive when the a is more than expected and negative when a is less than expected as in the case of the usage of ‘make america’ when president.
pvF = pv.Fisher.collostr(a, b, c, d)
# Convert to effect size measure
logpvF = ifelse(a < aExp, log10(pvF), -log10(pvF))
head(cbind(as.character(reduced_data$Bigram), a, aExp, b, c, d, logpvF))## a aExp b c d
## [1,] "make 3" "1" "0.300925925925926" "0" "259" "604"
## [2,] "make a" "44" "22.8703703703704" "32" "216" "572"
## [3,] "make almost" "1" "0.300925925925926" "0" "259" "604"
## [4,] "make america" "103" "144.74537037037" "378" "157" "226"
## [5,] "make an" "2" "1.2037037037037" "2" "258" "602"
## [6,] "make and" "1" "0.601851851851852" "1" "259" "603"
## logpvF
## [1,] "0.521540394508075"
## [2,] "6.83505620083176"
## [3,] "0.521540394508075"
## [4,] "-9.14445123335334"
## [5,] "0.230659049802421"
## [6,] "0.291121076380002"
Create the top 10 bigrams for Not Prez (positive scores) and for Prez (negative scores).
The top bigrams when president are below:
reduced_data$logp = logpvF
reduced_data = reduced_data[order(-reduced_data$logp),]
top_prez = reduced_data$Bigram[1:10]
head(reduced_data,10)## [1] "make a" "make up" "make california"
## [4] "make case" "make freedom" "make much"
## [7] "make or" "make safety" "make us"
## [10] "make his"
The top bigrams when not president are below:
reduced_data = reduced_data[order(reduced_data$logp),]
top_notprez = reduced_data$Bigram[1:10]
head(reduced_data,10)## [1] "make america" "make this" "make donald" "make them"
## [5] "make it" "make americaâ" "make anything" "make your"
## [9] "make our" "make any"
What can you gather from the different top scores for Not Prez and Prez? Did he appear to change his style once inaugurated?
Based on the above results and discussion, it is evident that Trump’s style of tweets with respect to the usage of ‘make’ has changed quite a bit since becoming the president. Prior to being the president, Trump’s tweets had a lot of of make bigrams focussed on promises such as “make america”, “make this”, “make donald”, “make them”, “make it”, “make anything”, “make your”, and “make our” which occured more than expected given the overall sum total of constructions when president and when not president. However, after being elected the president his tweets have make bigrams which don’t signify anything in particular (he is using make in a usual context), with bigrams such as “make a”, “make up”, and “make California” occuring more than expected given the overall sum total of constructions when president and when not president.