Introduction

On this page I will be reporting on some of the exploratory analysis I’ve done so far with my capstone project. The project’s ultimate goal is to create a text predictor, so that a person can type in words and my Shiny app can predict what will come next. This presentation will focus on Twitter data, but eventually the app may consider data from blogs and news sources as well.

Downloading the Data and Necessary Packages

I start by downloading the data and several packages (code and output for downloading packages not shown here).

set.seed(989)
twitter <- file("en_US.twitter.txt", "r")
noLinesTwitter<-countLines("~/Desktop/final/en_US/en_US.twitter.txt")
newTwitter<-sample_lines("en_US.twitter.txt",(noLinesTwitter/20))

Preprocessing Data

Next, I clean up the data to make it easier to work with. I end with “newTwitter”, which is a character vector of each tweet, as well as tokTwitter. tokTwitter is a list of character vectors, where each tweet is its own character vector and each word within the tweet is a character. These can be use for different purposes.

newTwitter<-removePunctuation(newTwitter)
newTwitter<-stripWhitespace(newTwitter)
newTwitter<-removeNumbers(newTwitter)
newTwitter<-tolower(newTwitter)
newTwitter<-stemDocument(newTwitter)
tokTwitter<-tokenize_tweets(newTwitter)

Stopwords

I want to have the words from all these tweets without profanity and stopwords, so I do three things here. First, I define a character vector containing profane words and stopwords, both of which I want to remove. Then, I make an extremely long character vector called “LongRWtokTwitter” that contains all of the original words minus my stop words. I don’t end up using this again here, but will keep it in case it’s useful for the future. I eventually added some stopwords to my list based off of common words that showed up that shouldn’t.

myStopWords<-c(profanity_arr_bad,stopwords_en)
LongRWtokTwitter<-removeWords(unlist(tokTwitter),myStopWords)

Creating a Useful List

What I want next is a list version of newTwitter, which I obtain using the “Magic” package in CRAN.

magic_for(print, silent = TRUE)
for(i in 1:118007){
  this<-removeWords(newTwitter[i],myStopWords)
  print(this)
}
magicTwitter<-magic_result() 
magicTwitter<-magicTwitter$this
head(magicTwitter)
## [[1]]
## [1] "isnt  funni     kidyou want   grown  now   grownyou wish    kid"
## 
## [[2]]
## [1] "  just   appl store   divid  hm  hot dog   stick"
## 
## [[3]]
## [1] "ah  local station  play faster  within temptat  like   lot"
## 
## [[4]]
## [1] "  want  wonder weekend  end tantrum throw commenc"
## 
## [[5]]
## [1] "great piecenic    boss didnt expect respons       understand"
## 
## [[6]]
## [1] "   new year plan  resolut"

Exploratory Plotting

Next I want to explore my data to see what words are most frequent and how frequent they are compared to the corpus as a whole. I create a document feature matrix to help make my plots.

library(quanteda)
dfmTwitter<-dfm(as.character(magicTwitter))
dfmTwitter<-dfmTwitter[,-c(1:3,5,11)] ## get rid of characters
dfmTotals<-(sort(colSums(dfmTwitter),decreasing = T)) ## no. of times each word appears
plot(dfmTotals[1:10])

plot(dfmTotals[1:100])

plot(dfmTotals[1:10]/sum(dfmTotals))

plot(dfmTotals[1:100]/sum(dfmTotals))

Cumulative Graphs

I next want to look at the cumulative effect. What percentage of total words are made up of the most common 10 words, for example? NOTE: I could have done this with or without stopwords included; I did not show both here.

magic_for(print, silent = TRUE)
num<-list()
for(i in 1:1000){
  index<-as.numeric(sum(dfmTotals[1:i])/sum(dfmTotals))
  num<-append(num,c(index))
  print(num)
}
magicPlot<-magic_result()
magicPlot<-magicPlot$num
plot(unlist(magicPlot[[10]]))

plot(unlist(magicPlot[[100]]))

plot(unlist(magicPlot[[1000]]))

Analyzing 2-Grams and 3-Grams

Things get a little more complicated here. I want to find all 2-grams and 3-grams in my data and analyze these similarly to how I analyzed my 1-grams. I organize the tweets by length, sift out the ones that are too short, find the most common grams and analyze their frequencies.

magic_for(print, silent = TRUE)
for(i in 1:118007){
  l<-length(tokTwitter[[i]])
  l
}
magicLength<-magic_result_as_dataframe()
magicLength<-as.list(magicLength[,2])
forThreeGram<-which(magicLength>2)
ThreeGram<-ngram(newTwitter[forThreeGram],n=3) ## 960588 3-grams
forTwoGram<-which(magicLength>1)
TwoGram<-ngram(newTwitter[forTwoGram],n=2) ## 536181 2-grams
print(ThreeGram,output="truncated")
## abbey not near | 1 
## as {1} | 
## 
## night of hip | 1 
## hop {1} | 
## 
## api cant wait | 1 
## to {1} | 
## 
## you gonna repli | 1 
## with {1} | 
## 
## phone is the | 1 
## most {1} | 
## 
## [[ ... results truncated ... ]]
pt3<-get.phrasetable(ThreeGram)
head(pt3,20)
##              ngrams freq         prop
## 1    thank for the  1228 0.0009960975
## 2  look forward to   597 0.0004842591
## 3     cant wait to   430 0.0003487964
## 4   for the follow   416 0.0003374402
## 5       i love you   411 0.0003333844
## 6        i want to   402 0.0003260840
## 7         go to be   396 0.0003212171
## 8    thank you for   383 0.0003106721
## 9        i need to   329 0.0002668698
## 10    have a great   327 0.0002652475
## 11         to be a   323 0.0002620029
## 12      to see you   307 0.0002490244
## 13        i have a   299 0.0002425351
## 14        im go to   299 0.0002425351
## 15        a lot of   295 0.0002392905
## 16       i have to   286 0.0002319901
## 17      one of the   261 0.0002117113
## 18     i dont know   251 0.0002035997
## 19      you have a   246 0.0001995440
## 20        is go to   240 0.0001946770
pt33<-as.numeric(pt3[,2])
print(TwoGram,output="truncated")
## insult word | 1 
## choic {1} | 
## 
## use chrome | 1 
## and {1} | 
## 
## one huge | 1 
## display {1} | 
## 
## the picnic | 1 
## knoll {1} | 
## 
## feat task | 1 
## to {1} | 
## 
## [[ ... results truncated ... ]]
pt2<-get.phrasetable(TwoGram)
head(pt2,20)
##        ngrams freq        prop
## 1     in the  4069 0.003012412
## 2    for the  3704 0.002742190
## 3     of the  2874 0.002127715
## 4      go to  2436 0.001803449
## 5      to be  2408 0.001782720
## 6     on the  2400 0.001776797
## 7  thank for  2188 0.001619847
## 8     to the  2169 0.001605781
## 9     have a  1981 0.001466598
## 10    i love  1912 0.001415515
## 11    at the  1832 0.001356289
## 12   want to  1701 0.001259305
## 13    if you  1654 0.001224509
## 14    i have  1644 0.001217106
## 15 thank you  1598 0.001183051
## 16     for a  1475 0.001091990
## 17      i am  1467 0.001086067
## 18    i dont  1418 0.001049791
## 19    to get  1399 0.001035725
## 20    to see  1391 0.001029802
pt22<-as.numeric(pt2[,2])
#####
plot(pt22[1:10])

plot(pt22[1:100])

plot(pt22[1:1000])

magic_for(print, silent = TRUE)
for(i in 1:1000){
  index<-as.numeric(sum(pt2[1:i,3])/sum(pt2[,3]))
  print(index)
}
magicPlot2<-magic_result()
magicPlot2<-magicPlot2$index
plot(c(1:10),magicPlot2[1:10])

plot(c(1:100),magicPlot2[1:100])

plot(c(1:1000),magicPlot2[1:1000])

plot(pt33[1:10])

plot(pt33[1:100])

plot(pt33[1:1000])

magic_for(print, silent = TRUE)
for(i in 1:1000){
  index<-as.numeric(sum(pt3[1:i,3])/sum(pt3[,3]))
  print(index)
}
magicPlot3<-magic_result()
magicPlot3<-magicPlot3$index
plot(c(1:10),magicPlot3[1:10])

plot(c(1:100),magicPlot3[1:100])

plot(c(1:1000),magicPlot3[1:1000])

Foreign Language Words

I have so far not made much progress in finding out what percentage of words in these tweets are non-English words, but here are some things I’ve done so far, including combining Spanish and French stopwords into one vector (and removing the words here that might also be found in English tweets) and combining an English standard dictionary with a list of English slang.

forWords<-c(stopwords_es,stopwords_fr)
forWords<-removeWords(forWords,c("la","era","sin","seas","pour","est","hay", "y", "son", "las", "he", "ha", "con", "ante", "antes", "ton","ya","c","e","d","yo","s","n","sean","eras","sea","o","t","l","j","m"))
forWords<-unique(LongRWtokTwitter[which(LongRWtokTwitter %in% forWords)])

ga<-lexicon::grady_augmented
ga<-c(ga,lexicon::hash_internet_slang)
ga<-c(ga,"rt","ah","")
NonEngWords<-which(LongRWtokTwitter %in% ga == FALSE)

Making my first Prediction

Now I can make a prediction of a word based off the previous two words (I don’t yet know how to combine 2-grams, 3-grams etc. into one prediction model).

# I find all instances of "in the", the most common 2-gram, and predict what will come next based off of what has already followed "in the " so far in my corpus.  Note: "in the" could possibly be replaced by whatever is typed in by the user (the user input) when using the shiny app.
inThe<-grep("in the +[^ ]+",newTwitter)
inThe<-newTwitter[inThe]
inThe3Gram<-get.phrasetable(ngram(inThe,3))
inTheToConsider2<-inThe3Gram[grep("^in the ",inThe3Gram[,1]),1:2]
column1<-removeWords(inTheToConsider2[,1], "in the ")
inTheToConsider2[,1]=column1
inTheToConsider2<-inTheToConsider2[order(-inTheToConsider2[,2]),]
pickThis<-inTheToConsider2[1,1]
print(paste(c("in the ",pickThis),collapse=""))
## [1] "in the world "
plot(as.factor(inTheToConsider2[1:10,1]),inTheToConsider2[1:10,2])

inTheToConsider2$prop=inTheToConsider2$freq/sum(inTheToConsider2$freq)
## Plot Proportions
plot(as.factor(inTheToConsider2[1:10,1]),inTheToConsider2[1:10,3])