On this page I will be reporting on some of the exploratory analysis I’ve done so far with my capstone project. The project’s ultimate goal is to create a text predictor, so that a person can type in words and my Shiny app can predict what will come next. This presentation will focus on Twitter data, but eventually the app may consider data from blogs and news sources as well.
I start by downloading the data and several packages (code and output for downloading packages not shown here).
set.seed(989)
twitter <- file("en_US.twitter.txt", "r")
noLinesTwitter<-countLines("~/Desktop/final/en_US/en_US.twitter.txt")
newTwitter<-sample_lines("en_US.twitter.txt",(noLinesTwitter/20))
Next, I clean up the data to make it easier to work with. I end with “newTwitter”, which is a character vector of each tweet, as well as tokTwitter. tokTwitter is a list of character vectors, where each tweet is its own character vector and each word within the tweet is a character. These can be use for different purposes.
newTwitter<-removePunctuation(newTwitter)
newTwitter<-stripWhitespace(newTwitter)
newTwitter<-removeNumbers(newTwitter)
newTwitter<-tolower(newTwitter)
newTwitter<-stemDocument(newTwitter)
tokTwitter<-tokenize_tweets(newTwitter)
I want to have the words from all these tweets without profanity and stopwords, so I do three things here. First, I define a character vector containing profane words and stopwords, both of which I want to remove. Then, I make an extremely long character vector called “LongRWtokTwitter” that contains all of the original words minus my stop words. I don’t end up using this again here, but will keep it in case it’s useful for the future. I eventually added some stopwords to my list based off of common words that showed up that shouldn’t.
myStopWords<-c(profanity_arr_bad,stopwords_en)
LongRWtokTwitter<-removeWords(unlist(tokTwitter),myStopWords)
What I want next is a list version of newTwitter, which I obtain using the “Magic” package in CRAN.
magic_for(print, silent = TRUE)
for(i in 1:118007){
this<-removeWords(newTwitter[i],myStopWords)
print(this)
}
magicTwitter<-magic_result()
magicTwitter<-magicTwitter$this
head(magicTwitter)
## [[1]]
## [1] "isnt funni kidyou want grown now grownyou wish kid"
##
## [[2]]
## [1] " just appl store divid hm hot dog stick"
##
## [[3]]
## [1] "ah local station play faster within temptat like lot"
##
## [[4]]
## [1] " want wonder weekend end tantrum throw commenc"
##
## [[5]]
## [1] "great piecenic boss didnt expect respons understand"
##
## [[6]]
## [1] " new year plan resolut"
Next I want to explore my data to see what words are most frequent and how frequent they are compared to the corpus as a whole. I create a document feature matrix to help make my plots.
library(quanteda)
dfmTwitter<-dfm(as.character(magicTwitter))
dfmTwitter<-dfmTwitter[,-c(1:3,5,11)] ## get rid of characters
dfmTotals<-(sort(colSums(dfmTwitter),decreasing = T)) ## no. of times each word appears
plot(dfmTotals[1:10])
plot(dfmTotals[1:100])
plot(dfmTotals[1:10]/sum(dfmTotals))
plot(dfmTotals[1:100]/sum(dfmTotals))
I next want to look at the cumulative effect. What percentage of total words are made up of the most common 10 words, for example? NOTE: I could have done this with or without stopwords included; I did not show both here.
magic_for(print, silent = TRUE)
num<-list()
for(i in 1:1000){
index<-as.numeric(sum(dfmTotals[1:i])/sum(dfmTotals))
num<-append(num,c(index))
print(num)
}
magicPlot<-magic_result()
magicPlot<-magicPlot$num
plot(unlist(magicPlot[[10]]))
plot(unlist(magicPlot[[100]]))
plot(unlist(magicPlot[[1000]]))
Things get a little more complicated here. I want to find all 2-grams and 3-grams in my data and analyze these similarly to how I analyzed my 1-grams. I organize the tweets by length, sift out the ones that are too short, find the most common grams and analyze their frequencies.
magic_for(print, silent = TRUE)
for(i in 1:118007){
l<-length(tokTwitter[[i]])
l
}
magicLength<-magic_result_as_dataframe()
magicLength<-as.list(magicLength[,2])
forThreeGram<-which(magicLength>2)
ThreeGram<-ngram(newTwitter[forThreeGram],n=3) ## 960588 3-grams
forTwoGram<-which(magicLength>1)
TwoGram<-ngram(newTwitter[forTwoGram],n=2) ## 536181 2-grams
print(ThreeGram,output="truncated")
## abbey not near | 1
## as {1} |
##
## night of hip | 1
## hop {1} |
##
## api cant wait | 1
## to {1} |
##
## you gonna repli | 1
## with {1} |
##
## phone is the | 1
## most {1} |
##
## [[ ... results truncated ... ]]
pt3<-get.phrasetable(ThreeGram)
head(pt3,20)
## ngrams freq prop
## 1 thank for the 1228 0.0009960975
## 2 look forward to 597 0.0004842591
## 3 cant wait to 430 0.0003487964
## 4 for the follow 416 0.0003374402
## 5 i love you 411 0.0003333844
## 6 i want to 402 0.0003260840
## 7 go to be 396 0.0003212171
## 8 thank you for 383 0.0003106721
## 9 i need to 329 0.0002668698
## 10 have a great 327 0.0002652475
## 11 to be a 323 0.0002620029
## 12 to see you 307 0.0002490244
## 13 i have a 299 0.0002425351
## 14 im go to 299 0.0002425351
## 15 a lot of 295 0.0002392905
## 16 i have to 286 0.0002319901
## 17 one of the 261 0.0002117113
## 18 i dont know 251 0.0002035997
## 19 you have a 246 0.0001995440
## 20 is go to 240 0.0001946770
pt33<-as.numeric(pt3[,2])
print(TwoGram,output="truncated")
## insult word | 1
## choic {1} |
##
## use chrome | 1
## and {1} |
##
## one huge | 1
## display {1} |
##
## the picnic | 1
## knoll {1} |
##
## feat task | 1
## to {1} |
##
## [[ ... results truncated ... ]]
pt2<-get.phrasetable(TwoGram)
head(pt2,20)
## ngrams freq prop
## 1 in the 4069 0.003012412
## 2 for the 3704 0.002742190
## 3 of the 2874 0.002127715
## 4 go to 2436 0.001803449
## 5 to be 2408 0.001782720
## 6 on the 2400 0.001776797
## 7 thank for 2188 0.001619847
## 8 to the 2169 0.001605781
## 9 have a 1981 0.001466598
## 10 i love 1912 0.001415515
## 11 at the 1832 0.001356289
## 12 want to 1701 0.001259305
## 13 if you 1654 0.001224509
## 14 i have 1644 0.001217106
## 15 thank you 1598 0.001183051
## 16 for a 1475 0.001091990
## 17 i am 1467 0.001086067
## 18 i dont 1418 0.001049791
## 19 to get 1399 0.001035725
## 20 to see 1391 0.001029802
pt22<-as.numeric(pt2[,2])
#####
plot(pt22[1:10])
plot(pt22[1:100])
plot(pt22[1:1000])
magic_for(print, silent = TRUE)
for(i in 1:1000){
index<-as.numeric(sum(pt2[1:i,3])/sum(pt2[,3]))
print(index)
}
magicPlot2<-magic_result()
magicPlot2<-magicPlot2$index
plot(c(1:10),magicPlot2[1:10])
plot(c(1:100),magicPlot2[1:100])
plot(c(1:1000),magicPlot2[1:1000])
plot(pt33[1:10])
plot(pt33[1:100])
plot(pt33[1:1000])
magic_for(print, silent = TRUE)
for(i in 1:1000){
index<-as.numeric(sum(pt3[1:i,3])/sum(pt3[,3]))
print(index)
}
magicPlot3<-magic_result()
magicPlot3<-magicPlot3$index
plot(c(1:10),magicPlot3[1:10])
plot(c(1:100),magicPlot3[1:100])
plot(c(1:1000),magicPlot3[1:1000])
I have so far not made much progress in finding out what percentage of words in these tweets are non-English words, but here are some things I’ve done so far, including combining Spanish and French stopwords into one vector (and removing the words here that might also be found in English tweets) and combining an English standard dictionary with a list of English slang.
forWords<-c(stopwords_es,stopwords_fr)
forWords<-removeWords(forWords,c("la","era","sin","seas","pour","est","hay", "y", "son", "las", "he", "ha", "con", "ante", "antes", "ton","ya","c","e","d","yo","s","n","sean","eras","sea","o","t","l","j","m"))
forWords<-unique(LongRWtokTwitter[which(LongRWtokTwitter %in% forWords)])
ga<-lexicon::grady_augmented
ga<-c(ga,lexicon::hash_internet_slang)
ga<-c(ga,"rt","ah","")
NonEngWords<-which(LongRWtokTwitter %in% ga == FALSE)
Now I can make a prediction of a word based off the previous two words (I don’t yet know how to combine 2-grams, 3-grams etc. into one prediction model).
# I find all instances of "in the", the most common 2-gram, and predict what will come next based off of what has already followed "in the " so far in my corpus. Note: "in the" could possibly be replaced by whatever is typed in by the user (the user input) when using the shiny app.
inThe<-grep("in the +[^ ]+",newTwitter)
inThe<-newTwitter[inThe]
inThe3Gram<-get.phrasetable(ngram(inThe,3))
inTheToConsider2<-inThe3Gram[grep("^in the ",inThe3Gram[,1]),1:2]
column1<-removeWords(inTheToConsider2[,1], "in the ")
inTheToConsider2[,1]=column1
inTheToConsider2<-inTheToConsider2[order(-inTheToConsider2[,2]),]
pickThis<-inTheToConsider2[1,1]
print(paste(c("in the ",pickThis),collapse=""))
## [1] "in the world "
plot(as.factor(inTheToConsider2[1:10,1]),inTheToConsider2[1:10,2])
inTheToConsider2$prop=inTheToConsider2$freq/sum(inTheToConsider2$freq)
## Plot Proportions
plot(as.factor(inTheToConsider2[1:10,1]),inTheToConsider2[1:10,3])