I began by examining the three files
twit<-readLines('/Users/jamesharris/Documents/capstone/final/en_US/en_US.twitter.txt')
news<-readLines('/Users/jamesharris/Documents/capstone/final/en_US/en_US.news.txt')
blogs<-readLines('/Users/jamesharris/Documents/capstone/final/en_US/en_US.blogs.txt')
print(sprintf('length of Twitter file: %d',length(twit)))
## [1] "length of Twitter file: 2360148"
print(sprintf('length of News file: %d',length(news)))
## [1] "length of News file: 1010242"
print(sprintf('length of Blogs file: %d',length(blogs)))
## [1] "length of Blogs file: 899288"
print(length(blogs))
## [1] 899288
To do some relatively crude counts of words, I used some simple regex expressions to identify continuous alphabetical strings (converted to lowercase for simplicity) to extract everything appearing to be a word.
library(stringr)
examineWords<-function(x, title){
x<-tolower(x)
words<-str_extract_all(tolower(x), '[a-z]+') %>% unlist
print(sprintf('Total words found: %d',length(words)))
print(sprintf('Total unique words found: %d',length(unique(words))))
wtable<-table(words)
wtable<-wtable[order(wtable, decreasing = TRUE)]
print(paste('20 Most comon words in',title))
print(head(wtable,20))
barplot(head(wtable,20), main = title)
}
examineWords(twit, title='Twitter')
## [1] "Total words found: 30557097"
## [1] "Total unique words found: 302653"
## [1] "20 Most comon words in Twitter"
## words
## the i to a you and for it in of is
## 937970 918857 788951 617518 601328 438744 385492 383744 380804 359757 358950
## s my on that t me be at with
## 316954 292160 278287 271121 221789 203521 188044 186869 173528
examineWords(news, title='News')
## [1] "Total words found: 34616527"
## [1] "Total unique words found: 212227"
## [1] "20 Most comon words in News"
## words
## the to a and of in s that for it
## 1975163 906198 894899 889612 774525 679242 457198 371864 353967 286646
## is on with he said was at i as his
## 284268 269987 254857 254498 250435 228974 214225 195520 187645 157686
examineWords(blogs, title='Blogs')
## [1] "Total words found: 37880273"
## [1] "Total unique words found: 253041"
## [1] "20 Most comon words in Blogs"
## words
## the and to i a of in it that is
## 1860686 1094859 1069565 906917 904594 876848 598782 485385 484198 432768
## for you s with was on my this as have
## 363937 327904 326046 286771 278361 276621 270974 259189 224051 218953
There are several parts of the data set that need further cleaning, however. For example, all three data sets appear to have URLs embedded, as well as “@” used for mentions or email addresses. News seems to have a lot of special characters in the form of u0093 or hex encoded which need to be converted to plain text characters. This appears to be especially true where the single quote or apostrophe is replaced with a “curly” apostrophe (’).
The simple decoding above also does not take into account words with an apostrophe, such as “I’m”
I tried using tm and others to clean these automatically, but I kept running into vector memory exhaustion errors, so after some time on Stackoverflow and the Rstudio RegEx cheat sheet (https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf), I found some better regex expressions and PERL expressions:
# Cleaning up Tword files
cleanUp<-function(x){
# remove URLs
x<-gsub("https\\S*", "", x)
# get rid of mentions / emails
x<-gsub("@\\S*", "", x)
# Get instances of ’ and replace with an ascii '
x <- gsub("\xe2\x80\x99", "'", x, perl=TRUE)
x <- gsub("\u0027|\u0060|\u0091|\u0092|\u0093|\u0094|\u2019", "'", x, perl=TRUE)
# remove punctuation that isn't an apostrophe between "word chartacters"
x <- gsub("(?<!\\w)['](?<!\\w)" , " ", x, perl=TRUE)
x <- gsub("[^[:alpha:][:space:]']", " ", x) # remove all other punctuation marks
# Strip out multiple spaces
x<-gsub("[' ']{2,}",' ',x)
return(x)
}
cleantwit<-cleanUp(twit)
cleannews<-cleanUp(news)
cleanblogs<-cleanUp(blogs)
#make a word list with all of the words
words<-str_extract_all(cleantwit, "[^[:space:]]+") %>% unlist
words<-c(words,str_extract_all(cleannews, "[^[:space:]]+") %>% unlist)
words<-c(words,str_extract_all(cleanblogs, "[^[:space:]]+") %>% unlist)
We will look at the new single-word frequencies:
library(data.table)
# Make a data table, count each word, and remove duplicates, once counted
dt<-data.table(words=words)
dt<-dt[,N:=.N, by=words]%>% unique()
dt<-dt[order(N, decreasing = TRUE)]
head(dt)
## words N
## 1: the 4232485
## 2: to 2724302
## 3: and 2299062
## 4: a 2287703
## 5: of 1991547
## 6: in 1549745
Now look at two-word pairs (digrams)
dt<-data.table(word1=words[1:(length(words)-1)], word2=words[2:length(words)])
dt<-dt[,N:=.N, by=.(word1,word2)] %>% unique
dt<-dt[order(N, decreasing = TRUE)]
head(dt)
## word1 word2 N
## 1: of the 423787
## 2: in the 384522
## 3: to the 208847
## 4: for the 191695
## 5: on the 185861
## 6: to be 159502
We could do tri-grams, as well
dt<-data.table(word1=words[1:(length(words)-2)], word2=words[2:(length(words)-1)], word3=words[3:length(words)])
dt<-dt[,N:=.N, by=.(word1,word2, word3)] %>% unique
dt<-dt[order(N, decreasing = TRUE)]
head(dt)
## word1 word2 word3 N
## 1: one of the 29027
## 2: a lot of 27242
## 3: to be a 17938
## 4: going to be 16941
## 5: Thanks for the 15746
## 6: the end of 14556
Now that we have a set of frequency look-up tables, we can cull them for memory and speed. There are still some issues - removing profanity, cleaning out non-english words, etc. - but the beginnings of a way to make efficient models is becomng apparent.
In the coming weeks, I will work on further cleaning the words, and trimming the models to the mos probable pairs or triples of words - and may use phoenym conversion as an added predictor.