Initial Data Analysis

I began by examining the three files

twit<-readLines('/Users/jamesharris/Documents/capstone/final/en_US/en_US.twitter.txt')
news<-readLines('/Users/jamesharris/Documents/capstone/final/en_US/en_US.news.txt')
blogs<-readLines('/Users/jamesharris/Documents/capstone/final/en_US/en_US.blogs.txt')
print(sprintf('length of Twitter file: %d',length(twit)))
## [1] "length of Twitter file: 2360148"
print(sprintf('length of News file: %d',length(news)))
## [1] "length of News file: 1010242"
print(sprintf('length of Blogs file: %d',length(blogs)))
## [1] "length of Blogs file: 899288"
print(length(blogs))
## [1] 899288

Basic summaries

To do some relatively crude counts of words, I used some simple regex expressions to identify continuous alphabetical strings (converted to lowercase for simplicity) to extract everything appearing to be a word.

library(stringr)
examineWords<-function(x, title){
  x<-tolower(x)
  words<-str_extract_all(tolower(x), '[a-z]+') %>% unlist
  print(sprintf('Total words found: %d',length(words)))
  print(sprintf('Total unique words found: %d',length(unique(words))))
  wtable<-table(words)
  wtable<-wtable[order(wtable, decreasing = TRUE)]
  print(paste('20 Most comon words in',title))
  print(head(wtable,20))
  barplot(head(wtable,20), main = title)
}
examineWords(twit, title='Twitter')
## [1] "Total words found: 30557097"
## [1] "Total unique words found: 302653"
## [1] "20 Most comon words in Twitter"
## words
##    the      i     to      a    you    and    for     it     in     of     is 
## 937970 918857 788951 617518 601328 438744 385492 383744 380804 359757 358950 
##      s     my     on   that      t     me     be     at   with 
## 316954 292160 278287 271121 221789 203521 188044 186869 173528

examineWords(news, title='News')
## [1] "Total words found: 34616527"
## [1] "Total unique words found: 212227"
## [1] "20 Most comon words in News"
## words
##     the      to       a     and      of      in       s    that     for      it 
## 1975163  906198  894899  889612  774525  679242  457198  371864  353967  286646 
##      is      on    with      he    said     was      at       i      as     his 
##  284268  269987  254857  254498  250435  228974  214225  195520  187645  157686

examineWords(blogs, title='Blogs')
## [1] "Total words found: 37880273"
## [1] "Total unique words found: 253041"
## [1] "20 Most comon words in Blogs"
## words
##     the     and      to       i       a      of      in      it    that      is 
## 1860686 1094859 1069565  906917  904594  876848  598782  485385  484198  432768 
##     for     you       s    with     was      on      my    this      as    have 
##  363937  327904  326046  286771  278361  276621  270974  259189  224051  218953

Cleaning Data further

There are several parts of the data set that need further cleaning, however. For example, all three data sets appear to have URLs embedded, as well as “@” used for mentions or email addresses. News seems to have a lot of special characters in the form of u0093 or hex encoded which need to be converted to plain text characters. This appears to be especially true where the single quote or apostrophe is replaced with a “curly” apostrophe (’).

The simple decoding above also does not take into account words with an apostrophe, such as “I’m”

I tried using tm and others to clean these automatically, but I kept running into vector memory exhaustion errors, so after some time on Stackoverflow and the Rstudio RegEx cheat sheet (https://rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf), I found some better regex expressions and PERL expressions:

# Cleaning up Tword files

cleanUp<-function(x){
  # remove URLs
  x<-gsub("https\\S*", "", x)
  # get rid of mentions / emails
  x<-gsub("@\\S*", "", x)
  # Get instances of ’ and replace with an ascii '
  x <- gsub("\xe2\x80\x99", "'", x, perl=TRUE)
  x <- gsub("\u0027|\u0060|\u0091|\u0092|\u0093|\u0094|\u2019", "'", x, perl=TRUE)
  # remove punctuation that isn't an apostrophe between "word chartacters"
  x <- gsub("(?<!\\w)['](?<!\\w)" , " ", x, perl=TRUE)
  x <- gsub("[^[:alpha:][:space:]']", " ", x) # remove all other punctuation marks
  # Strip out multiple spaces
  x<-gsub("[' ']{2,}",' ',x)
  return(x)
}
cleantwit<-cleanUp(twit)
cleannews<-cleanUp(news)
cleanblogs<-cleanUp(blogs)

#make a word list with all of the words
words<-str_extract_all(cleantwit, "[^[:space:]]+") %>% unlist
words<-c(words,str_extract_all(cleannews, "[^[:space:]]+") %>% unlist)
words<-c(words,str_extract_all(cleanblogs, "[^[:space:]]+") %>% unlist)

Clean word frequencies (with capitalization)

We will look at the new single-word frequencies:

library(data.table)
# Make a data table, count each word, and remove duplicates, once counted
dt<-data.table(words=words)
dt<-dt[,N:=.N, by=words]%>% unique()
dt<-dt[order(N, decreasing = TRUE)]
head(dt)
##    words       N
## 1:   the 4232485
## 2:    to 2724302
## 3:   and 2299062
## 4:     a 2287703
## 5:    of 1991547
## 6:    in 1549745

Now look at two-word pairs (digrams)

dt<-data.table(word1=words[1:(length(words)-1)], word2=words[2:length(words)])
dt<-dt[,N:=.N, by=.(word1,word2)] %>% unique
dt<-dt[order(N, decreasing = TRUE)]
head(dt)
##    word1 word2      N
## 1:    of   the 423787
## 2:    in   the 384522
## 3:    to   the 208847
## 4:   for   the 191695
## 5:    on   the 185861
## 6:    to    be 159502

We could do tri-grams, as well

dt<-data.table(word1=words[1:(length(words)-2)], word2=words[2:(length(words)-1)], word3=words[3:length(words)])
dt<-dt[,N:=.N, by=.(word1,word2, word3)] %>% unique
dt<-dt[order(N, decreasing = TRUE)]
head(dt)
##     word1 word2 word3     N
## 1:    one    of   the 29027
## 2:      a   lot    of 27242
## 3:     to    be     a 17938
## 4:  going    to    be 16941
## 5: Thanks   for   the 15746
## 6:    the   end    of 14556

Next steps

Now that we have a set of frequency look-up tables, we can cull them for memory and speed. There are still some issues - removing profanity, cleaning out non-english words, etc. - but the beginnings of a way to make efficient models is becomng apparent.

In the coming weeks, I will work on further cleaning the words, and trimming the models to the mos probable pairs or triples of words - and may use phoenym conversion as an added predictor.