Milestone Report

First Steps

First, I load the data using readLines and the packages that I will use for natural language processing.

library(tm)
library(NLP)

## 
## Attaching package: 'NLP'
## 
## The following objects are masked from 'package:tm':
## 
##     meta, meta<-

library(openNLP)

twitter <- readLines('en_US.twitter.txt')
blogs <- readLines('en_US.blogs.txt')
news <- readLines('en_US.news.txt')

Basic information

First, I want to know the length of each dataset, in number of lines:

length(twitter)

## [1] 2360148

length(blogs)

## [1] 899288

length(news)

## [1] 1010242

Next, what's the average length of a line, in characters?

mean(nchar(twitter))

## [1] 68.68

mean(nchar(blogs))

## [1] 230

mean(nchar(news))

## [1] 201.2

So while the Twitter dataset has the most lines, it also has much shorter lines.

I can also plot the distribution of line lengths in each dataset.

We see a gradual decrease in the number of tweets at greater lengths, but with upticks at 120 and 140 (the maximum tweet length).

hist(nchar(twitter))

plot of chunk unnamed-chunk-4

The news and blogs datasets are both heavily skewed right, indicating that there are some very long news and blog entries that are far above the normal range of lengths.

hist(nchar(news))

plot of chunk unnamed-chunk-5

hist(nchar(blogs))

plot of chunk unnamed-chunk-6

Zooming in, we get a better sense of their distributions.

hist(nchar(blogs)[nchar(blogs) < 500])

plot of chunk unnamed-chunk-7

hist(nchar(news)[nchar(news) < 500])

plot of chunk unnamed-chunk-8

Interestingly, although the blogs dataset had a higher mean line length than news, the blogs seem to be more clustered towards the low end of the distribution. This suggests that relatively few very long blog posts are bringing up the mean. The news dataset is also skewed right, but it is closer to normally distributed than the blogs dataset.

Sampling the data

These datasets are very large, so analyzing all of them will be too time consuming.

Instead, I'll take a sample of a random 0.1% of each dataset to explore. That will be about 2000 tweets, 900 blog posts, and 1000 news articles.

set.seed(1024)

twitter_sample <- twitter[sample(1:length(twitter), round(length(twitter)/1000), replace=F)]

news_sample <- news[sample(1:length(news), round(length(news)/1000), replace=F)]

blog_sample <- news[sample(1:length(blogs), round(length(blogs)/1000), replace=F)]

Preprocessing

I've written nested for loops to extract words, 2-grams, 3-grams, and 4-grams from each of the text files. This runs very slowly, so when I am building my model I will need to figure out how to make this run more quickly.

First, I create blank vectors.

twitter_sentences <- c()
twitter_words <- c()
twitter_2grams <- c()
twitter_3grams <- c()
twitter_4grams <- c()

Next, using the sentence_token_annotator function, I split individual tweets into sentences. Then, using nested for-loops, I run through the sentences, standardize them by removing all punctuation and changing all letters to lowercase, split them into individual words and add all words, 2-grams, 3-grams, and 4-grams to my blank vectors.

sentence_token_annotator <- Maxent_Sent_Token_Annotator(language = 'en')

for(tweet in twitter_sample){
  tweet <- as.String(tweet)
  sentence_boundaries <- annotate(tweet, sentence_token_annotator)
  tweet_sentences <- tweet[sentence_boundaries]
  for(sentence in tweet_sentences){
    # remove capitalization and punctuation
    sentence_without_punc <- gsub("[[:punct:]]", "", as.character(sentence))
    sentence_clean <- tolower(sentence_without_punc)
    twitter_sentences <- c(twitter_sentences, sentence_clean)  
    sentence_words <- strsplit(sentence_clean,split=" ")[[1]]
    for(i in 0:length(sentence_words)){
      twitter_words <- c(twitter_words, sentence_words[i])
      if (i > 1){
        twitter_2grams <- c (twitter_2grams, paste(sentence_words[i-1], sentence_words[i]))
      }
      if (i > 2){
        twitter_3grams <- c (twitter_3grams, paste(sentence_words[i-2], sentence_words[i-1], sentence_words[i]))
      }
      if (i > 3){
        twitter_4grams <- c (twitter_4grams, paste(sentence_words[i-3], sentence_words[i-2], sentence_words[i-1], sentence_words[i]))
      }
    }
  }
}

twitter_wordfreq = sort(table(twitter_words), decreasing = TRUE)
twitter_2gramfreq = sort(table(twitter_2grams), decreasing = TRUE)
twitter_3gramfreq = sort(table(twitter_3grams), decreasing = TRUE)
twitter_4gramfreq = sort(table(twitter_4grams), decreasing = TRUE)

I can then repeat the process with blogs and news.

Finally, I have a list of most common words and n-grams in each body of text. I'll look at the top 50 of each to show how the text files differ.

Word counts

Since I took a 0.1% sample of each corpus, I can multiply the resulting word counts by 1000 to get an estimate of each corpus's word count.

print(length(twitter_words)*1000)

## [1] 29553000

print(length(blog_words)*1000)

## [1] 30015000

print(length(news_words)*1000)

## [1] 3.4e+07

Words

print(twitter_wordfreq[1:50])

## twitter_words
##    the     to      i      a    you           and     in    for     of 
##    900    759    680    609    505    442    421    394    387    374 
##     is     it     my     on   that     me     at     be   have   your 
##    345    306    299    274    215    187    182    180    178    170 
##     so    are   with     im   just   this     we   like    get    not 
##    168    167    162    155    148    140    140    135    126    121 
##    out    but    its    was     up     rt    all     do   what   good 
##    121    119    117    116    113    112    110    109    108     98 
##     if thanks   when      u  about   from   love    can   will   dont 
##     94     94     93     88     85     85     85     83     82     81

print(blog_wordfreq[1:50])

## blog_words
##   the    to     a   and    of    in   for  that    is    on  with    he 
##  1763   810   782   748   641   604   304   304   254   236   223   222 
##  said    at   was    it    as   his         but    be  from  have     i 
##   214   211   211   200   173   162   158   153   148   134   133   128 
##    an   has    by   its   who    or  this   are about  they  will  were 
##   114   106   105   103   102    98    98    95    94    90    89    86 
##    we   not   one  when  more   out would   you   had   she their  been 
##    82    80    80    78    77    70    69    69    66    65    65    64 
##  what    up 
##    63    61

print(news_wordfreq[1:50])

## news_words
##   the    to   and     a    in    of   for  that    on    is  with    he 
##  1860   887   882   870   728   664   338   333   295   272   236   226 
##  said    it   was          at  from    as   his  have    be     i   but 
##   225   215   212   192   191   175   173   166   165   156   147   140 
##   are   its    an    by   not  this   has   you   who  they  more  will 
##   136   132   131   127   123   123   113   110   107    98    96    96 
##    or  when about   her    we   had   out   new    up   she  what  than 
##    95    87    86    86    82    80    77    75    75    73    73    71 
##  were would 
##    71    71

As you can see, the three corpora have different word frequencies. The words “I” and “you” are much more common in Twitter, where people are mostly expressing themselves or interacting with others. We also see twitter-specific words such as “rt” and informal non-words such as “u”. The blogs and news datasets have more standard word frequency distributions.

An interesting question is: how many words are needed to cover 50% of the words in each body of text? The following code can find the answer.

for (i in 1:length(twitter_words)){
  if(sum(twitter_wordfreq[1:i]) > length(twitter_words)/2.0){
    print(i)
    break
  }
}

## [1] 118

for (i in 1:length(blog_words)){
  if(sum(blog_wordfreq[1:i]) > length(blog_words)/2.0){
    print(i)
    break
  }
}

## [1] 194

for (i in 1:length(news_words)){
  if(sum(news_wordfreq[1:i]) > length(news_words)/2.0){
    print(i)
    break
  }
}

## [1] 208

This analysis also shows that Twitter, for the most part, uses a smaller set of words, despite the creative spellings. 50% of all words on twitter can be covered with just the 118 most common words, while 194 and 208 words are needed to cover 50% of all blog and news words, respectively.

2-, 3- and 4-grams

## twitter_2grams
##     in the        rt      of the    for the     on the thanks for 
##         74         72         70         67         51         46 
##      to be  thank you     to the     at the   going to     have a 
##         46         42         40         34         34         34 
##     i just     i have     if you     to see     to get       i am 
##         33         32         31         31         29         28 
##     i love    want to 
##         28         28

## twitter_3grams
##     thanks for the       cant wait to     for the follow 
##                 26                 15                 12 
##          i need to         to see you looking forward to 
##                 12                 11                 10 
##       check it out        im going to         of the day 
##                  8                  8                  8 
##         one of the        you so much        going to be 
##                  8                  8                  7 
##       have a great      thank you for      would like to 
##                  7                  7                  7 
##                rt            a lot of         how do you 
##                  6                  6                  6 
##          i have to         i just saw 
##                  6                  6

## twitter_4grams
## thanks for the follow add boston add boston     thank you for the 
##                    10                     5                     5 
## boston add boston add       hope to see you     thank you so much 
##                     4                     4                     4 
##      cant wait to see   dont even know what     even know what to 
##                     3                     3                     3 
##         going to be a         i am going to      i dont even know 
##                     3                     3                     3 
##       i will be there        if you want to    just trying to get 
##                     3                     3                     3 
##      love you so much     on the other side   thank you thank you 
##                     3                     3                     3 
##        the end of the      to see you there 
##                     3                     3

## blog_2grams
##   of the   in the   to the   at the  for the   on the     in a  and the 
##      161      157       87       66       59       55       47       41 
##    to be with the   with a    for a   he was from the  he said   as the 
##       40       36       35       34       34       31       30       29 
## that the   one of     as a   by the 
##       29       28       27       27

## blog_3grams
##       one of the according to the       the end of         a lot of 
##               15                7                7                6 
##      part of the       as part of     dont want to     in the third 
##                6                5                5                5 
##       out of the president of the      said he was    said it would 
##                5                5                5                5 
##      some of the      the way the     a little bit   according to a 
##                5                5                4                4 
##      at the same    in the fourth        is one of      it comes to 
##                4                4                4                4

## blog_4grams
##     at the same time     when it comes to         as part of a 
##                    4                    4                    3 
##        by the end of in the united states      the blazers are 
##                    3                    3                    2 
##    60 percent of the    a large number of      a lot of things 
##                    2                    2                    2 
##      a member of the   and not paying for       as a member of 
##                    2                    2                    2 
##  at the beginning of        at the end of      at the start of 
##                    2                    2                    2 
##   avenue on a charge     be approved by a    come out and play 
##                    2                    2                    2 
##  for the los angeles      for the rest of 
##                    2                    2

## news_2grams
##    in the    of the    to the    on the   for the      in a   and the 
##       172       170        79        75        51        50        45 
##     to be    at the  from the  with the more than     for a   will be 
##        43        42        42        39        36        34        34 
##      as a  that the    with a      is a   he said    by the 
##        33        33        33        31        30        27

## news_3grams
##    because of the     of the season        as well as      i dont think 
##                 7                 7                 6                 6 
##       im going to       in new york       more than a        one of the 
##                 6                 6                 6                 6 
##        out of the the united states      based on the       part of the 
##                 6                 6                 5                 5 
##    percent of the         this is a           to be a          a lot of 
##                 5                 5                 5                 4 
##  according to the        any of the  around the world        be able to 
##                 4                 4                 4                 4

## news_4grams
##     in the united states    from around the world               10 am to 4 
##                        4                        3                        2 
##           a scene in the    a spokeswoman for the about equality of result 
##                        2                        2                        2 
##      allowed two runs on               am to 4 pm           and in the end 
##                        2                        2                        2 
##       and natural gas in       are more likely to             as well as a 
##                        2                        2                        2 
##     at a news conference    at the community hall            at the end of 
##                        2                        2                        2 
##          be found at the  be required to disclose   by tonight at midnight 
##                        2                        2                        2 
##          can be found at           can do only so 
##                        2                        2

Analyzing the n-grams gives us more insight into the content of each of these corpora. The phrases in the Twitter corpus mostly involve people talking about themselves or using social phrases such as “thanks for the”, “hope to see you”, and “love you so much”. In blogs and news, we start to see some common news phrases such as “president of the united states”.

Next Steps

This is only a basic analysis. Before I create my predictive text model, I will probably want to add some additional preprocesing steps. Some additional steps I would like to include are:

Storing punctution as a feature. Right now, I am removing all punctuation, so that all words are included in word frequency counts. I will still want to do this for the final model. However, certain words may be more likely to follow commas, or may be more likely to start or end a sentence, for example.
Part of speech analysis to ensure that predicted words make grammatical sense, and to be used as features.
Counting word frequency by root word, rather than the entire word.

Storing all of these features will also require me to construct a faster method of preprocessing all of the lines.

Once I have a variety of features for each word such as previous word, previous n-grams, preceding punctuation, parts of speech and roots of previous words, I can use all of these features to build a variety of machine learning models which I will test the accuracy of using cross-validation. Tree-based models such as random forests come to mind immediately as being potentially useful but I will also explore other algorithms and research which ones are commonly used for this purpose.

There will be many instances where the user is typing something novel i.e. the previous word, or previous n-grams cannot be found in the corpus. In these cases, root word and part-of-speech features could be particularly useful, since the model needs to return a suggestion and it should have as many features available to it as possible, even if the data available does not match any existing patterns exactly. However, the model should still assign some degree of confidence to its predictions. In the case of a person typing nonsense, or typing in another language, the model should not return any suggestions.

Ideally the text prediction model should be able to learn from the user's distinctive language patterns, so a later, more advanced goal of the Shiny app could be to take the user's sentences as input and add them to the training set for the next time it generates the model.