Exploratory Data Analysis of Predicting Text

Introduction

This project aims to create an app that predicts user text. This document explores sample data from three internet sources and provides statistical summaries, and visualizations of the related text. The variation in frequency of words, phrases, and/or symbols is examined to determine how often items appear. Technical details (e.g. R programming code), additional statistical summaries, and other data visualizations have been excluded for brevity, but are available upon request.

For the remainder of this document any individual word, symbol, or combination of characters (e.g. #DubNation) is referred to as a token. Tokens represent individual units of text. The analysis will also reference types which refer to unique tokens.

The Modeling Plan

The prediction model plans to use the final 1-2 tokens input by a user to determine which token is most likely to follow. The ensuing exploratory analysis identifies how often a users input one token (unigram), a pair of tokens (bigram), or a group of three tokens (trigram). Generic tokens are known as n-grams* where n is the number of objects in the token. N-gram statistics will help construct a model that creates a network where tokens are connected based on their approximate probability of appearing together.

This document focuses on exploring the data and does not go into further detail of describing the model. However, some modeling considerations will be addressed.

The Data

The data is in plain text format and was collected from blog entries, news sites, and twitter posts on the internet by a webcrawler. The blog data consisted of 898,384 lines, the news data consisted of 77,258 lines, and the twitter data consisted of 2,307,307 lines.

With more than 3.2 million lines of data, memory is a concern. The data is randomly sampled to address this concern with the understanding that a sufficiently large random sample will reliably represent the data and provide useful information for modeling.

In each file 5% of the lines were randomly sampled. Then 80% of each sample was set aside for exploratory analysis (and later model construction), with the remaining 20% of the sample left to be used for testing the model. A few samples from each text file are included below.

## [1] "news text..."

## [1] "Kipnis is hitting .257 (19-for-74), with one double, two triples, three homers and 12 RBI. In his past seven games, he's batting .423 (11-for-26)."                                                                                                     
## [2] "As of Monday, and thanks to a couple of steel bars, the structure that will replace the twin towers holds the distinction of being the highest structure in the five boroughs of New York City. Eventually, it will be the tallest building in the U.S."

## [1] "blogs text..."

## [1] "You may also like…"                                                                                                                                                                                                          
## [2] "Then I ordered this king prawn dish that I always have, but it was just … outta this world and to finish I had brownies, ice cream, whipped cream, blackberries and this raspberry coulis! Again … AWESOMEEEEEEEEEEEEEEEEE!!"

## [1] "twitter text..."

## [1] "I'm jealous! Huge Museum Lab fan here! RT Great afternoon with friends from Louvre DNP Museum Lab; they do such fascinating..."
## [2] "Twitter randomly unfollowed you. Hope all is well, Elizabeth, and Freddie."

Analyzing Tokens

Each file was split into unigrams with punctuation, profanity, and stopwords removed. Stopwords are common words (e.g. “I”, “the”, “a”, etc.) that appear frequently in text. The three files were then combined into a single text file afterwhich unigrams, bigrams, and trigrams were created (again with punctuation, profanity, and stopwords removed). Finally, the combined file was again split into unigrams with stopwords included.

Below are the top 30 tokens from each data set. Columns 1 through 3 refer to the individual data sets, while columns 5 through 8 refer to the combined data sets. Note that the token just is the top ranked unigram in the combined data set when stopwords are exluded, but is the 30th ranked unigram when stopwords are included.

##       news  blogs twitter unigram          bigram                trigram
## 1     said    one    just    just             < 3                  = = =
## 2        $    can    like    like       right now                  > > >
## 3        =   just     get     one      last night                  < < <
## 4      one   like    love     can       feel like            let us know
## 5      new   time    good     get             > >         happy new year
## 6      two    get  thanks    time looking forward      happy mothers day
## 7      can   know      rt    good  happy birthday               love < 3
## 8     year    now     can    love        new york     happy mother's day
## 9     last people     day     now      looks like          new york city
## 10   first    new     now    know         can get                  $ $ $
## 11    also   also     one     day        let know          cinco de mayo
## 12    time   even    know     new      first time         new york times
## 13    just   make       u      go    good morning         love love love
## 14    like  first   great     see        just got            come see us
## 15    city    day      go  people       make sure                1 2 cup
## 16  people really    time    back       years ago looking forward seeing
## 17   state    see   today   think       last week  good morning everyone
## 18   years   back     lol   great             = =          just got back
## 19  school   well     new    make       last year         blah blah blah
## 20     get   good     see   going       good luck                1 4 cup
## 21    says     us     got  thanks             < <        please let know
## 22   three  think    back  really         can see         caps caps caps
## 23     now   much   think      us     even though          two years ago
## 24    good    way   going   today   thanks follow         cake cake cake
## 25 percent little  people    much             1 2                  < 3 <
## 26    back   love    need     got     follow back                  3 < 3
## 27   going     go       3    well         one day          two weeks ago
## 28    many   many   happy    want     high school really looking forward
## 29  police    two  follow      rt       look like               gt gt gt
## 30       1  going    want   first       next week            pop pop pop
##    unigram_stopw
## 1            the
## 2             to
## 3            and
## 4              a
## 5              i
## 6             of
## 7             in
## 8            you
## 9             is
## 10           for
## 11          that
## 12            it
## 13            my
## 14            on
## 15          with
## 16          this
## 17           was
## 18            be
## 19          have
## 20            at
## 21           are
## 22            me
## 23           but
## 24            so
## 25            we
## 26            as
## 27           not
## 28           all
## 29          your
## 30          just

The table below shows summary statistics of the n-gram frequencies in each set of tokens (with the exception of the Lines column. The analysis focuses on n-grams within the combined text files.

##                      File Min Q1 Median      Mean Q3    Max   Total  Lines
## 1                    news   1  1      1  3.745821  3    780   63870   3091
## 2                   blogs   1  1      1 11.818666  4   4906  776510  35936
## 3                 twitter   1  1      1 12.117573  4   5824  642607  92093
## 4       combined unigrams   1  1      1 15.440000  4   9932 1482987 131120
## 5        combined bigrams   1  1      1  1.340000  1   1090 1352635 131120
## 6       combined trigrams   1  1      1  1.020000  1    258 1227177 131120
## 7 unigrams with stopwords   1  1      1 28.250000  4 115481 2719276 131120

For each data set we see that many n-grams appear only once, which tells us that most words/tokens appeared infrequently whether considered individually or in groups. However, the Max column indicates that the maximum frequency of an n-gram can be extremely large. In particular, note the large disparity between maximum frequency for unigrams with and without stopwords. The most common stopword appeared ~ 6 times as often as the most common non-stopword. Stopwords are excluded in some forms of natural language processing, but it would be inappropriate to do so when predicting text given their relatively high probability of occurrence.

Also note that it is not entirely uncommon for unigrams to appear multiple times. However, bigrams and trigrams typically only appear once in the text.

Four plots are provided below to visualize these relationships further.

Histogram of Unigram Frequency (stopwords excluded)
Histogram of Bigram Frequency (stopwords excluded)
Scatter plot of Types vs Tokens for Trigrams (stopwords excluded)
Scatter plot of Types vs Tokens for Unigrams (stopwords included)

The histograms below reinforce these summary statistics while the plotted points give insight to how easily the model can account for each possible n-gram.

The histograms reinforce the notion that n-grams most commonly appear once. The scatter plots provide insight as to the number of unigrams or trigrams necessary for prediction. Recall that types represent unique tokens.

The the final plot of individual tokens increases extraordinarily quickly, with 10% of the most common unique unigrams accounting for over 90% of all unigrams. This supports the inclusion of stopwords in the model as they will allow the model to account for an extremely high percentage of tokens.

However, the third plot of trigrams shows a much more gradual increase. This indicates that very few trigrams repeat, which corroborates the summary statistics.

Final Thoughts

Analyzing the sampled text demonstrates that despite the expansive list of existing English words, we can plausibly attempt to represent the set of individual words a user is likely to choose by using a relatively small set of words or symbols. At the same time the analysis indicates that pairs or triples of words are fairly unique. As such, accurately predicting text based on previous words and may require explicit references to these word groupings.

Practical limitations do not allow us to include every avaiable English word. Therefore the model will need to address a method for handling tokens that are missing from its dictionary (e.g. mispellings, foreign words etc).

Note: Though profanity was excluded from the exploratory analysis, the model will include profanity when predicting text. Profanity is quite common, thus excluding it may bias prediction results. Instead of removing it from the data the app will likely filter the output.