This project aims to create an app that predicts user text. This document explores sample data from three internet sources and provides statistical summaries, and visualizations of the related text. The variation in frequency of words, phrases, and/or symbols is examined to determine how often items appear. Technical details (e.g. R programming code), additional statistical summaries, and other data visualizations have been excluded for brevity, but are available upon request.
For the remainder of this document any individual word, symbol, or combination of characters (e.g. #DubNation) is referred to as a token. Tokens represent individual units of text. The analysis will also reference types which refer to unique tokens.
The prediction model plans to use the final 1-2 tokens input by a user to determine which token is most likely to follow. The ensuing exploratory analysis identifies how often a users input one token (unigram), a pair of tokens (bigram), or a group of three tokens (trigram). Generic tokens are known as n-grams* where n is the number of objects in the token. N-gram statistics will help construct a model that creates a network where tokens are connected based on their approximate probability of appearing together.
This document focuses on exploring the data and does not go into further detail of describing the model. However, some modeling considerations will be addressed.
The data is in plain text format and was collected from blog entries, news sites, and twitter posts on the internet by a webcrawler. The blog data consisted of 898,384 lines, the news data consisted of 77,258 lines, and the twitter data consisted of 2,307,307 lines.
With more than 3.2 million lines of data, memory is a concern. The data is randomly sampled to address this concern with the understanding that a sufficiently large random sample will reliably represent the data and provide useful information for modeling.
In each file 5% of the lines were randomly sampled. Then 80% of each sample was set aside for exploratory analysis (and later model construction), with the remaining 20% of the sample left to be used for testing the model. A few samples from each text file are included below.
## [1] "news text..."
## [1] "Kipnis is hitting .257 (19-for-74), with one double, two triples, three homers and 12 RBI. In his past seven games, he's batting .423 (11-for-26)."
## [2] "As of Monday, and thanks to a couple of steel bars, the structure that will replace the twin towers holds the distinction of being the highest structure in the five boroughs of New York City. Eventually, it will be the tallest building in the U.S."
## [1] "blogs text..."
## [1] "You may also like…"
## [2] "Then I ordered this king prawn dish that I always have, but it was just … outta this world and to finish I had brownies, ice cream, whipped cream, blackberries and this raspberry coulis! Again … AWESOMEEEEEEEEEEEEEEEEE!!"
## [1] "twitter text..."
## [1] "I'm jealous! Huge Museum Lab fan here! RT Great afternoon with friends from Louvre DNP Museum Lab; they do such fascinating..."
## [2] "Twitter randomly unfollowed you. Hope all is well, Elizabeth, and Freddie."
Each file was split into unigrams with punctuation, profanity, and stopwords removed. Stopwords are common words (e.g. “I”, “the”, “a”, etc.) that appear frequently in text. The three files were then combined into a single text file afterwhich unigrams, bigrams, and trigrams were created (again with punctuation, profanity, and stopwords removed). Finally, the combined file was again split into unigrams with stopwords included.
Below are the top 30 tokens from each data set. Columns 1 through 3 refer to the individual data sets, while columns 5 through 8 refer to the combined data sets. Note that the token just is the top ranked unigram in the combined data set when stopwords are exluded, but is the 30th ranked unigram when stopwords are included.
## news blogs twitter unigram bigram trigram
## 1 said one just just < 3 = = =
## 2 $ can like like right now > > >
## 3 = just get one last night < < <
## 4 one like love can feel like let us know
## 5 new time good get > > happy new year
## 6 two get thanks time looking forward happy mothers day
## 7 can know rt good happy birthday love < 3
## 8 year now can love new york happy mother's day
## 9 last people day now looks like new york city
## 10 first new now know can get $ $ $
## 11 also also one day let know cinco de mayo
## 12 time even know new first time new york times
## 13 just make u go good morning love love love
## 14 like first great see just got come see us
## 15 city day go people make sure 1 2 cup
## 16 people really time back years ago looking forward seeing
## 17 state see today think last week good morning everyone
## 18 years back lol great = = just got back
## 19 school well new make last year blah blah blah
## 20 get good see going good luck 1 4 cup
## 21 says us got thanks < < please let know
## 22 three think back really can see caps caps caps
## 23 now much think us even though two years ago
## 24 good way going today thanks follow cake cake cake
## 25 percent little people much 1 2 < 3 <
## 26 back love need got follow back 3 < 3
## 27 going go 3 well one day two weeks ago
## 28 many many happy want high school really looking forward
## 29 police two follow rt look like gt gt gt
## 30 1 going want first next week pop pop pop
## unigram_stopw
## 1 the
## 2 to
## 3 and
## 4 a
## 5 i
## 6 of
## 7 in
## 8 you
## 9 is
## 10 for
## 11 that
## 12 it
## 13 my
## 14 on
## 15 with
## 16 this
## 17 was
## 18 be
## 19 have
## 20 at
## 21 are
## 22 me
## 23 but
## 24 so
## 25 we
## 26 as
## 27 not
## 28 all
## 29 your
## 30 just
The table below shows summary statistics of the n-gram
frequencies in each set of tokens (with the exception of the
Lines
column. The analysis focuses on n-grams within the
combined text files.
## File Min Q1 Median Mean Q3 Max Total Lines
## 1 news 1 1 1 3.745821 3 780 63870 3091
## 2 blogs 1 1 1 11.818666 4 4906 776510 35936
## 3 twitter 1 1 1 12.117573 4 5824 642607 92093
## 4 combined unigrams 1 1 1 15.440000 4 9932 1482987 131120
## 5 combined bigrams 1 1 1 1.340000 1 1090 1352635 131120
## 6 combined trigrams 1 1 1 1.020000 1 258 1227177 131120
## 7 unigrams with stopwords 1 1 1 28.250000 4 115481 2719276 131120
For each data set we see that many n-grams appear only once, which
tells us that most words/tokens appeared infrequently whether considered
individually or in groups. However, the Max
column
indicates that the maximum frequency of an n-gram can be extremely
large. In particular, note the large disparity between maximum frequency
for unigrams with and without stopwords. The most common stopword
appeared ~ 6 times as often as the most common non-stopword. Stopwords
are excluded in some forms of natural language processing, but it would
be inappropriate to do so when predicting text given their relatively
high probability of occurrence.
Also note that it is not entirely uncommon for unigrams to appear multiple times. However, bigrams and trigrams typically only appear once in the text.
Four plots are provided below to visualize these relationships further.
The histograms below reinforce these summary statistics while the plotted points give insight to how easily the model can account for each possible n-gram.
The histograms reinforce the notion that n-grams most commonly appear once. The scatter plots provide insight as to the number of unigrams or trigrams necessary for prediction. Recall that types represent unique tokens.
The the final plot of individual tokens increases extraordinarily quickly, with 10% of the most common unique unigrams accounting for over 90% of all unigrams. This supports the inclusion of stopwords in the model as they will allow the model to account for an extremely high percentage of tokens.
However, the third plot of trigrams shows a much more gradual increase. This indicates that very few trigrams repeat, which corroborates the summary statistics.
Analyzing the sampled text demonstrates that despite the expansive list of existing English words, we can plausibly attempt to represent the set of individual words a user is likely to choose by using a relatively small set of words or symbols. At the same time the analysis indicates that pairs or triples of words are fairly unique. As such, accurately predicting text based on previous words and may require explicit references to these word groupings.
Practical limitations do not allow us to include every avaiable English word. Therefore the model will need to address a method for handling tokens that are missing from its dictionary (e.g. mispellings, foreign words etc).
Note: Though profanity was excluded from the exploratory analysis, the model will include profanity when predicting text. Profanity is quite common, thus excluding it may bias prediction results. Instead of removing it from the data the app will likely filter the output.