The goal of the final project for this course is to build an app that predicts the next word as a user inputs text much in the same way that Swiftkey and similar software predict text on mobile devices. For example, if the user inputs on the other, the software may predict hand and side as likely candidates for the next word.
These types of software use analyses of a large body of exemplar text, called a corpus, to make their predictions. For this course we were given access to a large corpus of text from heliohost.org to develop our apps. The corpus included the following text files containing text data from various sources:
| File | Size (MB) | No. of entries | No. of words | Source |
|---|---|---|---|---|
en_US.blogs.txt |
210.2 | 899,288 | 37,334,690 | blog posts |
en_US.news.txt |
205.8 | 1,010,242 | 34,372,720 | news reports |
en_US.twitter.txt |
167.1 | 2,360,148 | 30,374,206 |
The purpose of this document is to give a brief description of the corpus, state my goals in terms of what features I would like my final app to have and to describe the basic procedure that I plan to follow to implement these features.
One of the most obvious differences between the sources in the corpus is entry length. The following boxplots summarise samples of 250 taken from each of the sources. (This small sample size was sufficient to show the general features and small enough to avoid over-plotting.)
Blog entries and news articles in the corpus tend to be of similar length overall but there is a lot more variation in length in blog entries. Tweets, as expected, are much shorter and much more similar to each other in terms of length.
Another interesting difference between the sources is the use of words. The following wordclouds (created using the wordcloud package) were created after ignoring letter case and removing 119 “stop-words”, that is, words that either occur frequently in English simply as a construct of the language or don’t really convey much information about the content of the text, like a, the, however. (The list of stop-words I used is taken from textfixer.com.) The size and opaqueness of each word is indicative of its frequency in the source.
Text from the Twitter source (right) again stands out from the other two. It frequently contains informal communication ticks, (oh, wow), contractions (lol, u), casual incorrect usage (im, haha), signs of self-centeredness (i'm, thanks, love, know) and a smaller vocabulary (quite a few words have high relative frequencies). This is expected since tweets usually have the authors or something near to the authors as a subject, are brief, informal and quickly put together.
Text from the news source (centre) appears to be at the other end of the spectrum (relatively) and text from the blogs (left) appears to be somewhere in between, but closer to news.
The differences in word coverage (that is, the proportion of the text that can be reproduced with a given number of n-grams or word sequences) between the sources is also interesting. The following graph illustrates the differences for the most frequent 10,000 words or word sequences based on large samples (of 100,000 entries) from the corpus.
The graphs show, for example, that the sample Twitter text requires a smaller set of word sequences for a given coverage level. This again illustrates the relatively smaller vocabulary and simpler language structures represented in the tweets, and the distinction between the Twitter data and the other two sources.
My goals for this course are to produce a Shiny application that:
Since a lot of this material is completely new to me, I’d like to create my app using only the R base and stringr libraries.
My ideas at the moment include:
Frequent 2-grams:
of the
in the
to the
on the
to be
Frequent 3-grams:
one of the
a lot of
some of the
the end of
out of the
Frequent 4-grams:
the end of the
at the end of
the rest of the
at the same time
to be able to
To use a “back-off” procedure (that is, a procedure that tries to predict text based on higher order n-grams and then resorts to lower order n-grams in the case of failure) to implement the predictions.
To remove higher-order n-grams that don’t improve on the lower-order n-grams’ predictions. For example, the is a very frequent word, so the of the bigram, which suggests that of should be followed by the, may be a little redundant to keep. I will have to be careful to make sure that I only remove ngrams so that a “back-off” procedure will produce the same or similar predictions that a database containing the removed higher-order n-grams would.
To make sure that I remove the very few terms that may only be used offensively, spelling mistakes, grammar mistakes and contractions from the prediction algorithm. Such things occur frequently, especially in the Twitter source. Possibly I’ll put more weight to n-grams extracted from the blogs and news sources than the twitter source to make more trustworthy predictions more likely.
Thanks for your time!