Building a Shiny text-predicting application: milestone report

Introduction

The goal of the final project for this course is to build an app that predicts the next word as a user inputs text much in the same way that Swiftkey and similar software predict text on mobile devices. For example, if the user inputs on the other, the software may predict hand and side as likely candidates for the next word.

These types of software use analyses of a large body of exemplar text, called a corpus, to make their predictions. For this course we were given access to a large corpus of text from heliohost.org to develop our apps. The corpus included the following text files containing text data from various sources:

File	Size (MB)	No. of entries	No. of words	Source
`en_US.blogs.txt`	210.2	899,288	37,334,690	blog posts
`en_US.news.txt`	205.8	1,010,242	34,372,720	news reports
`en_US.twitter.txt`	167.1	2,360,148	30,374,206	Twitter

The purpose of this document is to give a brief description of the corpus, state my goals in terms of what features I would like my final app to have and to describe the basic procedure that I plan to follow to implement these features.

Comparing the text data by source

Entry lengths

One of the most obvious differences between the sources in the corpus is entry length. The following boxplots summarise samples of 250 taken from each of the sources. (This small sample size was sufficient to show the general features and small enough to avoid over-plotting.)

Blog entries and news articles in the corpus tend to be of similar length overall but there is a lot more variation in length in blog entries. Tweets, as expected, are much shorter and much more similar to each other in terms of length.

Word frequencies with stop-words removed

Another interesting difference between the sources is the use of words. The following wordclouds (created using the wordcloud package) were created after ignoring letter case and removing 119 “stop-words”, that is, words that either occur frequently in English simply as a construct of the language or don’t really convey much information about the content of the text, like a, the, however. (The list of stop-words I used is taken from textfixer.com.) The size and opaqueness of each word is indicative of its frequency in the source.

Text from the Twitter source (right) again stands out from the other two. It frequently contains informal communication ticks, (oh, wow), contractions (lol, u), casual incorrect usage (im, haha), signs of self-centeredness (i'm, thanks, love, know) and a smaller vocabulary (quite a few words have high relative frequencies). This is expected since tweets usually have the authors or something near to the authors as a subject, are brief, informal and quickly put together.

Text from the news source (centre) appears to be at the other end of the spectrum (relatively) and text from the blogs (left) appears to be somewhere in between, but closer to news.

Word coverage

The differences in word coverage (that is, the proportion of the text that can be reproduced with a given number of n-grams or word sequences) between the sources is also interesting. The following graph illustrates the differences for the most frequent 10,000 words or word sequences based on large samples (of 100,000 entries) from the corpus.

The graphs show, for example, that the sample Twitter text requires a smaller set of word sequences for a given coverage level. This again illustrates the relatively smaller vocabulary and simpler language structures represented in the tweets, and the distinction between the Twitter data and the other two sources.

Goals and plan

My goals for this course are to produce a Shiny application that:

predicts three to five words based on what the user has input
makes predictions that are generally accurate (useful to the user)
makes predictions quickly enough so that they might appear as the user is typing
possibly produces a slightly more detailed prediction report at the request of the user (in response to a button click, say)

Since a lot of this material is completely new to me, I’d like to create my app using only the R base and stringr libraries.

My ideas at the moment include:

To extract n-grams (sequences of words) from the corpus and identifying which occur frequently enough to be of use for prediction. For example, the following 2-, 3- and 4-grams occurred frequently in the blogs source in the corpus:

Frequent 2-grams:
    of the
    in the
    to the
    on the
    to be

Frequent 3-grams:
    one of the
    a lot of
    some of the
    the end of
    out of the

Frequent 4-grams:
    the end of the
    at the end of
    the rest of the
    at the same time
    to be able to

To use a “back-off” procedure (that is, a procedure that tries to predict text based on higher order n-grams and then resorts to lower order n-grams in the case of failure) to implement the predictions.
To remove higher-order n-grams that don’t improve on the lower-order n-grams’ predictions. For example, the is a very frequent word, so the of the bigram, which suggests that of should be followed by the, may be a little redundant to keep. I will have to be careful to make sure that I only remove ngrams so that a “back-off” procedure will produce the same or similar predictions that a database containing the removed higher-order n-grams would.
To make sure that I remove the very few terms that may only be used offensively, spelling mistakes, grammar mistakes and contractions from the prediction algorithm. Such things occur frequently, especially in the Twitter source. Possibly I’ll put more weight to n-grams extracted from the blogs and news sources than the twitter source to make more trustworthy predictions more likely.

Thanks for your time!