Introduction

The goal of the final project for this course is to build an app that predicts the next word as a user inputs text much in the same way that Swiftkey and similar software predict text on mobile devices. For example, if the user inputs on the other, the software may predict hand and side as likely candidates for the next word.

These types of software use analyses of a large body of exemplar text, called a corpus, to make their predictions. For this course we were given access to a large corpus of text from heliohost.org to develop our apps. The corpus included the following text files containing text data from various sources:

File Size (MB) No. of entries No. of words Source
en_US.blogs.txt 210.2 899,288 37,334,690 blog posts
en_US.news.txt 205.8 1,010,242 34,372,720 news reports
en_US.twitter.txt 167.1 2,360,148 30,374,206 Twitter


The purpose of this document is to give a brief description of the corpus, state my goals in terms of what features I would like my final app to have and to describe the basic procedure that I plan to follow to implement these features.

Comparing the text data by source

Entry lengths

One of the most obvious differences between the sources in the corpus is entry length. The following boxplots summarise samples of 250 taken from each of the sources. (This small sample size was sufficient to show the general features and small enough to avoid over-plotting.)

Blog entries and news articles in the corpus tend to be of similar length overall but there is a lot more variation in length in blog entries. Tweets, as expected, are much shorter and much more similar to each other in terms of length.

Word frequencies with stop-words removed

Another interesting difference between the sources is the use of words. The following wordclouds (created using the wordcloud package) were created after ignoring letter case and removing 119 “stop-words”, that is, words that either occur frequently in English simply as a construct of the language or don’t really convey much information about the content of the text, like a, the, however. (The list of stop-words I used is taken from textfixer.com.) The size and opaqueness of each word is indicative of its frequency in the source.

Text from the Twitter source (right) again stands out from the other two. It frequently contains informal communication ticks, (oh, wow), contractions (lol, u), casual incorrect usage (im, haha), signs of self-centeredness (i'm, thanks, love, know) and a smaller vocabulary (quite a few words have high relative frequencies). This is expected since tweets usually have the authors or something near to the authors as a subject, are brief, informal and quickly put together.

Text from the news source (centre) appears to be at the other end of the spectrum (relatively) and text from the blogs (left) appears to be somewhere in between, but closer to news.

Word coverage

The differences in word coverage (that is, the proportion of the text that can be reproduced with a given number of n-grams or word sequences) between the sources is also interesting. The following graph illustrates the differences for the most frequent 10,000 words or word sequences based on large samples (of 100,000 entries) from the corpus.

The graphs show, for example, that the sample Twitter text requires a smaller set of word sequences for a given coverage level. This again illustrates the relatively smaller vocabulary and simpler language structures represented in the tweets, and the distinction between the Twitter data and the other two sources.

Goals and plan

My goals for this course are to produce a Shiny application that:

Since a lot of this material is completely new to me, I’d like to create my app using only the R base and stringr libraries.

My ideas at the moment include:

Frequent 2-grams:
    of the
    in the
    to the
    on the
    to be

Frequent 3-grams:
    one of the
    a lot of
    some of the
    the end of
    out of the

Frequent 4-grams:
    the end of the
    at the end of
    the rest of the
    at the same time
    to be able to

Thanks for your time!