Introduction

Mobile devices have become indispensible everyday companions at home and work, to socialize, play and do business. But lacking a full-size keyboard, text entry on touch screen devices in particular can be cumbersome. Automated text prediction aims to solve this by using entered text to predict the next word.

This report does an exploratory analysis of a large body of text to design a text prediction system that could be run efficiently on a mobile device.

R, RStudio and the R package tm for text mining were used to perform the analysis. All analysis code is included in the report R markdown document.

Understanding the problem

Natural Language Processing (NLP) is the field of computational linguistics that is concerned with the interactions between computers and humans (Wikipedia https://en.wikipedia.org/wiki/Natural_language_processing). It provides approaches and tools that may be used in a range of language processing tasks among which are identifying structure and meaning of texts, and – the focus of this report – generating natural language.

Natural Language Processing involves a number of activities (ref Mining the Social Web, Second Edition, by Matthew A. Russell).

  1. End of sentence detection. Text is broken up in a collection of meaningful sentences.

  2. Tokenization. Individual sentences are split into tokens, typically words and delineators such as start-of-sentence and end-of-sentence.

  3. Part-of-speech tagging. Tokens are assigned part-of-speech information: noun, verb, etc.

  4. Chunking. Deriving logical concepts from the tagged tokens within a sentence.

  5. Extraction. Named entities (people, locations, events, etc) are extracted from each chunk.

These steps should not be applied rigidly however. Some steps may be less relevant to achive the desired NLP objective. The derivation of meaning or concepts is not prerequisite to construct a predictive text model.

It may be expected that the accuracy of a predictive text model primarily depends on the number of unique words that are available in the original body of text. Complementary data sources may include dictionaries of profanity words (assuming that the prediction of such words is to be avoided), stop words, named entities (sports, cities, states, presidents), synonyms (WordNet database) and jargon dictionaries.

Common issues in the analysis of text data are the use of colloquial language and punctuation as well as occurrences of misspellings (bye vs by). Especially on social media people often use non-words and acronyms (lol) that are a language on their own.

Data acquisition and cleaning

Predictive text modeling using NLP follows generally the same approach to data as we learned in the Data Science Specialization. The data is obtained, cleaned and explored, before moving to the predictive modeling stage using a training, validation and test set, and finally text prediction itself.

The source data in the project at hand consists of a set of files taken from the set of corpora provided by HC Corpora (http://www.corpora.heliohost.org). The readme file at http://www.corpora.heliohost.org/aboutcorpus.html provides details on the available corpora. Besides text the files include meta data for Main Website, Date, Entry Type and Subject.

The Coursera dataset analyzed for this report is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset files contain text but they lack the additional meta data mentioned above. Although the files in the dataset have been filtered for language they may still contain some foreign language words.

Dataset summary

The analysis is performed on the files containing American English texts. The table below gives an overview of file size, line and word counts.

File File Size (MB) Object Size (MB) Lines Max length Words
./data/final/en_US/en_US.blogs.txt 200.40 248.50 899288 40835 37334114
./data/final/en_US/en_US.news.txt 196.30 19.20 1010242 5760 34365936
./data/final/en_US/en_US.twitter.txt 159.40 301.40 2360148 213 30359852


All files are quite large (File Size) and appear to require significant memory (Object Size). As this is before any processing it may be necessary to sample the files rather than use the entire file to avoid hitting memory and processing constraints.

Whereas the number of lines varies (Lines), the number of words is approximately the same for each file (Words). Taking equal samples of each file would ensure that the word frequency distribution of the combined corpus is not skewed.

File content

A few sample lines from each file are shown below.

./data/final/en_US/en_US.blogs.txt

[1] “The exact same spot we see in the video from 13 hours ago — near Nashville Tennessee — only now, instead of a FLASH on the RADAR — we’re seeing real life possible tornadoes develop."
[2] “This was intended to be wearable, but really it is just a muslin. I don’t usually make muslins. Unless I intend them to be wearable. I did a few alterations. Same old square shoulder, lower back neckline that I always do. The bodice fits well. Darts are in the right place. I like the subtle sweetheart neckline.”
[3] “Thanks for stopping bye, come back soon as I unleash my new style for my Creative biz, Orangalicious Creations.”
[4] “Flowers, butterflies and Ladybirds are all Craft for occasion embellishments”
[5] “This simple scenario sets the Fellowship against four (yeah, 4!) little goblins. The thing is to teach you how to move and shoot with your figures. The Fellowship has to cross the table and escape through the door. In my games the guys mostly just killed the little bastards and ran to safety, covering the hobbits. Not a chance to win for them really.”


./data/final/en_US/en_US.news.txt

[1] “Wrestling begins with semifinals at 11 this morning, with third-place matches at 1 p.m. The finals are slated for 6:30 p.m.”
[2] “LeGrand became an inspiration to the Scarlet Knights, eventually being able to stand upright with the help of a metal frame. He resumed his studies via video conferences for the 2011 spring semester, and on Oct. 29, 2011, led the team onto the field before a game. He also has done some broadcast work for the school.” [3] “President Serge Sarkisian’s party has won a majority of seats in a parliamentary election seen as a test of support ahead of next year’s presidential vote.”
[4] “DNA is a critical component in the death of 6-year-old JonBenét at her home in Boulder a decade ago. Investigators analyzed two spots found in JonBenét’s underwear, but DNA tests done in 1997 and 1999 only narrowed the genetic material believed to be in saliva in the blood spots to a male outside the Ramsey family.” [5] “It didn’t take Hendry long to find a new one. He was courted by numerous organizations, but the Yankees gig was too perfect to pass up. Call it the job that had the most variety. That included a bit more family time than the Cubs GM seat allowed.”


./data/final/en_US/en_US.twitter.txt

[1] “Nice. Great card.” “yeah, we did it years ago. haven’t implemented much which prolly speaks volumes to our culture.”
[3] “"Do Something. If it works, do more of it. If it doesn’t, do something else." – Franklin D. Roosevelt” “Great pics…that’s the same smile from the Met Life commercial”
[5] “They’re only showing it at arthouse theatREs, and you have to have a beard to buy a ticket. Sorry.”


All files contain lines that consist of multiple sentences. The character sets may include accented letters as used in non-English languages. Sentences may include punctuation such as single or double quotes surrounding words, and words may contain contractions (“don’t”" as a shorthand for “do not”). End of sentence punctuations may appear mid-sentence, as seen in the News and Blogs files. Cleaning the body of texts is therefore necessary.

Tokenization and filtering

Tokenization is the process of cutting the text into pieces: from body of text into sentences and words. We will remove punctuation characters such as commas, parentheses and the like, under the assumption that they have little impact on word order and n-gram composition. Numeric characters that may appear in numbers and dates are also removed. The combination of certain numbers and words can have predictive value, but their frequency might be too low to be useful.

Before building the word frequency tables (explained in the next section) all text is converted to lowercase. As a consequence the prediction model will predict lowercase words only. For English text this is less of an issue than for a language like German that uses capitalized nouns. In such case an additional tokenization step is needed that categorizes word classes (noun, verb, etc), allowing the character case to be reset selectively.

Initially we will not treat typos, garbage words and wrong language any differently from “proper” words, under the assumption that they represent a minor portion of the text. Profanity words will be removed. For this we use the lists of terms that could be found offensive that are available from http://www.bannedwordlist.com/lists/swearWords.txt (around 75 terms) and http://www.cs.cmu.edu/~biglou/resources/bad-words.txt (1,300 terms).

The planned model will use word sequences of up to four words to predict the next word.

Exploratory analysis

Every language has certain word sequences that occur often. In the field of computational linguistics they are know as n-grams. An n-gram is a contiguous sequence of n items from a given sequence of text or speech (Wikipedia, https://en.wikipedia.org/wiki/N-gram). Text prediction systems use n-grams to predict the next word based on the probability of its occurrence in the language’s n-grams.

In Natural Language Processing the ratio of unique words (or n-grams) and total number of occurrences of these words (or n-grams) is known as language coverage. If more words are included the language coverage is expected to increase. Words that occur often have a higher impact on coverage than words that occur rarely.

To determine how many unique words and n-grams are required for a certain percentage of coverage several scenarios are analyzed. The scenario parameter analyzed for its impact on coverage is dataset sample size, which we set to vary between 1%, 5% and 10% of the original dataset size.

N-grams and frequency

The three sample size scenarios are analyzed for n-grams with n between 1 and 4.

The table below shows the number of unique n-grams in each training scenario (unique.ngrams), the associated total number of n-gram occurrences (total.instances), and the ratio total to unique. A low ratio means a high number of n-grams that occur only once in the sample. From the perspective of prediction accuracy we would likely be looking to include as many n-grams as possible, whereas model efficiency drives us to drop the single n-gram ocurrences and include the n-grams that occur more than once. The ratio increase from one sample size to the next shows whether this helps to identify additional frequent n-grams. A stable or dropping ratio means that mostly new single occurrence n-grams were added.

scenario ngram unique.ngrams total.instances ratio
Train 0.01 1 27003 166392 6.20
Train 0.01 2 154488 289119 1.90
Train 0.01 3 234202 265732 1.10
Train 0.01 4 238523 243519 1.00
Train 0.05 1 67866 865776 12.80
Train 0.05 2 579866 1498790 2.60
Train 0.05 3 1072093 1374479 1.30
Train 0.05 4 1192041 1256737 1.10
Train 0.10 1 100014 1731966 17.30
Train 0.10 2 994992 2997573 3.00
Train 0.10 3 2005139 2748604 1.40
Train 0.10 4 2331943 2512488 1.10


The charts below show the top-12 most often occurring n-grams in the scenario Train 0.01 (sample size = 1 percent). There are many stop words among the single words. Bigrams and trigrams also contain a good proportion of stopword sequences. N-grams that include stopwords will be kept in the corpus as they can enlarge the model’s predictive capability.

Language coverage

The charts below shows coverage (as a percentage of the scenario dataset size) against the number of unique n-grams in the sample, for each of the three sample scenarios. On the x-axis n-grams are sorted in descending order of occurrence.

As expected the high frequency n-grams have a large impact on coverage, and the coverage ratio increases exponentially. As soon as the x-axis reaches the single occurrence n-grams the curve changes to linear.

For larger training sets it takes more words to reach a certain coverage, but this difference quickly disappears for the higher-order n-grams. For n=4 and higher the curves are mostly entirely linear from the start, which suggests that the increase of sample size has diminishing returns.

The n-gram frequency distributions are long-tailed. The table below shows the split between single and multiple n-gram occurrences. In the 1 percent sample scenario there are some 12,000 unique words that appear multiple times and some 15,000 unique words that appear just once, or 55 percent. 81 percent of 2-grams appear just once. Increasing the sample size to 10 percent offers a marginal improvement, and only for the higher order n-grams. A minimal predictive model that includes all higher order n-grams that appear more than once would contain approximately 530,000 n-grams.

scenario ngram unique.ngrams multiples singles singles.pct
Train 0.01 1 27003 12067 14936 55
Train 0.01 2 154488 28319 126169 81
Train 0.01 3 234202 14536 219666 93
Train 0.01 4 238523 3331 235192 98
Train 0.05 1 67866 31266 36600 53
Train 0.05 2 579866 127984 451882 77
Train 0.05 3 1072093 99106 972987 90
Train 0.05 4 1192041 33778 1158263 97
Train 0.10 1 100014 45168 54846 54
Train 0.10 2 994992 233709 761283 76
Train 0.10 3 2005139 211203 1793936 89
Train 0.10 4 2331943 84371 2247572 96

Modeling and prediction

Building the n-gram tables

The basic n-gram model will take the n-grams of one to four words to predict the next word.

The first task consists of generating the n-grams and frequencies from the sampled “training” dataset.

The larger the sample dataset, the more time and memory space it takes to generate the n-grams, especially for n > 2. Instead of processing the entire sample at once, the n-gram generation algorithm will process the files in pieces of 1,000 lines, build n-gram frequencies, and then combine the individual n-gram frequency tables into a single table, summarize and order the n-gram table by decreasing frequency. The process is repeated for each n. In this way a laptop with 8 GB of memory processes some 2,000 lines per minute, generating a 10 percent sample training set in about 3 hours. Unfortunately even this approach hits the memory limit when attempting to generate a larger 20 percent sample training set.

The n-gram tables are then processed and split in (n-1)-grams and the final word, corresponding to the predictor and the predicted word, respectively. Next the n-gram conditional probabilities are computed using the formula for maximum likelihood: \[ Pr(w_{i}|w_{i-1}) = \frac {count(w_{i-1},w_{i})} {count(w_{i-1})} \] where \(w_{i}\) is the last word, and \(w_{i-1}\) the preceding words.

The n-grams are stored in an R data frame, with character strings coded as factors. This should enable an easy and fast match by the prediction algorithm.

Prediction

The prediction algorithm takes the entered text, cleans and extracts the preceding 1 to n-1 words. It then uses a simple backoff approach in combination with weighting to build a list of probable next words.

  • For each n, select the top 6 matches based on n-gram frequency;
  • Combine the results;
  • Sort by maximum likelihood (high to low);
  • The top items form the prediction.

Accuracy

The lines of text in the dataset that were not included in the training set are used to test the model. The test selects lines at random, and splits each sentence in n-grams. The next word prediction accuracy is tested on each (n-1)-gram. The predicted word is then compared wih word n. The accuracy is the ratio of correct predictions and total predictions.

The test is done for the different training set sample sizes.

The table below shows the validation result for ten and twenty lines sampled from the test set.

trainSample lines nr.tests test.error
0.01 10 133 0.88
0.01 20 353 0.88
0.05 10 96 0.82
0.05 20 154 0.82
0.10 10 217 0.92
0.10 20 247 0.85

The model accuracy varies between 8 percent and 18 percent. Varying the training set sample size between 1 and 10 percent does not seem to have a significant effect on prediction accuracy. It is to be investigated whether a larger sample size will increase accuracy.

Next steps

To complete the project the following steps are projected: