Mobile devices have become indispensible everyday companions at home and work, to socialize, play and do business. But lacking a full-size keyboard, text entry on touch screen devices in particular can be cumbersome. Automated text prediction aims to solve this by using entered text to predict the next word.
This report does an exploratory analysis of a large body of text to design a text prediction system that could be run efficiently on a mobile device.
R, RStudio and the R package tm for text mining were used to perform the analysis. All analysis code is included in the report R markdown document.
Natural Language Processing (NLP) is the field of computational linguistics that is concerned with the interactions between computers and humans (Wikipedia https://en.wikipedia.org/wiki/Natural_language_processing). It provides approaches and tools that may be used in a range of language processing tasks among which are identifying structure and meaning of texts, and – the focus of this report – generating natural language.
Natural Language Processing involves a number of activities (ref Mining the Social Web, Second Edition, by Matthew A. Russell).
End of sentence detection. Text is broken up in a collection of meaningful sentences.
Tokenization. Individual sentences are split into tokens, typically words and delineators such as start-of-sentence and end-of-sentence.
Part-of-speech tagging. Tokens are assigned part-of-speech information: noun, verb, etc.
Chunking. Deriving logical concepts from the tagged tokens within a sentence.
Extraction. Named entities (people, locations, events, etc) are extracted from each chunk.
These steps should not be applied rigidly however. Some steps may be less relevant to achive the desired NLP objective. The derivation of meaning or concepts is not prerequisite to construct a predictive text model.
It may be expected that the accuracy of a predictive text model primarily depends on the number of unique words that are available in the original body of text. Complementary data sources may include dictionaries of profanity words (assuming that the prediction of such words is to be avoided), stop words, named entities (sports, cities, states, presidents), synonyms (WordNet database) and jargon dictionaries.
Common issues in the analysis of text data are the use of colloquial language and punctuation as well as occurrences of misspellings (bye vs by). Especially on social media people often use non-words and acronyms (lol) that are a language on their own.
Predictive text modeling using NLP follows generally the same approach to data as we learned in the Data Science Specialization. The data is obtained, cleaned and explored, before moving to the predictive modeling stage using a training, validation and test set, and finally text prediction itself.
The source data in the project at hand consists of a set of files taken from the set of corpora provided by HC Corpora (http://www.corpora.heliohost.org). The readme file at http://www.corpora.heliohost.org/aboutcorpus.html provides details on the available corpora. Besides text the files include meta data for Main Website, Date, Entry Type and Subject.
The Coursera dataset analyzed for this report is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The dataset files contain text but they lack the additional meta data mentioned above. Although the files in the dataset have been filtered for language they may still contain some foreign language words.
The analysis is performed on the files containing American English texts. The table below gives an overview of file size, line and word counts.
File | File Size (MB) | Object Size (MB) | Lines | Max length | Words |
---|---|---|---|---|---|
./data/final/en_US/en_US.blogs.txt | 200.40 | 248.50 | 899288 | 40835 | 37334114 |
./data/final/en_US/en_US.news.txt | 196.30 | 19.20 | 1010242 | 5760 | 34365936 |
./data/final/en_US/en_US.twitter.txt | 159.40 | 301.40 | 2360148 | 213 | 30359852 |
All files are quite large (File Size) and appear to require significant memory (Object Size). As this is before any processing it may be necessary to sample the files rather than use the entire file to avoid hitting memory and processing constraints.
Whereas the number of lines varies (Lines), the number of words is approximately the same for each file (Words). Taking equal samples of each file would ensure that the word frequency distribution of the combined corpus is not skewed.
A few sample lines from each file are shown below.
./data/final/en_US/en_US.blogs.txt
./data/final/en_US/en_US.news.txt
./data/final/en_US/en_US.twitter.txt
All files contain lines that consist of multiple sentences. The character sets may include accented letters as used in non-English languages. Sentences may include punctuation such as single or double quotes surrounding words, and words may contain contractions (“don’t”" as a shorthand for “do not”). End of sentence punctuations may appear mid-sentence, as seen in the News and Blogs files. Cleaning the body of texts is therefore necessary.
Tokenization is the process of cutting the text into pieces: from body of text into sentences and words. We will remove punctuation characters such as commas, parentheses and the like, under the assumption that they have little impact on word order and n-gram composition. Numeric characters that may appear in numbers and dates are also removed. The combination of certain numbers and words can have predictive value, but their frequency might be too low to be useful.
Before building the word frequency tables (explained in the next section) all text is converted to lowercase. As a consequence the prediction model will predict lowercase words only. For English text this is less of an issue than for a language like German that uses capitalized nouns. In such case an additional tokenization step is needed that categorizes word classes (noun, verb, etc), allowing the character case to be reset selectively.
Initially we will not treat typos, garbage words and wrong language any differently from “proper” words, under the assumption that they represent a minor portion of the text. Profanity words will be removed. For this we use the lists of terms that could be found offensive that are available from http://www.bannedwordlist.com/lists/swearWords.txt (around 75 terms) and http://www.cs.cmu.edu/~biglou/resources/bad-words.txt (1,300 terms).
The planned model will use word sequences of up to four words to predict the next word.
Every language has certain word sequences that occur often. In the field of computational linguistics they are know as n-grams. An n-gram is a contiguous sequence of n items from a given sequence of text or speech (Wikipedia, https://en.wikipedia.org/wiki/N-gram). Text prediction systems use n-grams to predict the next word based on the probability of its occurrence in the language’s n-grams.
In Natural Language Processing the ratio of unique words (or n-grams) and total number of occurrences of these words (or n-grams) is known as language coverage. If more words are included the language coverage is expected to increase. Words that occur often have a higher impact on coverage than words that occur rarely.
To determine how many unique words and n-grams are required for a certain percentage of coverage several scenarios are analyzed. The scenario parameter analyzed for its impact on coverage is dataset sample size, which we set to vary between 1%, 5% and 10% of the original dataset size.
The three sample size scenarios are analyzed for n-grams with n between 1 and 4.
The table below shows the number of unique n-grams in each training scenario (unique.ngrams), the associated total number of n-gram occurrences (total.instances), and the ratio total to unique. A low ratio means a high number of n-grams that occur only once in the sample. From the perspective of prediction accuracy we would likely be looking to include as many n-grams as possible, whereas model efficiency drives us to drop the single n-gram ocurrences and include the n-grams that occur more than once. The ratio increase from one sample size to the next shows whether this helps to identify additional frequent n-grams. A stable or dropping ratio means that mostly new single occurrence n-grams were added.
scenario | ngram | unique.ngrams | total.instances | ratio |
---|---|---|---|---|
Train 0.01 | 1 | 27003 | 166392 | 6.20 |
Train 0.01 | 2 | 154488 | 289119 | 1.90 |
Train 0.01 | 3 | 234202 | 265732 | 1.10 |
Train 0.01 | 4 | 238523 | 243519 | 1.00 |
Train 0.05 | 1 | 67866 | 865776 | 12.80 |
Train 0.05 | 2 | 579866 | 1498790 | 2.60 |
Train 0.05 | 3 | 1072093 | 1374479 | 1.30 |
Train 0.05 | 4 | 1192041 | 1256737 | 1.10 |
Train 0.10 | 1 | 100014 | 1731966 | 17.30 |
Train 0.10 | 2 | 994992 | 2997573 | 3.00 |
Train 0.10 | 3 | 2005139 | 2748604 | 1.40 |
Train 0.10 | 4 | 2331943 | 2512488 | 1.10 |
The charts below show the top-12 most often occurring n-grams in the scenario Train 0.01 (sample size = 1 percent). There are many stop words among the single words. Bigrams and trigrams also contain a good proportion of stopword sequences. N-grams that include stopwords will be kept in the corpus as they can enlarge the model’s predictive capability.
The charts below shows coverage (as a percentage of the scenario dataset size) against the number of unique n-grams in the sample, for each of the three sample scenarios. On the x-axis n-grams are sorted in descending order of occurrence.
As expected the high frequency n-grams have a large impact on coverage, and the coverage ratio increases exponentially. As soon as the x-axis reaches the single occurrence n-grams the curve changes to linear.
For larger training sets it takes more words to reach a certain coverage, but this difference quickly disappears for the higher-order n-grams. For n=4 and higher the curves are mostly entirely linear from the start, which suggests that the increase of sample size has diminishing returns.
The n-gram frequency distributions are long-tailed. The table below shows the split between single and multiple n-gram occurrences. In the 1 percent sample scenario there are some 12,000 unique words that appear multiple times and some 15,000 unique words that appear just once, or 55 percent. 81 percent of 2-grams appear just once. Increasing the sample size to 10 percent offers a marginal improvement, and only for the higher order n-grams. A minimal predictive model that includes all higher order n-grams that appear more than once would contain approximately 530,000 n-grams.
scenario | ngram | unique.ngrams | multiples | singles | singles.pct |
---|---|---|---|---|---|
Train 0.01 | 1 | 27003 | 12067 | 14936 | 55 |
Train 0.01 | 2 | 154488 | 28319 | 126169 | 81 |
Train 0.01 | 3 | 234202 | 14536 | 219666 | 93 |
Train 0.01 | 4 | 238523 | 3331 | 235192 | 98 |
Train 0.05 | 1 | 67866 | 31266 | 36600 | 53 |
Train 0.05 | 2 | 579866 | 127984 | 451882 | 77 |
Train 0.05 | 3 | 1072093 | 99106 | 972987 | 90 |
Train 0.05 | 4 | 1192041 | 33778 | 1158263 | 97 |
Train 0.10 | 1 | 100014 | 45168 | 54846 | 54 |
Train 0.10 | 2 | 994992 | 233709 | 761283 | 76 |
Train 0.10 | 3 | 2005139 | 211203 | 1793936 | 89 |
Train 0.10 | 4 | 2331943 | 84371 | 2247572 | 96 |
The basic n-gram model will take the n-grams of one to four words to predict the next word.
The first task consists of generating the n-grams and frequencies from the sampled “training” dataset.
The larger the sample dataset, the more time and memory space it takes to generate the n-grams, especially for n > 2. Instead of processing the entire sample at once, the n-gram generation algorithm will process the files in pieces of 1,000 lines, build n-gram frequencies, and then combine the individual n-gram frequency tables into a single table, summarize and order the n-gram table by decreasing frequency. The process is repeated for each n. In this way a laptop with 8 GB of memory processes some 2,000 lines per minute, generating a 10 percent sample training set in about 3 hours. Unfortunately even this approach hits the memory limit when attempting to generate a larger 20 percent sample training set.
The n-gram tables are then processed and split in (n-1)-grams and the final word, corresponding to the predictor and the predicted word, respectively. Next the n-gram conditional probabilities are computed using the formula for maximum likelihood: \[ Pr(w_{i}|w_{i-1}) = \frac {count(w_{i-1},w_{i})} {count(w_{i-1})} \] where \(w_{i}\) is the last word, and \(w_{i-1}\) the preceding words.
The n-grams are stored in an R data frame, with character strings coded as factors. This should enable an easy and fast match by the prediction algorithm.
The prediction algorithm takes the entered text, cleans and extracts the preceding 1 to n-1 words. It then uses a simple backoff approach in combination with weighting to build a list of probable next words.
The lines of text in the dataset that were not included in the training set are used to test the model. The test selects lines at random, and splits each sentence in n-grams. The next word prediction accuracy is tested on each (n-1)-gram. The predicted word is then compared wih word n. The accuracy is the ratio of correct predictions and total predictions.
The test is done for the different training set sample sizes.
The table below shows the validation result for ten and twenty lines sampled from the test set.
trainSample | lines | nr.tests | test.error |
---|---|---|---|
0.01 | 10 | 133 | 0.88 |
0.01 | 20 | 353 | 0.88 |
0.05 | 10 | 96 | 0.82 |
0.05 | 20 | 154 | 0.82 |
0.10 | 10 | 217 | 0.92 |
0.10 | 20 | 247 | 0.85 |
The model accuracy varies between 8 percent and 18 percent. Varying the training set sample size between 1 and 10 percent does not seem to have a significant effect on prediction accuracy. It is to be investigated whether a larger sample size will increase accuracy.
To complete the project the following steps are projected: