Predictive Text Models : 1st Milestone Report

Julien COHEN SOLAL

2016, March

Background

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types “I went to the”, the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant. In this capstone,we will work on understanding and building predictive text models like those used by SwiftKey.

Executive Summary

This report aims to display that I have gotten used to working with the data and that I am on track to create my prediction algorithm. It will explain my exploratory analysis, my goals for the eventual app and algorithm and focus on the major features of the data that I have identified.

Using exploratory data analyses, we will assess that while the data sources are way too big to be used in their entirety, it is possible to pick only a small fraction of the words used while retaining almost the same accuracy. We will also see that even though a lot of cleaning has been performed on the raw data, studying the quadrigrams shows us that more cleaning is necessary.

All of the code used in the redaction of this report is available in the Appendix section.

External Dependencies

Some R packages have been created and maintained specifically for text mining. We use 2 of those, namely tm and quanteda. The slam package provides easy manipulation of large matrices. The downloader package provides easy downloading of zip files. The ggplot2 provides nice visualizations tools.

Data Acquisition

The .zip file contains 4 folders, corresponding to data in 4 languages : German, English, Finnish and Russian. In each of these folders, there are 3 .txt files containing sentences taken from blogs, news and Twitter respectively. English text files are really big (at least 900 000 lines of text), files for other languages are smaller but still contain more than 100 000 lines each. Supposedly, there are no duplicates, and each entry is on an individual line. Also, we may still find lines of entirely different languages in the corpus, either because of similarity between languages, or because of embedding of one foreign language into another. The corpora are collected from publicly available sources by a web crawler. There’s no metadata attached, just plain text. We just know that the source is either a tweet, a blog post or a news article. They come from this website :

About The Corpora

For the purpose of this report, we’ll only take interest in the English data.

Some information about the 3 .txt files :

##                File     Bytes   Lines    Words
## 1   en_US.blogs.txt 210160014  899288 37334131
## 2    en_US.news.txt 205811889 1010242 34372530
## 3 en_US.twitter.txt 167105338 2360148 30373583

Exploratory Data Analysis

Subsetting

Since the files are so big, for the exploratory data analysis, we’re going to work on a sample of the full Corpus. For the remainder of this analysis, the sample size will be 5% of the full Corpus.

Also, the end goal is to have one unique dictionary on which to train our model. As such, it doesn’t make any sense to keep on working on 3 different data sets, so from now on, we’ll merge the data from the 3 files and work on a unique subset of this data.

## [1] "subSet contains 213483 lines, which is 5 % of the whole corpus"

Cleaning

A big part of preparing a dictionary for a predictive text model is the cleaning of the data. Raw data is alway very, very messy, and taking it “as such” will undoubtely result in a poor model.

Phase 1 is the removal of weird, non-UTF-8 characters.

Phase 2 is the removal of punctuation, with the notable exception of apostrophes. Apostrophes are used in a lot of contractions, like it’s in place of it is, which will be handled correctly in a later phase. I chose not to use the tm package’s removePunctuation method and go with a custom regexp instead in order to be able to keep the apostrophes. I might need to improve this phase in order to create new lines at periods and question/exclamation marks instead of white space, as the current solution creates unwanted n-grams.

Phase 3 is the removal of numbers. We do not want our program to suggest numbers to the user, it doesn’t seem to make any sense. There is a drawback though, words like 6th (instead of sixth) get transformed in th, which does not make any sense as a suggestion either. There is a possibility that I might later review this phase to solve this issue.

Phase 4 is the conversion to lower case. Our program wants to suggest words that come after other words, not the first word in a sentence, so the fact that a letter is in higher case does not bring much information. It also reinforces our dictionary, with more instances of the same words, and will allow us to do easier filtering in later phases.

Phase 5 is the removal of URLs. The purpose of our program is not to advertise some URLs that may have been present in our original data sources (and there are a lot). One issue is that the simple removal of these URLs creates sentences which don’t make much sense. This is another phase that I may have to review at a later date to improve my dictionary.

Phase 6 is the removal of offensive language. The idea is to propose this feature as optional in the final program, but for the moment it is mandatory. There are various possible implementations here. One would be to use profanity listings that exist on the web (like this one) and refer to it, but one problem is that you will never catch all of the variations of profanities. That was my first implementation, but I later opted for some direct regexp hunting, and I’ve gotten much better results. You only need to work on the most common offensive words, the more “original” ones won’t be present enough in the dictionary to have an impact anyway.

It seems important to not simply discard these words, but to replace those with something so the sentences still make some sense. There are still implementation choices to be made here. Do we want to keep the profanities in the dictionary but not suggest those to the end user? Do we want to replace those with a replacement like “XXXX” and suggest this replacement?

Phase 7 is the unifying of common word contractions. There are a lot of contractions in the english language. that is and that’s mean strictly the same thing for example, and you don’t want to offer 2 suggestions that are in fact the same one. Actually, since people often misspell, you can add thats to this list. Unifying all of these kind of contractions/misspells will greatly strengthen our dictionary.

It’s a very tricky process, since you don’t want to modify some other words unintendly.

Phase 8 is the stripping of excessive white space. All of the previous cleaning has resulted in a lot of empty space being created, and you need to clean this empty space before the tokenization of the data.

Creating the Dictionaries

Now that the text has been cleaned enough, and before proceeding with the actual exploration of the data, we need to creat one useful tool named Document-Term Matrices. These matrices allow to get the number of instances each word appears in the whole corpus. There are different implementation choices possible here. The tm package implementation is really slow, even on a recent computer. In stark contrast, the less-known quanteda package proposes a much faster implementation. I compared the resulting dictionary for both implementations, and while there are small differences in word counts, these differences are small enough that using the quanteda package seemed a no-brainer choice.

We will also create dictionaries for n-grams, which are contiguous sequences of n words. N-grams will be used by our model to predict next words later on. For the purpose of this study, we will limit ourselves to bigrams, trigrams, and quadrigrams.

## [1] "Creating the DTM for words"

## 
##    ... indexing documents: 213,483 documents
##    ... indexing features: 116,990 feature types
##    ... created a 213483 x 116990 sparse dfm
##    ... complete. 
## Elapsed time: 5.87 seconds.

## [1] "Creating the DTM for bigrams"

## 
##    ... indexing documents: 213,483 documents
##    ... indexing features: 1,533,204 feature types
##    ... created a 213483 x 1533204 sparse dfm
##    ... complete. 
## Elapsed time: 7.59 seconds.

## [1] "Creating the DTM for trigrams"

## 
##    ... indexing documents: 213,483 documents
##    ... indexing features: 3,349,663 feature types
##    ... created a 213483 x 3349663 sparse dfm
##    ... complete. 
## Elapsed time: 7.24 seconds.

## [1] "Creating the DTM for quadrigrams"

## 
##    ... indexing documents: 213,483 documents
##    ... indexing features: 4,121,470 feature types
##    ... created a 213483 x 4121470 sparse dfm
##    ... complete. 
## Elapsed time: 7.75 seconds.

Exploring the Data

Basic stats

Here comes the fun part. First the basic stats about the dictionaries :

## [1] "Unique words in sample : 116990"

## [1] "Total word instances in sample : 5105752"

## [1] "Unique bigrams in sample : 1533204"

## [1] "Total bigram instances in sample : 4892355"

## [1] "Unique trigrams in sample : 3349663"

## [1] "Total trigram instances in sample : 4680173"

## [1] "Unique quadrigrams in sample : 4121470"

## [1] "Total quadrigram instances in sample : 4473638"

Most Frequent N-Grams

Now we can have a graphic look at the most frequent n-grams in the sample.

No surprise here, all of these words are commonly refered to as “stop words” in NLP literature, and are often removed in NLP applications. For a predictive text model, we must absolutely keep those obviously, as we want to be able to suggest those to the end user.

For bigrams, we find in the top 20 various combinations of stop words, once again it was expected.

For trigrams, we begin to see some interesting biases due to the nature of our data sources. “Thanks for the” is really common on Twitter, to thank someone either for following you, or for retweeting one of you own Tweets. With just the news and blogs data sources, I’m pretty sure it wouldn’t appear in the top 20, much less the top 5.

For quadrigrams, we see weird quadrigrams appear “vested interests vested interests”. Further data analysis will be needed to find out what this is all about, and we’ll probably need to add one cleaning phase to get rid of these quadrigrams (advertisements? spam?). This also serves as a reminder that we’re working on a sample here, not on the full corpus, and someone else with a different sample might not have found these quadrigrams at all. Using another sample may result in other “original” findings that may need cleaning. To be sure to have a clean dictionary, we may have no choice but to clean the full corpus and subset afterwards.

Data Coverage

Given that the application will have to be deployed on a Shiny server, we need to keep in mind the overall performance of our algorithm, so that it runs in a reasonable amount of time, without taking up too many resources like memory. As such, we probably won’t be able to export the full dictionary. That’s why it could be interesting to study data coverage, like : how many unique words do you need in a frequency sorted dictionary to cover X% of all word instances in the language?

## [1] "50% coverage is reached with 132 unique words, or 0.11 % of the total number of unique words"

## [1] "70% coverage is reached with 863 unique words, or 0.74 % of the total number of unique words"

## [1] "90% coverage is reached with 6987 unique words, or 5.97 % of the total number of unique words"

## [1] "95% coverage is reached with 16231 unique words, or 13.87 % of the total number of unique words"

## [1] "99% coverage is reached with 65933 unique words, or 56.36 % of the total number of unique words"

As shown above, we can get 90% coverage with about the top 6% most frequent words, and 95% coverage with about the top 14% most frequent words. So using less than 15% of the total dictionary, we could still field a really good model, almost as good as a full dictionary model.

Now we can try the same work with bigrams.

## [1] "50% coverage is reached with 34218 unique bigrams, or 2.23 % of the total number of unique bigrams"

## [1] "70% coverage is reached with 221172 unique bigrams, or 14.43 % of the total number of unique bigrams"

## [1] "90% coverage is reached with 1043969 unique bigrams, or 68.09 % of the total number of unique bigrams"

## [1] "95% coverage is reached with 1288587 unique bigrams, or 84.05 % of the total number of unique bigrams"

## [1] "99% coverage is reached with 1484281 unique bigrams, or 96.81 % of the total number of unique bigrams"

With bigrams, we see that to reach 90% coverage, we need about the top 70% most frequent bigrams. That’s a way bigger proportion than with words.

Since I expect the needed proportion of trigrams and quadrigrams to reach 90% coverage to be even bigger, I won’t include coverage study of these n-grams in this report.

What’s to Follow

We’re only right at the beginning of this project. Here are some of the steps I have in mind for the next few weeks :

Start predicting next words using the current dictionary.
Being able to predict words even when working with previously unseen n-grams.
Work on performance improvement (time and memory)
Improve various phases of cleaning as previously mentioned.
Start prototyping a data product and test it on a Shiny server.
Think about adding more data sources (like conversation transcripts) to improve the dictionary.

Appendix

Code