This report presents initial exploratory data analysis supporting the predictive text model to be developed. The data for this project is the Coursera Swiftkey Dataset. It consists of internet-derived text in English, Finnish, German, and Russian. For this EDA, I will be using the English data, which consists of three files, each of approximately 200Mb, collected from news sites, blogs, and twitter.
## Name Size_Mb Date Num_Lines
## 11 final/en_US/en_US.twitter.txt 159.3641 2014-07-22 10:12:00 2360148
## 12 final/en_US/en_US.news.txt 196.2775 2014-07-22 10:13:00 1010242
## 13 final/en_US/en_US.blogs.txt 200.4242 2014-07-22 10:13:00 899288
## [1] "Extracted data has 4269678 lines and occupies 799.3494 Mb of memory"
The R code supporting this report can be found on GitHub. It will not be reproduced here.
Once the data were read into memory, the lines were shuffled, and randomly sampled, taking 98% for training, and 2% for testing. With such a large volume of data, 2% is more than sufficient for testing.
The R package quanteda was used to transform the raw data into tables of n-grams of size 1-5, and their observed counts in the data. This process was very memory-intensive, and required splitting the data into chunks of 100,000 lines at a time. For each chunk:
Converting n-gram tokens into document-feature matrices was the most taxing part of the process.
On my 8-year-old Dell Precision T3500 with a Xeon 3.33 GHz 6-core hyperthreaded processor, 24 GB of RAM, and a WDC WD2002FAEX-007BA0 hard drive running under Kubuntu 16.04.3 linux, the runtimes were:
The cleaning performed on the data was fairly light: replacing abbreviations, ordinals, symbols, converting to lowercase, and removing numbers, punctuation and whitespace. As I build and test the model, I may add or remove cleaning steps.
Analysis was fairly light, as the goal was not analysing the text itself for meaning, but simply determining frequent order of words. As such, I generated barplots of the most frequent combinations. Also included is the theoretical Zipf frequency; as you can see, it holds best for 1-grams:
Given that these data were collected from the internet, I was quite surprised that the word “cat” does not appear in the top 30 unigrams. So, how far down the list does it occur?
## [1] 1685
Just for fun, let’s plot a word cloud of common bigrams:
Since we’re not performing sentiment analysis or trying to glean any deeper meaning from the words, other than in what order they tend to appear, I don’t believe much more analysis at this point would be particularly fruitful.
A simple predictive text model can be constructed by splitting each n-gram (n = 2 to 5) into the first (n - 1) words (the input, X) and the last word (the prediction, y). User input can be matched against X, and a list of predictions y can be returned in descending order of observed count from the training data. If no predictions are found in n-grams (n > 2), then the 1-grams will at least return a list of most common words.
The test dataset will be split into X and y similarly, to compare against predictions made by the model.
When complete, this will be packaged into a Shiny app for evaluation.