The goal of this report is to explain the exploratory analysis carried out for the SwiftKey data and present the roadmap for developing the text prediction algorithm and the Shiny app for using the algorithm. This exploratory analysis is the first step towards developing the text prediction algorithm. The exploratory analysis allows us to understand the data better and, with this knowledge, make better decisions during the modeling step.
The data used for this analysis is the English language (en-US folder) corpus of text documents of the SwiftKey data (downloaded from Coursera-SwiftKey.zip). Additionally, a list of profanity words has been downloaded from the Internet (profanity.txt) to be used in the profanity filtering step.
The following table shows basic statistics for the raw data: the number of lines (documents, in tm corpus lingo) in each file, as well as the average number of characters and words, respectively, in a line and its standard deviation. The blogs file has the lowest number of lines but also the highest average of characters and words per line; whereas the Twitter file is the opposite, with the highest number of lines but the lowest average of characters and words per line. This observation is coherent with the Twitter character limit per tweet.
| Data | Number_of_lines | Avg_num_chars | SD_num_chars | Avg_num_words | SD_num_words |
|---|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 229.98695 | 258.66081 | 41.75107 | 41.75107 |
| en_US.news.txt | 1010242 | 201.16285 | 133.21714 | 34.40997 | 34.40997 |
| en_US.twitter.txt | 2360148 | 68.68045 | 37.22725 | 12.75063 | 12.75063 |
During the data cleaning and preparation step, we must bear in mind that the ultimate goal is to develop a text prediction model for predicting the next word as users type. Users will be using natural language, including stopwords. Thus, we will not remove these. The cleaning and preparation steps performed are the following: strip whitespace, transform all characters to lower case, remove numbers, remove punctuation, remove all non-ASCII characters, and, finally, remove profanities (we do not want the model to be suggesting their use). The following table shows basic statistics for the clean data. It is very surprising that the biggest relative change in the average number of words is for the news data, I would have expected the Twitter data to contain more profanities than the news data.
| Data | Number_of_lines | Avg_num_chars | SD_num_chars | Avg_num_words | SD_num_words |
|---|---|---|---|---|---|
| en_US.blogs.txt | 899288 | 220.89202 | 250.24536 | 40.90713 | 40.90713 |
| en_US.news.txt | 1010242 | 191.05155 | 127.79426 | 33.11583 | 33.11583 |
| en_US.twitter.txt | 2360148 | 64.28772 | 35.38398 | 12.39259 | 12.39259 |
The objective of the exploratory analysis is to understand the data better to make better decisions when developing the text prediction model. In this exploratory analysis, we are going to look at the distribution of words and the relationship between the words in the corpora.
The first figure shows the highest frequency words for each file.
This next figure shows the highest frequency words for each file after the stopwords have been removed.
The previous two figures show that when taking stopwords into account many of the highest frequency terms are common to the three files. However, when stopwords are removed the three files have many less common high frequency words.
Analysing the relationship between the words in the corpora is already the first step towards building the text prediction model. The model will have to work in a general context, whether the user is a Twitter user, a blogger or a news writer. Therefore, from this point forward the data will be analised as a single corpus, regardless of the file it came from. Additionally, the stopwords will not be removed from the data so that they too can be predicted and used by the user.
The following plot shows the frequency count of the most frequent bi-grams in the data.
The following plot shows the frequency count of the most frequent tri-grams in the data.
As mentioned earlier, to develop the next word prediction model the data from the three files will be joined in a single corpus and stopwords will not be removed because they are part of the language used when typing on a mobile keyboard and keeping them will help the users.
Mobile use is increasing for activities such as email, social networking, banking and others. Some of these activities require typing, but typing on mobile devices can be bothersome. Smart keyboards make typing easier by using predictive text models: when the user starts typing a sentence the model presents three options for what the next word might be. Next word prediction is based on n-gram language models. The kgrams package can be used to implement such models in R. Lately, it seems that deep learning is the most popular technology used to implement word prediction models. Particularly, the long short-term memory (LSTM) technique is the most mentioned word prediction technique when performing a Google search. LSTM is an artificial recurrent neural network (RNN). For this project, I will train both types of models and compared them to select the best performing one.
When the next word prediction algorithm is trained and performs well, a Shiny app will be developed in which the user can type an input a phrase (multiple words), click submit, and the model will predict the next word.