Data Science Specialization Capstone project: Milestone report

Introduction

The goal of this report is to explain the exploratory analysis carried out for the SwiftKey data and present the roadmap for developing the text prediction algorithm and the Shiny app for using the algorithm. This exploratory analysis is the first step towards developing the text prediction algorithm. The exploratory analysis allows us to understand the data better and, with this knowledge, make better decisions during the modeling step.

Data and other resources

The data used for this analysis is the English language (en-US folder) corpus of text documents of the SwiftKey data (downloaded from Coursera-SwiftKey.zip). Additionally, a list of profanity words has been downloaded from the Internet (profanity.txt) to be used in the profanity filtering step.

Raw data summary

The following table shows basic statistics for the raw data: the number of lines (documents, in tm corpus lingo) in each file, as well as the average number of characters and words, respectively, in a line and its standard deviation. The blogs file has the lowest number of lines but also the highest average of characters and words per line; whereas the Twitter file is the opposite, with the highest number of lines but the lowest average of characters and words per line. This observation is coherent with the Twitter character limit per tweet.

Summary statistics for the raw data
Data	Number_of_lines	Avg_num_chars	SD_num_chars	Avg_num_words	SD_num_words
en_US.blogs.txt	899288	229.98695	258.66081	41.75107	41.75107
en_US.news.txt	1010242	201.16285	133.21714	34.40997	34.40997
en_US.twitter.txt	2360148	68.68045	37.22725	12.75063	12.75063

Data cleaning and preparation

During the data cleaning and preparation step, we must bear in mind that the ultimate goal is to develop a text prediction model for predicting the next word as users type. Users will be using natural language, including stopwords. Thus, we will not remove these. The cleaning and preparation steps performed are the following: strip whitespace, transform all characters to lower case, remove numbers, remove punctuation, remove all non-ASCII characters, and, finally, remove profanities (we do not want the model to be suggesting their use). The following table shows basic statistics for the clean data. It is very surprising that the biggest relative change in the average number of words is for the news data, I would have expected the Twitter data to contain more profanities than the news data.

Summary statistics for the clean data
Data	Number_of_lines	Avg_num_chars	SD_num_chars	Avg_num_words	SD_num_words
en_US.blogs.txt	899288	220.89202	250.24536	40.90713	40.90713
en_US.news.txt	1010242	191.05155	127.79426	33.11583	33.11583
en_US.twitter.txt	2360148	64.28772	35.38398	12.39259	12.39259

Exploratory analysis

The objective of the exploratory analysis is to understand the data better to make better decisions when developing the text prediction model. In this exploratory analysis, we are going to look at the distribution of words and the relationship between the words in the corpora.

Most frequent words

The first figure shows the highest frequency words for each file.

This next figure shows the highest frequency words for each file after the stopwords have been removed.

The previous two figures show that when taking stopwords into account many of the highest frequency terms are common to the three files. However, when stopwords are removed the three files have many less common high frequency words.

Relationship between the words in the corpora

Analysing the relationship between the words in the corpora is already the first step towards building the text prediction model. The model will have to work in a general context, whether the user is a Twitter user, a blogger or a news writer. Therefore, from this point forward the data will be analised as a single corpus, regardless of the file it came from. Additionally, the stopwords will not be removed from the data so that they too can be predicted and used by the user.

The following plot shows the frequency count of the most frequent bi-grams in the data.

The following plot shows the frequency count of the most frequent tri-grams in the data.

Future work

As mentioned earlier, to develop the next word prediction model the data from the three files will be joined in a single corpus and stopwords will not be removed because they are part of the language used when typing on a mobile keyboard and keeping them will help the users.

Prediction algorithm

Mobile use is increasing for activities such as email, social networking, banking and others. Some of these activities require typing, but typing on mobile devices can be bothersome. Smart keyboards make typing easier by using predictive text models: when the user starts typing a sentence the model presents three options for what the next word might be. Next word prediction is based on n-gram language models. The kgrams package can be used to implement such models in R. Lately, it seems that deep learning is the most popular technology used to implement word prediction models. Particularly, the long short-term memory (LSTM) technique is the most mentioned word prediction technique when performing a Google search. LSTM is an artificial recurrent neural network (RNN). For this project, I will train both types of models and compared them to select the best performing one.

Shiny app

When the next word prediction algorithm is trained and performs well, a Shiny app will be developed in which the user can type an input a phrase (multiple words), click submit, and the model will predict the next word.