Overview

The aim of this project is to build a predictive text model to make it easy to type text by providing options for the next word based on words already typed. This report outlines the exploratory analysis that was done and provides a summary of how the model is being built.

The corpora used for this analysis was obtained from different types of websites and sorted into 3 files: newspapers, blogs and twitter. A sample of the data was cleaned and analysed then used to create the model.

Reading and Sampling the Data

The data was read into 3 separate files called twitter, blogs and news.

Some basic summary information for the datasets are provided below including the size of each file in MB, as well as the number of lines and the number of words in each file.

        # Get size of files in MB
                twitterSize <- file.info("../data/en_US.twitter.txt")$size / 1000000
                blogsSize <- file.info("../data/en_US.blogs.txt")$size / 1000000
                newsSize <- file.info("../data/en_US.news.txt")$size / 1000000

        # Get the number of words in each file
                numWords_twitter <- sum(str_count(twitter, '\\w+'))
                numWords_blogs <- sum(str_count(blogs, '\\w+'))
                numWords_news <- sum(str_count(news, '\\w+'))

        summaryStats <- data.frame("File_Name" = c("Twitter", "Blogs", "News"),
                                   "File_Size_MB" = c(twitterSize, blogsSize, newsSize),
                                   "Number_Of_Lines" = c(length(twitter), length(blogs), length(news)),
                                   "Number_Of_Words" = c(numWords_twitter, numWords_blogs, 
                                                       numWords_news))
        summaryStats
##   File_Name File_Size_MB Number_Of_Lines Number_Of_Words
## 1   Twitter     167.1053         2360148        31130580
## 2     Blogs     210.1600          899288        38601176
## 3      News     205.8119         1010242        35806831

Since the files are very large, ten percent of each dataset was sampled and concatenated into one file for use in the rest of the analysis. The sample should give us an accurate approximation to the results that would be obtained if all the data were used.

Create Corpus and Clean the data

A corpus was created from the sample file to be used for further analysis.

The corpus was transformed by:

Note that a decision was made to not remove stopwords, such as a, the, and etc., so that these can also be predicted by the model.

        cleanCorpus <- tm_map(sampleCorpus, content_transformer(tolower))%>%
                        tm_map(removeNumbers)%>%
                        tm_map(removePunctuation, preserve_intra_word_dashes = TRUE, 
                               preserve_intra_word_contractions = TRUE)%>%
                        tm_map(removeWords, profanity)%>%
                        tm_map(stripWhitespace)

Exploratory Analysis

A term document matrix was created from the corpus. This is simply a table where the rows represent the different documents in the corpus and the columns represent words (terms) from the documents. The cells in the table represent the number of times a word appears in each document. This format makes it easier to manipulate the data and obtain frequency tables.

Most Frequent n-grams

An n-gram is a sequence of n words that occur together in the data. The most frequent unigrams (1 word), bigrams (2 words) and trigrams (3 words) were obtained and plotted below. Note that most n-grams include stopwords as these were left in the data to be used in the prediction model.

Next Steps

The frequency n-gram tables obtained from this analysis will be used as the training data to create the prediction model. Note that the highest n-gram used will be trigrams. In the unigram table, there are over 244,000 unique words but only 500 words are required to cover 55% of the words in the data, and 13,000 words are required to cover about 91% of words in the data. After removing words which have a frequency of less than 2 (about 231,000 words), the remaining words will still cover over 90% of words in the data, as well as make the model smaller and hopefully more efficient. To cater for a word that does not appear in the table, a generic word, UNK (or UNKNOWN) will be included with a frequency of 1. If a word is not found in the vocabulary, the model defaults to this word. The model will be trained using this data and will treat UNK like a regular word.

Using the frequencies, a probability for the occurrence of each n-gram will be calculated. In the case where word(s) appear in a context not seen in the training data, a smoothing method will be applied to prevent the model from assigning zero probability to these unseen n-grams. Smoothing takes some probabilty from the n-grams that occur more frequently and assigns it to n-grams never seen before. The Kneser-Ney smoothing method was used in this project.

The probabilities that are calculated can be thought of as transition probabilities allowing the model to predict (or transition to) the next word given the current word. This is similar to how Markov Chains work - predicting the next state given the current state, using probabilities. Markov Chains therefore provide a convenient way to store and query n-gram probabilities, implemented by the use of a transition matrix which is simply a matrix of probabilities.

The Shiny app to be developed will use the last one or two words entered by the user as input and predicts the next word the user might enter. Three options will be provided to the user. The plan is to predict and provide options after each word that is typed by the user, similar to the smart keyboard used on mobile devices.