The aim of this project is to build a predictive text model to make it easy to type text by providing options for the next word based on words already typed. This report outlines the exploratory analysis that was done and provides a summary of how the model is being built.
The corpora used for this analysis was obtained from different types of websites and sorted into 3 files: newspapers, blogs and twitter. A sample of the data was cleaned and analysed then used to create the model.
The data was read into 3 separate files called twitter, blogs and news.
Some basic summary information for the datasets are provided below including the size of each file in MB, as well as the number of lines and the number of words in each file.
# Get size of files in MB
twitterSize <- file.info("../data/en_US.twitter.txt")$size / 1000000
blogsSize <- file.info("../data/en_US.blogs.txt")$size / 1000000
newsSize <- file.info("../data/en_US.news.txt")$size / 1000000
# Get the number of words in each file
numWords_twitter <- sum(str_count(twitter, '\\w+'))
numWords_blogs <- sum(str_count(blogs, '\\w+'))
numWords_news <- sum(str_count(news, '\\w+'))
summaryStats <- data.frame("File_Name" = c("Twitter", "Blogs", "News"),
"File_Size_MB" = c(twitterSize, blogsSize, newsSize),
"Number_Of_Lines" = c(length(twitter), length(blogs), length(news)),
"Number_Of_Words" = c(numWords_twitter, numWords_blogs,
numWords_news))
summaryStats
## File_Name File_Size_MB Number_Of_Lines Number_Of_Words
## 1 Twitter 167.1053 2360148 31130580
## 2 Blogs 210.1600 899288 38601176
## 3 News 205.8119 1010242 35806831
Since the files are very large, ten percent of each dataset was sampled and concatenated into one file for use in the rest of the analysis. The sample should give us an accurate approximation to the results that would be obtained if all the data were used.
A corpus was created from the sample file to be used for further analysis.
The corpus was transformed by:
Note that a decision was made to not remove stopwords, such as a, the, and etc., so that these can also be predicted by the model.
cleanCorpus <- tm_map(sampleCorpus, content_transformer(tolower))%>%
tm_map(removeNumbers)%>%
tm_map(removePunctuation, preserve_intra_word_dashes = TRUE,
preserve_intra_word_contractions = TRUE)%>%
tm_map(removeWords, profanity)%>%
tm_map(stripWhitespace)
A term document matrix was created from the corpus. This is simply a table where the rows represent the different documents in the corpus and the columns represent words (terms) from the documents. The cells in the table represent the number of times a word appears in each document. This format makes it easier to manipulate the data and obtain frequency tables.
An n-gram is a sequence of n words that occur together in the data. The most frequent unigrams (1 word), bigrams (2 words) and trigrams (3 words) were obtained and plotted below. Note that most n-grams include stopwords as these were left in the data to be used in the prediction model.
The frequency n-gram tables obtained from this analysis will be used as the training data to create the prediction model. Note that the highest n-gram used will be trigrams. In the unigram table, there are over 244,000 unique words but only 500 words are required to cover 55% of the words in the data, and 13,000 words are required to cover about 91% of words in the data. After removing words which have a frequency of less than 2 (about 231,000 words), the remaining words will still cover over 90% of words in the data, as well as make the model smaller and hopefully more efficient. To cater for a word that does not appear in the table, a generic word, UNK (or UNKNOWN) will be included with a frequency of 1. If a word is not found in the vocabulary, the model defaults to this word. The model will be trained using this data and will treat UNK like a regular word.
Using the frequencies, a probability for the occurrence of each n-gram will be calculated. In the case where word(s) appear in a context not seen in the training data, a smoothing method will be applied to prevent the model from assigning zero probability to these unseen n-grams. Smoothing takes some probabilty from the n-grams that occur more frequently and assigns it to n-grams never seen before. The Kneser-Ney smoothing method was used in this project.
The probabilities that are calculated can be thought of as transition probabilities allowing the model to predict (or transition to) the next word given the current word. This is similar to how Markov Chains work - predicting the next state given the current state, using probabilities. Markov Chains therefore provide a convenient way to store and query n-gram probabilities, implemented by the use of a transition matrix which is simply a matrix of probabilities.
The Shiny app to be developed will use the last one or two words entered by the user as input and predicts the next word the user might enter. Three options will be provided to the user. The plan is to predict and provide options after each word that is typed by the user, similar to the smart keyboard used on mobile devices.