The objective of this project, is to build a Natural Language Processing next-word prediction algorithm using a standard ngram model. We train the algorithm based on dataset provided directly in the Coursera website. This report provides an initial exploratory analysis of the dataset, and a first prediction algorithm.
The data corpus consists of three files with lines of sentences. This full dataset is not directly amenable to constructing a next-word prediction algorithm considering its size. Therefore, I employ a random sampling approach to construct a smaller representative dataset. I randomly sample approximately 10% of each dataset and use this as the corpus for training the next-word prediction algorithm.
The next step in data cleaning is to separate sentences on the same line into distinct sentences. This is important considering the next-word prediction algorithm we build is based on the ngrams model, the reason elaborated further in the ngrams model section. Once the lines containing multiple sentences are further stripped down to distinct lines, the size of the dataset grows a little more.
## file numLines_original numLines_sampled numLines_clean
## 1 en_US.blogs.txt 899288 90303 215893
## 2 en_US.news.txt 77259 7614 14660
## 3 en_US.twitter.txt 2360148 236352 282520
The final step in the cleaning procedure is to remove any punctuation marks, trailing and heading whitespaces. Non-alphanumeric words such as emoticons are removed, and all words are converted to their lowercase forms. Upon finishing the data cleaning, the three datasets are combined into one large corpus, and is used as the final dataset.
## [1] "Number of lines in final dataset is 513073"
It is noteworthy that the data cleaning procedure does not involve advanced but probably important word filtering procedures such as profanity filtering and also non-dictionary words. This is left for a later stage, after the building of an initial prediction algorithm.
## [1] "Total number of words in the corpus is 5996932"
The most frequently occurring words in the corpus are shown in the following histogram. As one would have anticipated the most frequently occurring words are commonly used words such as “the”, “to” etc.
ngrams are simply a contiguous set of n items. In a next-word prediction algorithm, an ngram will be a sequence of n words. The model allows prediction of the next word by looking at the most frequently occurring ngram. For instance, if we want to predict the fourth word given three words, we look at all the 4grams, and identify the most commonly occurring word in all 4grams whose first three words are the same as the query words. For my initial algorithm, I construct ngrams of size at most 4 from the corpus. It is important to note that only words on the same sentence can be contiguous and hence lines of text consisting of multiple sentences have to be stripped down to separate sentences, as was done in the data cleaning procedure.
Here, I construct all possible bigrams(n=2) and trigrams(n=3) from the data corpus using the R package NLP. The most frequently occurring bigrams and trigrams are shown in the bar plots below:
A key issue in a dataset of this size is memory/time constraints. Therefore, it is important to identify efficient ways of storing and manipulating the ngrams. One such way is to drop words that do not occur very frequently in the dataset, and consequently their occurrence in any ngram. Here, I present a first look into the number of unique words that cover a significant fraction of all words in the corpus:
## [1] "Number of unique words that cover 50% of the words in the corpus is 120"
## [1] "Number of unique words that cover 90% of the words in the corpus is 6998"
As we can see there is an excess of rare and infrequently occurring words in the corpus. We could significantly reduce memory constraints by ignoring ngrams involving a fraction of these rare words.
I constructed an initial word prediction algorithm based on 4grams - i.e. when presented with a sequence of 3 words, I predict the 4th word by looking at all 4 grams whose first three words are the query words, and the identify the most commonly occurring 4th word in these 4grams. The prediction algorithm works moderately successfully (5 correct out of 10) on the quiz questions. Future work involves extending it to a backoff model when it encounters a new and unseen query. Another aspect requiring further refinement is filtering out foreign or non-dictionary words. I intend to do this starting from highly infrequent words in the corpus. The final product will be a R shiny app that will function as an efficient next-word prediction algorithm.