The goal for this capstone project is to build a predictive model for the English text with data from a corpus called HC Corpora.
A simple model of 1-gram, 2-grams and 3-grams (n-gram) was constructed with a subset of the provided corpus. An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence in the form of a (n − 1)–order Markov model.
A very basic prediction routine is able to predict a sensible looking next word using the n-gram model. Further work is required to refine the model and consider an efficient way of including words that were provided in the earlier part of the a given phrase for improved accuracy.
My goal for this capstone project is to gamify the use of the next word prediction algorithm. I intend to present the app as a game to see whether the user can lead the algorithm to predict a given word (user specified or randomly generated). I am seeking feedback on the idea of gamifying the user experience when assessing the accuracy of my next word prediction algorithm.
The data is from a corpus called HC Corpora. The corpora are collected from publicly available sources by a web crawler. The files have been language filtered but may still contain some foreign text.
## Filename File size (Mb) Word Count Line Count
## 1 en_US.blogs.txt 200.4242 37334690 899288
## 2 en_US.news.txt 196.2775 34372720 1010242
## 3 en_US.twitter.txt 159.3641 30374206 2360148
The following data cleansing activites were performed:
remove punctuations
remove numbers (digits)
change to lower case
remove profanity (given the intention to gamify the user interaction for a younger audience)
List of profane words are drawn from Shutterstock github-hosted list of Bad Words
The following summary is drawn from a sample of 30,000 lines of text from each of the data source. The sample contains a total of 2627091 words with 83427 unique words.
## number.of.unique.words coverage
## 1 141 50
## 2 376 60
## 3 960 70
## 4 2414 80
## 5 7343 90
The sample is tokenise into 1-gram, 2-grams and 3-grams.
At the point of writing this report, I have created simple 1-gram, 2-grams and 3-grams tables with a simple frequency count against each of the n-grams from the dataset. Both the 2-grams and 3-grams are split into a ‘lookup words’ column and a ‘next word’ column.
## corpus.3grams freq lookupwords nextword lookupwords.freq
## 1: of the year 217 of the year 4745
## 2: of the most 174 of the most 4745
## 3: of the day 151 of the day 4745
## 4: of the best 144 of the best 4745
## 5: of the world 115 of the world 4745
## 6: of the season 101 of the season 4745
A simple prediction routine is used to predict the next word for a provided phrase. The algorithm currently works like this:
Turn the provided phrase into lowercase, remove punctuation and tokenise into words.
If the phrase contains 2 or more words, take the last two words in the phrase and lookup the 3-grams table using the ‘lookup words’ column.
Find all entries that contains those lookup words, sort them by the frequency count and return the top 3 highest count.
If none or less then three entries are found in the 3-grams table or if the provided phrase contains only 1 word, then same routine is applied to the 2-grams table.
Step 4 above is a step towards understanding how the ‘Stupid Backoff’ strategy can be implemented in the final algorithm.
Trained on 90,000 lines of text, taken equally from the blogs, news and twitter text, this simple routine currently suggested words that appears to be sensible.
predictnextword("I am going to buy a")
## [1] "new" "car" "house"
However, there are obvious flaws in both the model and the prediction routines:
predictnextword("I am going to the store to buy bacon, eggs and")
## [1] "i" "my" "a"
predictnextword("I am of the ")
## [1] "year" "most" "day"
The next steps of exploratory work I have planned is to assess if there is a simple way to address the flaws in my current work (taking into consideration the time (personal) and resource (machine) constraints for this project). Specifically, my intended exploratory work are:
Can the 1-gram, 2-grams and 3-grams tables be collapsed into a simpler model with precalculated probablities of a suggested word for easier inference at runtime?
Should stop words be removed in the construction of the n-grams tables? If they are, how does the prediction algorithm provide gramatically correct prediction?
Can cluster analysis of words be used to inform the probability of a suggested word? The intention of this analysis is to assess whether there is a faster way of taking into consideration words that are not used in the lookup of the 2-grams and 3-grams tables. If it can, when and how should this method be injected into the final prediction routine?
Given machine constraints, what is the optimal size of the final model?
If no suggested words are found from the n-grams lookup and the cluster analysis, can the suggested word be randomly drawn from the 1-gram lots?
It is my goal to provide a fun element to the final presentation of the next word prediction routine by gamifying the interaction between the user and next word prediction routine.
The intention is to provide a word (either random selection or user supplied) and allow the user to type in phrases. The game is won if the routine suggested the selected word. I am seeking feedback on the concept.