I was tasked with creating a text prediction application that would be capable of “guessing” the next word a user would like to type. For example, if the user typed “Are you”, the next word might be “hungry”, “busy”, “coming”, etc. Naturally, there are many possibilities that need to be sorted through. To help put these possibilities in perspective, I was provided with three large datasets from three different sources (“blogs”, news excerpts, and “tweets”). I used a free statistical software to extract information that I could use in my prediction application.
One of the challenges presented by this sort of project stems from the size of the datasets. These are large datasets with each containing over 30,000,000 words! It can require a lot of computing power to process this information, and I have sometimes found it helpful to restrict my focus to a sample of the data to speed up the process.
## line.count word.count
## blogs 899288 37570839
## news 1010242 34494539
## twitter 2360148 30451170
In terms of the types of things we look at in the data, one concept of particular importance is that of the “n-gram”. The term may sound a bit confusing, but n-grams simply refer to terms that are composed of “n” words. For example, a 1-gram is just a single word such as “are”, a 2-gram is a two-word pairing such as “are you”, and a 3-gram is a three-word combination such as “are you hungry”.
One interesting feature of our data is that relatively few 1-grams are necessary to account for most of the words of our dataset (e.g. “the” accounts for about 5% of the words in our database). I found it even more interesting that this pattern is less prominent for 2-grams and 3-grams. It appears that the occurrence of multi-word combos is more evenly distributed than the occurrence of individual words. (This makes sense because individual words such as “and”, “to”, or “the” can be used in almost any scenario, but a specific word combo may only be used in certain contexts).
The distributions for multi-word combos may require me to store a lot of different word combinations to be able to predict well, and could pose challenges in the creation of my application. For now, the plan going forward is for me to analyze the relationships between words and word pairings as well as the relationships between word pairings and three-word combos to see which words are most likely to come next given a particular scenario. Once I document the relationships, I’ll package them into a user-friendly app that will offer the user a suggestion for the next word based on what they’ve typed already.