Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain.
Predictive text is an input technology that facilitates typing on a mobile device by suggesting words the end user may wish to insert in a text field. Predictions are based on the context of other words in the message and the first letters typed. Because the end user simply taps on a word instead of typing it it out on a soft keyboard, predictive text can significantly speed up the input process.
The goal of this project is to build a predictive text application, which takes a phrase of one or more words as input and predicts the next word as output. For example, if the user types “I went to the”, the application should output the 3 most likely candidates for the next word. In this case, we could expect the output to be “gym”, “store”, “restaurant”.
The data is from a corpus called HC Corpora (www.corpora.heliohost.org). The corpora are collected from publicly available sources by a web crawler. More information can be found on http://www.corpora.heliohost.org/aboutcorpus.html.
Data has been collected from various sources such as Twitter, Blogs and News, each having their respective data files.
A brief summary of the data is presented below.
The Blogs document contains 899,288 lines, 37,334,131 words, and 208,361,438 characters.
Length of the longest line in the Blogs document is 40835.
The best way to understand data better is to visualise it. Word clouds give greater prominence to words that appear more frequently in the source text. It is a very powerful instrument to get the first glance about frequency and variation of the words in the text. Below are word clouds for each dataset.
Blogs Word Cloud:
News Word Cloud:
Tweets Word Cloud:
Considering the size of the 3 data files, 5000 lines were randomly sampled from each of them. These 3 samples were then combined into one “training” file, containing 1500 lines.
Following this, a function was created to clean up the data set. The following actions were performed in the clean up process:The data was tokenized into One Gram Tokens, Two Gram Tokens and Three Gram Tokens. They were then sorted based on frequency.
Top 20 One Gram Tokens:
Top 20 Two Gram Tokens:
Top 20 Three Gram Tokens:
A function was created to analyze the minimum number of unique words needed to cover a certain percentage of all word occurences in the language. Following is the graph depecting the results:
Therefore, 3 unique words are needed to cover 10% of all words occurences in the language, 9 unique words to cover 20% of all word occurences in the language and so on. In general, onw can say that the minimum number of unique words needed keeps increasing by 2 to 3 times with an increase of 10% in the target percentages.
I plan to do modelling using the following 3 combination techniques:
Following this, prediction will be performed to check the accuracy of the model created.
Finally, A Shiny app will be created that takes as input a phrase (multiple words) in a text box input and outputs a prediction of the next word.