Greg Sutcliffe
2019-01-15
The Data Science Specialization Capstone tasks us with building a text prediction application, based on a corpus of text drawn from blogs, news articles, and Twitter tweets.
The premise is to build a predication model which accepts a string of text, and outputs the next word to use in the sentence. This could be something that would potentially be used in a mobile application, so memory usage and speed of execution are considerations.
In this presentation, we'll go over the dataset generation process, the model selection, the prototype application itself, and some potential next steps.
The corpus of text is large, in total comprising 4,269,678 lines (file size is 557Mb). This requires a significant amount of memory and CPU time to process. The data processing follows the following flow:
tidytext package to extract the 4-word-gramsn=1This is time-consuming - 20% of the corpus took approximately 4 hours to parse on my laptop. However, the chosen model depends strongly on the quantity of text available to it, so I choose to spend a large amount of time on this processing step, to make the execution faster instead.
The model selected for this application is a Stupid Backoff model which is defined as:
\[ S(w_i|w_{i-k+1}^{i-1}) = \begin{cases} \frac{count(w_{i-k+1}^{i})}{count(w_{i-k+1}^{i-1})}, & \mbox{if } count(w_{i-k+1}^{i})\mbox{ is > 0} \\ \lambda S(w_i|w_{i-k+2}^{i-1}), & \mbox{otherwise} \end{cases} \]
With \( \alpha \) set to the recommend value of 0.4. This model was chosen as it is very simple (i.e. fast), but it requires a large table of ngrams to function well, hence the focus on dataset generation.
The final frequency table weighs in at 292Mb, which should fit in the memory of most mobiles, and returns predictions in well under a second, which feels good to the user.
The prototype application can be found here - please be patient as the initial data load can take a few seconds if the app is sleeping. Below is a screenshot.
To use the app:
In order to complete this prototype, many areas of improvement have not yet been evaluated. Below are some areas for further work, in rough order of priority: