The current project is to build a Shiny app for text prediction: given some texts input by the app user, the app outputs prediction(s) of the next word. This page is a summary on the dataset and the data cleaning process. Brief information on the model and its building process is also outlined.
We refer interested readers to the app’s Github repository for the formal pdf version of this page as well as other documentations of the app.
The raw dataset is provided by Swiftkey and the app author gratefully acknowledges its generous contribution.
The raw data consists of three raw text files which contain (mainly) English texts extracted from three sources - blogs, news and twitter - in the U.S. during the late 2000s to early 2010s.
Table 1.1 summarizes the word count and line count from each of the three files. As shown, all files contain similar amount of words. Note that the line counts here convey little information, since a line in the raw data does not necessarily correspond to one sentence, and it can also end with an incomplete sentence.
| Blogs | News | ||
|---|---|---|---|
| Text file size (MB) | 200 | 196 | 159 |
| Word count | 38050950 | 35628125 | 31073243 |
| Line count | 899288 | 1010242 | 2360148 |
Figure 1.1 is a plot of the log-count of the \(n\)-th most frequently appearing words (upper and lower cases not differentiated) in the blogs data for \(n \in \{1,2,\ldots,10000\}\) . Pay attention to that only a small fraction of words appear often, while most of the words have only a small count. The same pattern is also observed for the other two data sources.
Figure 1.1: Log-count of the 10000 most frequent words in the blogs data.
Figure 1.2 is a second plot on the blogs data, displaying the cumulative fraction of total word counts constituted by the corresponding number of the most frequently appearing words, which again suggests that, a small fraction of the most common words accounts for a large proportion of the total word count. For example, around 2000 out of the over 300000 unique words already account for 80% of the total word count of the data.
Figure 1.2: Cumulative fraction of total word counts constitute by the corresponding number of most frequent words in the blogs data.
The raw data is cleaned in three aspects:
For these purposes, the \(\mathsf{R}\) library \(\mathsf{tokenizers}\) is chosen to assist the process. 1 The raw data is passed to a pipe of the library functions and \(\mathsf{R}\) base functions to accomplish the above tasks. Refer to the functions documentation for details on the function for tokenization.
A sentence-begin tag, the <\(\mathsf{s}\)> token2, is added to the beginning of each tuple of tokens from the same sentence. With this, as soon as the app user inputs one word, the app outputs prediction(s) by searching through the 3-grams (with the first word being <\(\mathsf{s}\)>, the second word being the one input by the user, and the last word the prediction) instead of 2-grams. Similarly, the 4-grams can be searched through already when the app user inputs two words. Prediction accuracy is improved through the extra information of the text being (or not being) at the sentence begin.
The unknown token <\(\mathsf{unk}\)> is introduced to replace the words which do not occur in the dataset frequently. This is done for three reasons:
Recall that a relatively small amount of words constitute the majority of total word counts (refer to Figures 1.1 and 1.2 and the discussion therein). A certain percentage \(\alpha\) is chosen to be the threshold, the most commonly appearing words which constitute just over \(\alpha\) of the total word counts will be retained and the remaining words will be replaced by <\(\mathsf{unk}\)>. In the current model \(\alpha\) is set as 90%.
The three sources of data are processed separately in the same way as specified in Section 2. The processed data has attributes as displayed in Table 3.1. Note that the token counts in Table 3.1 are higher than the word counts in Table 1.1, since the punctuations “,” and “.” form also single tokens and the sentence-begin tokens are added.
| Blogs | News | Total | ||
|---|---|---|---|---|
| Token count | 42859210 | 39094206 | 36938806 | 118892222 |
| Sentence count | 2375718 | 2024588 | 3770155 | 8170461 |
| Number of unique tokens | 6085 | 7273 | 4957 | 9015 |
| Minimum no. of appearance for a word to be in the vocabulary | 429 | 367 | 395 | - |
| Number of unique words replaced by |
237656 | 195323 | 283546 | - |
The Shiny app uses a 4-gram model to predict the next word given some text. (See the model-construction documentation in the Github repo for a brief summary of an \(n\)-gram model, or e.g. Jurafsky and Martin (2009) for a detailed explanation.)
A number of self-defined functions are used to run a linear scan of the processed data, to write all 2-grams, 3-grams and 4-grams into a set of nested dictionaries, and to output prediction(s) on the next word of some text base on the frequencies of the \(n\)-grams. Refer to the functions documentation as well as the model-construction documentation in the repo for further information in this regard.
[1] Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
A variety of \(\mathsf{R}\) libraries for text mining/natural language processing including \(\mathsf{quanteda, tm, RWeka}\) and \(\mathsf{text2vec}\) have been considered. The \(\mathsf{tokenizers}\) library is found to best fit the needs of this project due to its ability to flexibly split text chunks into sentences/words, to distinguish some (although not all) English short forms which end with a full stop from sentence end, and to effectively remove extra punctuations/symbols in tokens. ↩︎
For sentence. Irrelevant to the html strike-through tag.↩︎