People are spending more and more time using their mobile devices for various functions, some of them are related to text typing. However, typing with mobile devices is not always easy. This is why tech companies, such as SwiftKey, create applications that help people type more easily. Providing further text suggestions to the application user is achieved with predictive text modeling. For example, if we have a sentence:
Mary had a little …
Google search text prediction function tells us the next most likely word is lamb. The purpose of this project is to create an application that would predict the following words in a similar way.
For this project, we will use three files containing texts from Twitter messages, blogs and internet news in US English, provided by SwiftKey and downloaded from Coursera. They will be processed and explored below.
Data processing workflow will look like this:
Firstly, we will load all three files to investigate their basic properties.
Here is an example of Twitter texts:
## [1] "Things going so well! We've written 2 songs, finished another, and arranged a fourth. So excited for all the progress!"
## [2] "yea. Plus they run at Sports Academy in the mornings."
## [3] "Desk put together, room all set up. Oh boy, oh boy"
## [4] "Save the Date: AdFed Miami \"Lemonade\" Documentary Screening with Erik Proulx. Nov 17th 630pm @ Books and Books (Coral Gables)"
## [5] "Way overdue for a #date! #patience is a virtue"
In a similar way, we can extract sample lines for blogs:
## [1] "Now, speaking in tongues may manifest not only in personal prayer but it may also be the media whereby the Lord wishes to express a word of prophecy to the community. In such cases it is necessary that there be someone in the community to interpret so that it can be of benefit to all. One is also free to ask the Holy Spirit to interpret their prayer language when praying in private."
## [2] "Summary of clinic: Four and a half hours. 15 minutes of face time. A surgery I don't want, but which should have happened already, no bracing for standing/walking, no good advice for an issue we have, no urology answers, more appointments."
## [3] "My cake schedule usually looks like this."
## [4] "Firpo's Balloon Cocktail (mildly adapted)"
## [5] "Step 6. Create different mindmaps and then translate to word document, pdfs, powerpoints, websites based on a particular purpose."
…and news:
## [1] "We took a look at Unit 7, a third-floor 1,780-square-foot two-bedroom plus study that features red oak flooring throughout, a custom kitchen with Bosch appliances, a living/dining space with a gas fireplace and a Carrara marble master-suite bathroom. The unit is on the market for $699,000."
## [2] "Florissant Police Chief William Karabas urged the council to publish a notice of intent to enter into a PLA. Karabas is chairman of the Emergency Communications Commission."
## [3] "The driving event features dressage and cones on Oct. 8 and the marathon Oct. 9. There is no admission or parking fee."
## [4] "Bad: Robinson loves Barton's defense and Nelson's offense, but the performance of those third guards (0 for 7 in 34 minutes) raises the question of whether the Beavers have enough guard depth right now to pull off the strategy."
## [5] "Published: April 15, 2000"
Basic characteristics of these data sets - number of lines, words and symbols, and size in megabytes - are provided in the table below:
| File | Lines | Words | Size_in_MB | Symbols |
|---|---|---|---|---|
| 2,360,148 | 30,373,543 | 318.99 | 162,096,031 | |
| Blogs | 899,288 | 37,334,131 | 255.35 | 206,824,505 |
| News | 1,010,242 | 34,372,530 | 257.34 | 203,223,159 |
The selected training sample will be tokenized (i.e., split) into sentences and words. Also, English stopwords and profanities will be removed, to get rid of inappropriate words and functional words than contribute little to the model.
Apart from separate words, we will analyze \(n\)-grams - contiguous sequences of \(n\) words from selected text subsamples. In this report, \(n\) values will be 2, 3 and 4, which can be also called bigrams, trigrams and quadrigrams, respectively.
For the analysis, a document-feature matrix will be created from the selected corpus subsets. It describes the frequency of terms (features) in a collection of documents. For this matrix, we will report summary data and calculate sparsity, which is the proportion of cells with zero counts.
For each type of source, the following items will be reported:
This summary is given below. As we can see, the number of unique single words is much smaller than for \(n\)-grams with \(n>1\). However, the number of total tokens decreases when \(n\) increases. Less than 1% of single words are necessary to cover 50% of single word instances, and 3-7% are needed to cover 90% of them. 7-13% of bigrams are enough to cover 50% of bigram instances, and about 80% are necessary to cover 90% of them. In case of higher \(n\) values, the percentage of features necessary to cover given thresholds is almost equal to those thresholds. Non-English texts make about 1% of the sample:
| File name | ngram length | Unique tokens | Total tokens | 50% threshold | 90% threshold | Non-English percentage |
|---|---|---|---|---|---|---|
| 1 | 334461 | 13383952 | 0.16 | 3.70 | 1.12 | |
| 2 | 4800804 | 10459482 | 7.73 | 78.20 | 1.39 | |
| 3 | 6979666 | 7921334 | 43.25 | 88.65 | 1.37 | |
| 4 | 5674575 | 5885767 | 48.13 | 89.62 | 0.97 | |
| blogs | 1 | 268116 | 15143342 | 0.36 | 5.53 | 1.15 |
| blogs | 2 | 6977089 | 13286612 | 12.81 | 80.95 | 1.43 |
| blogs | 3 | 10836618 | 11544175 | 46.73 | 89.34 | 1.22 |
| blogs | 4 | 9852073 | 9958155 | 49.46 | 89.89 | 0.83 |
| news | 1 | 225148 | 15651113 | 0.51 | 6.51 | 1.34 |
| news | 2 | 6929962 | 14074563 | 11.08 | 79.68 | 1.71 |
| news | 3 | 11170260 | 12551002 | 43.81 | 88.76 | 1.35 |
| news | 4 | 10689294 | 11115454 | 48.00 | 89.60 | 0.82 |
Top 20 words/\(n\)-grams for each text source and \(n\) value are presented below. For all sources, the larger \(n\), the lower feature counts. As we can see, the most frequent entries are related to famous people (“president barack obama”, “martin luther king jr”, “hunter matt hunter matt”), celebrations (“happy birthday”, “happy mother’s day”, “happy cinco de mayo”), food (“cake cake cake cake”, “g fat g saturated”). people in authority positions (“county prosecutor’s office”, “superior court judge”, “chief executive officer”), time (“last year”, “last week”, “two years ago”):
Frequencies of unique \(n\)-grams/words by text source and \(n\) are given in lollipop plots below. As we can see, most distinct \(n\)-grams have a frequency of 1, and this proportion increases with \(n\):
Total frequencies of \(n\)-grams/words by text source and \(n\) are presented below. Similar to frequencies for unique \(n\)-grams, the proportion of \(n\)-grams with a frequency of 1 increases with \(n\). However, for lower \(n\) values, the percentage of total tokens for high frequency \(n\)-grams is much higher (compared to distinct counts above), as it might be expected:
For all data sources and \(n\) values, document-frequency matrix sparsity exceeds 99.99%.
Since frequency of words and \(n\)-grams can be expressed in proportion, it is natural to assess probabilities of the following words given the beginning of the phrase as ratios between particular word counts and total counts for the beginning. However, we want to simplify our assessment using only one or several words instead of the whole sentence. This is why \(n\)-grams will be used in the word prediction model.
\(n\)-grams that were generated above will be split into \(n-1\)-grams containing with the first \(n-1\) words and the final word. Our task is to predict the following word for a given phrase, so \(n-1\) last words of the phrase will be proceeded as an input. For this \(n-1\)-gram, we will find out words that are the most frequent in respective \(n\)-grams.
The model given above can be used when words are predicted with 2-gram models (the last given and predicted word):
\[ P(X_{n}=x \mid X_{n-1}=x_{n-1}, X_{n-2}=x_{n-2}, \ldots , X_1=x_1) = P(X_{n}=x \mid X_{n-1}=x_{n-1}) \]
For higher order \(n\)-grams however, the predicted word depends on several last words. In this case, Markov chains with memory (of order \(m\)) can be applied:
\[ P(X_{n}=x \mid X_{n-1}=x_{n-1}, X_{n-2}=x_{n-2}, \ldots , X_1=x_1) = P(X_{n}=x \mid X_{n-1}=x_{n-1}, X_{n-2} = x_{n-2}, \ldots , X_{n-m}=x_{n-m}), \] when \(n>m\)
Of course, it might happen that the given \(n-1\)-gram will not be among those that were created from the training set. In that case, we will use back-off models to allocate some probability mass to unseen word combinations. For example, we could use the last two words to predict the next word if we had examples of the particular trigram. If we don’t, we back-off to using the respective bigram. If we don’t have counts to compute the next word probability using the last word, we estimate unconditional word probabilities. In other words, we only “back off” to a lower-order \(n\)-gram if we have no examples for a higher-order \(n\)-gram.
In order for a back-off model to keep the correct probability distribution (so that the total probability does not exceed 1), we have to decrease probabilities for the higher-order \(n\)-grams to save some probability mass for the lower order \(n\)-grams. This is called smoothing or discounting. Often, a function (denoted as e.g. \(\alpha\)) is used to distribute this probability mass.
Katz back-off is a kind of back-off used with discounting. If the \(n\)-gram has non-zero counts, the next word probability is based on a conditional probability given the \(n\)-gram with a discount. Otherwise, we back-off to a lower \(n\)-gram:
\[ P_{BO}(w_n \mid w_{n-N+1}^{n-1})= \begin{cases} P^*(w_n \mid w_{n-N+1}^{n-1}), & \text{if $C(w^n_{n-N+1})>0$}.\\ \alpha(w^{n-1}_{n-N+1})P_{BO}(w_n \mid w^{n-1}_{n-N+2}), & \text{otherwise}. \end{cases} \] Katz back-off is often combined with Good-Turing smoothing. However, empirical examples show that absolute discounting also can give good estimates on \(n\)-gram frequencies. So the exact way to implement the algorithm is still to be decided.
The final algorithm will be implemented in a Shiny app with the entered phrase as an input and the suggested word as an output. The following challenges can be expected: