The objective of the project under development is to create a shiny app with the next-word predictor functionality. In order to do so, a text-predictive algorithm will be developed using the provided corpora consisting of blogs, news articles and tweets as the training set.
The objective of this milestone report is to present summary of the exploratory data analysis, insights gained from it, and the future steps.
The Corpora used here come from a link provided on the course syllabus page on Coursera. The data was made available by SwiftKey, and was collected from publicly available sources by a web crawler. The data was downloaded from the above link on Nov 29, 2021.
The provided data contains corpora in four languages. German, English(US), Finnish and Russian. Each set has 3 files corresponding to text scrubbed from blogs, news sites and twitter. We will focus only on the English corpora for this project.
Some basic characteristics of the corpora, such as file size, encodings, number of lines, words etc. are listed below in table 1.
| File Name | File Size | File Encodings | Total Characters | Total Words | Total Lines | Total Sentences |
|---|---|---|---|---|---|---|
| en_US.blogs.txt | 200.42MB | UTF-8 | 208361438 | 38154238 | 899288 | 2354963 |
| en_US.news.txt | 196.28MB | UTF-8 | 15683765 | 2693898 | 77259 | 154141 |
| en_US.twitter.txt | 159.36MB | UTF-8 | 162384825 | 30218125 | 2360148 | 3764817 |
Looking at Table 1, some interesting differences among the three datasets become apparent. The text files corresponding to the blog and news data are almost the same size. However, the blog text contains more than 10 times as many characters, words, lines and sentences as the news text.
The twitter text file is the smallest of the three. However, it has the maximum number of lines as well as sentences. This is not surprising since Twitter restricts the tweet size to 140, and most lines in that text are expected to be 140 characters long or less. Also, a smaller sentence and word-size is expected in that data set as people try to fit all they need to say in those 140 characters.
The file sizes of the provided corpora are a few hundreds of MegaBytes each. Processing files of this size with the limited memory and computation power of a personal computer is difficult. All further analysis was done using random samples representing 5% of the data from each of the three corpora.
The data was sampled by randomly selecting 5% of the lines from each of three text sets.
set.seed(2355)
sample_data <- sapply(data, function(dataset) {
sample(dataset, size=length(dataset) %/% 20, replace=F)
})
| Dataset | Characters | Words | Lines | Sentences |
|---|---|---|---|---|
| Blogs | 10465983 | 1915251 | 44964 | 118183 |
| News | 789060 | 135560 | 3862 | 7804 |
| 8123325 | 1510904 | 118007 | 188081 |
Table 2: Text samples extracted from each of the three files depicts the number of characters, words, lines and sentences in the samples extracted from the original corpora. The same statistic for the original corpora could be found in Table 1.
Next I explored the words in the three sample corpora. The data was processed per the following steps:
| Dataset | Total | English | % Foreign | Without Numbers | % Numeric | Without Stopwords | % Stopwords | Unique | Most Frequent | Second Frequent | Third Frequent |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Blogs | 1915251 | 1872110 | 2.2524985 | 1850374 | 1.161043 | 684486 | 63.43773 | 61345 | time | people | day |
| News | 135560 | 134244 | 0.9707878 | 130525 | 2.770329 | 59143 | 55.94365 | 16794 | time | people | home |
| 1510904 | 1502749 | 0.5397431 | 1476434 | 1.751124 | 595950 | 60.34268 | 60820 | love | rt | day |
Table 3: Analysis of words from the three text data sets. This table shows the results of word analysis of the three texts. The data was processed as described above under Analysing Words. The column ‘Without Stopwords’ contains words after all the processing is done. The column ‘Unique’ represents the number of unique non-stopword English words in each set.
The above analysis shows some remarkable differences between the three sample corpora. The News corpus seems to have almost one forth the size of vocabulary compared to the other corpora (refer to column ‘Unique’ above). The three corpora also differ notably in the proportion of foreign words, stopwords and numerics.
Understanding the frequency distribution of words in a corpus is useful in determining the size of frequency table a text-predictive algorithm will need for acceptable coverage. The following plots show frequency distribution of words in the three corpora.
Fig 1. Frequency distribution of words shows the frequency distribution of top 250 most frequent words in the three datasets. The word-set derived after removing foreign words and numerics, inclusive of stopwords is used in this plot.
Fig 1 shows that the word frequencies drop rapidly in the beginning and then gradually, suggesting that unique words required to cover a corpus may increase exponentially with desired coverage. The plots also show that the frequency distribution curves are quite different for the three corpora, especially for the News corpus, where the frequencies drop more gradually than the other corpora.
The differences in the three corpora are further emphasized by the most frequent words in these corpora shown in Figure 2 below.
Figure 2: Most Frequent Words shows the top 20 most frequent words in the three corpora. This plot uses a word set exclusive of foreign words, numerics and stopwords.
As stated before, frequency distribution of words in a corpora helps us to determine the size of frequency tables for modeling text-predictive algorithms. Based on the word frequencies in each data set, the number of words required to cover specified percentage of that dataset were calculated.
| dataset | 25% | 50% | 75% | 90% | 95% |
|---|---|---|---|---|---|
| Blogs | 363 | 1594 | 5999 | 17496 | 30030 |
| News | 371 | 1502 | 4825 | 10881 | 13838 |
| 175 | 985 | 4415 | 15476 | 31024 |
Table 4: Number of words required for indicated coverage This table shows how many words are required to cover the given percentage of text for each of the three data sets. This table uses the English word sets inclusive of stopwords, but excluding the numerics.
Figure 3. Increase in the number of words required for increased coverage depicts the exponential increase in the number of unique words required for increasing amount of coverage of the corpora.
The predictive text-model I am looking to build will use frequency table of n-grams for predicting the next word. Therefore, frequencies and distribution of n-grams were explored. The following table shows total, unique and repeated bigrams, trigrams, quadrugrams and quintugrams in the three datasets. Package tokenizers was used to generate the n-grams.
| Dataset | Total | Unique | Repeated | Total | Unique | Repeated | Total | Unique | Repeated | Total | Unique | Repeated |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Blogs | 1870836 | 703420 | 161169 | 1827733 | 1413455 | 133107 | 1786168 | 1689407 | 48202 | 1746202 | 1722430 | 12473 |
| News | 131705 | 88856 | 13034 | 127889 | 120697 | 4260 | 124131 | 123082 | 749 | 120437 | 120127 | 129 |
| 1392927 | 555420 | 117687 | 1278165 | 997847 | 88810 | 1168586 | 1093513 | 33174 | 1065049 | 1033702 | 10887 |
Figure 4: Comparison of Bi, Tri and Quadrugrams. This figure shows how the number of unique (left) and repeated (right) n-grams changes with increasing number of \(n\). While the number of unique n-grams increases rapidly with increasing \(n\), the number of repeated n-grams falls rapidly.
It can be seen in Figure 4 that the number of unique n-grams increases sharply between bigrams and trigrams. However, there is only a moderate level of increase between unique trigrams and quadrugrams, and none thereafter. On the contrary, repeated n-grams drop sharply as the value of \(n\) increases.
The frequency distribution of n-grams may be useful in determining how many n-grams should be used for modeling. Figure 5 below shows the frequency distribution of bi, tri and quadrugrams in the blogs data set while Figure 6 shows the frequency distribution of trigrams in all three datasets.
Let us now look at the top 20 most frequent tri-grams in the three datasets.
Some important insights were gained from this exploratory analyses.
We clearly see that the three datasets are quite different in terms of vocabulary and word distribution as well as the number and distribution of unique n-grams. This suggests that depending on the final application, the data used for modeling should be selected carefully.
The second important insight from this analysis comes from the changes in unique and repeated n-gram frequencies with increasing \(n\) value. Since the unique n-gram number peaks at \(n\) = 3, and the number of repeated n-grams drastically falls from tri- to quadrugrams, bigrams could be most suited to be fed to the predictive algorithms to find the next word.
This analysis also gives us a some idea about the size of word and n-gram frequency tables required for the model. Since the size of frequency tables required grow exponentially with coverage, at higher coverage values approaching 90+% accuracy, large increases in the size are required for relatively small gains in coverage. Thus, beyond a certain threshold, for small gains in accuracy huge increases in the size of the app will be necessary, which will make the app sluggish. Therefore, the size of these tables need to be adjusted carefully to balance the accuracy and the speed of the app.
Based on the insights above, a text-predictive model will be built that is reasonably accurate and speedy. The News corpus seems to have a relatively small and specialized vocabulary, and may reduce the accuracy of a general purpose text-predictive algorithm. I will derive my training dataset solely from the Blogs and Twitter corpora.
A frequency table of unique and repeated n-grams (\(n\) from 2 to 5) will be created. Unigrams, or larger n-grams, based on the available input, will be fed to the algorithm to find top 3 most frequent matches of n-grams that are 1 word larger than the input in the frequency table. When no matches are found, the first word of the input will be removed and the resulting shorter n-gram will be fed to the algorithm. The Process can be repeated till we are left with a single word to be fed to the algorithm.
Since the app needs to hold the frequency tables in memory, care will be taken to balance accuracy, app size and speed of execution.