Next-word predictor: Milestone Report

Objective

The objective of the project under development is to create a shiny app with the next-word predictor functionality. In order to do so, a text-predictive algorithm will be developed using the provided corpora consisting of blogs, news articles and tweets as the training set.

The objective of this milestone report is to present summary of the exploratory data analysis, insights gained from it, and the future steps.

The Corpora

The Corpora used here come from a link provided on the course syllabus page on Coursera. The data was made available by SwiftKey, and was collected from publicly available sources by a web crawler. The data was downloaded from the above link on Nov 29, 2021.

The provided data contains corpora in four languages. German, English(US), Finnish and Russian. Each set has 3 files corresponding to text scrubbed from blogs, news sites and twitter. We will focus only on the English corpora for this project.

Basic features

Some basic characteristics of the corpora, such as file size, encodings, number of lines, words etc. are listed below in table 1.

Table 1: Basic details of the Coursera-SwiftKey English dataset
File Name	File Size	File Encodings	Total Characters	Total Words	Total Lines	Total Sentences
en_US.blogs.txt	200.42MB	UTF-8	208361438	38154238	899288	2354963
en_US.news.txt	196.28MB	UTF-8	15683765	2693898	77259	154141
en_US.twitter.txt	159.36MB	UTF-8	162384825	30218125	2360148	3764817

Looking at Table 1, some interesting differences among the three datasets become apparent. The text files corresponding to the blog and news data are almost the same size. However, the blog text contains more than 10 times as many characters, words, lines and sentences as the news text.

The twitter text file is the smallest of the three. However, it has the maximum number of lines as well as sentences. This is not surprising since Twitter restricts the tweet size to 140, and most lines in that text are expected to be 140 characters long or less. Also, a smaller sentence and word-size is expected in that data set as people try to fit all they need to say in those 140 characters.

The file sizes of the provided corpora are a few hundreds of MegaBytes each. Processing files of this size with the limited memory and computation power of a personal computer is difficult. All further analysis was done using random samples representing 5% of the data from each of the three corpora.

Sampling the data

The data was sampled by randomly selecting 5% of the lines from each of three text sets.

set.seed(2355)
sample_data <- sapply(data, function(dataset) {
  sample(dataset, size=length(dataset) %/% 20, replace=F)
})

Table 2: Text samples extracted from each of the three files
Dataset	Characters	Words	Lines	Sentences
Blogs	10465983	1915251	44964	118183
News	789060	135560	3862	7804
Twitter	8123325	1510904	118007	188081

Table 2: Text samples extracted from each of the three files depicts the number of characters, words, lines and sentences in the samples extracted from the original corpora. The same statistic for the original corpora could be found in Table 1.

Analyzing Words

Next I explored the words in the three sample corpora. The data was processed per the following steps:

Tokenize the data to words (using tidytext). This step makes some additional changes including converting the words to lowercase, removing all punctuation and removing whitespace.
Remove non-English words by filtering out words with non-ASCII characters (package stringi).
Remove words without any alphabets. Most of these are numerics. However, this step did not remove alpha-numeric words such as ‘4ever’
Remove common English words called stopwords as defined in package stopwords. Stopwords are words that occur with high frequency, such as ‘a’, ‘the’, ‘she’, ‘just’, ‘into’, ‘therefore’ without contributing distinct character to a text body.

Table 3: Analysis of words from the three text data sets
Dataset	Total	English	% Foreign	Without Numbers	% Numeric	Without Stopwords	% Stopwords	Unique	Most Frequent	Second Frequent	Third Frequent
Blogs	1915251	1872110	2.2524985	1850374	1.161043	684486	63.43773	61345	time	people	day
News	135560	134244	0.9707878	130525	2.770329	59143	55.94365	16794	time	people	home
Twitter	1510904	1502749	0.5397431	1476434	1.751124	595950	60.34268	60820	love	rt	day

Table 3: Analysis of words from the three text data sets. This table shows the results of word analysis of the three texts. The data was processed as described above under Analysing Words. The column ‘Without Stopwords’ contains words after all the processing is done. The column ‘Unique’ represents the number of unique non-stopword English words in each set.

The above analysis shows some remarkable differences between the three sample corpora. The News corpus seems to have almost one forth the size of vocabulary compared to the other corpora (refer to column ‘Unique’ above). The three corpora also differ notably in the proportion of foreign words, stopwords and numerics.

Frequency distribution

Understanding the frequency distribution of words in a corpus is useful in determining the size of frequency table a text-predictive algorithm will need for acceptable coverage. The following plots show frequency distribution of words in the three corpora.
Fig 1. Frequency distribution of words shows the frequency distribution of top 250 most frequent words in the three datasets. The word-set derived after removing foreign words and numerics, inclusive of stopwords is used in this plot.

Fig 1 shows that the word frequencies drop rapidly in the beginning and then gradually, suggesting that unique words required to cover a corpus may increase exponentially with desired coverage. The plots also show that the frequency distribution curves are quite different for the three corpora, especially for the News corpus, where the frequencies drop more gradually than the other corpora.

The differences in the three corpora are further emphasized by the most frequent words in these corpora shown in Figure 2 below.

Fig. 2: Most Frequent words

Figure 2: Most Frequent Words shows the top 20 most frequent words in the three corpora. This plot uses a word set exclusive of foreign words, numerics and stopwords.

As stated before, frequency distribution of words in a corpora helps us to determine the size of frequency tables for modeling text-predictive algorithms. Based on the word frequencies in each data set, the number of words required to cover specified percentage of that dataset were calculated.

Table 4: Number of words required for indicated coverage
dataset	25%	50%	75%	90%	95%
Blogs	363	1594	5999	17496	30030
News	371	1502	4825	10881	13838
Twitter	175	985	4415	15476	31024

Table 4: Number of words required for indicated coverage This table shows how many words are required to cover the given percentage of text for each of the three data sets. This table uses the English word sets inclusive of stopwords, but excluding the numerics.

Figure 3. Increase in the number of words required for increased coverage depicts the exponential increase in the number of unique words required for increasing amount of coverage of the corpora.

Analyzing n-grams

The predictive text-model I am looking to build will use frequency table of n-grams for predicting the next word. Therefore, frequencies and distribution of n-grams were explored. The following table shows total, unique and repeated bigrams, trigrams, quadrugrams and quintugrams in the three datasets. Package tokenizers was used to generate the n-grams.

Table 5: Analysis of n-grams from the three corpora
	Bigrams			Trigrams			Quadrugrams			Quintugrams
Dataset	Total	Unique	Repeated	Total	Unique	Repeated	Total	Unique	Repeated	Total	Unique	Repeated
Blogs	1870836	703420	161169	1827733	1413455	133107	1786168	1689407	48202	1746202	1722430	12473
News	131705	88856	13034	127889	120697	4260	124131	123082	749	120437	120127	129
Twitter	1392927	555420	117687	1278165	997847	88810	1168586	1093513	33174	1065049	1033702	10887

Figure 4: Comparison of Bi, Tri and Quadrugrams. This figure shows how the number of unique (left) and repeated (right) n-grams changes with increasing number of \(n\). While the number of unique n-grams increases rapidly with increasing \(n\), the number of repeated n-grams falls rapidly.

It can be seen in Figure 4 that the number of unique n-grams increases sharply between bigrams and trigrams. However, there is only a moderate level of increase between unique trigrams and quadrugrams, and none thereafter. On the contrary, repeated n-grams drop sharply as the value of \(n\) increases.

Frequency distribution

The frequency distribution of n-grams may be useful in determining how many n-grams should be used for modeling. Figure 5 below shows the frequency distribution of bi, tri and quadrugrams in the blogs data set while Figure 6 shows the frequency distribution of trigrams in all three datasets.

Let us now look at the top 20 most frequent tri-grams in the three datasets.

Fig. 7: Top 20 most frequent trigrams

Insights gained from the exploratory data analysis

Some important insights were gained from this exploratory analyses.

We clearly see that the three datasets are quite different in terms of vocabulary and word distribution as well as the number and distribution of unique n-grams. This suggests that depending on the final application, the data used for modeling should be selected carefully.

The second important insight from this analysis comes from the changes in unique and repeated n-gram frequencies with increasing \(n\) value. Since the unique n-gram number peaks at \(n\) = 3, and the number of repeated n-grams drastically falls from tri- to quadrugrams, bigrams could be most suited to be fed to the predictive algorithms to find the next word.

This analysis also gives us a some idea about the size of word and n-gram frequency tables required for the model. Since the size of frequency tables required grow exponentially with coverage, at higher coverage values approaching 90+% accuracy, large increases in the size are required for relatively small gains in coverage. Thus, beyond a certain threshold, for small gains in accuracy huge increases in the size of the app will be necessary, which will make the app sluggish. Therefore, the size of these tables need to be adjusted carefully to balance the accuracy and the speed of the app.

Future Steps

Based on the insights above, a text-predictive model will be built that is reasonably accurate and speedy. The News corpus seems to have a relatively small and specialized vocabulary, and may reduce the accuracy of a general purpose text-predictive algorithm. I will derive my training dataset solely from the Blogs and Twitter corpora.

A frequency table of unique and repeated n-grams (\(n\) from 2 to 5) will be created. Unigrams, or larger n-grams, based on the available input, will be fed to the algorithm to find top 3 most frequent matches of n-grams that are 1 word larger than the input in the frequency table. When no matches are found, the first word of the input will be removed and the resulting shorter n-gram will be fed to the algorithm. The Process can be repeated till we are left with a single word to be fed to the algorithm.

Since the app needs to hold the frequency tables in memory, care will be taken to balance accuracy, app size and speed of execution.

Next-word predictor: Milestone Report

Data Science Specialization Capstone Project

Sangeeta Shah

December 3, 2021

Objective

The Corpora

Basic features

Sampling the data

Analyzing Words

Frequency distribution

Fig. 2: Most Frequent words

Analyzing n-grams

Frequency distribution

Fig. 7: Top 20 most frequent trigrams

Insights gained from the exploratory data analysis

Future Steps