The purpose of this document is to outline the exploratory data analysis performed on three corpora, in service of planning a simple model for a predictive text algorithm. The algorithm will predict the next word given a unigram, bigram or trigram input (i.e. it will output the final token in a bigram, trigram or quadgram, respectively.)
There were three corpora to use as training data for the model:
These files were downloaded from the course website and read in from disk for the purpose of this exercise.
As a first step, each file was loaded in its entirety to summarise the data held within, pulling out the file size, the number of lines, and the total number of words.
| Corpus Size (Kb) | Number of Lines | Number of Words |
|---|---|---|
| 326,645.2 | 2,360,148 | 30,093,372 |
| Statistic | Length | Word Count |
|---|---|---|
| Min | 2.00 | 1.00 |
| 1st Qu. | 37.00 | 7.00 |
| Median | 64.00 | 12.00 |
| Mean | 68.68 | 12.75 |
| 3rd Qu. | 100.00 | 18.00 |
| Max | 140.00 | 47.00 |
| Corpus Size (Kb) | Number of Lines | Number of Words |
|---|---|---|
| 5,235.1 | 20,000 | 691,184 |
| Statistic | Length | Word Count |
|---|---|---|
| Min | 2.0 | 1.00 |
| 1st Qu. | 109.0 | 19.00 |
| Median | 185.0 | 32.00 |
| Mean | 202.1 | 34.56 |
| 3rd Qu. | 269.0 | 46.00 |
| Max | 2,900.0 | 532.00 |
| Corpus Size (Kb) | Number of Lines | Number of Words |
|---|---|---|
| 5,802.8 | 20,000 | 833,312 |
| Statistic | Length | Word Count |
|---|---|---|
| Min | 2.0 | 1.00 |
| 1st Qu. | 47.0 | 9.00 |
| Median | 156.0 | 28.00 |
| Mean | 229.4 | 41.67 |
| 3rd Qu. | 330.0 | 60.00 |
| Max | 3,644.0 | 654.00 |
With 2360148 tweets, 20000 news articles and 20000 blogs, there is a total of 2400148 documents to potentially hold in memory and analyse. Clearly, this is too much for most mobile phones and would lead to an unacceptably long runtime for the model. It was therefore necessary to subset the data in subsequent exploration. Additionally, large datasets with no further use were deleted in situ to free up RAM as the code executed.
Exploration and calculations were initially performed on the Tweet dataset, and then replicated for the news and blog datasets.
A subset of the data were loaded to avoid excessive RAM usage and runtime. entries from the data were grabbed, looping over the lines in the dataset and using rbinom to decide whether or not to include each line. This randomisation of line selection helped to mitigate bias in the sampling process.
With the data grabbed as both a character vector and a corpus, a spell checking function implemented via the hunspell package was used to check for non-english or misspelled words.
A total of 7901 misspelled or non-english words were detected in the tweet dataset. They were added to a list of additional stopwords (a profanity dataset at http://www.bannedwordlist.com/lists/swearWords.txt) to be filtered out, so that they do not corrupt the model. A potential extension would be to perform correct spelling suggestion on the misspelled words, thereby retrieving some information at the cost of confounding the dataset with mistaken suggestions from the spell checker. This was not implemented because of its impact on the runtime and the additional complexity of validating suggested corrections.
Next, the data were tokenised. The list of additional stopwords was passed for removal from the list of tokens, so that modelling did not take them into account. All further processing was performed on the tokenised list, so the corpus was removed from memory.
Stemming the words reduces them to their most basic root (e.g. running –> run) and thereby reduces the number of unique words required in the dataset and model. In a real-world application, this would necessitate an additional layer of tense-checking to suggest a valid word to the user in the valid tense (e.g. run –> ran/run/running, or simpl –> simple/simplify/simplest) and was deemed out of the scope of this exercise. Another advantage of stemming is that it increases the coverage of real-world n-grams, including unseen n-grams, with fewer words in the model, since each stem can serve multiple words.
To perform stemming, the tokens were converted to a document feature matrix and passed to the textstat_frequency function, which returns a dataframe of each unique word and the frequency with which it appears in the corpus. As a measure of coverage, the cumulative sum of word frequency was added to the data frame, expressed as both a raw number and a percentage.
A total of 17697 unique words were identified after tokenisation. To achieve 50% coverage of all word instances, only 363 were required, and to achieve 90% coverage only 5408 were required.
With frequencies calculated, it was possible to visualise the distribution of word frequencies. A scatter plot shows the frequency of the most common 500 words, and a bar plot shows the 20 most common words with their frequencies.
C:_1.png
C:_2.png
The top 500 words show a shape reminiscent of a power-law, with frequency of appearance dropping rapidly from the top handful of words and reaching a slowly-decaying floor level.
Finally, the number of bigrams and trigrams were calculated for the tokenised and stemmed data set. These will form the basis of the model’s recommendations.
A total of 107020 bigrams, 103861 trigrams and 86950 quadgrams were found.
The same processes were then performed for the news and blogs datasets
C:_3.png
C:_4.png
A total of 28343 unique words were identified after tokenisation. To achieve 50% coverage of all word instances, only 656 were required, and to achieve 90% coverage only 6455 were required.
A total of 315741 bigrams, 346512 trigrams and 331305 quadgrams were found.
C:_5.png
C:_6.png
A total of 29721 unique words were identified after tokenisation. To achieve 50% coverage of all word instances, only 565 were required, and to achieve 90% coverage only 6184 were required.
A total of 347827 bigrams, 385325 trigrams and 369988 quadgrams were found.
| Document Type | Unique Words | Cover 50% | Cover 90% | Number of Bigrams | Number of Trigrams |
|---|---|---|---|---|---|
| Tweets | 17,697 | 363 | 5,408 | 107,020 | 103,861 |
| News | 28,343 | 656 | 6,455 | 315,741 | 346,512 |
| Blogs | 29,721 | 565 | 6,184 | 347,827 | 385,325 |
The summary implies that there is a much broader vocabulary in use in the blog and news datasets than the tweet datasets, after cleaning and tokenisation. This may be a misleading result, since it is far more likely that slang, abbreviation and other informal language is used in the character-limited tweet format than the other two document types, leading to more of the unique words being filtered for that tweet data set.
With the bigrams, trigrams and quadgrams generated, the raw data required to build a model were ready. The next step was to transform it into an apporpriate format.
First, the individual lists of n-grams were converted into a term frequency dataframe using the dfm() and textstat_frequency() functions from the quanteda package. These term frequency lists were then combined on an ngram-by-ngram basis, i.e. a single dataset for bigrams, a single dataset for trigrams and a single dataset for quadgrams. This combination processes used the sum() aggregation function to avoid duplication without losing information on total frequency.
Next, the ngrams were split into their constituent tokens and stored as their own features, so that models could access the nth term in each one.
Finally, another round of data cleaning removed any records with invalid tokens from the separated strings (numerical and punctuation based tokens which were not captured during the tokenisation process). Following these transformations, the term frequency data frames act as lookup tables for seen ngrams and their relative frequencies. This provides a measure of conditional probability of term n, given terms 1 to n-1, and allows predictions of term n to be made.
The philosophy behind the final model is as follows:
By returning a statement that there is no suggestion, the model is prevented from making arbitrarily bad predictions in pursuit of returning something.