Overview

The purpose of this document is to outline the exploratory data analysis performed on three corpora, in service of planning a simple model for a predictive text algorithm. The algorithm will predict the next word given a unigram, bigram or trigram input (i.e. it will output the final token in a bigram, trigram or quadgram, respectively.)

Exploratory Data Analysis

There were three corpora to use as training data for the model:

A set of tweets
A set of blogs
A set of news articles

These files were downloaded from the course website and read in from disk for the purpose of this exercise.

File Summaries

As a first step, each file was loaded in its entirety to summarise the data held within, pulling out the file size, the number of lines, and the total number of words.

Tweets

Corpus Size (Kb)	Number of Lines	Number of Words
326,645.2	2,360,148	30,093,372

Statistic	Length	Word Count
Min	2.00	1.00
1st Qu.	37.00	7.00
Median	64.00	12.00
Mean	68.68	12.75
3rd Qu.	100.00	18.00
Max	140.00	47.00

News

Corpus Size (Kb)	Number of Lines	Number of Words
5,235.1	20,000	691,184

Statistic	Length	Word Count
Min	2.0	1.00
1st Qu.	109.0	19.00
Median	185.0	32.00
Mean	202.1	34.56
3rd Qu.	269.0	46.00
Max	2,900.0	532.00

Blogs

Corpus Size (Kb)	Number of Lines	Number of Words
5,802.8	20,000	833,312

Statistic	Length	Word Count
Min	2.0	1.00
1st Qu.	47.0	9.00
Median	156.0	28.00
Mean	229.4	41.67
3rd Qu.	330.0	60.00
Max	3,644.0	654.00

With 2360148 tweets, 20000 news articles and 20000 blogs, there is a total of 2400148 documents to potentially hold in memory and analyse. Clearly, this is too much for most mobile phones and would lead to an unacceptably long runtime for the model. It was therefore necessary to subset the data in subsequent exploration. Additionally, large datasets with no further use were deleted in situ to free up RAM as the code executed.

Exploration and Cleaning

Exploration and calculations were initially performed on the Tweet dataset, and then replicated for the news and blog datasets.

A subset of the data were loaded to avoid excessive RAM usage and runtime. entries from the data were grabbed, looping over the lines in the dataset and using rbinom to decide whether or not to include each line. This randomisation of line selection helped to mitigate bias in the sampling process.

With the data grabbed as both a character vector and a corpus, a spell checking function implemented via the hunspell package was used to check for non-english or misspelled words.

A total of 7901 misspelled or non-english words were detected in the tweet dataset. They were added to a list of additional stopwords (a profanity dataset at http://www.bannedwordlist.com/lists/swearWords.txt) to be filtered out, so that they do not corrupt the model. A potential extension would be to perform correct spelling suggestion on the misspelled words, thereby retrieving some information at the cost of confounding the dataset with mistaken suggestions from the spell checker. This was not implemented because of its impact on the runtime and the additional complexity of validating suggested corrections.

Next, the data were tokenised. The list of additional stopwords was passed for removal from the list of tokens, so that modelling did not take them into account. All further processing was performed on the tokenised list, so the corpus was removed from memory.

Stemming the words reduces them to their most basic root (e.g. running –> run) and thereby reduces the number of unique words required in the dataset and model. In a real-world application, this would necessitate an additional layer of tense-checking to suggest a valid word to the user in the valid tense (e.g. run –> ran/run/running, or simpl –> simple/simplify/simplest) and was deemed out of the scope of this exercise. Another advantage of stemming is that it increases the coverage of real-world n-grams, including unseen n-grams, with fewer words in the model, since each stem can serve multiple words.

To perform stemming, the tokens were converted to a document feature matrix and passed to the textstat_frequency function, which returns a dataframe of each unique word and the frequency with which it appears in the corpus. As a measure of coverage, the cumulative sum of word frequency was added to the data frame, expressed as both a raw number and a percentage.

A total of 17697 unique words were identified after tokenisation. To achieve 50% coverage of all word instances, only 363 were required, and to achieve 90% coverage only 5408 were required.

With frequencies calculated, it was possible to visualise the distribution of word frequencies. A scatter plot shows the frequency of the most common 500 words, and a bar plot shows the 20 most common words with their frequencies.

C:_1.png

C:_2.png

The top 500 words show a shape reminiscent of a power-law, with frequency of appearance dropping rapidly from the top handful of words and reaching a slowly-decaying floor level.

Finally, the number of bigrams and trigrams were calculated for the tokenised and stemmed data set. These will form the basis of the model’s recommendations.

A total of 107020 bigrams, 103861 trigrams and 86950 quadgrams were found.

The same processes were then performed for the news and blogs datasets

News

C:_3.png

C:_4.png

A total of 28343 unique words were identified after tokenisation. To achieve 50% coverage of all word instances, only 656 were required, and to achieve 90% coverage only 6455 were required.

A total of 315741 bigrams, 346512 trigrams and 331305 quadgrams were found.

Blog

C:_5.png

C:_6.png

A total of 29721 unique words were identified after tokenisation. To achieve 50% coverage of all word instances, only 565 were required, and to achieve 90% coverage only 6184 were required.

A total of 347827 bigrams, 385325 trigrams and 369988 quadgrams were found.

Summary of Exploration

Document Type	Unique Words	Cover 50%	Cover 90%	Number of Bigrams	Number of Trigrams
Tweets	17,697	363	5,408	107,020	103,861
News	28,343	656	6,455	315,741	346,512
Blogs	29,721	565	6,184	347,827	385,325

The summary implies that there is a much broader vocabulary in use in the blog and news datasets than the tweet datasets, after cleaning and tokenisation. This may be a misleading result, since it is far more likely that slang, abbreviation and other informal language is used in the character-limited tweet format than the other two document types, leading to more of the unique words being filtered for that tweet data set.

Model Planning

With the bigrams, trigrams and quadgrams generated, the raw data required to build a model were ready. The next step was to transform it into an apporpriate format.

First, the individual lists of n-grams were converted into a term frequency dataframe using the dfm() and textstat_frequency() functions from the quanteda package. These term frequency lists were then combined on an ngram-by-ngram basis, i.e. a single dataset for bigrams, a single dataset for trigrams and a single dataset for quadgrams. This combination processes used the sum() aggregation function to avoid duplication without losing information on total frequency.

Next, the ngrams were split into their constituent tokens and stored as their own features, so that models could access the nth term in each one.

Finally, another round of data cleaning removed any records with invalid tokens from the separated strings (numerical and punctuation based tokens which were not captured during the tokenisation process). Following these transformations, the term frequency data frames act as lookup tables for seen ngrams and their relative frequencies. This provides a measure of conditional probability of term n, given terms 1 to n-1, and allows predictions of term n to be made.

The philosophy behind the final model is as follows:

Receive an input string from the user
Separate the input string into its tokens
Count the tokens; if there are more than 3, trim to the final 3 tokens
For n = 1, 2, or 3 tokens, predict using the bigram, trigram, or quadgram table
Return the top 3 candidates for the next word
If no match is found for an ngram, remove the first input token and search again with n-1 tokens
If no match is found for any value of n, return a message that there is no prediction

By returning a statement that there is no suggestion, the model is prevented from making arbitrarily bad predictions in pursuit of returning something.

‘Milestone Report’ Project

Aniket kumar

2024-05-19

Capstone Milestone Report

Overview

Exploratory Data Analysis

File Summaries

Tweets

News

Blogs

Exploration and Cleaning

News

Blog

Summary of Exploration

Model Planning