Milestone

Context

The objective of this analysis is to clean and prepare text data extracted from three sources: - News articles - Blog posts - Twitter posts

These datasets are part of the SwiftKey Dataset and contain raw, unstructured text. Before proceeding to model building or other analyses, it is essential to clean this data to ensure quality and consistency.

Cleaning Process

The cleaning steps applied to the datasets include: 1. Removing empty lines: Lines with no content were discarded. 2. Removing duplicate lines: Lines were made lowercase, and duplicates were removed. 3. Removing special characters: Non-alphanumeric characters, except for spaces, were removed. 4. Removing short lines: Lines with fewer than 20 characters were removed. 5. Removing vulgar words: Lines containing words from a predefined list of profanities were removed. 6. Removing stopwords (optional): Frequently used words that carry little meaning (e.g., “and”, “the”) were removed, if specified.

Original dataset

Below is a table summarizing the number of lines and words in the datasets.

General characteristics of the datasets
Dataset	Lines_Before	Words_Before	File_Size_KB	Characters
en_US.news.txt	1010242	34372530	200988.2	203223159
en_US.blogs.txt	899288	37334131	205234.4	206824505
en_US.twitter.txt	2360148	30373543	163188.8	162096031

Tokenization and Analysis

After cleaning the datasets, we proceed to tokenization. Given the large size of the combined datasets, we opted to randomly sample 10% of the cleaned lines for this analysis. Tokenization involves splitting the sampled text into smaller units such as words, bigrams (two-word phrases), and trigrams (three-word phrases). This sampling approach ensures computational efficiency while maintaining a representative subset of the text. Below, we analyze the most frequent tokens across all three datasets combined.

Frequency Coverage Analysis

This section examines how many unique words are required to cover 50% and 90% of all word instances in the combined dataset. A cumulative frequency plot is also included for visualization.

Foreign Words Analysis

To evaluate foreign words in the text, we used a spell-checking approach. Words not recognized by the English dictionary are flagged as potential foreign words. Below is a pie chart showing the proportion of foreign words in the combined dataset. However the majority of words detected are due to the presence of spelling error, this will be taken in consideration with a spell check afterwards.

Increasing Coverage (Theoretical Discussion)

To increase coverage, several strategies can be employed: - Identifying missing words: Incorporate external dictionaries, including domain-specific terms or slang dictionaries, to identify and include words not present in the current corpus. - Stemming or Lemmatization: Reduce words to their root forms to group similar words together (e.g., “running” and “runner” become “run”). - Subword Models: Use subword tokenization to break words into smaller units, such as character n-grams, to handle unseen words or rare terms more effectively. - Phrase Modeling: Focus on frequent phrases instead of individual words, which can help to convey more context while reducing vocabulary size.