The objective of this analysis is to clean and prepare text data extracted from three sources: - News articles - Blog posts - Twitter posts
These datasets are part of the SwiftKey Dataset and contain raw, unstructured text. Before proceeding to model building or other analyses, it is essential to clean this data to ensure quality and consistency.
The cleaning steps applied to the datasets include: 1. Removing empty lines: Lines with no content were discarded. 2. Removing duplicate lines: Lines were made lowercase, and duplicates were removed. 3. Removing special characters: Non-alphanumeric characters, except for spaces, were removed. 4. Removing short lines: Lines with fewer than 20 characters were removed. 5. Removing vulgar words: Lines containing words from a predefined list of profanities were removed. 6. Removing stopwords (optional): Frequently used words that carry little meaning (e.g., “and”, “the”) were removed, if specified.
Below is a table summarizing the number of lines and words in the datasets.
| Dataset | Lines_Before | Words_Before | File_Size_KB | Characters |
|---|---|---|---|---|
| en_US.news.txt | 1010242 | 34372530 | 200988.2 | 203223159 |
| en_US.blogs.txt | 899288 | 37334131 | 205234.4 | 206824505 |
| en_US.twitter.txt | 2360148 | 30373543 | 163188.8 | 162096031 |
After cleaning the datasets, we proceed to tokenization. Given the large size of the combined datasets, we opted to randomly sample 10% of the cleaned lines for this analysis. Tokenization involves splitting the sampled text into smaller units such as words, bigrams (two-word phrases), and trigrams (three-word phrases). This sampling approach ensures computational efficiency while maintaining a representative subset of the text. Below, we analyze the most frequent tokens across all three datasets combined.
This section examines how many unique words are required to cover 50% and 90% of all word instances in the combined dataset. A cumulative frequency plot is also included for visualization.
To evaluate foreign words in the text, we used a spell-checking approach. Words not recognized by the English dictionary are flagged as potential foreign words. Below is a pie chart showing the proportion of foreign words in the combined dataset. However the majority of words detected are due to the presence of spelling error, this will be taken in consideration with a spell check afterwards.
To increase coverage, several strategies can be employed: - Identifying missing words: Incorporate external dictionaries, including domain-specific terms or slang dictionaries, to identify and include words not present in the current corpus. - Stemming or Lemmatization: Reduce words to their root forms to group similar words together (e.g., “running” and “runner” become “run”). - Subword Models: Use subword tokenization to break words into smaller units, such as character n-grams, to handle unseen words or rare terms more effectively. - Phrase Modeling: Focus on frequent phrases instead of individual words, which can help to convey more context while reducing vocabulary size.
The table above provides a detailed comparison of the datasets before and after cleaning. The cleaning process significantly reduces the size of the datasets by removing unnecessary or undesired content. Tokenization enables the analysis of the most frequently used words, bigrams, and trigrams across all datasets, while foreign word analysis highlights non-English words in the text. Frequency coverage analysis shows the number of words required to cover 50% and 90% of all word instances, providing insights into vocabulary distribution. These steps lay a solid foundation for subsequent analyses such as topic modeling or sentiment analysis.