This report presents an exploratory data analysis of a text corpus consisting of blog posts, news articles, and Twitter messages. The goal is to understand the structure and patterns in the data as a foundation for building a predictive text model (similar to SwiftKey’s keyboard). Key findings include:
Mobile typing is challenging, and predictive text models help users type faster and more accurately. This project aims to build a predictive text model using natural language processing (NLP) techniques, specifically n-gram modeling.
The dataset consists of text from three sources:
These sources represent different writing styles and contexts, providing a diverse corpus for training the predictive model.
Due to the large size of the original files, I created a representative sample using random sampling:
# Example sampling code (adjust to show your actual approach)
set.seed(123)
sample_rate <- 0.05 # 5% sample
# Sample from each file
blogs_sample <- sample_file("en_US.blogs.txt", sample_rate)
news_sample <- sample_file("en_US.news.txt", sample_rate)
twitter_sample <- sample_file("en_US.twitter.txt", sample_rate)
Justification: A 5% random sample provides sufficient data for exploratory analysis while maintaining computational efficiency. Random sampling ensures the sample is representative of the full dataset.
The following text preprocessing steps were applied:
Important: Stopwords (common words like “the”, “is”, “at”) were NOT removed because they are essential for natural language prediction.
Source | Lines | Words |
---|---|---|
Blogs | XX,XXX | XXX,XXX |
News | XX,XXX | XXX,XXX |
XX,XXX | XXX,XXX | |
Total | XX,XXX | X,XXX,XXX |
Unigrams are individual words. Analyzing their frequency helps us understand which words are most common in the corpus.
Top 20 Most Frequent Unigrams
Bigrams capture common two-word sequences, revealing patterns in how words are combined.
Top 20 Most Frequent Bigrams
Trigrams capture longer phrases and more specific contexts.
Top 20 Most Frequent Trigrams
A word cloud provides a visual summary of the most frequent terms, with size indicating frequency.
Word Cloud of Most Frequent Terms
A key question for model building: How many unique words do we need to cover most text?
This is important for: - Memory efficiency - Model size - Prediction accuracy vs. computational cost
Cumulative Word Coverage Curve
Coverage | Words_Needed | Percentage_of_Vocabulary |
---|---|---|
50% | [INSERT] | [INSERT]% |
90% | [INSERT] | [INSERT]% |
Based on the exploratory analysis, several challenges were identified:
The predictive text model will use an n-gram language model with the following components:
Consider implementing: - Katz Backoff: Discount probability mass for seen n-grams, redistribute to unseen - Stupid Backoff: Simplified approach suitable for large datasets - Good-Turing Smoothing: Adjust frequencies based on frequency of frequencies
To make the model efficient: - Pruning: Remove very low-frequency n-grams - Hashing: Use hash tables for fast lookup - Compression: Store only necessary information - Top-K prediction: Return only top 3-5 predictions
Model performance will be evaluated using: - Perplexity: How well the model predicts held-out test data - Accuracy: Percentage of correct top-1, top-3 predictions - Speed: Response time for predictions - Coverage: Percentage of test queries the model can handle
The final application will: 1. Accept text input (multiple words) 2. Predict next word(s) using the trained model 3. Display top 3 predictions 4. Provide fast, real-time response 5. Handle edge cases gracefully
This exploratory analysis has provided valuable insights into the text corpus structure:
The next phase will focus on building an efficient n-gram model with appropriate smoothing and backoff strategies to create a functional predictive text application.
All code used in this analysis is available in the accompanying R scripts:
task1_data_sampling.R
: Data loading and samplingtask2_exploratory_analysis.R
: N-gram analysis and
visualization## R version 4.5.1 (2025-06-13 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26100)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_South Africa.utf8 LC_CTYPE=English_South Africa.utf8
## [3] LC_MONETARY=English_South Africa.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_South Africa.utf8
##
## time zone: Africa/Johannesburg
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.6.1 fastmap_1.2.0 xfun_0.52
## [5] cachem_1.1.0 knitr_1.50 htmltools_0.5.8.1 png_0.1-8
## [9] rmarkdown_2.30 lifecycle_1.0.4 cli_3.6.5 sass_0.4.10
## [13] jquerylib_0.1.4 compiler_4.5.1 rstudioapi_0.17.1 tools_4.5.1
## [17] evaluate_1.0.4 bslib_0.9.0 yaml_2.3.10 rlang_1.1.6
## [21] jsonlite_2.0.0
Note: This report was created as part of the Coursera Data Science Capstone project.