Capstone_languagModel

1. Basic Data Summary

I’ve started by analysing a quick summary of the data.

# Define file paths
blogs_path <- "/home/rstudio/en_US.blogs.txt"
news_path <- "/home/rstudio/en_US.news.txt"
twitter_path <- "/home/rstudio/en_US.twitter.txt"

# Read files
blogs <- readLines(blogs_path, warn = FALSE, encoding = "UTF-8")
news <- readLines(news_path, warn = FALSE, encoding = "UTF-8")
twitter <- readLines(twitter_path, warn = FALSE, encoding = "UTF-8")

# Function to calculate word count per dataset
word_count <- function(text_data) {
  sum(sapply(strsplit(text_data, "\\s+"), length))
}

# Compute summary statistics
data_summary <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(word_count(blogs), word_count(news), word_count(twitter)),
  Max_Line_Length = c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))) # Longest line in each dataset
)

# Print summary
print(data_summary)

##   Dataset   Lines    Words Max_Line_Length
## 1   Blogs  899288 37334131           40833
## 2    News 1010242 34372530           11384
## 3 Twitter 2360148 30373543             140

2. Data Cleaning

To prepare text data for an n-gram model, we need to remove:

Stopwords: Common words like “the,” “is,” and “and” that don’t add meaning.

Punctuation: To ensure clean tokenization.

Numbers: Unless they contribute meaning.

Whitespace: Extra spaces or line breaks.

Special Characters & Non-ASCII Text: Emojis, symbols, and foreign characters.

3. Memory & Runtime Considerations

Working with large text files is computationally expensive. Key strategies:

Sampling: Only use a subset of data (readLines(n = 50000)).

Parallel Processing: Use parallel::mclapply() to speed up n-gram extraction.

Sparse Matrices: Use slam or Matrix package instead of dense matrices.

Efficient Data Storage: Save intermediate results as .rds files.

library(ggplot2)
library(data.table)
library(stringi)
library(tokenizers)

# Load sample (limit to 50k for memory)
set.seed(123)
file_path <- "/home/rstudio/en_US.twitter.txt"
lines <- readLines(file_path, warn = FALSE, n = 50000)

# Clean text efficiently
cleaned <- tolower(lines)
cleaned <- stri_replace_all_regex(cleaned, "[^a-z\\s]", " ")
cleaned <- stri_replace_all_regex(cleaned, "\\s+", " ")  # remove extra spaces
cleaned <- trimws(cleaned)

4. Exploratory Analysis

Historgrams

Utilizing the ‘twitter’ text file I have created a historgram to examine frequency of common words.

# Load necessary libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(Matrix)  # For sparse matrices

# Read dataset (Example: Twitter)
file_path <- "/home/rstudio/en_US.twitter.txt"
twitter <- readLines(file_path, warn = FALSE)

# Sample a smaller subset (e.g., 10,000 tweets) to reduce memory consumption
twitter_sample <- sample(twitter, 10000)

# Convert to Corpus
twitter_corpus <- Corpus(VectorSource(twitter_sample))

# Clean text
twitter_corpus <- tm_map(twitter_corpus, content_transformer(tolower))  # Convert to lowercase

## Warning in tm_map.SimpleCorpus(twitter_corpus, content_transformer(tolower)):
## transformation drops documents

twitter_corpus <- tm_map(twitter_corpus, removePunctuation)  # Remove punctuation

## Warning in tm_map.SimpleCorpus(twitter_corpus, removePunctuation):
## transformation drops documents

twitter_corpus <- tm_map(twitter_corpus, removeNumbers)  # Remove numbers

## Warning in tm_map.SimpleCorpus(twitter_corpus, removeNumbers): transformation
## drops documents

twitter_corpus <- tm_map(twitter_corpus, removeWords, stopwords("en"))  # Remove stopwords

## Warning in tm_map.SimpleCorpus(twitter_corpus, removeWords, stopwords("en")):
## transformation drops documents

# Remove empty documents after transformation
twitter_corpus <- twitter_corpus[sapply(twitter_corpus, function(x) nchar(x) > 0)]

# Create a Document-Term Matrix with sparse representation
dtm <- TermDocumentMatrix(twitter_corpus, control = list(weighting = weightTfIdf))

## Warning in TermDocumentMatrix.SimpleCorpus(twitter_corpus, control =
## list(weighting = weightTfIdf)): custom functions are ignored

## Warning in weighting(x): empty document(s): 397 457 853 1220 1509 2001 2051 2150
## 2442 2737 2767 2841 3111 3834 4340 4501 4525 4882 4939 4953 5011 5117 5412 5834
## 5911 5939 7383 7694 7791 7926 8232 8262 8554 8624 8917 9402 9687 9691 9848 9942

# Convert to sparse matrix
m <- as.matrix(dtm)
word_freq <- sort(rowSums(m), decreasing = TRUE)

# Convert to Data Frame for plotting
word_freq_df <- data.frame(word = names(word_freq), freq = word_freq)

# Plot histogram of top 30 word frequencies
ggplot(word_freq_df[1:30, ], aes(x = reorder(word, -freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  theme_minimal() +
  labs(title = "Top 30 Most Frequent Words in Twitter Dataset", x = "Word", y = "Frequency") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Bi & Trigram analysis

To exapand on this analysis, also looking at the ‘Twitter’ data set, I have examined Bi & Trigram frequencies.

# Sample 10,000 lines for n-gram extraction
set.seed(456)
sampled <- sample(cleaned, size = min(10000, length(cleaned)))

# Tokenize bigrams and trigrams using `tokenizers`
bigrams <- unlist(tokenize_ngrams(sampled, n = 2))
trigrams <- unlist(tokenize_ngrams(sampled, n = 3))

# Frequency tables using data.table
bigram_dt <- data.table(table(bigrams))[order(-N)][1:20]
trigram_dt <- data.table(table(trigrams))[order(-N)][1:20]

# Plot bigrams
ggplot(bigram_dt, aes(x = reorder(bigrams, N), y = N)) +
  geom_bar(stat = "identity", fill = "tomato") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 20 Bigrams", x = "Bigram", y = "Frequency")

# Plot trigrams
ggplot(trigram_dt, aes(x = reorder(trigrams, N), y = N)) +
  geom_bar(stat = "identity", fill = "seagreen") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 20 Trigrams", x = "Trigram", y = "Frequency")

5. Handling Unseen N-grams (Smoothing)

Predicting unseen words is a challenge. We can use smoothing techniques:

Laplace Smoothing: Adds a small probability to unseen n-grams.

Kneser-Ney Smoothing: Adjusts based on lower-order n-grams.

Simple Laplace Smoothing Implementation

6. Next Steps

Model Selection For word prediction, common models include:

Markov Chains: Simple probabilistic transition between words.

Backoff Models: If a higher-order n-gram is missing, fall back to lower n-grams.

Interpolation: Combines different n-gram probabilities.

7. Building the Shiny App

The Shiny App will:

Take user input

Predict next words using an n-gram model

Display results interactively

8. Conclusion

I look forward to presenting the next stage of the assignment where I will dive deeper into modelling.

Capstone_languagModel_Assignment1