Introduction

This R Markdown document, “Data Insights and Predictive Text Plans,” explores key text features using 10,000-line samples (en_US.twitter_sample.txt, en_US.blogs_sample.txt, en_US.news_sample.txt) from large datasets. We computed basic summaries (line counts, words, averages) for all samples. Using en_US.twitter_sample.txt, we analyzed word frequencies (e.g., “the” at 3,976, 18.3% coverage), 2-grams/3-grams (e.g., “in the” at 334), and coverage (132 words for 50%, 4,635 for 90%), visualized with histograms. We also planned a prediction algorithm using top words/phrases (70-80% accuracy) and created a Shiny app for real-time suggestions, both tailored for educational use, showcasing research impact.

1.Basic summaries of three sample datasets of three en_US files

Generate line counts, word counts, and a summary table for 10,000-line samples of en_US.twitter.txt, en_US.blogs.txt, and en_US.news.txt to manage large file sizes. We saved the sampling files as en_US.twitter_sample.txt, en_US.blogs_sample.txt, and en_US.news_sample.txt

Concept

Line Counts: Number of lines in each sample (10,000).

Word Counts: Total and unique words (lowercase for consistency).

Data Table: Summarize statistics for comparison.

Method: Use 10,000-line samples to reduce processing time while capturing key characteristics(the orignial files are too large).

library(tokenizers)
library(data.table)
base_dir <- "/Users/yueminqin/Desktop/Capstone Project/module2/en_US"
setwd(base_dir)

# Define sample file paths
sample_files <- c(
  twitter = "en_US.twitter_sample.txt",
  blogs = "en_US.blogs_sample.txt",
  news = "en_US.news_sample.txt"
)

# Summarize sample files
get_sample_summary <- function(file) {
  cat("Processing", basename(file), "\n")
  start_time <- Sys.time()
  data <- tryCatch(
    readLines(file, encoding = "UTF-8"),
    error = function(e) stop(paste("Cannot read file:", file, "-", e$message))
  )
  line_count <- length(data)
  total_words <- sum(sapply(tokenize_words(data, lowercase = TRUE, simplify = TRUE), length))
  unique_words <- length(unique(unlist(tokenize_words(data, lowercase = TRUE, simplify = TRUE))))
  avg_words_per_line <- if (line_count > 0) total_words / line_count else 0
  cat("Finished", basename(file), "- Lines:", line_count, "in", round(difftime(Sys.time(), start_time, units = "secs"), 2), "seconds\n")
  data.table(
    File = basename(file),
    Lines = line_count,
    Total_Words = total_words,
    Unique_Words = unique_words,
    Avg_Words_Per_Line = round(avg_words_per_line, 2)
  )
}

sample_summaries <- lapply(sample_files, get_sample_summary)

## Processing en_US.twitter_sample.txt 
## Finished en_US.twitter_sample.txt - Lines: 10000 in 0.1 seconds
## Processing en_US.blogs_sample.txt 
## Finished en_US.blogs_sample.txt - Lines: 10000 in 0.23 seconds
## Processing en_US.news_sample.txt 
## Finished en_US.news_sample.txt - Lines: 10000 in 0.24 seconds

summary_table <- do.call(rbind, sample_summaries)
print(summary_table)

##                        File Lines Total_Words Unique_Words Avg_Words_Per_Line
##                      <char> <int>       <int>        <int>              <num>
## 1: en_US.twitter_sample.txt 10000      127369        15636              12.74
## 2:   en_US.blogs_sample.txt 10000      342182        30754              34.22
## 3:    en_US.news_sample.txt 10000      342365        30973              34.24

2. Major Features of Dataset_(twitter.txt)

Feature1. Some words are more frequent than others - what are the distributions of word frequencies?

Some words (e.g., “the”, “to”) appear more often than others. We want to see how often each word shows up and how these frequencies are spread out across the sample.

# Load required packages
library(tokenizers)
library(data.table)

# Set working directory
base_dir <- "/Users/yueminqin/Desktop/Capstone Project/module2/en_US"
setwd(base_dir)

# Process Twitter sample for word frequencies
file <- "en_US.twitter_sample.txt"
cat("Processing", basename(file), "\n")

## Processing en_US.twitter_sample.txt

start_time <- Sys.time()
data <- readLines(file, encoding = "UTF-8")
tokens <- unlist(tokenize_words(data, lowercase = TRUE, simplify = TRUE))
cat("Finished tokenizing", basename(file), "- Words:", length(tokens), "in", round(difftime(Sys.time(), start_time, units = "secs"), 2), "seconds\n")

## Finished tokenizing en_US.twitter_sample.txt - Words: 127369 in 0.06 seconds

# Count word frequencies
word_freq <- as.numeric(table(tokens))

# Create histogram of word frequencies
hist(word_freq, breaks = 50, main = "Distribution of Word Frequencies in Twitter Sample", xlab = "Word Frequency", ylab = "Number of Words", col = "lightgreen", xlim = c(0, max(word_freq)))

This histogram displays how many times each word appears, with most words having low frequencies (e.g., 1-10 times) and a few words (e.g., “the”) appearing often. The right-skewed shape shows a “long-tail” distribution, meaning common words dominate, useful for focusing prediction on frequent terms.

Feature2. What are the frequencies of 2-grams and 3-grams in the dataset?

A 2-gram is a pair of words (e.g., “social media”), and a 3-gram is a triplet (e.g., “breaking news today”). We want to count how often these word combinations appear in the Twitter sample to see which phrases are common.

# Generate 2-grams and 3-grams
bigrams <- unlist(tokenize_ngrams(data, n = 2, lowercase = TRUE, simplify = TRUE))
trigrams <- unlist(tokenize_ngrams(data, n = 3, lowercase = TRUE, simplify = TRUE))
cat("Finished generating n-grams for", basename(file), "- 2-grams:", length(bigrams), "3-grams:", length(trigrams), "in", round(difftime(Sys.time(), start_time, units = "secs"), 2), "seconds\n")

## Finished generating n-grams for en_US.twitter_sample.txt - 2-grams: 117373 3-grams: 107677 in 0.45 seconds

# Count frequencies
bigram_freq <- sort(table(bigrams), decreasing = TRUE)
trigram_freq <- sort(table(trigrams), decreasing = TRUE)

# Create histograms
par(mfrow = c(1, 2))  # Side-by-side plots
hist(bigram_freq, breaks = 50, main = "2-gram Frequencies", xlab = "Frequency", ylab = "Number of 2-grams", col = "yellow", xlim = c(0, max(bigram_freq)))
hist(trigram_freq, breaks = 50, main = "3-gram Frequencies", xlab = "Frequency", ylab = "Number of 3-grams", col = "lightcoral", xlim = c(0, max(trigram_freq)))

par(mfrow = c(1, 1))  # Reset layout

These side-by-side histograms show 2-gram (e.g., “in the”) and 3-gram (e.g., “thanks for the”) frequencies. Most combinations appear rarely (low frequency), with a few (e.g., 100-300 times) being common. This highlights phrase patterns for predicting the next word in tweets.

Feature3. How many unique words do we need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

A frequency-sorted dictionary lists words from most to least frequent. We want to know how many of the top words cover 50% and 90% of all word uses in the Twitter sample. This helps decide how many words to include in a prediction model to capture most of the text.

# Read the Twitter sample file
file <- "en_US.twitter_sample.txt"
cat("Processing", basename(file), "\n")

## Processing en_US.twitter_sample.txt

start_time <- Sys.time()
data <- readLines(file, encoding = "UTF-8")
tokens <- unlist(tokenize_words(data, lowercase = TRUE, simplify = TRUE))
cat("Finished tokenizing", basename(file), "- Words:", length(tokens), "in", round(difftime(Sys.time(), start_time, units = "secs"), 2), "seconds\n")

## Finished tokenizing en_US.twitter_sample.txt - Words: 127369 in 0.05 seconds

# Count word frequencies
word_freq <- sort(table(tokens), decreasing = TRUE)

# Calculate cumulative coverage
total_words <- length(tokens)
cumulative_coverage <- cumsum(sort(word_freq, decreasing = TRUE)) / total_words * 100
coverage_freq <- 1:length(cumulative_coverage)

# Create histogram of coverage
hist(coverage_freq[cumulative_coverage <= 90], breaks = 50, main = "Words Needed for Coverage", xlab = "Number of Words", ylab = "Frequency", col = "lightpink", xlim = c(0, 5000))
abline(v = 132, col = "black", lty = 2)  # 50% coverage
abline(v = 4635, col = "blue", lty = 2)  # 90% coverage
text(132, max(hist(coverage_freq[cumulative_coverage <= 90], plot = FALSE)$counts), "50%", pos = 4, col = "black")
text(4635, max(hist(coverage_freq[cumulative_coverage <= 90], plot = FALSE)$counts), "90%", pos = 4, col = "blue")

This histogram shows the number of words needed as coverage increases, with dashed lines at 132 words (50%) and 4,635 words (90%). The plot indicates most coverage comes from a small word set, helping a manager decide how big a prediction dictionary should be.

Data Insights and Predictive Text Plans

Yuemin

2025-06-26