This milestone report summarizes my exploratory data analysis (EDA) on the SwiftKey text data and outlines the plan for developing the final prediction algorithm and Shiny application.
The goals of this report are to:
The data come from three sources: blogs, news, and Twitter, all in English (US).
# Paths
twitter <- "/Users/lynnettong/Desktop/Coursera/final/en_US/en_US.twitter.txt"
blogs <- "/Users/lynnettong/Desktop/Coursera/final/en_US/en_US.blogs.txt"
news <- "/Users/lynnettong/Desktop/Coursera/final/en_US/en_US.news.txt"
# Count lines
twitter_lines <- length(readLines(twitter, warn = FALSE))
blogs_lines <- length(readLines(blogs, warn = FALSE))
news_lines <- length(readLines(news, warn = FALSE))
## Source Lines
## 1 Blogs 899288
## 2 News 1010242
## 3 Twitter 2360148
As the data sets were large, I sampled 1% of the data for my exploratory analysis:
set.seed(123)
sample_pct <- 0.01
sample_data <- c(
sample(readLines(blogs, warn = FALSE), blogs_lines * sample_pct),
sample(readLines(news, warn = FALSE), news_lines * sample_pct),
sample(readLines(twitter, warn = FALSE), twitter_lines * sample_pct)
)
# If sample_data is huge, take a smaller subsample
sample_size <- 15000 # or 10,000
sample_small <- sample(sample_data, size = sample_size)
length(sample_small)
## [1] 15000
To prepare the data for analysis, I worked with a randomly sampled subset of the original corpus to reduce memory usage and ensure faster processing.
The following steps were applied:
library(quanteda)
## Package version: 4.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
# sample_small = your sampled vector of text lines (e.g. 10k–20k lines)
corp <- corpus(sample_small)
toks <- tokens(
corp,
what = "word",
remove_punct = TRUE,
remove_numbers = TRUE,
remove_symbols = TRUE
)
toks <- tokens_tolower(toks)
toks <- tokens_remove(toks, stopwords("en"))
# Remove empty tokens
toks <- tokens_remove(toks, "")
toks <- tokens_select(toks, pattern = "^[a-z]+$", valuetype = "regex")
toks[[1]][1:20] # preview tokens
## [1] "screaming" "freshman" "looked" "pretty" "good" "red"
## [7] "white" "game" NA NA NA NA
## [13] NA NA NA NA NA NA
## [19] NA NA
The cleaned tokens were used to explore the structure of the language present in the dataset. I examined:
word frequencies (unigrams)
common 2-word combinations (bigrams)
common 3-word combinations (trigrams)
These provide insight into how people naturally write, and help guide the design of the future text prediction algorithm.
dfm_uni <- dfm(toks)
dfm_uni <- dfm_trim(dfm_uni, min_termfreq = 10)
top_uni <- topfeatures(dfm_uni, 20)
top_uni
## said just one like can get time new good now love
## 1079 1039 1032 928 866 839 771 677 629 609 589
## day know people see back go also first make
## 582 565 545 541 519 491 480 479 460
library(ggplot2)
uni_df <- data.frame(
word = names(top_uni),
freq = as.numeric(top_uni)
)
ggplot(uni_df, aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 20 Most Frequent Words",
x = "Word", y = "Frequency")
toks_bi <- tokens_ngrams(toks, n = 2)
dfm_bi <- dfm(toks_bi)
dfm_bi <- dfm_trim(dfm_bi, min_termfreq = 5)
top_bi <- topfeatures(dfm_bi, 20)
top_bi
## right_now new_york last_year years_ago last_night
## 98 66 57 56 55
## first_time looking_forward make_sure high_school can_get
## 51 47 44 44 38
## just_like feel_like good_morning even_though just_got
## 36 36 33 32 32
## looks_like let_know last_week new_jersey two_years
## 32 32 32 30 30
bi_df <- data.frame(
ngram = names(top_bi),
freq = as.numeric(top_bi)
)
ggplot(bi_df, aes(x = reorder(ngram, freq), y = freq)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 20 Bigrams",
x = "Bigram", y = "Frequency")
toks_tri <- tokens_ngrams(toks, n = 3)
dfm_tri <- dfm(toks_tri)
dfm_tri <- dfm_trim(dfm_tri, min_termfreq = 3)
top_tri <- topfeatures(dfm_tri, 20)
top_tri
## paintball_marker_upgrades w_sunset_blvd
## 11 10
## let_us_know kentucky_kentucky_kentucky
## 8 8
## just_got_back two_years_ago
## 7 7
## new_york_city new_york_times
## 7 6
## happy_mothers_day told_associated_press
## 5 5
## luther_king_jr los_angeles_times
## 4 4
## four_years_ago one_three_girls
## 4 4
## happy_new_year love_love_love
## 4 4
## three_years_ago time_last_year
## 4 4
## follow_back_please really_really_really
## 4 4
tri_df <- data.frame(
ngram = names(top_tri),
freq = as.numeric(top_tri)
)
ggplot(tri_df, aes(x = reorder(ngram, freq), y = freq)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 20 Trigrams",
x = "Trigram", y = "Frequency")
The exploratory analysis of the sampled text corpus reveals several consistent language patterns that reflect how users naturally write on social media platforms such as Twitter:
Frequent bigrams such as “right_now”, “just_like”, “feel_like”, and trigrams like “really_really_really” and “love_love_love” show that users rely heavily on informal, expressive phrasing and repetition to convey emotion, emphasis, and personality. This aligns with the casual, fast-paced nature of online communication.
Time-related expressions such as “last_year”, “years_ago”, “last_night”, “two_years”, and trigrams like “three_years_ago” and “four_years_ago” indicate that users frequently discuss past events, stories, and personal experiences. Temporal references are a core component of online conversations.
Bigrams and trigrams such as “new_york”, “new_jersey”, “new_york_city”, and “los_angeles_times” suggest users often mention locations—either in relation to news, travel, or personal updates. This reflects the diverse and regionally distributed nature of U.S. social media users.
Phrases like “good_morning”, “looking_forward”, and holiday expressions such as “happy_new_year” and “happy_mothers_day” highlight routine social greetings, celebrations, and well-wishes—common behaviors in digital communication.
The presence of trigrams such as “told_associated_press” and “luther_king_jr” demonstrates that the corpus includes references to news reporting and public figures, indicating a blend of personal expression and real-world topics typical of a large, public social media dataset.
Overall, the combined n-gram analysis shows a rich mixture of informal conversation, temporal storytelling, geographic context, social interaction, and news-related content. These patterns provide a strong foundation for constructing an n-gram-based next-word prediction model.
The final predictive model will be based on n-gram probabilities that estimate the most likely next word given the previous one or two words.
Planned Approach:
Build n-gram tables (1-, 2-, 3-grams) from the cleaned corpus
Implement a Stupid Backoff algorithm:
Prefer trigrams
Fall back to bigrams if no match
Fall back to unigrams if needed
Apply smoothing to handle rare or unseen events
Optimize for fast lookup to ensure a responsive Shiny app
This method is widely used, simple to implement, and efficient for mobile text input prediction.
The Shiny app will demonstrate the prediction algorithm and provide an interactive interface for users.
Planned Features:
A text input box for users to type a word or phrase
Real-time predicted next word based on n-gram lookup
A ranked list of the top 3–5 suggested next words
A clean, minimal interface suitable for mobile or web use
The app will be lightweight, fast, and easy for non-technical users.