Introduction

This milestone report summarizes my exploratory data analysis (EDA) on the SwiftKey text data and outlines the plan for developing the final prediction algorithm and Shiny application.

The goals of this report are to:

  1. Ensure successful download and loading of the dataset
  2. Present summary statistics of the data
  3. Highlight interesting findings from the exploratory analyses
  4. Briefly describe planned next steps for the prediction model and app

The data come from three sources: blogs, news, and Twitter, all in English (US).

Loading the data

# Paths
twitter <- "/Users/lynnettong/Desktop/Coursera/final/en_US/en_US.twitter.txt"
blogs   <- "/Users/lynnettong/Desktop/Coursera/final/en_US/en_US.blogs.txt"
news    <- "/Users/lynnettong/Desktop/Coursera/final/en_US/en_US.news.txt"

# Count lines
twitter_lines <- length(readLines(twitter, warn = FALSE))
blogs_lines   <- length(readLines(blogs, warn = FALSE))
news_lines    <- length(readLines(news, warn = FALSE))

Summary of lines count

##    Source   Lines
## 1   Blogs  899288
## 2    News 1010242
## 3 Twitter 2360148

Sampling of data

As the data sets were large, I sampled 1% of the data for my exploratory analysis:

set.seed(123)
sample_pct <- 0.01

sample_data <- c(
  sample(readLines(blogs, warn = FALSE), blogs_lines * sample_pct),
  sample(readLines(news,  warn = FALSE), news_lines  * sample_pct),
  sample(readLines(twitter, warn = FALSE), twitter_lines * sample_pct)
)

# If sample_data is huge, take a smaller subsample
sample_size <- 15000     # or 10,000
sample_small <- sample(sample_data, size = sample_size)
length(sample_small)
## [1] 15000

Simple data cleaning

To prepare the data for analysis, I worked with a randomly sampled subset of the original corpus to reduce memory usage and ensure faster processing.

The following steps were applied:

  1. Created a corpus from the sampled text
  2. Tokenised the text into individual words
  3. Removed punctuation, numbers, and symbols
  4. Converted all text to lowercase
  5. Removed English stopwords (e.g., “the”, “and”, “at”)
  6. (Optional) Removed profanity terms
  7. Prepared cleaned tokens for n-gram analysis
library(quanteda)
## Package version: 4.3.1
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
# sample_small = your sampled vector of text lines (e.g. 10k–20k lines)

corp <- corpus(sample_small)

toks <- tokens(
  corp,
  what = "word",
  remove_punct   = TRUE,
  remove_numbers = TRUE,
  remove_symbols = TRUE
)

toks <- tokens_tolower(toks)
toks <- tokens_remove(toks, stopwords("en"))

# Remove empty tokens
toks <- tokens_remove(toks, "")
toks <- tokens_select(toks, pattern = "^[a-z]+$", valuetype = "regex")

toks[[1]][1:20]   # preview tokens
##  [1] "screaming" "freshman"  "looked"    "pretty"    "good"      "red"      
##  [7] "white"     "game"      NA          NA          NA          NA         
## [13] NA          NA          NA          NA          NA          NA         
## [19] NA          NA

Exploratory Analysis

The cleaned tokens were used to explore the structure of the language present in the dataset. I examined:

These provide insight into how people naturally write, and help guide the design of the future text prediction algorithm.

Unigram Analysis (Single word)

dfm_uni <- dfm(toks)
dfm_uni <- dfm_trim(dfm_uni, min_termfreq = 10)

top_uni <- topfeatures(dfm_uni, 20)
top_uni
##   said   just    one   like    can    get   time    new   good    now   love 
##   1079   1039   1032    928    866    839    771    677    629    609    589 
##    day   know people    see   back     go   also  first   make 
##    582    565    545    541    519    491    480    479    460
library(ggplot2)

uni_df <- data.frame(
  word = names(top_uni),
  freq = as.numeric(top_uni)
)

ggplot(uni_df, aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words",
       x = "Word", y = "Frequency")

Bigram Analysis (Two-word pair)

toks_bi <- tokens_ngrams(toks, n = 2)
dfm_bi  <- dfm(toks_bi)
dfm_bi  <- dfm_trim(dfm_bi, min_termfreq = 5)

top_bi <- topfeatures(dfm_bi, 20)
top_bi
##       right_now        new_york       last_year       years_ago      last_night 
##              98              66              57              56              55 
##      first_time looking_forward       make_sure     high_school         can_get 
##              51              47              44              44              38 
##       just_like       feel_like    good_morning     even_though        just_got 
##              36              36              33              32              32 
##      looks_like        let_know       last_week      new_jersey       two_years 
##              32              32              32              30              30
bi_df <- data.frame(
  ngram = names(top_bi),
  freq  = as.numeric(top_bi)
)

ggplot(bi_df, aes(x = reorder(ngram, freq), y = freq)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 20 Bigrams",
       x = "Bigram", y = "Frequency")

Trigram Analysis (Three-word pair)

toks_tri <- tokens_ngrams(toks, n = 3)
dfm_tri  <- dfm(toks_tri)
dfm_tri  <- dfm_trim(dfm_tri, min_termfreq = 3)

top_tri <- topfeatures(dfm_tri, 20)
top_tri
##  paintball_marker_upgrades              w_sunset_blvd 
##                         11                         10 
##                let_us_know kentucky_kentucky_kentucky 
##                          8                          8 
##              just_got_back              two_years_ago 
##                          7                          7 
##              new_york_city             new_york_times 
##                          7                          6 
##          happy_mothers_day      told_associated_press 
##                          5                          5 
##             luther_king_jr          los_angeles_times 
##                          4                          4 
##             four_years_ago            one_three_girls 
##                          4                          4 
##             happy_new_year             love_love_love 
##                          4                          4 
##            three_years_ago             time_last_year 
##                          4                          4 
##         follow_back_please       really_really_really 
##                          4                          4
tri_df <- data.frame(
  ngram = names(top_tri),
  freq  = as.numeric(top_tri)
)

ggplot(tri_df, aes(x = reorder(ngram, freq), y = freq)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 20 Trigrams",
       x = "Trigram", y = "Frequency")

Key Findings

The exploratory analysis of the sampled text corpus reveals several consistent language patterns that reflect how users naturally write on social media platforms such as Twitter:

  1. Conversational and expressive language dominates

Frequent bigrams such as “right_now”, “just_like”, “feel_like”, and trigrams like “really_really_really” and “love_love_love” show that users rely heavily on informal, expressive phrasing and repetition to convey emotion, emphasis, and personality. This aligns with the casual, fast-paced nature of online communication.

  1. Strong temporal themes appear in all n-gram levels

Time-related expressions such as “last_year”, “years_ago”, “last_night”, “two_years”, and trigrams like “three_years_ago” and “four_years_ago” indicate that users frequently discuss past events, stories, and personal experiences. Temporal references are a core component of online conversations.

  1. Geographic references are common in user posts

Bigrams and trigrams such as “new_york”, “new_jersey”, “new_york_city”, and “los_angeles_times” suggest users often mention locations—either in relation to news, travel, or personal updates. This reflects the diverse and regionally distributed nature of U.S. social media users.

  1. Greetings and positive social interactions appear frequently

Phrases like “good_morning”, “looking_forward”, and holiday expressions such as “happy_new_year” and “happy_mothers_day” highlight routine social greetings, celebrations, and well-wishes—common behaviors in digital communication.

  1. News, events, and public figures also appear in the data

The presence of trigrams such as “told_associated_press” and “luther_king_jr” demonstrates that the corpus includes references to news reporting and public figures, indicating a blend of personal expression and real-world topics typical of a large, public social media dataset.

Overall, the combined n-gram analysis shows a rich mixture of informal conversation, temporal storytelling, geographic context, social interaction, and news-related content. These patterns provide a strong foundation for constructing an n-gram-based next-word prediction model.

Plan for the Prediction Algorithm

The final predictive model will be based on n-gram probabilities that estimate the most likely next word given the previous one or two words.

Planned Approach:

This method is widely used, simple to implement, and efficient for mobile text input prediction.

Shiny App Design

The Shiny app will demonstrate the prediction algorithm and provide an interactive interface for users.

Planned Features:

The app will be lightweight, fast, and easy for non-technical users.