Milestone Report: Exploratory Analysis of SwiftKey Text Data

knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)

Introduction

The goal of this project is to build a predictive text algorithm using the English SwiftKey dataset, which contains samples from blogs, news, and Twitter.

This report summarizes the exploratory analysis of the data and outlines plans for building a next-word prediction model and a Shiny app for end users.

# Set working directory
setwd("C:/Users/salya/Desktop/Coursera-SwiftKey/final/en_US")

# Load datasets (use skipNul=TRUE to avoid errors with Twitter file)
blogs   <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news    <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# Function to count words per line
word_count <- function(x) sapply(gregexpr("\\W+", x), length) + 1

# Apply function
blogs_words   <- word_count(blogs)
news_words    <- word_count(news)
twitter_words <- word_count(twitter)

# Summary table
summary_table <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  AvgWordsPerLine = c(mean(blogs_words), mean(news_words), mean(twitter_words)),
  MaxWordsPerLine = c(max(blogs_words), max(news_words), max(twitter_words)),
  MinWordsPerLine = c(min(blogs_words), min(news_words), min(twitter_words))
)

summary_table

##      File   Lines AvgWordsPerLine MaxWordsPerLine MinWordsPerLine
## 1   Blogs  899288        43.50290            6852               2
## 2    News 1010206        36.34854            1929               2
## 3 Twitter 2360148        13.89476              47               2

library(ggplot2)

df_plot <- data.frame(
  Words = c(blogs_words, news_words, twitter_words),
  Source = rep(c("Blogs","News","Twitter"), 
               times = c(length(blogs_words), length(news_words), length(twitter_words)))
)

ggplot(df_plot, aes(x = Words, fill = Source)) +
  geom_histogram(binwidth = 5, alpha = 0.6, position="identity") +
  xlim(0,100) +
  labs(title="Distribution of Words per Line", x="Words per line", y="Count") +
  theme_minimal()

library(dplyr)
library(tidytext)

blogs_df <- data.frame(text = blogs, stringsAsFactors = FALSE)

blogs_tokens <- blogs_df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

top10 <- blogs_tokens %>% top_n(10, n) %>% arrange(n)

ggplot(top10, aes(x = reorder(word, n), y = n)) +
  geom_col(fill="steelblue") +
  coord_flip() +
  labs(title="Top 10 Words in Blogs", x="Word", y="Frequency") +
  theme_minimal()

Observations

Twitter posts tend to be very short, often under 20 words per line.

Blogs have the longest lines, with some exceeding 40,000 characters.

Common words like “the”, “and”, “you” dominate, suggesting stopwords will need to be handled.

Some profanity and hashtags appear in Twitter, which may need filtering.

Plans for Prediction Algorithm

Build a next-word prediction model using n-grams (uni-, bi-, tri-grams).

For unseen word sequences, use backoff strategies and smoothing to assign small probabilities.

Optimize the model for speed and memory, suitable for deployment in a Shiny app.

The final app will allow users to type text and see predicted next words in real time.

Milestone Report: Exploratory Analysis of SwiftKey Text Data

Sara AlYafei

2025-09-21