SwiftKey Capstone Milestone Report

Introduction

This report analyzes the SwiftKey dataset for the Coursera Data Science Capstone. The goal is to build a word prediction application.

1. Data Loading (Using Sampling to Prevent Crashes)

We sample only 10,000 lines from each file to avoid memory issues.[citation:5]

# Set file paths (adjust if your files are in a different location)
blogs_path <- "final/en_US/en_US.blogs.txt"
news_path <- "final/en_US/en_US.news.txt"
twitter_path <- "final/en_US/en_US.twitter.txt"

# Function to safely read a sample of lines
read_sample <- function(path, n = 10000) {
  if (!file.exists(path)) {
    stop(paste("File not found:", path))
  }
  con <- file(path, "r", encoding = "UTF-8")
  on.exit(close(con))
  lines <- readLines(con, n = n, warn = FALSE, skipNul = TRUE)
  return(lines)
}

# Read 10,000 lines from each file
blogs <- read_sample(blogs_path, 10000)
news <- read_sample(news_path, 10000)
twitter <- read_sample(twitter_path, 10000)

cat("Successfully loaded", length(blogs), "blogs,", length(news), "news articles, and", length(twitter), "tweets")

## Successfully loaded 10000 blogs, 10000 news articles, and 10000 tweets

2. Basic Summary Statistics

library(stringi)

# Calculate file sizes in MB
file_size <- function(path) {
  round(file.info(path)$size / 1024^2, 2)
}

# Create summary table
summary_table <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Size_MB = c(file_size(blogs_path), file_size(news_path), file_size(twitter_path)),
  Lines_Sampled = c(length(blogs), length(news), length(twitter)),
  Words_Sampled = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

summary_table

##      File Size_MB Lines_Sampled Words_Sampled
## 1   Blogs  200.42         10000        413215
## 2    News  196.28         10000        349062
## 3 Twitter  159.36         10000        126736

3. Word Frequency Analysis (Simplified - No DTM)

library(ggplot2)

# Combine sampled data
all_text <- c(blogs, news, twitter)

# Split into words
all_words <- unlist(strsplit(tolower(all_text), "[[:space:][:punct:]]+"))

# Remove empty strings and numbers
all_words <- all_words[!all_words %in% c("", as.character(0:9))]

# Get frequency table
word_freq <- sort(table(all_words), decreasing = TRUE)

# Take top 20
top_words <- data.frame(
  Word = names(word_freq[1:20]), 
  Count = as.numeric(word_freq[1:20])
)

# Plot
ggplot(top_words, aes(x = reorder(Word, -Count), y = Count)) +
  geom_bar(stat = "identity", fill = "darkorange") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Top 20 Most Frequent Words", x = "Word", y = "Frequency")

## 4. Key Findings

The Twitter file has the most lines but the smallest file size due to character limits.[citation:3]
Common stop words (“the”, “and”, “for”, “you”) dominate the data.
Profanity and slang appear frequently in the Twitter sample, requiring filtering.

5. Plan for Prediction Algorithm and Shiny App

Algorithm: I will build an n-gram model (sequences of 2-3 words) with back-off. When a user types a phrase, the app will look for the most frequent word that follows the last 2 words in our database.

Shiny App: The app will have a simple text input box. As the user types, the predicted next word will appear below. The app will run entirely in the browser and respond in real time.

6. Next Steps

Build n-gram tables from the full dataset using sampling
Implement stupid back-off smoothing
Create the Shiny app interface
Deploy to shinyapps.io