This report analyzes the SwiftKey dataset for the Coursera Data Science Capstone. The goal is to build a word prediction application.
We sample only 10,000 lines from each file to avoid memory issues.[citation:5]
# Set file paths (adjust if your files are in a different location)
blogs_path <- "final/en_US/en_US.blogs.txt"
news_path <- "final/en_US/en_US.news.txt"
twitter_path <- "final/en_US/en_US.twitter.txt"
# Function to safely read a sample of lines
read_sample <- function(path, n = 10000) {
if (!file.exists(path)) {
stop(paste("File not found:", path))
}
con <- file(path, "r", encoding = "UTF-8")
on.exit(close(con))
lines <- readLines(con, n = n, warn = FALSE, skipNul = TRUE)
return(lines)
}
# Read 10,000 lines from each file
blogs <- read_sample(blogs_path, 10000)
news <- read_sample(news_path, 10000)
twitter <- read_sample(twitter_path, 10000)
cat("Successfully loaded", length(blogs), "blogs,", length(news), "news articles, and", length(twitter), "tweets")
## Successfully loaded 10000 blogs, 10000 news articles, and 10000 tweets
library(stringi)
# Calculate file sizes in MB
file_size <- function(path) {
round(file.info(path)$size / 1024^2, 2)
}
# Create summary table
summary_table <- data.frame(
File = c("Blogs", "News", "Twitter"),
Size_MB = c(file_size(blogs_path), file_size(news_path), file_size(twitter_path)),
Lines_Sampled = c(length(blogs), length(news), length(twitter)),
Words_Sampled = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
summary_table
## File Size_MB Lines_Sampled Words_Sampled
## 1 Blogs 200.42 10000 413215
## 2 News 196.28 10000 349062
## 3 Twitter 159.36 10000 126736
library(ggplot2)
# Combine sampled data
all_text <- c(blogs, news, twitter)
# Split into words
all_words <- unlist(strsplit(tolower(all_text), "[[:space:][:punct:]]+"))
# Remove empty strings and numbers
all_words <- all_words[!all_words %in% c("", as.character(0:9))]
# Get frequency table
word_freq <- sort(table(all_words), decreasing = TRUE)
# Take top 20
top_words <- data.frame(
Word = names(word_freq[1:20]),
Count = as.numeric(word_freq[1:20])
)
# Plot
ggplot(top_words, aes(x = reorder(Word, -Count), y = Count)) +
geom_bar(stat = "identity", fill = "darkorange") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Top 20 Most Frequent Words", x = "Word", y = "Frequency")
## 4. Key Findings
Algorithm: I will build an n-gram model (sequences of 2-3 words) with back-off. When a user types a phrase, the app will look for the most frequent word that follows the last 2 words in our database.
Shiny App: The app will have a simple text input box. As the user types, the predicted next word will appear below. The app will run entirely in the browser and respond in real time.