The goal of this project is to build a predictive text model capable of suggesting the next word based on previously entered words. This milestone report focuses on exploratory analysis of the training data to understand its structure, size, and basic characteristics.
The dataset used is provided by Coursera and contains text data from blogs, news articles, and Twitter.
library(stringi)
library(ggplot2)
library(dplyr)
if(!file.exists("Coursera-SwiftKey.zip")) {
download.file(
"https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
destfile = "Coursera-SwiftKey.zip"
)
unzip("Coursera-SwiftKey.zip")
}
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
summary_table <- data.frame(
File = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
),
Characters = c(
sum(nchar(blogs)),
sum(nchar(news)),
sum(nchar(twitter))
)
)
summary_table
## File Lines Words Characters
## 1 Blogs 899288 37546806 206824505
## 2 News 1010206 34761151 203214543
## 3 Twitter 2360148 30096690 162096241
set.seed(123)
sample_blogs <- sample(blogs, length(blogs) * 0.01)
sample_news <- sample(news, length(news) * 0.01)
sample_twitter <- sample(twitter, length(twitter) * 0.01)
sample_data <- c(sample_blogs, sample_news, sample_twitter)
words <- unlist(strsplit(tolower(sample_data), "\\s+"))
word_freq <- sort(table(words), decreasing = TRUE)
top_words <- data.frame(
word = names(word_freq)[1:20],
freq = as.numeric(word_freq[1:20])
)
ggplot(top_words, aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Most Frequent Words",
x = "Word",
y = "Frequency"
)
The final predictive model will be based on n-gram language models. The processed text data will be tokenized and used to predict the next word based on user input.
A Shiny web application will be developed to provide real-time predictions in a simple and user-friendly interface.