The purpose of this project is to demonstrate that I have successfully downloaded and loaded the SwiftKey dataset, explored its structure, and am ready to build a predictive text algorithm and a Shiny app.
This report presents key features of the data and outlines a high-level plan, written to be understandable by non-technical stakeholders.
We are using the English language corpora provided by SwiftKey, which includes:
blogs <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
data_summary <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Max_Characters = c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))),
Avg_Characters = c(mean(nchar(blogs)), mean(nchar(news)), mean(nchar(twitter)))
)
data_summary
## Source Lines Max_Characters Avg_Characters
## 1 Blogs 899288 40833 229.98695
## 2 News 1010206 11384 201.16149
## 3 Twitter 2360148 140 68.68054
library(stringr)
library(ggplot2)
blog_word_counts <- str_count(blogs, "\\S+")
qplot(blog_word_counts, bins = 50, xlab = "Words per Line", ylab = "Frequency",
main = "Word Count Distribution in Blogs")
To develop the prediction model and Shiny app:
A Shiny app that suggests the next word based on user input, leveraging a trained n-gram model and fast lookup.