๐Ÿ“Œ Project Goal

The purpose of this project is to demonstrate that I have successfully downloaded and loaded the SwiftKey dataset, explored its structure, and am ready to build a predictive text algorithm and a Shiny app.

This report presents key features of the data and outlines a high-level plan, written to be understandable by non-technical stakeholders.


๐Ÿ“‚ Data Loading

We are using the English language corpora provided by SwiftKey, which includes:

blogs <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

๐Ÿงพ Summary Statistics

data_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Max_Characters = c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))),
  Avg_Characters = c(mean(nchar(blogs)), mean(nchar(news)), mean(nchar(twitter)))
)

data_summary
##    Source   Lines Max_Characters Avg_Characters
## 1   Blogs  899288          40833      229.98695
## 2    News 1010206          11384      201.16149
## 3 Twitter 2360148            140       68.68054

๐Ÿ“Š Word Count Distribution (Blogs)

library(stringr)
library(ggplot2)

blog_word_counts <- str_count(blogs, "\\S+")
qplot(blog_word_counts, bins = 50, xlab = "Words per Line", ylab = "Frequency",
      main = "Word Count Distribution in Blogs")


๐Ÿ” Initial Findings


๐ŸŽฏ Next Steps

To develop the prediction model and Shiny app:

  1. Clean the data (remove punctuation, numbers, stopwords, etc.)
  2. Tokenize into n-grams (unigrams, bigrams, trigrams)
  3. Build predictive model using frequency and smoothing
  4. Deploy using a Shiny web application

๐Ÿš€ Final Goal

A Shiny app that suggests the next word based on user input, leveraging a trained n-gram model and fast lookup.