Exploratory Data Analysis

📌 Project Goal

The purpose of this project is to demonstrate that I have successfully downloaded and loaded the SwiftKey dataset, explored its structure, and am ready to build a predictive text algorithm and a Shiny app.

This report presents key features of the data and outlines a high-level plan, written to be understandable by non-technical stakeholders.

📂 Data Loading

We are using the English language corpora provided by SwiftKey, which includes:

Blogs
News
Twitter

blogs <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("C:/Users/oll31/Downloads/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

🧾 Summary Statistics

data_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Max_Characters = c(max(nchar(blogs)), max(nchar(news)), max(nchar(twitter))),
  Avg_Characters = c(mean(nchar(blogs)), mean(nchar(news)), mean(nchar(twitter)))
)

data_summary

##    Source   Lines Max_Characters Avg_Characters
## 1   Blogs  899288          40833      229.98695
## 2    News 1010206          11384      201.16149
## 3 Twitter 2360148            140       68.68054

📊 Word Count Distribution (Blogs)

library(stringr)
library(ggplot2)

blog_word_counts <- str_count(blogs, "\\S+")
qplot(blog_word_counts, bins = 50, xlab = "Words per Line", ylab = "Frequency",
      main = "Word Count Distribution in Blogs")

🔍 Initial Findings

Twitter has the highest number of entries but each line is short due to the character limit.
Blogs have longer lines with more informal and varied vocabulary.
News lines are formal and moderate in length.
Some outliers (very long lines) may require cleaning or filtering.

🎯 Next Steps

To develop the prediction model and Shiny app:

Clean the data (remove punctuation, numbers, stopwords, etc.)
Tokenize into n-grams (unigrams, bigrams, trigrams)
Build predictive model using frequency and smoothing
Deploy using a Shiny web application

🚀 Final Goal

A Shiny app that suggests the next word based on user input, leveraging a trained n-gram model and fast lookup.