NLP Capstone: Exploratory Data Analysis

Overview

This report explores the SwiftKey corpus provided by Johns Hopkins University on Coursera. The dataset contains text from three sources: Blogs, News, and Twitter. The goal is to build a next-word prediction algorithm and deploy it as a Shiny app.

1. Loading the Data

library(tm)
library(ggplot2)

blogs <- readLines("D:/SEMESTER 6 HORE/Coursera/capstone/final/en_US/en_US.blogs.txt",
                   encoding = "UTF-8", skipNul = TRUE)
news <- readLines("D:/SEMESTER 6 HORE/Coursera/capstone/final/en_US/en_US.news.txt",
                  encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("D:/SEMESTER 6 HORE/Coursera/capstone/final/en_US/en_US.twitter.txt",
                     encoding = "UTF-8", skipNul = TRUE)

2. Summary Statistics

num_lines <- c(length(blogs), length(news), length(twitter))
num_words <- c(sum(lengths(strsplit(blogs,   " "))),
               sum(lengths(strsplit(news,    " "))),
               sum(lengths(strsplit(twitter, " "))))
file_size_mb <- round(
  file.info(c(
    "D:/SEMESTER 6 HORE/Coursera/capstone/final/en_US/en_US.blogs.txt",
    "D:/SEMESTER 6 HORE/Coursera/capstone/final/en_US/en_US.news.txt",
    "D:/SEMESTER 6 HORE/Coursera/capstone/final/en_US/en_US.twitter.txt"
  ))$size / 1e6, 1)

summary_table <- data.frame(
  Source   = c("Blogs", "News", "Twitter"),
  Lines    = num_lines,
  Words    = num_words,
  Size_MB  = file_size_mb
)
knitr::kable(summary_table, caption = "Summary of the Three Text Sources")

Summary of the Three Text Sources
Source	Lines	Words	Size_MB
Blogs	899288	37334131	210.2
News	1010206	34371031	205.8
Twitter	2360148	30373583	167.1

3. Word Frequency Analysis

set.seed(1234)
small_sample <- c(sample(blogs,   1000),
                  sample(news,    1000),
                  sample(twitter, 1000))

corp2 <- VCorpus(VectorSource(small_sample))
corp2 <- tm_map(corp2, content_transformer(tolower))
corp2 <- tm_map(corp2, removePunctuation)
corp2 <- tm_map(corp2, removeNumbers)
corp2 <- tm_map(corp2, stripWhitespace)

dtm     <- TermDocumentMatrix(corp2)
mat     <- as.matrix(dtm)
freq    <- sort(rowSums(mat), decreasing = TRUE)
df_freq <- data.frame(word = names(freq), frequency = freq)

4. Plots

Top 20 Most Frequent Words

ggplot(df_freq[1:20, ], aes(x = reorder(word, frequency), y = frequency)) +
  geom_bar(stat = "identity", fill = "#2e86ab") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words",
       x = "Word", y = "Frequency") +
  theme_minimal()

Number of Lines per Source

ggplot(summary_table, aes(x = Source, y = Lines, fill = Source)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("#2e86ab", "#a23b72", "#f18f01")) +
  labs(title = "Number of Lines per Source",
       x = "Source", y = "Number of Lines") +
  theme_minimal()

Number of Words per Source

ggplot(summary_table, aes(x = Source, y = Words, fill = Source)) +
  geom_bar(stat = "identity") +
  scale_fill_manual(values = c("#2e86ab", "#a23b72", "#f18f01")) +
  labs(title = "Number of Words per Source",
       x = "Source", y = "Number of Words") +
  theme_minimal()

5. Key Findings

Blogs have the longest average text per line (~41 words)
Twitter has the most lines but shortest text (~13 words per line)
The word “the” is the most frequent word across all sources
A small vocabulary (~1,000 words) covers ~50% of all text instances

6. Plan for Prediction Algorithm

The next-word prediction model will use an N-gram approach:

Build bigram, trigram, and fourgram frequency tables
Use Stupid Backoff to predict the next word
Prune rare n-grams to keep the model lightweight
Deploy as a Shiny app where users type text and get instant predictions