Exploratory Analysis of Text Data for Next Word Prediction

Introduction

The goal of this project is to build a predictive text model capable of suggesting the next word based on previously entered words. This milestone report focuses on exploratory analysis of the training data to understand its structure, size, and basic characteristics.

The dataset used is provided by Coursera and contains text data from blogs, news articles, and Twitter.

Libraries

library(stringi)
library(ggplot2)
library(dplyr)

Download and Load Data

if(!file.exists("Coursera-SwiftKey.zip")) {
  download.file(
    "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
    destfile = "Coursera-SwiftKey.zip"
  )
  unzip("Coursera-SwiftKey.zip")
}

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Basic Summary Statistics

summary_table <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  ),
  Characters = c(
    sum(nchar(blogs)),
    sum(nchar(news)),
    sum(nchar(twitter))
  )
)

summary_table

##      File   Lines    Words Characters
## 1   Blogs  899288 37546806  206824505
## 2    News 1010206 34761151  203214543
## 3 Twitter 2360148 30096690  162096241

Data Sampling

set.seed(123)
sample_blogs <- sample(blogs, length(blogs) * 0.01)
sample_news <- sample(news, length(news) * 0.01)
sample_twitter <- sample(twitter, length(twitter) * 0.01)

sample_data <- c(sample_blogs, sample_news, sample_twitter)

Word Frequency Analysis

words <- unlist(strsplit(tolower(sample_data), "\\s+"))
word_freq <- sort(table(words), decreasing = TRUE)

top_words <- data.frame(
  word = names(word_freq)[1:20],
  freq = as.numeric(word_freq[1:20])
)

Visualization

ggplot(top_words, aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 20 Most Frequent Words",
    x = "Word",
    y = "Frequency"
  )

Key Findings

Twitter data contains a large number of short text entries.
Blog and news data have longer sentences and richer vocabulary.
Word usage follows a highly skewed distribution where few words appear very frequently.
Sampling is necessary to handle the data efficiently.

Plan for Final Application

The final predictive model will be based on n-gram language models. The processed text data will be tokenized and used to predict the next word based on user input.

A Shiny web application will be developed to provide real-time predictions in a simple and user-friendly interface.