Introduction

This milestone report summarizes the exploratory analysis of the text data provided for the Coursera Data Science Capstone project. The objective of the project is to build a predictive text model using data from blogs, news, and Twitter. This report highlights key features of the data, summarizes initial findings, and outlines a plan for building a predictive algorithm and deploying it in a Shiny app.

Data Summary

The dataset includes three English text files from blogs, news, and Twitter. We performed basic summaries including file size, number of lines, and word counts.

library(stringi)
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8")
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 167155
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 268547
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 1274086
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 1759032
## appears to contain an embedded nul
data_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)), 
            sum(stri_count_words(news)), 
            sum(stri_count_words(twitter)))
)
knitr::kable(data_summary)
Source Lines Words
Blogs 899288 37546806
News 1010206 34761151
Twitter 2360148 30096649

Exploratory Data Analysis

To better understand the text data, we analyzed the distribution of line lengths and the most frequent terms.

Line Length Distribution

library(ggplot2)
blog_lengths <- nchar(blogs)
qplot(blog_lengths, bins = 50, main = "Distribution of Blog Post Lengths", xlab = "Characters")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Most Frequent Words

We tokenized the text data and created frequency tables of the most common unigrams (single words). Stop words were removed.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.5.1
library(tibble)

blog_df <- data.frame(text = blogs)
blog_tokens <- blog_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words, by = "word") %>%
  count(word, sort = TRUE)

head(blog_tokens, 10)
##      word     n
## 1    time 90920
## 2  people 59575
## 3     day 52373
## 4    love 45230
## 5    life 41254
## 6    it’s 38660
## 7       1 30907
## 8       2 29561
## 9   world 29306
## 10    i’m 29192

Plans for the Prediction Algorithm

The next steps in this project include:

  1. Data Cleaning: Remove profanity, punctuation, numbers, and excess whitespace. Normalize text (lowercase).
  2. N-Gram Tokenization: Build models using unigrams, bigrams, and trigrams.
  3. Modeling: Use term frequency-inverse document frequency (TF-IDF), Kneser-Ney smoothing, or Stupid Backoff to build a probabilistic next-word prediction model.
  4. Shiny App: Create a user-friendly interface allowing users to input text and receive word predictions in real-time.

Conclusion

This report presents a high-level overview of the initial exploratory data analysis and outlines a roadmap for building a predictive model and Shiny app. The data is rich and suitable for natural language modeling, and we are now in a strong position to move forward with algorithm development.