This milestone report summarizes the exploratory analysis of the text data provided for the Coursera Data Science Capstone project. The objective of the project is to build a predictive text model using data from blogs, news, and Twitter. This report highlights key features of the data, summarizes initial findings, and outlines a plan for building a predictive algorithm and deploying it in a Shiny app.
The dataset includes three English text files from blogs, news, and Twitter. We performed basic summaries including file size, number of lines, and word counts.
library(stringi)
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8")
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 167155
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 268547
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 1274086
## appears to contain an embedded nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): line 1759032
## appears to contain an embedded nul
data_summary <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
knitr::kable(data_summary)
Source | Lines | Words |
---|---|---|
Blogs | 899288 | 37546806 |
News | 1010206 | 34761151 |
2360148 | 30096649 |
To better understand the text data, we analyzed the distribution of line lengths and the most frequent terms.
library(ggplot2)
blog_lengths <- nchar(blogs)
qplot(blog_lengths, bins = 50, main = "Distribution of Blog Post Lengths", xlab = "Characters")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
We tokenized the text data and created frequency tables of the most common unigrams (single words). Stop words were removed.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.5.1
library(tibble)
blog_df <- data.frame(text = blogs)
blog_tokens <- blog_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE)
head(blog_tokens, 10)
## word n
## 1 time 90920
## 2 people 59575
## 3 day 52373
## 4 love 45230
## 5 life 41254
## 6 it’s 38660
## 7 1 30907
## 8 2 29561
## 9 world 29306
## 10 i’m 29192
The next steps in this project include:
This report presents a high-level overview of the initial exploratory data analysis and outlines a roadmap for building a predictive model and Shiny app. The data is rich and suitable for natural language modeling, and we are now in a strong position to move forward with algorithm development.