This is the exploratory data analysis (EDA) report for the Capstone Project. The objective is to analyze and understand the data sets from blogs, news, and Twitter to prepare for building a text prediction model and a Shiny app.
# Load libraries
library(stringi)
library(ggplot2)
library(wordcloud)
## Loading required package: RColorBrewer
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load data
# Giả sử bạn lưu tại: D:\DSR301c/final/en_US/
blogs <- readLines("D:/DSR301c/final/en_US/en_US.blogs.txt", warn = FALSE)
news <- readLines("D:/DSR301c/final/en_US/en_US.news.txt", warn = FALSE)
twitter <- readLines("D:/DSR301c/final/en_US/en_US.twitter.txt", warn = FALSE)
# Line counts
length(blogs); length(news); length(twitter)
## [1] 899288
## [1] 1010206
## [1] 2360148
# Word counts
sum(stri_count_words(blogs))
## [1] 37546806
sum(stri_count_words(news))
## [1] 34761151
sum(stri_count_words(twitter))
## [1] 30096649
# Max line length
max(nchar(blogs))
## [1] 40833
max(nchar(news))
## [1] 11384
max(nchar(twitter))
## [1] 144
# Histogram of line lengths
hist(nchar(twitter), main="Twitter Line Lengths", xlab="Chars", col="lightblue")
# Word frequency (example)
library(tm)
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
sample_text <- paste(twitter[1:10000], collapse=" ")
corpus <- Corpus(VectorSource(sample_text))
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus <- tm_map(corpus, removeWords, stopwords("en"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
## transformation drops documents
dtm <- TermDocumentMatrix(corpus)
matrix <- as.matrix(dtm)
word_freq <- sort(rowSums(matrix), decreasing=TRUE)
wordcloud(names(word_freq), word_freq, max.words=100)
We will use the following steps:
The model will be trained on a cleaned, sampled version of the dataset.
This analysis confirms that the data is cleanable and usable for modeling. The next step will be building n-gram tokenizers and experimenting with various predictive models.