Capstone Exploratory Data Analysis Report

1. Introduction

This is the exploratory data analysis (EDA) report for the Capstone Project. The objective is to analyze and understand the data sets from blogs, news, and Twitter to prepare for building a text prediction model and a Shiny app.

2. Data Loading

# Load libraries
library(stringi)
library(ggplot2)
library(wordcloud)

## Loading required package: RColorBrewer

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load data
# Giả sử bạn lưu tại: D:\DSR301c/final/en_US/
blogs <- readLines("D:/DSR301c/final/en_US/en_US.blogs.txt", warn = FALSE)
news  <- readLines("D:/DSR301c/final/en_US/en_US.news.txt", warn = FALSE)
twitter <- readLines("D:/DSR301c/final/en_US/en_US.twitter.txt", warn = FALSE)

3. Basic Summary

# Line counts
length(blogs); length(news); length(twitter)

## [1] 899288

## [1] 1010206

## [1] 2360148

# Word counts
sum(stri_count_words(blogs))

## [1] 37546806

sum(stri_count_words(news))

## [1] 34761151

sum(stri_count_words(twitter))

## [1] 30096649

# Max line length
max(nchar(blogs))

## [1] 40833

max(nchar(news))

## [1] 11384

max(nchar(twitter))

## [1] 144

4. Sample Plots

# Histogram of line lengths
hist(nchar(twitter), main="Twitter Line Lengths", xlab="Chars", col="lightblue")

# Word frequency (example)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

sample_text <- paste(twitter[1:10000], collapse=" ")
corpus <- Corpus(VectorSource(sample_text))
corpus <- tm_map(corpus, content_transformer(tolower))

## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents

corpus <- tm_map(corpus, removePunctuation)

## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents

corpus <- tm_map(corpus, removeNumbers)

## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents

corpus <- tm_map(corpus, removeWords, stopwords("en"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("en")):
## transformation drops documents

dtm <- TermDocumentMatrix(corpus)
matrix <- as.matrix(dtm)
word_freq <- sort(rowSums(matrix), decreasing=TRUE)
wordcloud(names(word_freq), word_freq, max.words=100)

5. Interesting Findings

The Twitter data has the most lines but the shortest average line length.
Blogs have longer lines and higher word density.
Most common words include stop words like “the”, “and”, etc.

6. Plans for Prediction Algorithm and App

We will use the following steps:

Build n-gram models (unigram, bigram, trigram)
Use Markov chain or Stupid Backoff model
Implement the prediction logic inside a Shiny web app
UI will allow users to type text and receive the next word prediction

The model will be trained on a cleaned, sampled version of the dataset.

7. Conclusion

This analysis confirms that the data is cleanable and usable for modeling. The next step will be building n-gram tokenizers and experimenting with various predictive models.