This report presents the exploratory data analysis (EDA) of the text datasets used for building a next word prediction model. The datasets include blogs, news, and Twitter data in English. The goal of this analysis is to understand the structure, size, and basic characteristics of the data.
blogs <- readLines("en_US.blogs.txt", n = 10000)
news <- readLines("en_US.news.txt", n = 10000)
twitter <- readLines("en_US.twitter.txt", n = 10000)
library(stringi)
summary_table <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Line_Count = c(length(blogs), length(news), length(twitter)),
Word_Count = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
summary_table
## Dataset Line_Count Word_Count
## 1 Blogs 10000 412805
## 2 News 10000 348070
## 3 Twitter 10000 126511
The table above shows the number of lines and words in each dataset. Blogs and Twitter contain a large amount of informal text, while news data is more structured and formal.
Due to the large size of the dataset, a small sample is taken for further analysis.
set.seed(123)
blogs_sample <- sample(blogs, 1000)
news_sample <- sample(news, 1000)
twitter_sample <- sample(twitter, 1000)
sample_data <- c(blogs_sample, news_sample, twitter_sample)
A subset of the data is taken to reduce computation time while still preserving patterns.
library(tm)
# Create corpus
corpus <- Corpus(VectorSource(sample_data))
# Convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
The text data is cleaned by converting all characters to lowercase, removing punctuation, numbers, and extra spaces. Common stopwords such as “the”, “and”, and “is” are also removed.This improves text quality by removing noise and making patterns easier to analyze.This process reduces noise in the data and makes it easier to identify meaningful patterns during analysis.
dtm <- DocumentTermMatrix(corpus)
freq <- colSums(as.matrix(dtm))
freq <- sort(freq, decreasing = TRUE)
top_words <- head(freq, 10)
top_words
## said will just one can like time get people good
## 327 281 259 259 211 202 189 159 139 133
The most frequent words are common English stopwords such as “the”, “and”, and “to”. These words dominate the dataset but carry limited meaning, indicating the need for stopword removal in further modeling.
barplot(top_words,
main = "Top 10 Most Frequent Words",
las = 2)
The bar plot highlights the most frequent words, showing common patterns and dominance of stopwords in the dataset.
This exploratory analysis provides a clear understanding of the dataset. It helps in preparing the data for building a next word prediction model and developing a Shiny application.