This report presents the exploratory data analysis (EDA) of the text datasets used for building a next word prediction model. The datasets include blogs, news, and Twitter data in English. The goal of this analysis is to understand the structure, size, and basic characteristics of the data.
The analysis focuses on understanding word frequency patterns and structure of the dataset.
blogs <- readLines("en_US.blogs.txt", n = 10000)
news <- readLines("en_US.news.txt", n = 10000)
twitter <- readLines("en_US.twitter.txt", n = 10000)
library(stringi)
summary_table <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Line_Count = c(length(blogs), length(news), length(twitter)),
Word_Count = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
summary_table
## Dataset Line_Count Word_Count
## 1 Blogs 10000 412805
## 2 News 10000 348070
## 3 Twitter 10000 126511
The table above shows the number of lines and words in each dataset. Blogs and Twitter contain a large amount of informal text, while news data is more structured and formal.### Explanation
library(stringi)
char_count <- c(
sum(nchar(blogs)),
sum(nchar(news)),
sum(nchar(twitter))
)
data.frame(
Dataset = c("Blogs","News","Twitter"),
Character_Count = char_count
)
## Dataset Character_Count
## 1 Blogs 2277384
## 2 News 2035687
## 3 Twitter 681674
Character count provides an estimate of the overall size of each dataset. This helps understand how large the text data is and how much processing power may be required.
Due to the large size of the dataset, a small sample is taken for further analysis.
set.seed(123)
blogs_sample <- sample(blogs, 1000)
news_sample <- sample(news, 1000)
twitter_sample <- sample(twitter, 1000)
sample_data <- c(blogs_sample, news_sample, twitter_sample)
length(sample_data)
## [1] 3000
A subset of 1000 lines from each dataset is selected to reduce computation time while maintaining the overall structure and patterns of the data. Sampling allows efficient analysis without processing the entire dataset.
A subset of the data is taken to reduce computation time while still preserving patterns.
library(tm)
# Create corpus
corpus <- Corpus(VectorSource(sample_data))
# Convert to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# Remove punctuation
corpus <- tm_map(corpus, removePunctuation)
# Remove numbers
corpus <- tm_map(corpus, removeNumbers)
# Remove stopwords
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Remove extra whitespace
corpus <- tm_map(corpus, stripWhitespace)
The text data is cleaned by converting all characters to lowercase, removing punctuation, numbers, and extra spaces. Common stopwords such as “the”, “and”, and “is” are also removed.This improves text quality by removing noise and making patterns easier to analyze.This process reduces noise in the data and makes it easier to identify meaningful patterns during analysis.
dtm <- DocumentTermMatrix(corpus)
freq <- colSums(as.matrix(dtm))
freq <- sort(freq, decreasing = TRUE)
top_words <- head(freq, 10)
top_words
## said will just one can like time get people good
## 327 281 259 259 211 202 189 159 139 133
The most frequent words are common English stopwords such as “the”, “and”, and “to”. These words dominate the dataset but carry limited meaning, indicating the need for stopword removal in further modeling.
barplot(top_words,
main = "Top 10 Most Frequent Words",
las = 2)
This bar plot shows the most frequently occurring words in the dataset. It helps identify commonly used words and overall patterns in text data.
barplot(rev(top_words),
horiz = TRUE,
main = "Top Words (Horizontal View)")
The horizontal bar plot improves readability of word labels and provides a clearer comparison of word frequencies.
library(wordcloud)
wordcloud(names(freq), freq, max.words = 100)
The word cloud visually highlights the most frequent words. Larger words indicate higher frequency, making it easy to identify dominant terms.
barplot(summary_table$Line_Count,
names.arg = summary_table$Dataset,
col = "lightblue",
main = "Line Count Comparison")
This plot compares the number of lines across Blogs, News, and Twitter datasets, helping understand dataset size differences.
This exploratory analysis provides a clear understanding of the dataset. It helps in preparing the data for building a next word prediction model and developing a Shiny application.