This report is part of the Coursera Data Science Capstone Project. The goal is to perform exploratory data analysis on a large corpus of English text data from blogs, news, and Twitter. This report demonstrates that the data has been loaded successfully, summarizes key statistics, and outlines the plan to build a next-word prediction algorithm and a Shiny web application.
The datasets are from the HC Corpora and include:
library(stringi)
# Load the data
blogs <- readLines("C:/Users/Lenovo/Documents/final/en_US/en_US.blogs.txt", warn = FALSE)
news <- readLines("C:/Users/Lenovo/Documents/final/en_US/en_US.news.txt", warn = FALSE)
twitter <- readLines("C:/Users/Lenovo/Documents/final/en_US/en_US.twitter.txt", warn = FALSE)
# Get summary stats
stats <- data.frame(
File = c("Blogs", "News", "Twitter"),
FileSize_MB = c(file.info("./final/en_US/en_US.blogs.txt")$size/1024^2,
file.info("./final/en_US/en_US.news.txt")$size/1024^2,
file.info("./final/en_US/en_US.twitter.txt")$size/1024^2),
LineCount = c(length(blogs), length(news), length(twitter)),
WordCount = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
knitr::kable(stats, caption = "Summary Statistics of the Data Files")
| File | FileSize_MB | LineCount | WordCount |
|---|---|---|---|
| Blogs | NA | 899288 | 37546806 |
| News | NA | 77259 | 2674561 |
| NA | 2360148 | 30096649 |
## Line Length Analysis
library(ggplot2)
# Calculate line lengths
line_lengths <- data.frame(
Source = rep(c("Blogs", "News", "Twitter"),
c(length(blogs), length(news), length(twitter))),
LineLength = c(nchar(blogs), nchar(news), nchar(twitter))
)
# Plot
ggplot(line_lengths, aes(x = LineLength, fill = Source)) +
geom_histogram(binwidth = 100, show.legend = FALSE) +
facet_wrap(~Source, scales = "free") +
labs(title = "Line Length Distribution by Source", x = "Line Length (characters)", y = "Count")
## Word Frequency Analysis
library(tm)
# Sample to reduce memory usage
set.seed(123)
sample_data <- c(sample(blogs, 5000),
sample(news, 5000),
sample(twitter, 5000))
# Create corpus
corpus <- Corpus(VectorSource(sample_data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# Term Document Matrix
tdm <- TermDocumentMatrix(corpus)
tdm <- removeSparseTerms(tdm, 0.99)
m <- as.matrix(tdm)
word_freqs <- sort(rowSums(m), decreasing=TRUE)
df <- data.frame(word=names(word_freqs), freq=word_freqs)
# Plot
ggplot(df[1:20,], aes(x=reorder(word, freq), y=freq)) +
geom_bar(stat="identity", fill="steelblue") +
coord_flip() +
labs(title="Top 20 Most Frequent Words", x="Word", y="Frequency")
The blogs dataset contains the longest lines, with some lines exceeding 40,000 characters.
The Twitter dataset contains over 2 million lines, the highest number among the three datasets.
The word “love” appears approximately 4 times more frequently than the word “hate” in the Twitter dataset.
Unique observations such as the only tweet containing the word “biostats”, referencing a professor, were noted.
N-gram Modeling: Build models using bigrams, trigrams, and quadragrams to understand word sequences.
Prediction Algorithm: Implement a back-off model or Katz smoothing to predict the next word based on previous input.
Shiny App: Develop a web-based interface where users can input a phrase, and the app will predict the most likely next word.