The goal of this project is to build a text prediction model using Natural Language Processing (NLP) techniques and deploy it in a Shiny web application. The data includes text from blogs, news articles, and Twitter messages.
This report demonstrates:
We are using three datasets:
blogs <- readLines(“en_US/en_US.blogs.txt”, warn = FALSE, encoding = “UTF-8”) news <- readLines(“en_US/en_US.news.txt”, warn = FALSE, encoding = “UTF-8”) twitter <- readLines(“en_US/en_US.twitter.txt”, warn = FALSE, encoding = “UTF-8”)
Each file contains a large collection of text lines.
# Load data (adjust path as needed)
blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
# Basic statistics
data_summary <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))),
MaxLineLength = c(max(nchar(blogs)),
max(nchar(news)),
max(nchar(twitter)))
)
knitr::kable(data_summary)
| Source | Lines | Words | MaxLineLength |
|---|---|---|---|
| Blogs | 899288 | 37546806 | 40833 |
| News | 1010206 | 34761151 | 11384 |
| 2360148 | 30096649 | 140 |
line_lengths <- data.frame(
Source = rep(c("Blogs", "News", "Twitter"), times = c(length(blogs), length(news), length(twitter))),
LineLength = c(nchar(blogs), nchar(news), nchar(twitter))
)
ggplot(line_lengths, aes(x = LineLength, fill = Source)) +
geom_histogram(binwidth = 20, alpha = 0.6, position = "identity") +
xlim(0, 1000) +
labs(title = "Distribution of Line Lengths", x = "Line Length", y = "Count")
# Sample 1% for quick analysis
set.seed(123)
sample_text <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
sample_df <- data.frame(text = sample_text)
# Clean and tokenize
tokens <- sample_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
top_words <- tokens %>%
count(word, sort = TRUE) %>%
top_n(20)
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
geom_col() +
coord_flip() +
labs(title = "Top 20 Most Common Words", x = "Word", y = "Frequency")
We plan to:
The exploratory analysis shows that:
Thank you for your time. Feedback is welcome!