Introduction

The goal of this project is to build a text prediction model using Natural Language Processing (NLP) techniques and deploy it in a Shiny web application. The data includes text from blogs, news articles, and Twitter messages.

This report demonstrates:


Data Description

We are using three datasets:

blogs <- readLines(“en_US/en_US.blogs.txt”, warn = FALSE, encoding = “UTF-8”) news <- readLines(“en_US/en_US.news.txt”, warn = FALSE, encoding = “UTF-8”) twitter <- readLines(“en_US/en_US.twitter.txt”, warn = FALSE, encoding = “UTF-8”)

Each file contains a large collection of text lines.

# Load data (adjust path as needed)
blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Summary Statistics

# Basic statistics
data_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter))),
  MaxLineLength = c(max(nchar(blogs)),
                    max(nchar(news)),
                    max(nchar(twitter)))
)
knitr::kable(data_summary)
Source Lines Words MaxLineLength
Blogs 899288 37546806 40833
News 1010206 34761151 11384
Twitter 2360148 30096649 140

Line Length Distribution

line_lengths <- data.frame(
  Source = rep(c("Blogs", "News", "Twitter"), times = c(length(blogs), length(news), length(twitter))),
  LineLength = c(nchar(blogs), nchar(news), nchar(twitter))
)

ggplot(line_lengths, aes(x = LineLength, fill = Source)) +
  geom_histogram(binwidth = 20, alpha = 0.6, position = "identity") +
  xlim(0, 1000) +
  labs(title = "Distribution of Line Lengths", x = "Line Length", y = "Count")


Word Frequency Analysis

# Sample 1% for quick analysis
set.seed(123)
sample_text <- c(sample(blogs, length(blogs) * 0.01),
                 sample(news, length(news) * 0.01),
                 sample(twitter, length(twitter) * 0.01))

sample_df <- data.frame(text = sample_text)

# Clean and tokenize
tokens <- sample_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

top_words <- tokens %>%
  count(word, sort = TRUE) %>%
  top_n(20)

ggplot(top_words, aes(x = reorder(word, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 20 Most Common Words", x = "Word", y = "Frequency")


Next Steps

We plan to:


Conclusion

The exploratory analysis shows that:

Thank you for your time. Feedback is welcome!