Data Overview

This project demonstrates initial exploration of three text datasets—Blogs, News, and Twitter—and outlines the plan to develop a predictive algorithm and Shiny application.

r load-packages, message=FALSE library(tidyverse) library(stringi) library(ggplot2) library(knitr) library(wordcloud) library(RColorBrewer)

r load-data blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8") news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8") twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Summary Table

summary_table <- tibble( Dataset = c(“Blogs”, “News”, “Twitter”), Lines = line_counts, Word_Counts = word_counts )

kable(summary_table)


## Visualizations

### Histogram of Words Per Line

```r histograms
blogs_wc <- stri_count_words(blogs)
news_wc <- stri_count_words(news)
twitter_wc <- stri_count_words(twitter)

data_frame(Source = rep(c("Blogs", "News", "Twitter"), 
                        times = c(length(blogs_wc), length(news_wc), length(twitter_wc))),
           Words = c(blogs_wc, news_wc, twitter_wc)) %>%
  ggplot(aes(x = Words, fill = Source)) +
  geom_histogram(bins = 50, alpha = 0.6) +
  facet_wrap(~ Source, scales = "free_y") +
  theme_minimal() +
  labs(title = "Histogram of Word Counts per Line", x = "Words", y = "Frequency")

Word Cloud (Combined)

r wordcloud, echo=FALSE combined <- paste(blogs, news, twitter) words <- str_split(combined, "\\s+") word_table <- table(tolower(unlist(words))) word_table <- sort(word_table, decreasing = TRUE) wordcloud(names(word_table), freq = word_table, max.words = 100, colors = brewer.pal(8, "Dark2"))

Prediction Algorithm Plan

I aim to build a Next Word Prediction Model using the following techniques:

Tokenization into unigrams, bigrams, trigrams
Frequency tables to model sequence probabilities
Markov Chains or Stupid Backoff for prediction logic
Smoothing for unseen n-grams
Sampling to balance memory usage and speed

Shiny App Strategy

The Shiny app will:

Accept partial sentences from the user
Suggest likely next word(s) based on trained model
Provide alternative suggestions and confidence scores
Offer responsive and intuitive UI interaction

Key Observations

Twitter text has more abbreviations, emoji, and slang
Blogs and News contain structured grammar and longer phrasing
Stopwords, punctuation, and casing require cleaning strategies
High-frequency words dominate across all datasets

Conclusion

I have successfully loaded and explored the data, generated summaries and visualizations, and laid out a plan to build a robust prediction algorithm and Shiny app. My goal is to deliver a smart and user-friendly tool powered by clean data and insightful modeling.

Exploratory Data Analysis and Next Word Prediction Plan

polarbear244

20 Jul 2025