Data Overview

This project demonstrates initial exploration of three text datasets—Blogs, News, and Twitter—and outlines the plan to develop a predictive algorithm and Shiny application.

r load-packages, message=FALSE library(tidyverse) library(stringi) library(ggplot2) library(knitr) library(wordcloud) library(RColorBrewer)

r load-data blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8") news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8") twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Basic Summaries

```r summaries # Line counts line_counts <- c(length(blogs), length(news), length(twitter))

Word counts

word_counts <- c(sum(stri_count_words(blogs)), sum(stri_count_words(news)), sum(stri_count_words(twitter)))

Summary Table

summary_table <- tibble( Dataset = c(“Blogs”, “News”, “Twitter”), Lines = line_counts, Word_Counts = word_counts )

kable(summary_table)


## Visualizations

### Histogram of Words Per Line

```r histograms
blogs_wc <- stri_count_words(blogs)
news_wc <- stri_count_words(news)
twitter_wc <- stri_count_words(twitter)

data_frame(Source = rep(c("Blogs", "News", "Twitter"), 
                        times = c(length(blogs_wc), length(news_wc), length(twitter_wc))),
           Words = c(blogs_wc, news_wc, twitter_wc)) %>%
  ggplot(aes(x = Words, fill = Source)) +
  geom_histogram(bins = 50, alpha = 0.6) +
  facet_wrap(~ Source, scales = "free_y") +
  theme_minimal() +
  labs(title = "Histogram of Word Counts per Line", x = "Words", y = "Frequency")

Word Cloud (Combined)

r wordcloud, echo=FALSE combined <- paste(blogs, news, twitter) words <- str_split(combined, "\\s+") word_table <- table(tolower(unlist(words))) word_table <- sort(word_table, decreasing = TRUE) wordcloud(names(word_table), freq = word_table, max.words = 100, colors = brewer.pal(8, "Dark2"))

Prediction Algorithm Plan

I aim to build a Next Word Prediction Model using the following techniques:

  • Tokenization into unigrams, bigrams, trigrams
  • Frequency tables to model sequence probabilities
  • Markov Chains or Stupid Backoff for prediction logic
  • Smoothing for unseen n-grams
  • Sampling to balance memory usage and speed

Shiny App Strategy

The Shiny app will:

  • Accept partial sentences from the user
  • Suggest likely next word(s) based on trained model
  • Provide alternative suggestions and confidence scores
  • Offer responsive and intuitive UI interaction

Key Observations

  • Twitter text has more abbreviations, emoji, and slang
  • Blogs and News contain structured grammar and longer phrasing
  • Stopwords, punctuation, and casing require cleaning strategies
  • High-frequency words dominate across all datasets

Conclusion

I have successfully loaded and explored the data, generated summaries and visualizations, and laid out a plan to build a robust prediction algorithm and Shiny app. My goal is to deliver a smart and user-friendly tool powered by clean data and insightful modeling.