This project demonstrates initial exploration of three text datasets—Blogs, News, and Twitter—and outlines the plan to develop a predictive algorithm and Shiny application.
r load-packages, message=FALSE library(tidyverse) library(stringi) library(ggplot2) library(knitr) library(wordcloud) library(RColorBrewer)
r load-data blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8") news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8") twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
```r summaries # Line counts line_counts <- c(length(blogs), length(news), length(twitter))
word_counts <- c(sum(stri_count_words(blogs)), sum(stri_count_words(news)), sum(stri_count_words(twitter)))
summary_table <- tibble( Dataset = c(“Blogs”, “News”, “Twitter”), Lines = line_counts, Word_Counts = word_counts )
kable(summary_table)
## Visualizations
### Histogram of Words Per Line
```r histograms
blogs_wc <- stri_count_words(blogs)
news_wc <- stri_count_words(news)
twitter_wc <- stri_count_words(twitter)
data_frame(Source = rep(c("Blogs", "News", "Twitter"),
times = c(length(blogs_wc), length(news_wc), length(twitter_wc))),
Words = c(blogs_wc, news_wc, twitter_wc)) %>%
ggplot(aes(x = Words, fill = Source)) +
geom_histogram(bins = 50, alpha = 0.6) +
facet_wrap(~ Source, scales = "free_y") +
theme_minimal() +
labs(title = "Histogram of Word Counts per Line", x = "Words", y = "Frequency")
r wordcloud, echo=FALSE combined <- paste(blogs, news, twitter) words <- str_split(combined, "\\s+") word_table <- table(tolower(unlist(words))) word_table <- sort(word_table, decreasing = TRUE) wordcloud(names(word_table), freq = word_table, max.words = 100, colors = brewer.pal(8, "Dark2"))
I aim to build a Next Word Prediction Model using the following techniques:
The Shiny app will:
I have successfully loaded and explored the data, generated summaries and visualizations, and laid out a plan to build a robust prediction algorithm and Shiny app. My goal is to deliver a smart and user-friendly tool powered by clean data and insightful modeling.