Introduction

This report summarizes the initial exploratory analysis of the text data provided for the Data Science Capstone project. The goal is to build a predictive text algorithm and deploy it in a Shiny app.

Data Loading

# Load the datasets
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# Count lines and words
data_summary <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter)))
)
kable(data_summary, caption = "Summary Statistics of Datasets")
Summary Statistics of Datasets
Dataset Lines Words
Blogs 899288 37546806
News 77259 2674561
Twitter 2360148 30096690
# Compute line lengths
line_lengths <- data.frame(
  Length = c(stri_count_words(blogs), stri_count_words(news), stri_count_words(twitter)),
  Source = factor(rep(c("Blogs", "News", "Twitter"), 
                      c(length(blogs), length(news), length(twitter))))
)
ggplot(line_lengths, aes(x = Length, fill = Source)) +
  geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") +
  facet_wrap(~Source, ncol = 1, scales = "free_y") +
  xlim(0, 200) +
  labs(title = "Distribution of Words per Line", x = "Words per Line", y = "Number of Lines")

# Interesting Findings

Plans for Prediction Algorithm and Shiny App

I plan to build an N-gram model (bigram/trigram) to predict the next word for a given phrase. I will clean and preprocess the text (remove special characters, lowercasing, etc.), build frequency tables, and use them for prediction. The Shiny app will allow users to input a phrase and will return likely next-word predictions.

Conclusion

This milestone demonstrates successful loading, exploration, and basic analysis of the data. I welcome any feedback on my approach and look forward to developing the prediction model and app.