Introduction

This report summarizes the initial exploratory analysis of the text data provided for the Data Science Capstone project. The goal is to build a predictive text algorithm and deploy it in a Shiny app.

Data Loading

# Load the datasets
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

# Count lines and words
data_summary <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter)))
)
kable(data_summary, caption = "Summary Statistics of Datasets")

Summary Statistics of Datasets
Dataset	Lines	Words
Blogs	899288	37546806
News	77259	2674561
Twitter	2360148	30096690

# Compute line lengths
line_lengths <- data.frame(
  Length = c(stri_count_words(blogs), stri_count_words(news), stri_count_words(twitter)),
  Source = factor(rep(c("Blogs", "News", "Twitter"), 
                      c(length(blogs), length(news), length(twitter))))
)
ggplot(line_lengths, aes(x = Length, fill = Source)) +
  geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") +
  facet_wrap(~Source, ncol = 1, scales = "free_y") +
  xlim(0, 200) +
  labs(title = "Distribution of Words per Line", x = "Words per Line", y = "Number of Lines")

# Interesting Findings

Blogs contain longer entries on average compared to news and Twitter.
Twitter posts are, as expected, much shorter due to character limits.
Some lines are empty or contain very few words, which may need cleaning.
The datasets are quite large and may require sampling or memory management.

Plans for Prediction Algorithm and Shiny App

I plan to build an N-gram model (bigram/trigram) to predict the next word for a given phrase. I will clean and preprocess the text (remove special characters, lowercasing, etc.), build frequency tables, and use them for prediction. The Shiny app will allow users to input a phrase and will return likely next-word predictions.

Conclusion

This milestone demonstrates successful loading, exploration, and basic analysis of the data. I welcome any feedback on my approach and look forward to developing the prediction model and app.

Milestone Report

Fahad

2025-07-05

Introduction

Data Loading

Plans for Prediction Algorithm and Shiny App

Conclusion