This report summarizes the initial exploratory analysis of the text data provided for the Data Science Capstone project. The goal is to build a predictive text algorithm and deploy it in a Shiny app.
# Load the datasets
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
# Count lines and words
data_summary <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
kable(data_summary, caption = "Summary Statistics of Datasets")
| Dataset | Lines | Words |
|---|---|---|
| Blogs | 899288 | 37546806 |
| News | 77259 | 2674561 |
| 2360148 | 30096690 |
# Compute line lengths
line_lengths <- data.frame(
Length = c(stri_count_words(blogs), stri_count_words(news), stri_count_words(twitter)),
Source = factor(rep(c("Blogs", "News", "Twitter"),
c(length(blogs), length(news), length(twitter))))
)
ggplot(line_lengths, aes(x = Length, fill = Source)) +
geom_histogram(binwidth = 5, alpha = 0.7, position = "identity") +
facet_wrap(~Source, ncol = 1, scales = "free_y") +
xlim(0, 200) +
labs(title = "Distribution of Words per Line", x = "Words per Line", y = "Number of Lines")
# Interesting Findings
I plan to build an N-gram model (bigram/trigram) to predict the next word for a given phrase. I will clean and preprocess the text (remove special characters, lowercasing, etc.), build frequency tables, and use them for prediction. The Shiny app will allow users to input a phrase and will return likely next-word predictions.
This milestone demonstrates successful loading, exploration, and basic analysis of the data. I welcome any feedback on my approach and look forward to developing the prediction model and app.