Introduction

The purpose of this report is to demonstrate that the data for the Capstone Project has been downloaded, loaded, and basic exploratory analysis has been performed. This analysis will guide the development of the final predictive text model and Shiny app.


Data Loading

blogs <- readLines("en_US.blogs.txt", skipNul = TRUE)
news <- readLines("en_US.news.txt", skipNul = TRUE)
## Warning in readLines("en_US.news.txt", skipNul = TRUE): incomplete final line
## found on 'en_US.news.txt'
twitter <- readLines("en_US.twitter.txt", skipNul = TRUE)

Basic Summary Statistics

summary_table <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter)))
)
kable(summary_table, caption = "Basic Summary of the Datasets")
Basic Summary of the Datasets
Source Lines Words
Blogs 899288 37546806
News 77259 2674561
Twitter 2360148 30096690

Distribution of Line Lengths

line_lengths <- data.frame(
  Length = c(nchar(blogs), nchar(news), nchar(twitter)),
  Source = c(rep("Blogs", length(blogs)),
             rep("News", length(news)),
             rep("Twitter", length(twitter)))
)

ggplot(line_lengths, aes(x = Length, fill = Source)) +
  geom_histogram(binwidth = 200, alpha = 0.5, position = "identity") +
  xlim(0, 3000) +
  labs(title = "Distribution of Line Lengths",
       x = "Characters per Line",
       y = "Frequency") +
  theme_minimal()
## Warning: Removed 152 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).


Findings

These differences influence how we handle each dataset for predictive modeling.


Plan for Prediction Model and Shiny App

To build the predictive text model:

The final product will be an interactive, user-friendly app for word prediction, suitable for mobile or desktop use.