The purpose of this report is to demonstrate that the data for the Capstone Project has been downloaded, loaded, and basic exploratory analysis has been performed. This analysis will guide the development of the final predictive text model and Shiny app.
blogs <- readLines("en_US.blogs.txt", skipNul = TRUE)
news <- readLines("en_US.news.txt", skipNul = TRUE)
## Warning in readLines("en_US.news.txt", skipNul = TRUE): incomplete final line
## found on 'en_US.news.txt'
twitter <- readLines("en_US.twitter.txt", skipNul = TRUE)
summary_table <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
kable(summary_table, caption = "Basic Summary of the Datasets")
| Source | Lines | Words |
|---|---|---|
| Blogs | 899288 | 37546806 |
| News | 77259 | 2674561 |
| 2360148 | 30096690 |
line_lengths <- data.frame(
Length = c(nchar(blogs), nchar(news), nchar(twitter)),
Source = c(rep("Blogs", length(blogs)),
rep("News", length(news)),
rep("Twitter", length(twitter)))
)
ggplot(line_lengths, aes(x = Length, fill = Source)) +
geom_histogram(binwidth = 200, alpha = 0.5, position = "identity") +
xlim(0, 3000) +
labs(title = "Distribution of Line Lengths",
x = "Characters per Line",
y = "Frequency") +
theme_minimal()
## Warning: Removed 152 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).
These differences influence how we handle each dataset for predictive modeling.
To build the predictive text model:
The final product will be an interactive, user-friendly app for word prediction, suitable for mobile or desktop use.