This report explores text data from blogs, news articles, and Twitter posts to understand common language patterns. The goal is to prepare for building a text prediction application that can suggest the next word while a user is typing.
The analysis compares the three text sources using basic summaries such as file size, line counts, word counts, and line length distributions. It also reviews the most frequent words and common word sequences using n-gram analysis.
The main findings are that Twitter text is shorter and more informal, while blog and news text contains longer and more structured sentences. These differences are important because the final prediction model should work across different writing styles.
The final application will use common word patterns to predict the next word and will be deployed as a Shiny web application.
The data comes from the HC Corpora dataset and includes three English-language text files:
en_US.blogs.txten_US.news.txten_US.twitter.txtThe dataset is provided through the Coursera Data Science Capstone project resources.
blogs <- readLines(
"en_US.blogs.txt",
encoding = "UTF-8",
skipNul = TRUE
)
news <- readLines(
"en_US.news.txt",
encoding = "UTF-8",
skipNul = TRUE
)
twitter <- readLines(
"en_US.twitter.txt",
encoding = "UTF-8",
skipNul = TRUE
)
file_stats <- data.frame(
File = c("Blogs", "News", "Twitter"),
File_Size_MB = c(
round(file.info("en_US.blogs.txt")$size / 1024^2, 2),
round(file.info("en_US.news.txt")$size / 1024^2, 2),
round(file.info("en_US.twitter.txt")$size / 1024^2, 2)
),
Lines = c(
length(blogs),
length(news),
length(twitter)
),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
kable(file_stats)
| File | File_Size_MB | Lines | Words |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546806 |
| News | 196.28 | 77259 | 2674561 |
| 159.36 | 2360148 | 30096690 |
The dataset contains a large number of lines and words. Blogs and news files generally contain longer text entries, while Twitter data contains shorter and more informal text. These differences are important because the final prediction model must handle both formal and informal writing styles.
Because the original dataset is large, a smaller random sample is used for exploratory analysis. This allows the report to be generated efficiently while still showing the main characteristics of the corpus.
set.seed(1234)
sample_blogs <- sample(blogs, min(1000, length(blogs)))
sample_news <- sample(news, min(1000, length(news)))
sample_twitter <- sample(twitter, min(1000, length(twitter)))
sample_data <- c(sample_blogs, sample_news, sample_twitter)
corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
The cleaning process converts all text to lowercase, removes punctuation, removes numbers, strips extra whitespace, and removes common English stop words.
Line length analysis helps compare the structure of the three data sources.
length_df <- data.frame(
Source = c(
rep("Blogs", length(sample_blogs)),
rep("News", length(sample_news)),
rep("Twitter", length(sample_twitter))
),
Line_Length = c(
nchar(sample_blogs),
nchar(sample_news),
nchar(sample_twitter)
)
)
sample_summary <- length_df %>%
group_by(Source) %>%
summarise(
Min_Length = min(Line_Length),
Median_Length = median(Line_Length),
Mean_Length = round(mean(Line_Length), 2),
Max_Length = max(Line_Length)
)
kable(sample_summary)
| Source | Min_Length | Median_Length | Mean_Length | Max_Length |
|---|---|---|---|---|
| Blogs | 3 | 151.5 | 219.13 | 2353 |
| News | 6 | 182.0 | 200.28 | 980 |
| 7 | 63.0 | 67.41 | 140 |
ggplot(length_df, aes(x = Line_Length)) +
geom_histogram(bins = 50) +
facet_wrap(~ Source, scales = "free_y") +
labs(
title = "Distribution of Line Lengths by Data Source",
x = "Characters per Line",
y = "Number of Lines"
)
blog_lengths <- nchar(sample_blogs)
hist(
blog_lengths,
main = "Distribution of Blog Line Lengths",
xlab = "Characters",
col = "lightblue",
breaks = 50
)
news_lengths <- nchar(sample_news)
hist(
news_lengths,
main = "Distribution of News Line Lengths",
xlab = "Characters",
col = "lightgreen",
breaks = 50
)
twitter_lengths <- nchar(sample_twitter)
hist(
twitter_lengths,
main = "Distribution of Twitter Line Lengths",
xlab = "Characters",
col = "lightpink",
breaks = 50
)
Twitter posts are generally shorter than blog and news text. Blog entries tend to have the largest variation in line length, while tweets are limited by their short-message format.
tdm <- TermDocumentMatrix(corpus)
tdm_matrix <- as.matrix(tdm)
word_freq <- sort(rowSums(tdm_matrix), decreasing = TRUE)
freq_df <- data.frame(
word = names(word_freq),
freq = word_freq,
row.names = NULL
)
top20 <- head(freq_df, 20)
ggplot(top20, aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Most Frequent Words",
x = "Words",
y = "Frequency"
)
A relatively small number of words appear very frequently throughout the corpus. This pattern is common in natural language and supports the idea that a smaller vocabulary can still cover a large portion of the text.
wordcloud(
words = freq_df$word,
freq = freq_df$freq,
max.words = 100,
random.order = FALSE,
colors = brewer.pal(8, "Dark2")
)
The word cloud provides a visual summary of the most common words in the sampled corpus.
coverage <- cumsum(freq_df$freq) / sum(freq_df$freq)
coverage_df <- data.frame(
Words = 1:length(coverage),
Coverage = coverage
)
plot(
coverage_df$Words,
coverage_df$Coverage,
type = "l",
col = "blue",
xlab = "Number of Words",
ylab = "Coverage",
main = "Vocabulary Coverage"
)
The analysis shows that a relatively small vocabulary can cover a large percentage of the corpus. This is useful for the final application because the prediction model must be small and fast enough to run in a Shiny application.
N-gram analysis is important for next-word prediction. The final model will use common word sequences to estimate the most likely next word.
unigram_tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
bigram_tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
trigram_tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
bigram_text <- paste(sample_data, collapse = " ")
bigram_tokens <- NGramTokenizer(
bigram_text,
Weka_control(min = 2, max = 2)
)
bigram_freq <- sort(table(bigram_tokens), decreasing = TRUE)
bigram_df <- data.frame(
bigram = names(bigram_freq),
freq = as.numeric(bigram_freq),
row.names = NULL
)
top_bigrams <- head(bigram_df, 15)
ggplot(top_bigrams, aes(x = reorder(bigram, freq), y = freq)) +
geom_bar(stat = "identity", fill = "darkgreen") +
coord_flip() +
labs(
title = "Top 15 Bigrams",
x = "Bigram",
y = "Frequency"
)
trigram_text <- paste(sample_data, collapse = " ")
trigram_tokens <- NGramTokenizer(
trigram_text,
Weka_control(min = 3, max = 3)
)
trigram_freq <- sort(table(trigram_tokens), decreasing = TRUE)
trigram_df <- data.frame(
trigram = names(trigram_freq),
freq = as.numeric(trigram_freq),
row.names = NULL
)
top_trigrams <- head(trigram_df, 15)
ggplot(top_trigrams, aes(x = reorder(trigram, freq), y = freq)) +
geom_bar(stat = "identity", fill = "purple") +
coord_flip() +
labs(
title = "Top 15 Trigrams",
x = "Trigram",
y = "Frequency"
)
Several important observations were identified during the exploratory analysis:
The final prediction model will use n-gram language modeling techniques.
The proposed approach includes:
The prediction algorithm will first attempt to use trigram context. If no trigram match exists, it will back off to bigram predictions. If no bigram match exists, it will use the most frequent unigram predictions.
The next steps for the project include:
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_Indonesia.utf8 LC_CTYPE=English_Indonesia.utf8
## [3] LC_MONETARY=English_Indonesia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Indonesia.utf8
##
## time zone: Asia/Jakarta
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.51 dplyr_1.1.4 wordcloud_2.6 RColorBrewer_1.1-3
## [5] ggplot2_4.0.3 RWeka_0.4-48 tm_0.7-18 NLP_0.3-2
## [9] stringi_1.8.7
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_1.8.8 compiler_4.4.1 tidyselect_1.2.1
## [5] Rcpp_1.0.13 slam_0.1-55 xml2_1.3.6 parallel_4.4.1
## [9] jquerylib_0.1.4 scales_1.4.0 yaml_2.3.10 fastmap_1.2.0
## [13] R6_2.5.1 labeling_0.4.3 generics_0.1.3 tibble_3.2.1
## [17] bslib_0.8.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4
## [21] cachem_1.1.0 xfun_0.57 sass_0.4.9 S7_0.2.2
## [25] RWekajars_3.9.3-2 otel_0.2.0 cli_3.6.3 withr_3.0.1
## [29] magrittr_2.0.3 digest_0.6.37 grid_4.4.1 rstudioapi_0.18.0
## [33] rJava_1.0-18 lifecycle_1.0.5 vctrs_0.6.5 evaluate_0.24.0
## [37] glue_1.7.0 farver_2.1.2 fansi_1.0.6 rmarkdown_2.28
## [41] pkgconfig_2.0.3 tools_4.4.1 htmltools_0.5.8.1