This report presents an exploratory analysis of the SwiftKey natural language dataset provided for the Coursera Data Science Capstone project. The dataset contains text collected from blogs, news articles, and Twitter posts.
The primary goal of this project is to build a predictive text model capable of suggesting the next word in a sentence and deploy the model through a Shiny web application.
This milestone report focuses on:
The data comes from the HC Corpora dataset and includes three English-language text files:
en_US.blogs.txten_US.news.txten_US.twitter.txtThe dataset is provided through the Coursera Data Science Capstone project resources.
blogs <- readLines(
"en_US.blogs.txt",
encoding = "UTF-8",
skipNul = TRUE
)
news <- readLines(
"en_US.news.txt",
encoding = "UTF-8",
skipNul = TRUE
)
twitter <- readLines(
"en_US.twitter.txt",
encoding = "UTF-8",
skipNul = TRUE
)
file_stats <- data.frame(
File = c("Blogs", "News", "Twitter"),
File_Size_MB = c(
round(file.info("en_US.blogs.txt")$size / 1024^2, 2),
round(file.info("en_US.news.txt")$size / 1024^2, 2),
round(file.info("en_US.twitter.txt")$size / 1024^2, 2)
),
Lines = c(
length(blogs),
length(news),
length(twitter)
),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
kable(file_stats)
| File | File_Size_MB | Lines | Words |
|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546806 |
| News | 196.28 | 77259 | 2674561 |
| 159.36 | 2360148 | 30096690 |
The dataset contains a large number of lines and words. Blogs and news files generally contain longer text entries, while Twitter data contains shorter and more informal text. These differences are important because the final prediction model must handle both formal and informal writing styles.
Because the original dataset is large, a smaller random sample is used for exploratory analysis. This allows the report to be generated efficiently while still showing the main characteristics of the corpus.
set.seed(1234)
sample_blogs <- sample(blogs, min(1000, length(blogs)))
sample_news <- sample(news, min(1000, length(news)))
sample_twitter <- sample(twitter, min(1000, length(twitter)))
sample_data <- c(sample_blogs, sample_news, sample_twitter)
corpus <- VCorpus(VectorSource(sample_data))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
The cleaning process converts all text to lowercase, removes punctuation, removes numbers, strips extra whitespace, and removes common English stop words.
Line length analysis helps compare the structure of the three data sources.
blog_lengths <- nchar(sample_blogs)
hist(
blog_lengths,
main = "Distribution of Blog Line Lengths",
xlab = "Characters",
col = "lightblue",
breaks = 50
)
news_lengths <- nchar(sample_news)
hist(
news_lengths,
main = "Distribution of News Line Lengths",
xlab = "Characters",
col = "lightgreen",
breaks = 50
)
twitter_lengths <- nchar(sample_twitter)
hist(
twitter_lengths,
main = "Distribution of Twitter Line Lengths",
xlab = "Characters",
col = "lightpink",
breaks = 50
)
Twitter posts are generally shorter than blog and news text. Blog entries tend to have the largest variation in sentence length, while tweets are limited by their short-message format.
tdm <- TermDocumentMatrix(corpus)
tdm_matrix <- as.matrix(tdm)
word_freq <- sort(rowSums(tdm_matrix), decreasing = TRUE)
freq_df <- data.frame(
word = names(word_freq),
freq = word_freq,
row.names = NULL
)
top20 <- head(freq_df, 20)
ggplot(top20, aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Most Frequent Words",
x = "Words",
y = "Frequency"
)
A relatively small number of words appear very frequently throughout the corpus. This pattern is common in natural language and supports the idea that a smaller vocabulary can still cover a large portion of the text.
wordcloud(
words = freq_df$word,
freq = freq_df$freq,
max.words = 100,
random.order = FALSE,
colors = brewer.pal(8, "Dark2")
)
The word cloud provides a visual summary of the most common words in the sampled corpus.
coverage <- cumsum(freq_df$freq) / sum(freq_df$freq)
coverage_df <- data.frame(
Words = 1:length(coverage),
Coverage = coverage
)
plot(
coverage_df$Words,
coverage_df$Coverage,
type = "l",
col = "blue",
xlab = "Number of Words",
ylab = "Coverage",
main = "Vocabulary Coverage"
)
The analysis shows that a relatively small vocabulary can cover a large percentage of the corpus. This is useful for the final application because the prediction model must be small and fast enough to run in a Shiny application.
N-gram analysis is important for next-word prediction. The final model will use common word sequences to estimate the most likely next word.
unigram_tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 1, max = 1))
}
bigram_tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 2, max = 2))
}
trigram_tokenizer <- function(x) {
NGramTokenizer(x, Weka_control(min = 3, max = 3))
}
bigram_text <- paste(sample_data, collapse = " ")
bigram_tokens <- NGramTokenizer(
bigram_text,
Weka_control(min = 2, max = 2)
)
bigram_freq <- sort(table(bigram_tokens), decreasing = TRUE)
bigram_df <- data.frame(
bigram = names(bigram_freq),
freq = as.numeric(bigram_freq),
row.names = NULL
)
top_bigrams <- head(bigram_df, 15)
ggplot(top_bigrams, aes(x = reorder(bigram, freq), y = freq)) +
geom_bar(stat = "identity", fill = "darkgreen") +
coord_flip() +
labs(
title = "Top 15 Bigrams",
x = "Bigram",
y = "Frequency"
)
trigram_text <- paste(sample_data, collapse = " ")
trigram_tokens <- NGramTokenizer(
trigram_text,
Weka_control(min = 3, max = 3)
)
trigram_freq <- sort(table(trigram_tokens), decreasing = TRUE)
trigram_df <- data.frame(
trigram = names(trigram_freq),
freq = as.numeric(trigram_freq),
row.names = NULL
)
top_trigrams <- head(trigram_df, 15)
ggplot(top_trigrams, aes(x = reorder(trigram, freq), y = freq)) +
geom_bar(stat = "identity", fill = "purple") +
coord_flip() +
labs(
title = "Top 15 Trigrams",
x = "Trigram",
y = "Frequency"
)
Several important observations were identified during the exploratory analysis:
The final prediction model will use n-gram language modeling techniques.
The proposed approach includes:
The prediction algorithm will first attempt to use trigram context. If no trigram match exists, it will back off to bigram predictions. If no bigram match exists, it will use the most frequent unigram predictions.
The next steps for the project include:
sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
##
## Matrix products: default
##
##
## locale:
## [1] LC_COLLATE=English_Indonesia.utf8 LC_CTYPE=English_Indonesia.utf8
## [3] LC_MONETARY=English_Indonesia.utf8 LC_NUMERIC=C
## [5] LC_TIME=English_Indonesia.utf8
##
## time zone: Asia/Jakarta
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] knitr_1.51 dplyr_1.1.4 wordcloud_2.6 RColorBrewer_1.1-3
## [5] ggplot2_4.0.3 RWeka_0.4-48 tm_0.7-18 NLP_0.3-2
## [9] stringi_1.8.7
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_1.8.8 compiler_4.4.1 tidyselect_1.2.1
## [5] Rcpp_1.0.13 slam_0.1-55 xml2_1.3.6 parallel_4.4.1
## [9] jquerylib_0.1.4 scales_1.4.0 yaml_2.3.10 fastmap_1.2.0
## [13] R6_2.5.1 labeling_0.4.3 generics_0.1.3 tibble_3.2.1
## [17] bslib_0.8.0 pillar_1.9.0 rlang_1.1.4 utf8_1.2.4
## [21] cachem_1.1.0 xfun_0.57 sass_0.4.9 S7_0.2.2
## [25] RWekajars_3.9.3-2 otel_0.2.0 cli_3.6.3 withr_3.0.1
## [29] magrittr_2.0.3 digest_0.6.37 grid_4.4.1 rstudioapi_0.18.0
## [33] rJava_1.0-18 lifecycle_1.0.5 vctrs_0.6.5 evaluate_0.24.0
## [37] glue_1.7.0 farver_2.1.2 fansi_1.0.6 rmarkdown_2.28
## [41] pkgconfig_2.0.3 tools_4.4.1 htmltools_0.5.8.1