1 Executive Summary

This report presents an exploratory analysis of the SwiftKey natural language dataset provided for the Coursera Data Science Capstone project. The dataset contains text collected from blogs, news articles, and Twitter posts.

The primary goal of this project is to build a predictive text model capable of suggesting the next word in a sentence and deploy the model through a Shiny web application.

This milestone report focuses on:

  • Understanding the structure of the dataset
  • Performing exploratory data analysis
  • Cleaning and preprocessing the text data
  • Analyzing word frequencies and n-grams
  • Planning the prediction algorithm and Shiny application

2 The Data

2.1 Source of Data

The data comes from the HC Corpora dataset and includes three English-language text files:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

The dataset is provided through the Coursera Data Science Capstone project resources.

2.2 Loading the Data

blogs <- readLines(
  "en_US.blogs.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

news <- readLines(
  "en_US.news.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

twitter <- readLines(
  "en_US.twitter.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

3 Basic Summary Statistics

3.1 File Statistics

file_stats <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  
  File_Size_MB = c(
    round(file.info("en_US.blogs.txt")$size / 1024^2, 2),
    round(file.info("en_US.news.txt")$size / 1024^2, 2),
    round(file.info("en_US.twitter.txt")$size / 1024^2, 2)
  ),
  
  Lines = c(
    length(blogs),
    length(news),
    length(twitter)
  ),
  
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

kable(file_stats)
File File_Size_MB Lines Words
Blogs 200.42 899288 37546806
News 196.28 77259 2674561
Twitter 159.36 2360148 30096690

3.2 Observations

The dataset contains a large number of lines and words. Blogs and news files generally contain longer text entries, while Twitter data contains shorter and more informal text. These differences are important because the final prediction model must handle both formal and informal writing styles.

4 Sampling and Cleaning

Because the original dataset is large, a smaller random sample is used for exploratory analysis. This allows the report to be generated efficiently while still showing the main characteristics of the corpus.

4.1 Sampling the Data

set.seed(1234)

sample_blogs <- sample(blogs, min(1000, length(blogs)))
sample_news <- sample(news, min(1000, length(news)))
sample_twitter <- sample(twitter, min(1000, length(twitter)))

sample_data <- c(sample_blogs, sample_news, sample_twitter)

4.2 Creating the Corpus

corpus <- VCorpus(VectorSource(sample_data))

4.3 Cleaning the Data

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

The cleaning process converts all text to lowercase, removes punctuation, removes numbers, strips extra whitespace, and removes common English stop words.

5 Distribution of Line Lengths

Line length analysis helps compare the structure of the three data sources.

5.1 Blogs

blog_lengths <- nchar(sample_blogs)

hist(
  blog_lengths,
  main = "Distribution of Blog Line Lengths",
  xlab = "Characters",
  col = "lightblue",
  breaks = 50
)

5.2 News

news_lengths <- nchar(sample_news)

hist(
  news_lengths,
  main = "Distribution of News Line Lengths",
  xlab = "Characters",
  col = "lightgreen",
  breaks = 50
)

5.3 Twitter

twitter_lengths <- nchar(sample_twitter)

hist(
  twitter_lengths,
  main = "Distribution of Twitter Line Lengths",
  xlab = "Characters",
  col = "lightpink",
  breaks = 50
)

5.4 Observations

Twitter posts are generally shorter than blog and news text. Blog entries tend to have the largest variation in sentence length, while tweets are limited by their short-message format.

6 Word Frequency Analysis

6.1 Creating a Term Document Matrix

tdm <- TermDocumentMatrix(corpus)

tdm_matrix <- as.matrix(tdm)

word_freq <- sort(rowSums(tdm_matrix), decreasing = TRUE)

freq_df <- data.frame(
  word = names(word_freq),
  freq = word_freq,
  row.names = NULL
)

6.2 Top 20 Most Frequent Words

top20 <- head(freq_df, 20)

ggplot(top20, aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 20 Most Frequent Words",
    x = "Words",
    y = "Frequency"
  )

6.3 Observations

A relatively small number of words appear very frequently throughout the corpus. This pattern is common in natural language and supports the idea that a smaller vocabulary can still cover a large portion of the text.

7 Word Cloud

wordcloud(
  words = freq_df$word,
  freq = freq_df$freq,
  max.words = 100,
  random.order = FALSE,
  colors = brewer.pal(8, "Dark2")
)

The word cloud provides a visual summary of the most common words in the sampled corpus.

8 Coverage Analysis

8.1 Vocabulary Coverage

coverage <- cumsum(freq_df$freq) / sum(freq_df$freq)

coverage_df <- data.frame(
  Words = 1:length(coverage),
  Coverage = coverage
)

plot(
  coverage_df$Words,
  coverage_df$Coverage,
  type = "l",
  col = "blue",
  xlab = "Number of Words",
  ylab = "Coverage",
  main = "Vocabulary Coverage"
)

8.2 Observations

The analysis shows that a relatively small vocabulary can cover a large percentage of the corpus. This is useful for the final application because the prediction model must be small and fast enough to run in a Shiny application.

9 N-gram Analysis

N-gram analysis is important for next-word prediction. The final model will use common word sequences to estimate the most likely next word.

9.1 Tokenizer Functions

unigram_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 1, max = 1))
}

bigram_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

trigram_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 3, max = 3))
}

9.2 Bigram Analysis

bigram_text <- paste(sample_data, collapse = " ")

bigram_tokens <- NGramTokenizer(
  bigram_text,
  Weka_control(min = 2, max = 2)
)

bigram_freq <- sort(table(bigram_tokens), decreasing = TRUE)

bigram_df <- data.frame(
  bigram = names(bigram_freq),
  freq = as.numeric(bigram_freq),
  row.names = NULL
)

top_bigrams <- head(bigram_df, 15)

ggplot(top_bigrams, aes(x = reorder(bigram, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  coord_flip() +
  labs(
    title = "Top 15 Bigrams",
    x = "Bigram",
    y = "Frequency"
  )

9.3 Trigram Analysis

trigram_text <- paste(sample_data, collapse = " ")

trigram_tokens <- NGramTokenizer(
  trigram_text,
  Weka_control(min = 3, max = 3)
)

trigram_freq <- sort(table(trigram_tokens), decreasing = TRUE)

trigram_df <- data.frame(
  trigram = names(trigram_freq),
  freq = as.numeric(trigram_freq),
  row.names = NULL
)

top_trigrams <- head(trigram_df, 15)

ggplot(top_trigrams, aes(x = reorder(trigram, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "purple") +
  coord_flip() +
  labs(
    title = "Top 15 Trigrams",
    x = "Trigram",
    y = "Frequency"
  )

10 Key Findings

Several important observations were identified during the exploratory analysis:

  • Twitter data contains shorter and more informal language.
  • Blogs contain longer and more diverse text structures.
  • News text tends to be more formal than Twitter text.
  • Common words dominate the dataset frequency distribution.
  • N-gram analysis reveals useful phrase patterns for predictive text modeling.
  • A smaller vocabulary can cover most of the text corpus efficiently.

11 Prediction Algorithm Plan and Shiny App

The final prediction model will use n-gram language modeling techniques.

The proposed approach includes:

  • Building unigram, bigram, and trigram frequency tables
  • Cleaning and tokenizing user input
  • Matching the latest words typed by the user against trigram and bigram tables
  • Applying a backoff strategy when an exact match is unavailable
  • Returning the most likely next-word prediction
  • Deploying the final model through a Shiny web application

The prediction algorithm will first attempt to use trigram context. If no trigram match exists, it will back off to bigram predictions. If no bigram match exists, it will use the most frequent unigram predictions.

12 Next Steps

The next steps for the project include:

  1. Building the predictive text algorithm
  2. Optimizing memory and computation efficiency
  3. Developing the Shiny user interface
  4. Testing prediction quality
  5. Deploying the final Shiny application online

13 Reproducibility

sessionInfo()
## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Indonesia.utf8  LC_CTYPE=English_Indonesia.utf8   
## [3] LC_MONETARY=English_Indonesia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Indonesia.utf8    
## 
## time zone: Asia/Jakarta
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.51         dplyr_1.1.4        wordcloud_2.6      RColorBrewer_1.1-3
## [5] ggplot2_4.0.3      RWeka_0.4-48       tm_0.7-18          NLP_0.3-2         
## [9] stringi_1.8.7     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6      jsonlite_1.8.8    compiler_4.4.1    tidyselect_1.2.1 
##  [5] Rcpp_1.0.13       slam_0.1-55       xml2_1.3.6        parallel_4.4.1   
##  [9] jquerylib_0.1.4   scales_1.4.0      yaml_2.3.10       fastmap_1.2.0    
## [13] R6_2.5.1          labeling_0.4.3    generics_0.1.3    tibble_3.2.1     
## [17] bslib_0.8.0       pillar_1.9.0      rlang_1.1.4       utf8_1.2.4       
## [21] cachem_1.1.0      xfun_0.57         sass_0.4.9        S7_0.2.2         
## [25] RWekajars_3.9.3-2 otel_0.2.0        cli_3.6.3         withr_3.0.1      
## [29] magrittr_2.0.3    digest_0.6.37     grid_4.4.1        rstudioapi_0.18.0
## [33] rJava_1.0-18      lifecycle_1.0.5   vctrs_0.6.5       evaluate_0.24.0  
## [37] glue_1.7.0        farver_2.1.2      fansi_1.0.6       rmarkdown_2.28   
## [41] pkgconfig_2.0.3   tools_4.4.1       htmltools_0.5.8.1