1 Executive Summary

This report explores text data from blogs, news articles, and Twitter posts to understand common language patterns. The goal is to prepare for building a text prediction application that can suggest the next word while a user is typing.

The analysis compares the three text sources using basic summaries such as file size, line counts, word counts, and line length distributions. It also reviews the most frequent words and common word sequences using n-gram analysis.

The main findings are that Twitter text is shorter and more informal, while blog and news text contains longer and more structured sentences. These differences are important because the final prediction model should work across different writing styles.

The final application will use common word patterns to predict the next word and will be deployed as a Shiny web application.

2 The Data

2.1 Source of Data

The data comes from the HC Corpora dataset and includes three English-language text files:

en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt

The dataset is provided through the Coursera Data Science Capstone project resources.

2.2 Loading the Data

blogs <- readLines(
  "en_US.blogs.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

news <- readLines(
  "en_US.news.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

twitter <- readLines(
  "en_US.twitter.txt",
  encoding = "UTF-8",
  skipNul = TRUE
)

3 Basic Summary Statistics

3.1 File Statistics

file_stats <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  
  File_Size_MB = c(
    round(file.info("en_US.blogs.txt")$size / 1024^2, 2),
    round(file.info("en_US.news.txt")$size / 1024^2, 2),
    round(file.info("en_US.twitter.txt")$size / 1024^2, 2)
  ),
  
  Lines = c(
    length(blogs),
    length(news),
    length(twitter)
  ),
  
  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  )
)

kable(file_stats)

File	File_Size_MB	Lines	Words
Blogs	200.42	899288	37546806
News	196.28	77259	2674561
Twitter	159.36	2360148	30096690

3.2 Observations

The dataset contains a large number of lines and words. Blogs and news files generally contain longer text entries, while Twitter data contains shorter and more informal text. These differences are important because the final prediction model must handle both formal and informal writing styles.

4 Sampling and Cleaning

Because the original dataset is large, a smaller random sample is used for exploratory analysis. This allows the report to be generated efficiently while still showing the main characteristics of the corpus.

4.1 Sampling the Data

set.seed(1234)

sample_blogs <- sample(blogs, min(1000, length(blogs)))
sample_news <- sample(news, min(1000, length(news)))
sample_twitter <- sample(twitter, min(1000, length(twitter)))

sample_data <- c(sample_blogs, sample_news, sample_twitter)

4.2 Creating the Corpus

corpus <- VCorpus(VectorSource(sample_data))

4.3 Cleaning the Data

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))

The cleaning process converts all text to lowercase, removes punctuation, removes numbers, strips extra whitespace, and removes common English stop words.

5 Distribution of Line Lengths

Line length analysis helps compare the structure of the three data sources.

5.1 Combined Line Length Summary

length_df <- data.frame(
  Source = c(
    rep("Blogs", length(sample_blogs)),
    rep("News", length(sample_news)),
    rep("Twitter", length(sample_twitter))
  ),
  Line_Length = c(
    nchar(sample_blogs),
    nchar(sample_news),
    nchar(sample_twitter)
  )
)

sample_summary <- length_df %>%
  group_by(Source) %>%
  summarise(
    Min_Length = min(Line_Length),
    Median_Length = median(Line_Length),
    Mean_Length = round(mean(Line_Length), 2),
    Max_Length = max(Line_Length)
  )

kable(sample_summary)

Source	Min_Length	Median_Length	Mean_Length	Max_Length
Blogs	3	151.5	219.13	2353
News	6	182.0	200.28	980
Twitter	7	63.0	67.41	140

5.2 Histogram of Line Lengths by Source

ggplot(length_df, aes(x = Line_Length)) +
  geom_histogram(bins = 50) +
  facet_wrap(~ Source, scales = "free_y") +
  labs(
    title = "Distribution of Line Lengths by Data Source",
    x = "Characters per Line",
    y = "Number of Lines"
  )

5.3 Blogs

blog_lengths <- nchar(sample_blogs)

hist(
  blog_lengths,
  main = "Distribution of Blog Line Lengths",
  xlab = "Characters",
  col = "lightblue",
  breaks = 50
)

5.4 News

news_lengths <- nchar(sample_news)

hist(
  news_lengths,
  main = "Distribution of News Line Lengths",
  xlab = "Characters",
  col = "lightgreen",
  breaks = 50
)

5.5 Twitter

twitter_lengths <- nchar(sample_twitter)

hist(
  twitter_lengths,
  main = "Distribution of Twitter Line Lengths",
  xlab = "Characters",
  col = "lightpink",
  breaks = 50
)

5.6 Observations

Twitter posts are generally shorter than blog and news text. Blog entries tend to have the largest variation in line length, while tweets are limited by their short-message format.

6 Word Frequency Analysis

6.1 Creating a Term Document Matrix

tdm <- TermDocumentMatrix(corpus)

tdm_matrix <- as.matrix(tdm)

word_freq <- sort(rowSums(tdm_matrix), decreasing = TRUE)

freq_df <- data.frame(
  word = names(word_freq),
  freq = word_freq,
  row.names = NULL
)

6.2 Top 20 Most Frequent Words

top20 <- head(freq_df, 20)

ggplot(top20, aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 20 Most Frequent Words",
    x = "Words",
    y = "Frequency"
  )

6.3 Observations

A relatively small number of words appear very frequently throughout the corpus. This pattern is common in natural language and supports the idea that a smaller vocabulary can still cover a large portion of the text.

7 Word Cloud

wordcloud(
  words = freq_df$word,
  freq = freq_df$freq,
  max.words = 100,
  random.order = FALSE,
  colors = brewer.pal(8, "Dark2")
)

The word cloud provides a visual summary of the most common words in the sampled corpus.

8 Coverage Analysis

8.1 Vocabulary Coverage

coverage <- cumsum(freq_df$freq) / sum(freq_df$freq)

coverage_df <- data.frame(
  Words = 1:length(coverage),
  Coverage = coverage
)

plot(
  coverage_df$Words,
  coverage_df$Coverage,
  type = "l",
  col = "blue",
  xlab = "Number of Words",
  ylab = "Coverage",
  main = "Vocabulary Coverage"
)

8.2 Observations

The analysis shows that a relatively small vocabulary can cover a large percentage of the corpus. This is useful for the final application because the prediction model must be small and fast enough to run in a Shiny application.

9 N-gram Analysis

N-gram analysis is important for next-word prediction. The final model will use common word sequences to estimate the most likely next word.

9.1 Tokenizer Functions

unigram_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 1, max = 1))
}

bigram_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 2, max = 2))
}

trigram_tokenizer <- function(x) {
  NGramTokenizer(x, Weka_control(min = 3, max = 3))
}

9.2 Bigram Analysis

bigram_text <- paste(sample_data, collapse = " ")

bigram_tokens <- NGramTokenizer(
  bigram_text,
  Weka_control(min = 2, max = 2)
)

bigram_freq <- sort(table(bigram_tokens), decreasing = TRUE)

bigram_df <- data.frame(
  bigram = names(bigram_freq),
  freq = as.numeric(bigram_freq),
  row.names = NULL
)

top_bigrams <- head(bigram_df, 15)

ggplot(top_bigrams, aes(x = reorder(bigram, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "darkgreen") +
  coord_flip() +
  labs(
    title = "Top 15 Bigrams",
    x = "Bigram",
    y = "Frequency"
  )

9.3 Trigram Analysis

trigram_text <- paste(sample_data, collapse = " ")

trigram_tokens <- NGramTokenizer(
  trigram_text,
  Weka_control(min = 3, max = 3)
)

trigram_freq <- sort(table(trigram_tokens), decreasing = TRUE)

trigram_df <- data.frame(
  trigram = names(trigram_freq),
  freq = as.numeric(trigram_freq),
  row.names = NULL
)

top_trigrams <- head(trigram_df, 15)

ggplot(top_trigrams, aes(x = reorder(trigram, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "purple") +
  coord_flip() +
  labs(
    title = "Top 15 Trigrams",
    x = "Trigram",
    y = "Frequency"
  )

10 Key Findings

Several important observations were identified during the exploratory analysis:

Twitter data contains shorter and more informal language.
Blogs contain longer and more diverse text structures.
News text tends to be more formal than Twitter text.
Common words dominate the dataset frequency distribution.
N-gram analysis reveals useful phrase patterns for predictive text modeling.
A smaller vocabulary can cover most of the text corpus efficiently.

11 Prediction Algorithm Plan and Shiny App

The final prediction model will use n-gram language modeling techniques.

The proposed approach includes:

Building unigram, bigram, and trigram frequency tables
Cleaning and tokenizing user input
Matching the latest words typed by the user against trigram and bigram tables
Applying a backoff strategy when an exact match is unavailable
Returning the most likely next-word prediction
Deploying the final model through a Shiny web application

The prediction algorithm will first attempt to use trigram context. If no trigram match exists, it will back off to bigram predictions. If no bigram match exists, it will use the most frequent unigram predictions.

12 Next Steps

The next steps for the project include:

Building the predictive text algorithm
Optimizing memory and computation efficiency
Developing the Shiny user interface
Testing prediction quality
Deploying the final Shiny application online

13 Reproducibility

sessionInfo()

## R version 4.4.1 (2024-06-14 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 11 x64 (build 26200)
## 
## Matrix products: default
## 
## 
## locale:
## [1] LC_COLLATE=English_Indonesia.utf8  LC_CTYPE=English_Indonesia.utf8   
## [3] LC_MONETARY=English_Indonesia.utf8 LC_NUMERIC=C                      
## [5] LC_TIME=English_Indonesia.utf8    
## 
## time zone: Asia/Jakarta
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] knitr_1.51         dplyr_1.1.4        wordcloud_2.6      RColorBrewer_1.1-3
## [5] ggplot2_4.0.3      RWeka_0.4-48       tm_0.7-18          NLP_0.3-2         
## [9] stringi_1.8.7     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6      jsonlite_1.8.8    compiler_4.4.1    tidyselect_1.2.1 
##  [5] Rcpp_1.0.13       slam_0.1-55       xml2_1.3.6        parallel_4.4.1   
##  [9] jquerylib_0.1.4   scales_1.4.0      yaml_2.3.10       fastmap_1.2.0    
## [13] R6_2.5.1          labeling_0.4.3    generics_0.1.3    tibble_3.2.1     
## [17] bslib_0.8.0       pillar_1.9.0      rlang_1.1.4       utf8_1.2.4       
## [21] cachem_1.1.0      xfun_0.57         sass_0.4.9        S7_0.2.2         
## [25] RWekajars_3.9.3-2 otel_0.2.0        cli_3.6.3         withr_3.0.1      
## [29] magrittr_2.0.3    digest_0.6.37     grid_4.4.1        rstudioapi_0.18.0
## [33] rJava_1.0-18      lifecycle_1.0.5   vctrs_0.6.5       evaluate_0.24.0  
## [37] glue_1.7.0        farver_2.1.2      fansi_1.0.6       rmarkdown_2.28   
## [41] pkgconfig_2.0.3   tools_4.4.1       htmltools_0.5.8.1

Exploratory Analysis of the SwiftKey Text Dataset

Louis Natasha

2026-05-13