Data Science Capstone - Exploratory Analysis

R Markdown

Introduction

This project explores large text datasets from blogs, news articles, and Twitter. The goal is to understand the structure and characteristics of the text data before building a predictive text model.

This analysis focuses on basic statistics such as number of lines, word counts, and text distribution patterns.

The goal of this project is to explore the SwiftKey text dataset and understand its basic characteristics before developing a next-word prediction model. The dataset contains text from three sources: blogs, news articles, and Twitter posts.

To make the analysis computationally manageable, a sample of 10,000 lines was taken from each source, resulting in a combined sample of 30,000 text records. The analysis focuses on word frequencies and common word sequences that may be useful for building a predictive text application.

blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Loading and Sampling the Data

The three English-language datasets were loaded into R. A random sample was then selected from each source using a fixed seed to ensure reproducibility.

set.seed(123)

sample_data <- c(
  sample(blogs, 10000),
  sample(news, 10000),
  sample(twitter, 10000)
)

Data Cleaning

The sampled text was cleaned before analysis. The following preprocessing steps were applied:

-Converted all text to lowercase -Removed punctuation -Removed numbers -Removed extra whitespace -Removed common English stop words

library(quanteda)

# Create quanteda corpus
corp <- corpus(sample_data)

# Tokenize + clean in ONE step
tokens_all <- tokens(
  corp,
  remove_punct = TRUE,
  remove_numbers = TRUE
)

# Convert to lowercase
tokens_all <- tokens_tolower(tokens_all)

After cleaning, the text was tokenized and transformed into unigrams, bigrams, and trigrams.

# Build document-feature matrices
unigrams_dfm <- dfm(tokens_all)
bigrams_dfm  <- dfm(tokens_ngrams(tokens_all, n = 2))
trigrams_dfm <- dfm(tokens_ngrams(tokens_all, n = 3))

# Build frequency tables
unigram_freq <- sort(colSums(unigrams_dfm), decreasing = TRUE)
bigram_freq  <- sort(colSums(bigrams_dfm), decreasing = TRUE)
trigram_freq <- sort(colSums(trigrams_dfm), decreasing = TRUE)

# Remove very rare features (frequency < 2)
unigram_freq <- unigram_freq[unigram_freq >= 2]
bigram_freq  <- bigram_freq[bigram_freq >= 2]
trigram_freq <- trigram_freq[trigram_freq >= 2]

names(bigram_freq)  <- gsub(" ", "_", names(bigram_freq))
names(trigram_freq) <- gsub(" ", "_", names(trigram_freq))

Summary of the Data

The table below summarizes the number of lines in each source file.

data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter))
)

##    Source   Lines
## 1   Blogs  899288
## 2    News 1010206
## 3 Twitter 2360148

This table shows the relative size of each text source before sampling.

Most Frequent Words

Unigram frequencies were calculated from the cleaned corpus.

head(
  data.frame(
    Word = names(unigram_freq),
    Frequency = as.numeric(unigram_freq)
  ),
  20
)

##    Word Frequency
## 1   the     44172
## 2    to     24147
## 3   and     22869
## 4     a     21146
## 5    of     18719
## 6    in     14997
## 7     i     13388
## 8  that      9550
## 9   for      9207
## 10   is      8819
## 11   it      7872
## 12   on      6838
## 13 with      6568
## 14  you      6484
## 15  was      5956
## 16   at      4800
## 17   be      4755
## 18 this      4707
## 19   my      4699
## 20 have      4553

The table above displays the twenty most frequently occurring words in the sampled dataset.

Top 20 Words

top20_uni <- head(unigram_freq,20)

ggplot(
  data.frame(
    Word = factor(names(top20_uni),
                  levels = rev(names(top20_uni))),
    Frequency = as.numeric(top20_uni)
  ),
  aes(Word, Frequency)
) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Most Frequent Words")

The figure provides a visual representation of the most common words observed in the sample.

Most Frequent Bigrams

Bigrams are sequences of two consecutive words.

head(
  data.frame(
    Bigram = names(bigram_freq),
    Frequency = as.numeric(bigram_freq)
  ),
  20
)

##      Bigram Frequency
## 1    of_the      4080
## 2    in_the      3956
## 3    to_the      1978
## 4    on_the      1738
## 5   for_the      1688
## 6     to_be      1431
## 7    at_the      1229
## 8   and_the      1211
## 9      in_a      1081
## 10 with_the       986
## 11   it_was       923
## 12     is_a       915
## 13 from_the       856
## 14    for_a       809
## 15    i_was       778
## 16     of_a       762
## 17   with_a       759
## 18   i_have       736
## 19    and_i       736
## 20    it_is       718

Top 20 Bigrams

top20_bi <- head(bigram_freq,20)

ggplot(
  data.frame(
    Bigram = factor(names(top20_bi),
                    levels = rev(names(top20_bi))),
    Frequency = as.numeric(top20_bi)
  ),
  aes(Bigram, Frequency)
) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(title = "Top 20 Most Frequent Bigrams")

These word pairs provide additional context beyond individual words and may be useful for predicting the next word in a sequence.

Most Frequent Trigrams

Trigrams are sequences of three consecutive words.

head(
  data.frame(
    Trigram = names(trigram_freq),
    Frequency = as.numeric(trigram_freq)
  ),
  20
)

##         Trigram Frequency
## 1    one_of_the       320
## 2         =_=_=       261
## 3      a_lot_of       251
## 4    out_of_the       154
## 5      it_was_a       148
## 6    the_end_of       141
## 7   going_to_be       139
## 8       to_be_a       138
## 9    as_well_as       131
## 10   be_able_to       125
## 11  some_of_the       122
## 12    this_is_a       122
## 13  part_of_the       112
## 14    i_want_to       111
## 15  a_couple_of       106
## 16  the_rest_of       101
## 17    i_have_to        99
## 18   end_of_the        98
## 19     i_have_a        96
## 20 in_the_first        95

Top 20 Trigrams

top20_tri <- head(trigram_freq,20)

ggplot(
  data.frame(
    Trigram = factor(names(top20_tri),
                     levels = rev(names(top20_tri))),
    Frequency = as.numeric(top20_tri)
  ),
  aes(Trigram, Frequency)
) +
geom_col(fill = "darkred") +
coord_flip() +
labs(title = "Top 20 Most Frequent Trigrams")

Trigrams capture longer patterns of language and may improve prediction accuracy by incorporating more context.

Code:

pronouns <- c("you", "we", "they", "i", "he", "she", "it")
verbs <- c("are", "is", "was", "were", "have", "do", "make", "go", "get", "know", "think")
stop_junk <- c("said", "will", "one", "new", "like", "just", "get", "go", "can", "say")

predict_next_word <- function(text_input, seed = 123) {
    text_input <- tolower(as.character(text_input))
    text_input <- removePunctuation(text_input)
    
    words <- unlist(strsplit(text_input, " "))
    words <- words[words != ""]
    
    n <- length(words)
    
    # ---------------- TRIGRAM ----------------
    if (n >= 2) {
        pattern <- paste(words[n-1], words[n], sep = "_")
        candidates <- trigram_freq[grepl(paste0("^", pattern, "_"), names(trigram_freq))]
        
        if (length(candidates) > 0) {
            best <- names(sort(candidates, decreasing = TRUE))
            for (b in best) {
                next_word <- strsplit(b, "_")[[1]][3]
                if (next_word %in% stop_junk) next
                if (next_word == words[n]) next
                if (paste(words[n-1], words[n]) == "how are" && !(next_word %in% pronouns)) next
                return(next_word)
            }
        }
    }
    
    # ---------------- BIGRAM ----------------
    if (n >= 1) {
        pattern <- words[n]
        candidates <- bigram_freq[grepl(paste0("^", pattern, "_"), names(bigram_freq))]
        if (length(candidates) > 0) {
            best <- names(sort(candidates, decreasing = TRUE))
            for (b in best) {
                next_word <- strsplit(b, "_")[[1]][2]
                if (next_word %in% stop_junk) next
                if (next_word == words[n]) next
                if (words[n] == "love" && next_word %in% stop_junk) next
                return(next_word)
            }
        }
    }
    
    # ---------------- UNIGRAM ----------------
    
    unigram_sorted <- sort(unigram_freq, decreasing = TRUE)
    return(sample(names(unigram_sorted)[5:50], 1))
}

Preliminary Prediction Model

A simple prediction function was created using unigram, bigram, and trigram frequency tables.

The model first attempts to find a matching trigram based on the last two words entered. If no trigram match is available, it searches the bigram table. If neither a trigram nor bigram match is found, a word is selected from the unigram frequency table.

Example predictions generated by the current model are shown below.

set.seed(123)
predict_next_word("I love")

## [1] "you"

predict_next_word("how are")

## [1] "you"

predict_next_word("the weather")

## [1] "is"

predict_next_word("going to")

## [1] "be"

predict_next_word("thank you")

## [1] "for"

predict_next_word("what is")

## [1] "the"

These examples demonstrate that the model is able to generate candidate next words using patterns learned from the text corpus

Future Work

The next stage of the project will focus on improving the prediction model and deploying it as a Shiny application.

Planned enhancements include:

Refining the n-gram prediction strategy Improving handling of previously unseen word combinations Evaluating prediction accuracy Creating an interactive Shiny interface for real-time next-word prediction

Conclusion

This exploratory analysis successfully loaded, cleaned, and analyzed the SwiftKey text data. Frequency analysis of words, bigrams, and trigrams provides a foundation for developing a predictive text model. The results from this analysis will be used to guide the design of the final prediction algorithm and Shiny application.

Data Science Capstone - Exploratory Analysis

Shradha Verma

2026-06-07

R Markdown