The goal of this report is to display that I got used to working with the data and that I am on track to creating my prediction algorithm. I should explain my EDA and my goals for the app and the algorithm.

Loading and reading in data

Let’s start with loading the packages, the data and reading in the data.

Load packages

library(quanteda.textstats)
library(tibble)
library(tidytext)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ purrr     1.0.4
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(stringr)
library(quanteda)
## Package version: 4.2.0
## Unicode version: 14.0
## ICU version: 71.1
## Parallel computing: disabled
## See https://quanteda.io for tutorials and examples.
library(readtext)
## 
## Attaching package: 'readtext'
## 
## The following object is masked from 'package:quanteda':
## 
##     texts
library(patchwork)
set.seed(9102015)

Download data

start <- Sys.time()

download_and_prepare_data <- function() {
                        
        url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
        
        if (!file.exists("data")) {
                dir.create("data")
                dir.create("data/raw")
                download.file(url, "./data/data.zip")
                unzip("./data/data.zip", exdir = "./data/raw")
                file.remove("./data/data.zip")
                unlink(c("./data/raw/final/de_DE/", "./data/raw/final/fi_FI", "./data/raw/final/ru_RU"), recursive=TRUE)
        return(date())  # return the download date
          } else {
            return("Data already exists.")
                }
        
        
          
}

dataDownloaded <- download_and_prepare_data()
## Warning in unzip("./data/data.zip", exdir = "./data/raw"): write error in
## extracting from zip file
print(dataDownloaded)
## [1] "Wed Jun 25 13:34:33 2025"
list.files("./data/raw/final/en_US")
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"
end <- Sys.time()
end-start
## Time difference of 49.70516 secs

Read in data

start<-Sys.time()

blogs <- read_lines("./data/raw/final/en_US/en_US.blogs.txt", skip_empty_rows = TRUE, locale = locale(encoding = "UTF-8"))
news <- read_lines("./data/raw/final/en_US/en_US.news.txt", skip_empty_rows = TRUE, locale = locale(encoding = "UTF-8"))
twitter <- suppressWarnings(read_lines("./data/raw/final/en_US/en_US.twitter.txt", skip_empty_rows = TRUE, locale = locale(encoding = "UTF-8")))

Exploratory Data Analysis

Here is a basic summary of the data ordered by source with

Mbs, words and lines per source

blog_path    <- "./data/raw/final/en_US/en_US.blogs.txt"
news_path    <- "./data/raw/final/en_US/en_US.news.txt"
twitter_path <- "./data/raw/final/en_US/en_US.twitter.txt"
        
# Word count per line
count_words <- function(text) str_count(text, "\\S+")
blogs_words   <- count_words(blogs)
news_words    <- count_words(news)
twitter_words <- count_words(twitter)

# Totals
lines_total <- c(length(blogs), length(news), length(twitter))
words_total <- c(sum(blogs_words), sum(news_words), sum(twitter_words))

total_lines_all <- sum(lines_total)
total_words_all <- sum(words_total)

# File sizes in MB
file_sizes_bytes <- c(
  Blogs   = file.info(blog_path)$size,
  News    = file.info(news_path)$size,
  Twitter = file.info(twitter_path)$size
)
file_sizes_mb <- round(file_sizes_bytes / (1024^2), 2)
total_mb <- sum(file_sizes_mb)

# Build tibble
source_stats <- tibble(
  source = c("Blogs", "News", "Twitter"),
  file_size_mb = as.numeric(file_sizes_mb),
  percent_of_total_mb = round(file_sizes_mb / total_mb * 100, 1),
  lines = lines_total,
  total_words = words_total,
  avg_words_per_line = round(words_total / lines_total, 2),
  percent_of_total_lines = round(lines_total / total_lines_all * 100, 1),
  percent_of_total_words = round(words_total / total_words_all * 100, 1)
)

# View result
source_stats

We observe that twitter has the smaller file size, but has much more lines than blogs and news. The total words and the average words per line are much smaller than in blogs and news.

Example lines from each data source

Next, we explore the data intuitively by printing the first few lines of each source:

cat("BLOG EXAMPLES:\n\n", paste(blogs[1:5], collapse = "\n\n"), "\n\n")
## BLOG EXAMPLES:
## 
##  In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
## 
## We love you Mr. Brown.
## 
## Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
## 
## so anyways, i am going to share some home decor inspiration that i have been storing in my folder on the puter. i have all these amazing images stored away ready to come to life when we get our home.
## 
## With graduation season right around the corner, Nancy has whipped up a fun set to help you out with not only your graduation cards and gifts, but any occasion that brings on a change in one's life. I stamped the images in Memento Tuxedo Black and cut them out with circle Nestabilities. I embossed the kraft and red cardstock with TE's new Stars Impressions Plate, which is double sided and gives you 2 fantastic patterns. You can see how to use the Impressions Plates in this tutorial Taylor created. Just one pass through your die cut machine using the Embossing Pad Kit is all you need to do - super easy!
cat("NEWS EXAMPLES:\n\n", paste(news[1:5], collapse = "\n\n"), "\n\n")
## NEWS EXAMPLES:
## 
##  He wasn't home alone, apparently.
## 
## The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
## 
## WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.
## 
## The Alaimo Group of Mount Holly was up for a contract last fall to evaluate and suggest improvements to Trenton Water Works. But campaign finance records released this week show the two employees donated a total of $4,500 to the political action committee (PAC) Partners for Progress in early June. Partners for Progress reported it gave more than $10,000 in both direct and in-kind contributions to Mayor Tony Mack in the two weeks leading up to his victory in the mayoral runoff election June 15.
## 
## And when it's often difficult to predict a law's impact, legislators should think twice before carrying any bill. Is it absolutely necessary? Is it an issue serious enough to merit their attention? Will it definitely not make the situation worse?
cat("TWITTER EXAMPLES:\n\n", paste(twitter[1:5], collapse = "\n\n"), "\n\n")
## TWITTER EXAMPLES:
## 
##  How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
## 
## When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.
## 
## they've decided its more fun if I don't.
## 
## So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)
## 
## Words from a complete stranger! Made my birthday even better :)

We see that the language is very different in twitter, it has more slang, uses special characters, abbreviations. On the one hand it has the potential to enrich our algorithm by creating larger diversity, on the other hand we must be careful not to train our model with erroneous or confusing data.

And observe here again that the sentences are much smaller with twitter than with the blogs and news content. This is understandable knowing that twitter has a limit on character input.

In the following, we deepen our exploration into the word count per line with a histogram to see the distribution and a boxplot to track the outliers:

# Calculate words per line for each source
words_per_line_blog <- count_words(blogs)
words_per_line_news <- count_words(news)
words_per_line_twitter <- count_words(twitter)

# Combine into a single vector for plotting
words_per_line_data <- c(words_per_line_blog, words_per_line_news, words_per_line_twitter)
source_labels <- rep(c("Blogs", "News", "Twitter"), times = c(length(blogs), length(news), length(twitter)))

# Create a data frame for ggplot
words_per_line_df <- tibble(words_per_line = words_per_line_data, source = source_labels)

# Plot separate histograms stacked vertically using faceting
ggplot(words_per_line_df, aes(x = words_per_line, fill = source)) +
  geom_histogram(binwidth = 5, color = "black", alpha = 0.7) +
  labs(title = "Histogram of Words per Line",
       x = "Words per Line",
       y = "Frequency") +
  theme_minimal() +
  scale_x_continuous(limits = c(0, 250)) +  # Limit x-axis to 500 words
  facet_wrap(~source, scales = "free_y", ncol = 1)  # Stack histograms vertically with 1 column
## Warning: Removed 3895 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_bar()`).

The histogram also shows us, that if we created a sample with the same amount of lines per source, it might be biased as there is such a huge difference in the average words per line. Rather, we should use the same amount of words per source.

# Create a boxplot of words per line by source
words_per_line_df <- tibble(
  words_per_line = c(words_per_line_blog, words_per_line_news, words_per_line_twitter),
  source = rep(c("Blogs", "News", "Twitter"), times = c(length(blogs), length(news), length(twitter)))
)

ggplot(words_per_line_df, aes(x = source, y = words_per_line, fill = source)) +
  geom_boxplot() +
  labs(title = "Box Plot of Words per Line by Source",
       x = "Source",
       y = "Words per Line") +
  theme_minimal()

In the boxplot we observe that there are a couple of outliers in blogs we might want to check to make sure they are valid entries because otherwise they might bias our data:

# Count words in each line of the Blogs dataset
count_words <- function(text) str_count(text, "\\S+")

# Example: Assuming `blogs` is already loaded as a character vector
# Count words per line in blogs
blogs_words <- count_words(blogs)

# Filter lines where word count is more than 6000
blogs_more_than_6000_words <- blogs[blogs_words > 6000]

# Display the first few lines that have more than 6000 words
blogs_more_than_6000_words_snippets <- substr(blogs_more_than_6000_words, 1, 1000)
blogs_more_than_6000_words_snippets
## [1] "UPDATE AS OF 11:30 A.M. EDT, MONDAY, APRIL 11: No damage to Japan's nuclear power plants was reported today after another strong aftershock hit the northeast coast. The temblor, measured at magnitude 6.6 by the U.S. Geological Survey, rocked the country one month after the magnitude 9.0 earthquake and tsunami struck March 11, damaging the Fukushima Daiichi nuclear power plant. A magnitude 7.1 aftershock rattled Japan April 7. The Monday earthquake prompted the temporary evacuation of workers at the plant and interrupted the offsite electric power supply for less than an hour. Injection of cooling water to reactors 1, 2, and 3 resumed within an hour. Officials reported no new damage or increased radiation levels. Workers continued to spray water into the spent fuel pools of reactors 1-4 as needed. As an additional safety measure, Tokyo Electric Power Co. (TEPCO) has brought additional diesel generators to the site as a backup in case offsite power is disabled. Preparations are being mad"
## [2] "THE relationship between business and the public has become closer in the past few decades. Business to-day is taking the public into partnership. A number of causes, some economic, others due to the growing public understanding of business and the public interest in business, have produced this situation. Business realizes that its relationship to the public is not confined to the manufacture and sale of a given product, but includes at the same time the selling of itself and of all those things for which it stands in the public mind. Twenty or twenty-five years ago, business sought to run its own affairs regardless of the public. The reaction was the muck-raking period, in which a multitude of sins were, justly and unjustly, laid to the charge of the interests. In the face of an aroused public conscience the large corporations were obliged to renounce their contention that their affairs were nobody's business. If to-day big business were to seek to throttle the public, a new reaction"

After checking, everything seems to be OK with the biggest outliers.

To summarize, we need to be careful to:

  • include data in our sample not by line count but by word count
  • thoroughly clean the data, especially the twitter data.

What we could explore in a deeper analysis (but will not do here)

We could even go deeper and do a sentiment analysis or topic modeling and we would probably see that there were different sentiments displayed in the different sources as well as different topics more relevant in one source than in another.

Data preparation

Next, we prepare the data:

Creating sample for model evaluation

set.seed(1234567)
start <- Sys.time()

# Create or load sample for model evaluation
if (file.exists("data/sample_data_eval_raw.rds")) {
  message("Loading existing sample...")
  sample_data_eval_raw <- readRDS("data/sample_data_eval_raw.rds")
} else {
  message("Sample not found. Creating a new one...")

  # Function to sample lines until target word count is reached
  sample_by_word_target <- function(text_vector, target_words) {

    # Randomly shuffle lines
    shuffled <- sample(text_vector)

    # Count cumulative words per line
    word_counts <- str_count(shuffled, "\\S+")
    cum_words <- cumsum(word_counts)

    # Get enough lines to meet the word target
    cutoff_index <- which(cum_words >= target_words)[1]
    sampled <- shuffled[1:cutoff_index]

    return(sampled)
  }

  blogs_sample_eval   <- sample_by_word_target(blogs, 100000)
  news_sample_eval    <- sample_by_word_target(news, 100000)
  twitter_sample_eval <- sample_by_word_target(twitter, 100000)
  # check resulting word counf
  message("Blogs:", sum(str_count(blogs_sample_eval, "\\S+")), "words\n")
  message("News:", sum(str_count(news_sample_eval, "\\S+")), "words\n")
  message("Twitter:", sum(str_count(twitter_sample_eval, "\\S+")), "words\n")

  # Combine samples
  sample_data_eval_raw <- c(blogs_sample_eval, news_sample_eval, twitter_sample_eval)

  # Save for reuse
  saveRDS(sample_data_eval_raw, "data/sample_data_eval_raw.rds")
}
## Sample not found. Creating a new one...
## Blogs:100021words
## News:100025words
## Twitter:100002words
end <- Sys.time()
end-start
## Time difference of 23.20675 secs

Light pre-clean of model evaluation data

Especially twitter data often includes: mentions(@user), hashtages(#topic), URLS, emojis/symbols, excessive whitespace/line breaks. That is why we want to use pre-cleaning:

if (file.exists("data/sample_data_eval_clean.RDS")) {
  message("Loading existing clean sample...")
  sample_data_eval_clean <- readRDS("data/sample_data_eval_clean.RDS")
} else {
  message("Creating clean sample...")
  
  # Initial string cleaning steps
  sample_data_eval_clean <- sample_data_eval_raw %>%
    str_to_lower() %>%  # Convert to lowercase
    str_replace_all("http\\S+", "") %>%  # Remove URLs
    str_replace_all("@\\w+", "") %>%  # Remove mentions
    str_replace_all("#\\w+", "") %>%  # Remove hashtags
    str_replace_all("[^\\w\\s']", " ") %>%  # Remove symbols except apostrophes
    str_squish() # Remove excessive whitespace and normalize
  
  # Convert cleaned text to a tibble
  sample_data_eval_clean <- tibble(text=sample_data_eval_clean)
  
  # Filter out lines with repeated words (like "rain rain rain rain")
  sample_data_eval_clean <- sample_data_eval_clean %>%
    filter(!str_detect(text, "(\\b\\w+\\b)(\\s+\\1)+$"))

  # Expand contractions
  expand_contractions <- function(text) {
    text %>%
      str_replace_all("\\bi'm\\b", "i am") %>%
      str_replace_all("\\bcan't\\b", "cannot") %>%
      str_replace_all("\\bwon't\\b", "will not") %>%
      str_replace_all("\\bit's\\b", "it is") %>%
      str_replace_all("\\bthat's\\b", "that is") %>%
      str_replace_all("\\bwe're\\b", "we are") %>%
      str_replace_all("\\bthey're\\b", "they are") %>%
      str_replace_all("\\bdon't\\b", "do not")
  }

  sample_data_eval_clean <- sample_data_eval_clean %>%
    mutate(text = expand_contractions(text))
  print(head(sample_data_eval_clean))

  # Save the cleaned data
  saveRDS(sample_data_eval_clean, "data/sample_data_eval_clean.rds")
  
  # Clean up
  rm(sample_data_eval_raw)
  gc()
}
## Creating clean sample...
## # A tibble: 6 × 1
##   text                                                                          
##   <chr>                                                                         
## 1 yet it seems i must go away and leave it all behind to find my future         
## 2 06 fast one                                                                   
## 3 there is a big chance when you open the fridge in my house there will be a pa…
## 4 i as the ceo of this multinational company hereby relate with you because of …
## 5 the following are synopses of the incidents                                   
## 6 before picking any match for yourself you must ensure that the site you are r…
##             used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   6907428 368.9   10673960  570.1         NA   9580768  511.7
## Vcells 110882117 846.0  227927202 1739.0      16384 227927202 1739.0

Next, we further clean the data and tokenize it.

Clean and tokenize

clean_and_tokenize <- function(text_vector) {
  # Create corpus
  corp <- corpus(text_vector)

  # Tokenize and clean
  toks <- tokens(
    corp,
    remove_punct = TRUE,
    remove_numbers = TRUE,
    remove_symbols = TRUE
  )

  # Load and clean bad words (profanity filtering)
  if (!file.exists("./data/badwords.txt")) {
    url_bad_words <- "https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/badwordslist/badwords.txt"
    download.file(url_bad_words, "./data/badwords.txt")
  }

  bad_words <- readLines("./data/badwords.txt", warn = FALSE) %>%
    str_trim() %>%
    tolower()
  bad_words <- bad_words[bad_words != ""]

  # Remove bad words (profanity filtering)
  toks_clean <- tokens_remove(toks, pattern = bad_words, valuetype = "fixed")

  return(toks_clean)
  message("Tokenization and profanity filtering complete for current dataset.")
}

Save tokens

We save the tokens in a separate file:

if (file.exists("data/tokens_all.rds")) {
  tokens_all <- readRDS("data/tokens_all.rds")
} else {
  tokens_all <- clean_and_tokenize(sample_data_eval_clean)
  saveRDS(tokens_all, "data/tokens_all.rds")
}

Split cleaned data

Now that we have the cleaned evaluation data, we can split it into a training and test set for future model evaluation.

# Split the tokenized data into training and testing sets
set.seed(42)  # Set a seed for reproducibility

# Create indices for 80% train, 20% test split
indices <- sample(seq_along(tokens_all), size = 0.8 * length(tokens_all))

# Split tokens_all into tokens_train and tokens_test
tokens_train <- tokens_all[indices]
tokens_test <- tokens_all[-indices]

# Ensure there is no overlap
message("Split complete: ", length(tokens_train), " training lines, ", length(tokens_test), " testing lines.")
## Split complete: 10499 training lines, 2625 testing lines.
# Check if split worked
length(tokens_train) / length(tokens_all)  # Should be ≈ 0.8
## [1] 0.7999848
length(tokens_test) / length(tokens_all)   # Should be ≈ 0.2
## [1] 0.2000152
# For training
if (file.exists("data/tokens_train.rds")) {
  tokens_train <- readRDS("data/tokens_train.rds")
} else {
  saveRDS(tokens_train, "data/tokens_train.rds")
}

# For testing
if (file.exists("data/tokens_test.rds")) {
  tokens_test <- readRDS("data/tokens_test.rds")
} else {
  saveRDS(tokens_test, "data/tokens_test.rds")
}

Generate n-gram tables

The most common approach to creating a model for text prediction includes n-gram tables. So we

Create function that generates n-gram frequency tables

build_ngram_tables <- function(tokens_input, min_frequency = 2) {
  # Unigram table
  unigram_table <- dfm(tokens_input) %>% 
    textstat_frequency() %>% 
    rename(next_word = feature) %>% 
    select(next_word, frequency) %>%
    filter(frequency >= min_frequency)  # Filter out n-grams with frequency < 2

  # Bigram table
  bigram_table <- tokens_ngrams(tokens_input, n = 2) %>% 
    dfm() %>% 
    textstat_frequency() %>% 
    separate(feature, into = c("prefix", "next_word"), sep = "_", extra = "merge") %>%
    filter(frequency >= min_frequency)  # Filter out n-grams with frequency < 2

  # Trigram table
  trigram_table <- tokens_ngrams(tokens_input, n = 3) %>% 
    dfm() %>% 
    textstat_frequency() %>% 
    separate(feature, into = c("w1", "w2", "next_word"), sep = "_", extra = "merge") %>% 
    mutate(prefix = paste(w1, w2)) %>% 
    select(prefix, next_word, frequency) %>%
    filter(frequency >= min_frequency)  # Filter out n-grams with frequency < 2

  # Fourgram table
  fourgram_table <- tokens_ngrams(tokens_input, n = 4) %>% 
    dfm() %>% 
    textstat_frequency() %>% 
    separate(feature, into = c("w1", "w2", "w3", "next_word"), sep = "_", extra = "merge") %>% 
    mutate(prefix = paste(w1, w2, w3)) %>% 
    select(prefix, next_word, frequency) %>%
    filter(frequency >= min_frequency)  # Filter out n-grams with frequency < 2

  return(list(
    unigram_table = unigram_table,
    bigram_table = bigram_table,
    trigram_table = trigram_table,
    fourgram_table = fourgram_table
  ))
}

Applying function on tokens_train

if (
  file.exists("data/train_unigram_table.rds") &&
  file.exists("data/train_bigram_table.rds") &&
  file.exists("data/train_trigram_table.rds") &&
  file.exists("data/train_fourgram_table.rds")
) {
  message("Training n-gram tables already exist. Skipping build.")
  train_unigram_table <- readRDS("data/train_unigram_table.rds")
  train_bigram_table <- readRDS("data/train_bigram_table.rds")
  train_trigram_table <- readRDS("data/train_trigram_table.rds")
  train_fourgram_table <- readRDS("data/train_fourgram_table.rds")
        } else {
  message("Building training n-gram tables...")

  train_ngrams <- build_ngram_tables(tokens_train)
  
train_unigram_table <- train_ngrams$unigram_table
train_bigram_table <- train_ngrams$bigram_table
train_trigram_table <- train_ngrams$trigram_table
train_fourgram_table <- train_ngrams$fourgram_table
  # Save each table
  saveRDS(train_ngrams$unigram_table,  "data/train_unigram_table.rds")
  saveRDS(train_ngrams$bigram_table,   "data/train_bigram_table.rds")
  saveRDS(train_ngrams$trigram_table,  "data/train_trigram_table.rds")
  saveRDS(train_ngrams$fourgram_table, "data/train_fourgram_table.rds")

  message("Training n-gram tables saved to /data.")
}
## Building training n-gram tables...
## Training n-gram tables saved to /data.
rm(sample_data_eval_clean)
gc()
##             used  (Mb) gc trigger   (Mb) limit (Mb)  max used   (Mb)
## Ncells   7026591 375.3   14260221  761.6         NA  14260221  761.6
## Vcells 111586787 851.4  227927202 1739.0      16384 227927202 1739.0

Visualize n-grams

library(ggplot2)
library(dplyr)
library(patchwork)

plot_ngrams <- function(n = 20) {
  # --- Unigrams ---
  top_unigrams <- train_unigram_table %>%
    arrange(desc(frequency)) %>%
    slice_head(n = n)

  plot_uni <- ggplot(top_unigrams, aes(x = reorder(next_word, frequency), y = frequency)) +
    geom_col(fill = "dodgerblue4") +
    coord_flip() +
    labs(title = paste("Top", n, "Unigrams"), x = "Unigram", y = "Frequency") +
    theme_minimal()

  # --- Bigrams ---
  top_bigrams <- train_bigram_table %>%
    arrange(desc(frequency)) %>%
    mutate(ngram = paste(prefix, next_word, sep = " ")) %>%
    slice_head(n = n)

  plot_bi <- ggplot(top_bigrams, aes(x = reorder(ngram, frequency), y = frequency)) +
    geom_col(fill = "forestgreen") +
    coord_flip() +
    labs(title = paste("Top", n, "Bigrams"), x = "Bigram", y = "Frequency") +
    theme_minimal()

  # --- Trigrams ---
  top_trigrams <- train_trigram_table %>%
    arrange(desc(frequency)) %>%
    mutate(ngram = paste(prefix, next_word, sep = " ")) %>%
    slice_head(n = n)

  plot_tri <- ggplot(top_trigrams, aes(x = reorder(ngram, frequency), y = frequency)) +
    geom_col(fill = "steelblue") +
    coord_flip() +
    labs(title = paste("Top", n, "Trigrams"), x = "Trigram", y = "Frequency") +
    theme_minimal()

  # --- Fourgrams ---
  top_fourgrams <- train_fourgram_table %>%
    arrange(desc(frequency)) %>%
    mutate(ngram = paste(prefix, next_word, sep = " ")) %>%
    slice_head(n = n)

  plot_four <- ggplot(top_fourgrams, aes(x = reorder(ngram, frequency), y = frequency)) +
    geom_col(fill = "darkred") +
    coord_flip() +
    labs(title = paste("Top", n, "Fourgrams"), x = "Fourgram", y = "Frequency") +
    theme_minimal()

  # --- Combine Plots ---
  combined_plot <- (plot_uni | plot_bi) / (plot_tri | plot_four)

  # Return the combined plot
  return(combined_plot)
}

# Example usage:
# Call the function with n = 20
plot_ngrams(20)

Looking forward to creating a prediciton algorithm and the shiny app

Prediction algorithm

So, next I will try several prediction algorithms with my ngram tables and choose the one that performs best: a simple backoff and a stupid backoff model as well as additive k model. I might try to combine the models, too, to increase accuracy. One main issue will probably be that the very frequent unigrams should not outweight the other less frequent n-grams that might be more accurate.

Shiny App

After that, I will create a shiny app in which the user can write words and the app outputs the 3 best next words from the last 1-3 words of the user input.