Basic text prediction model

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Package version: 4.2.0
## Unicode version: 14.0
## ICU version: 71.1
## 
## Parallel computing: disabled
## 
## See https://quanteda.io for tutorials and examples.

Introduction:

he goal of this project is to develop a predictive text model using a combination of Twitter, Blogs, and News datasets. This report outlines the exploratory analysis performed on these datasets and discusses plans for building a predictive model. The prototype model will only be using the Blogs data as sufficient due to computational constraints.

Exploratory Analysis:

Summary Statistics To begin, we calculated basic statistics for each dataset:

##                file filesize   lines    words longest_line
## 1 en_US.twitter.txt 159.3641 2360148 30373583          140
## 2   en_US.blogs.txt 200.4242  899288 37334131        40833
## 3    en_US.news.txt 196.2775 1010242 34372530        11384

Next, we analyzed word frequencies across the combined dataset:

N-Gram Model:

To predict the next word in a sequence, we built a basic n-gram model using unigrams, bigrams, and trigrams. Due to limitations of processing power, the algorithm is currently limited to the blog data. The basic trigram model is the most in depth analysis so the algorithm only checks for the past 2 words to decide which one to include next. The distributions of the most popular bi- and tri- grams are given in the chart below

Handling Unseen N-Grams:

Handling Unseen N-Grams To handle unseen n-grams, we implemented a backoff model that estimates probabilities by falling back to lower-order n-grams.

Example runs: input phrase: I love to eat... result:

## [1] "and"

input phrase: I love to eat and... result:

## [1] "drink"

input phrase: I am going to the... result:

## [1] "point"

input phrase: The company will... result:

## [1] "be"

input phrase: I will... result:

## [1] "be"

Future Plans:

Our next steps include optimizing the model for size and run time efficiency and developing a Shiny app for user interaction.

Conclusion:

We can see that the basic text complete is functional but relatively limited die to the 3-gram max design. The result delivers on a basic prototype for an autocomplete suggestion that is to be further upgraded and published in a shiny app.