## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Package version: 4.2.0
## Unicode version: 14.0
## ICU version: 71.1
##
## Parallel computing: disabled
##
## See https://quanteda.io for tutorials and examples.
he goal of this project is to develop a predictive text model using a combination of Twitter, Blogs, and News datasets. This report outlines the exploratory analysis performed on these datasets and discusses plans for building a predictive model. The prototype model will only be using the Blogs data as sufficient due to computational constraints.
Summary Statistics To begin, we calculated basic statistics for each dataset:
## file filesize lines words longest_line
## 1 en_US.twitter.txt 159.3641 2360148 30373583 140
## 2 en_US.blogs.txt 200.4242 899288 37334131 40833
## 3 en_US.news.txt 196.2775 1010242 34372530 11384
Next, we analyzed word frequencies across the combined dataset:
To predict the next word in a sequence, we built a basic n-gram model using unigrams, bigrams, and trigrams. Due to limitations of processing power, the algorithm is currently limited to the blog data. The basic trigram model is the most in depth analysis so the algorithm only checks for the past 2 words to decide which one to include next. The distributions of the most popular bi- and tri- grams are given in the chart below
Handling Unseen N-Grams To handle unseen n-grams, we implemented a backoff model that estimates probabilities by falling back to lower-order n-grams.
Example runs: input phrase: I love to eat... result:
## [1] "and"
input phrase: I love to eat and... result:
## [1] "drink"
input phrase: I am going to the... result:
## [1] "point"
input phrase: The company will... result:
## [1] "be"
input phrase: I will... result:
## [1] "be"
Our next steps include optimizing the model for size and run time efficiency and developing a Shiny app for user interaction.
We can see that the basic text complete is functional but relatively limited die to the 3-gram max design. The result delivers on a basic prototype for an autocomplete suggestion that is to be further upgraded and published in a shiny app.