The goal of this project is to build a predictive text model that can suggest the next word based on previously typed words. This is a foundational step in developing a text prediction application similar to mobile keyboard autocomplete systems.
The final objective is to implement this model in a Shiny application.
The following libraries were used for text processing, analysis, and visualization:
library(tm)
## Warning: package 'tm' was built under R version 4.5.3
## Loading required package: NLP
library(stringi)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.5.3
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.5.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidytext)
## Warning: package 'tidytext' was built under R version 4.5.3
library(data.table)
## Warning: package 'data.table' was built under R version 4.5.3
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(quanteda)
## Warning: package 'quanteda' was built under R version 4.5.3
## Package version: 4.4
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 8 of 8 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:tm':
##
## stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
library(tidyr)
## Warning: package 'tidyr' was built under R version 4.5.3
library(quanteda.textstats)
## Warning: package 'quanteda.textstats' was built under R version 4.5.3
The dataset consists of three text sources:
-Blogs -News articles -Twitter posts
blogs <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt", encoding="UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8", skipNul=TRUE)
Due to dataset size, a 1% sample was used for efficiency
set.seed(123)
sample_data <- c(
sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01)
)
The text was cleaned by:
-converting to lowercase -removing numbers -removing punctuation -removing extra whitespace
clean_text <- tolower(sample_data)
clean_text <- removeNumbers(clean_text)
clean_text <- removePunctuation(clean_text)
clean_text <- stripWhitespace(clean_text)
clean_text <- clean_text[clean_text != ""]
tokens_data <- tokens(clean_text)
dfm_uni <- dfm(tokens_data)
freq_unigram <- textstat_frequency(dfm_uni)
top_unigrams <- head(freq_unigram, 20)
ggplot(top_unigrams, aes(x = reorder(feature, frequency), y = frequency)) +
geom_bar(stat = "identity") +
coord_flip() +
labs(title = "Top 20 Unigrams", x = "Words", y = "Frequency")
bigrams <- tokens_ngrams(tokens_data, n = 2)
dfm_bigrams <- dfm(bigrams)
freq_bigram <- textstat_frequency(dfm_bigrams)
bigram_df <- as.data.frame(freq_bigram)
colnames(bigram_df) <- c("ngram", "freq")
bigram_df <- bigram_df %>%
separate(ngram, into = c("w1", "w2"), sep = " ")
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 445583 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
trigrams <- tokens_ngrams(tokens_data, n = 3)
dfm_trigrams <- dfm(trigrams)
freq_trigram <- textstat_frequency(dfm_trigrams)
trigram_df <- as.data.frame(freq_trigram)
colnames(trigram_df) <- c("ngram", "freq")
trigram_df <- trigram_df %>%
separate(ngram, into = c("w1", "w2", "w3"), sep = " ")
## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 773674 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
A predictive model was built using n-grams:
-Unigrams: single words -Bigrams: word pairs -Trigrams: word triplets
These models estimate the probability of the next word based on previous words. ## 11. prediction functions ## bigram prediction
predict_bigram <- function(word) {
result <- bigram_df %>%
filter(w1 == word) %>%
arrange(desc(freq)) %>%
head(3)
return(result)
}
predict_trigram <- function(word1, word2) {
result <- trigram_df %>%
filter(w1 == word1, w2 == word2) %>%
arrange(desc(freq)) %>%
head(3)
return(result)
}
to handle unseen word combinations, a backoff strategy was implemented
predict_next_word <- function(word1, word2) {
result <- trigram_df %>%
filter(w1 == word1, w2 == word2) %>%
arrange(desc(freq)) %>%
head(1)
if (nrow(result) > 0) {
return(result)
}
result <- bigram_df %>%
filter(w1 == word2) %>%
arrange(desc(freq)) %>%
head(1)
return(result)
}
A basic predictive text model was successfully developed. This model forms the foundation for a real-time text prediction application.