library(tm)
## Loading required package: NLP
library(stringi)
library(quanteda)
## Package version: 4.3.1
## Unicode version: 15.1
## ICU version: 74.1
## Parallel computing: 18 of 18 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:tm':
##
## stopwords
## The following objects are masked from 'package:NLP':
##
## meta, meta<-
library(quanteda.textstats)
blogs <- readLines("en_US.blogs.txt", warn = FALSE)
news <- readLines("en_US.news.txt", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", warn = FALSE)
``` r
length(blogs)
## [1] 899288
length(news)
## [1] 1010206
length(twitter)
## [1] 2360148
set.seed(123)
sample_data <- c(
sample(blogs, 1000),
sample(news, 1000),
sample(twitter, 1000)
)
length(sample_data)
## [1] 3000
tokens_data <- tokens(sample_data, remove_punct = TRUE)
head(tokens_data, 2)
## Tokens consisting of 2 documents.
## text1 :
## [1] "The" "bruschetta" "however" "missed" "the"
## [6] "mark" "Instead" "of" "manageable" "two-bite"
## [11] "crostini" "these"
## [ ... and 20 more ]
##
## text2 :
## [1] "Walden" "Pond" "Mt" "Rainier" "Big"
## [6] "Sur" "Everglades" "and" "so" "forth"
bigram <- tokens_ngrams(tokens_data, n = 2)
trigram <- tokens_ngrams(tokens_data, n = 3)
head(bigram, 2)
## Tokens consisting of 2 documents.
## text1 :
## [1] "The_bruschetta" "bruschetta_however" "however_missed"
## [4] "missed_the" "the_mark" "mark_Instead"
## [7] "Instead_of" "of_manageable" "manageable_two-bite"
## [10] "two-bite_crostini" "crostini_these" "these_were"
## [ ... and 19 more ]
##
## text2 :
## [1] "Walden_Pond" "Pond_Mt" "Mt_Rainier" "Rainier_Big"
## [5] "Big_Sur" "Sur_Everglades" "Everglades_and" "and_so"
## [9] "so_forth"
head(trigram, 2)
## Tokens consisting of 2 documents.
## text1 :
## [1] "The_bruschetta_however" "bruschetta_however_missed"
## [3] "however_missed_the" "missed_the_mark"
## [5] "the_mark_Instead" "mark_Instead_of"
## [7] "Instead_of_manageable" "of_manageable_two-bite"
## [9] "manageable_two-bite_crostini" "two-bite_crostini_these"
## [11] "crostini_these_were" "these_were_huge"
## [ ... and 18 more ]
##
## text2 :
## [1] "Walden_Pond_Mt" "Pond_Mt_Rainier" "Mt_Rainier_Big"
## [4] "Rainier_Big_Sur" "Big_Sur_Everglades" "Sur_Everglades_and"
## [7] "Everglades_and_so" "and_so_forth"
bigram_dfm <- dfm(bigram)
trigram_dfm <- dfm(trigram)
bigram_freq <- textstat_frequency(bigram_dfm)
trigram_freq <- textstat_frequency(trigram_dfm)
predict_next_word <- function(text) {
text <- tolower(text)
# Trigram match
tri_match <- subset(trigram_freq, grepl(paste0("^", text), feature))
if(nrow(tri_match) > 0) {
return(strsplit(tri_match$feature[1], " ")[[1]][3])
}
# Bigram match
bi_match <- subset(bigram_freq, grepl(paste0("^", text), feature))
if(nrow(bi_match) > 0) {
return(strsplit(bi_match$feature[1], " ")[[1]][2])
}
return("the") # fallback
}
predict_next_word("i love")
## [1] "the"
predict_next_word("data science")
## [1] "the"
predict_next_word("thank you")
## [1] "the"
This report describes the development of a simple text prediction model using n-grams. The goal is to predict the next word based on the input text.
The model is built using bigrams and trigrams. Tokens are created from the dataset and converted into frequency tables. These frequency tables are used to identify common word sequences.
A backoff model is used for prediction. The model first searches for trigram matches, then bigram matches. If no match is found, a fallback word is returned.
The model provides reasonable predictions for common phrases. Due to the limited sample size, the model may return common fallback words when no match is found.
The n-gram model successfully demonstrates basic next word prediction. The model can be improved by using larger datasets and more advanced techniques.