Prediction Model

library(tm)

## Loading required package: NLP

library(stringi)
library(quanteda)

## Package version: 4.3.1
## Unicode version: 15.1
## ICU version: 74.1

## Parallel computing: 18 of 18 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:tm':
## 
##     stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

library(quanteda.textstats)

blogs <- readLines("en_US.blogs.txt", warn = FALSE)
news <- readLines("en_US.news.txt", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", warn = FALSE)


``` r
length(blogs)

## [1] 899288

length(news)

## [1] 1010206

length(twitter)

## [1] 2360148

set.seed(123)

sample_data <- c(
  sample(blogs, 1000),
  sample(news, 1000),
  sample(twitter, 1000)
)

length(sample_data)

## [1] 3000

tokens_data <- tokens(sample_data, remove_punct = TRUE)

head(tokens_data, 2)

## Tokens consisting of 2 documents.
## text1 :
##  [1] "The"        "bruschetta" "however"    "missed"     "the"       
##  [6] "mark"       "Instead"    "of"         "manageable" "two-bite"  
## [11] "crostini"   "these"     
## [ ... and 20 more ]
## 
## text2 :
##  [1] "Walden"     "Pond"       "Mt"         "Rainier"    "Big"       
##  [6] "Sur"        "Everglades" "and"        "so"         "forth"

bigram <- tokens_ngrams(tokens_data, n = 2)
trigram <- tokens_ngrams(tokens_data, n = 3)

head(bigram, 2)

## Tokens consisting of 2 documents.
## text1 :
##  [1] "The_bruschetta"      "bruschetta_however"  "however_missed"     
##  [4] "missed_the"          "the_mark"            "mark_Instead"       
##  [7] "Instead_of"          "of_manageable"       "manageable_two-bite"
## [10] "two-bite_crostini"   "crostini_these"      "these_were"         
## [ ... and 19 more ]
## 
## text2 :
## [1] "Walden_Pond"    "Pond_Mt"        "Mt_Rainier"     "Rainier_Big"   
## [5] "Big_Sur"        "Sur_Everglades" "Everglades_and" "and_so"        
## [9] "so_forth"

head(trigram, 2)

## Tokens consisting of 2 documents.
## text1 :
##  [1] "The_bruschetta_however"       "bruschetta_however_missed"   
##  [3] "however_missed_the"           "missed_the_mark"             
##  [5] "the_mark_Instead"             "mark_Instead_of"             
##  [7] "Instead_of_manageable"        "of_manageable_two-bite"      
##  [9] "manageable_two-bite_crostini" "two-bite_crostini_these"     
## [11] "crostini_these_were"          "these_were_huge"             
## [ ... and 18 more ]
## 
## text2 :
## [1] "Walden_Pond_Mt"     "Pond_Mt_Rainier"    "Mt_Rainier_Big"    
## [4] "Rainier_Big_Sur"    "Big_Sur_Everglades" "Sur_Everglades_and"
## [7] "Everglades_and_so"  "and_so_forth"

bigram_dfm <- dfm(bigram)
trigram_dfm <- dfm(trigram)


bigram_freq <- textstat_frequency(bigram_dfm)
trigram_freq <- textstat_frequency(trigram_dfm)

predict_next_word <- function(text) {
  text <- tolower(text)
  
  # Trigram match
  tri_match <- subset(trigram_freq, grepl(paste0("^", text), feature))
  if(nrow(tri_match) > 0) {
    return(strsplit(tri_match$feature[1], " ")[[1]][3])
  }
  
  # Bigram match
  bi_match <- subset(bigram_freq, grepl(paste0("^", text), feature))
  if(nrow(bi_match) > 0) {
    return(strsplit(bi_match$feature[1], " ")[[1]][2])
  }
  
  return("the")  # fallback
}

predict_next_word("i love")

## [1] "the"

predict_next_word("data science")

## [1] "the"

predict_next_word("thank you")

## [1] "the"

Introduction

This report describes the development of a simple text prediction model using n-grams. The goal is to predict the next word based on the input text.

Model Building

The model is built using bigrams and trigrams. Tokens are created from the dataset and converted into frequency tables. These frequency tables are used to identify common word sequences.

Prediction Algorithm

A backoff model is used for prediction. The model first searches for trigram matches, then bigram matches. If no match is found, a fallback word is returned.

Evaluation

The model provides reasonable predictions for common phrases. Due to the limited sample size, the model may return common fallback words when no match is found.

Conclusion

The n-gram model successfully demonstrates basic next word prediction. The model can be improved by using larger datasets and more advanced techniques.