2022-03-08

Using R’s data analysis capabilities to predict text.

  • Here 3 large files containing blogs, twitter and news were read.
  • Using library(quanteda) they were tokenized.
  • When the user inputs a phrase, the last two words are used to predict the next one.
  • Using regular expressions, the last two words are compared against the tokenized file.
  • Then the matches are displayed from the most popular downwards

Slide with R Cleaning the data

blogs<-file("en_US.blogs.txt","r"); blogs_lines<-readLines(blogs) 
news<-file("en_US.news.txt","r"); news_lines<-readLines(news)
twitter<-file("en_US.twitter.txt","r"); twitter_lines<-readLines(twitter
CleanR <- function(x) {
  sampleTxt <- tolower(x)
  sampleTxt <- gsub("([^[:space:]]*)(@|#|http://|https://)([^[:space:]]*)", " ", sampleTxt)
  sampleTxt <- iconv(sampleTxt, "latin1", "ASCII", sub="")
  sampleTxt <- gsub("[[:punct:]]", "", sampleTxt)
  sampleTxt <- gsub("[[:digit:]]","",sampleTxt)
  sampleTxt <- gsub("'","", sampleTxt)
  sampleTxt <- sampleTxt[stri_count_words(sampleTxt, "\\s+")>2]
}
blogs_lines <- CleanR(blogs_lines)
news_lines   <- CleanR(news_lines)
twitter_lines  <- CleanR(twitter_lines)

Slide with the algorithm

allFiles <- c(blogs_lines,news_lines, twitter_lines)
corpus <- corpus(allFiles)
remove(blogs_lines, news_lines, twitter_lines)
token <- tokens(corpus, remove_punct = TRUE)
remove(corpus)
toks_nostop <- tokens_select(token, pattern = stopwords("en"), selection = "remove")
remove(token)
ngramTri <- tokens_ngrams(toks_nostop, n = 3)
topNgramTRI <- topfeatures(dfm(ngramTri), 5000000)
saveRDS(topNgramTRI, "topNgramTRI.rds")

Slide with the Output Function

shinyServer(function(input, output) {
  output$predicted <- renderText({
    topNgramTRI <- readRDS("topNgramTRI.rds")
    phrase <- input$twoWords
    if(phrase != "") {
      formated <- word(phrase, -2, -1)
      formated <- tolower(formated)
      formated <- gsub("'","",formated)
      formated <- gsub(" ", "_" , formated)
      formated <- paste("^", formated, ".+", sep = "")
      searchTerms <- names(topNgramTRI)[grepl(formated, names(topNgramTRI))]
      predicted <- gsub(".+_.+_", "",  head(searchTerms, 20))
      predicted <- paste(predicted[1:20], "-----")

Enter a simple open ended phrase in the text box provided and click enter. A list of best matches will appear.

THANK YOU