Coursera datascience capstone

2022-03-08

Using R’s data analysis capabilities to predict text.

Here 3 large files containing blogs, twitter and news were read.
Using library(quanteda) they were tokenized.
When the user inputs a phrase, the last two words are used to predict the next one.
Using regular expressions, the last two words are compared against the tokenized file.
Then the matches are displayed from the most popular downwards

Slide with R Cleaning the data

blogs<-file("en_US.blogs.txt","r"); blogs_lines<-readLines(blogs) 
news<-file("en_US.news.txt","r"); news_lines<-readLines(news)
twitter<-file("en_US.twitter.txt","r"); twitter_lines<-readLines(twitter
CleanR <- function(x) {
  sampleTxt <- tolower(x)
  sampleTxt <- gsub("([^[:space:]]*)(@|#|http://|https://)([^[:space:]]*)", " ", sampleTxt)
  sampleTxt <- iconv(sampleTxt, "latin1", "ASCII", sub="")
  sampleTxt <- gsub("[[:punct:]]", "", sampleTxt)
  sampleTxt <- gsub("[[:digit:]]","",sampleTxt)
  sampleTxt <- gsub("'","", sampleTxt)
  sampleTxt <- sampleTxt[stri_count_words(sampleTxt, "\\s+")>2]
}
blogs_lines <- CleanR(blogs_lines)
news_lines   <- CleanR(news_lines)
twitter_lines  <- CleanR(twitter_lines)

Slide with the algorithm

allFiles <- c(blogs_lines,news_lines, twitter_lines)
corpus <- corpus(allFiles)
remove(blogs_lines, news_lines, twitter_lines)
token <- tokens(corpus, remove_punct = TRUE)
remove(corpus)
toks_nostop <- tokens_select(token, pattern = stopwords("en"), selection = "remove")
remove(token)
ngramTri <- tokens_ngrams(toks_nostop, n = 3)
topNgramTRI <- topfeatures(dfm(ngramTri), 5000000)
saveRDS(topNgramTRI, "topNgramTRI.rds")

Slide with the Output Function

shinyServer(function(input, output) {
  output$predicted <- renderText({
    topNgramTRI <- readRDS("topNgramTRI.rds")
    phrase <- input$twoWords
    if(phrase != "") {
      formated <- word(phrase, -2, -1)
      formated <- tolower(formated)
      formated <- gsub("'","",formated)
      formated <- gsub(" ", "_" , formated)
      formated <- paste("^", formated, ".+", sep = "")
      searchTerms <- names(topNgramTRI)[grepl(formated, names(topNgramTRI))]
      predicted <- gsub(".+_.+_", "",  head(searchTerms, 20))
      predicted <- paste(predicted[1:20], "-----")

Enter a simple open ended phrase in the text box provided and click enter. A list of best matches will appear.

THANK YOU