11/2/2021

  1. Load and cleaning Data

  • I downloaded the dataset from here

  • Replace all non alphanumeric letters with space;

  • Remove excessive spaces;

dataset = sent_detect(dataset, language = "en", model = NULL)
body = VCorpus(VectorSource(dataset
body = tm_map( body, removeNumbers) # removing numbers
body = tm_map( body, stripWhitespace) # removing whitespaces
body = tm_map( body, tolower) #lowercasing all contents
body = tm_map( body, removePunctuation) # removing special characters

  1. N-gram

  • Before creating an Ngram, first delete with Regex, begin with @, etc;

  • Split text at space to get n-gram

  • To reduce the N-gram size, first calculate frequency for each N-gram

body_2<-gsub("http\\w+", "", body)
token_n <- NGramTokenizer(body_2, Weka_control(min = n, max = n))
  • The total count of 2-gram is around 165,000.

Information

  • Also add more information about where the dataset is

  • You can download the dataset I use and also the shiny app code

Thank you