Kyle Scully
One direct application of computational linguistics is text prediction, where based on user input the next word is predicted.
The application can be cloned from here: https://github.com/zieka/computational_linguistics
The application is hosted on shinyapps.io: https://zieka.shinyapps.io/computational_linguistics
Data is a compilation of text from news, blogs, and tweets retrieved at: (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)
The Data was cleaned in the following manner
set <- tm_map(set, stripWhitespace)
set <- tm_map(set, content_transformer(tolower))
set <- tm_map(set, removePunctuation)
set <- tm_map(set, removeNumbers)
badWords <- scan("./badwords", "")
set <- tm_map(set, removeWords, badWords)
The tokenizing algothim basically does the following:
The end result is a matrix of strings all n number of words long
ngram_needed = number_of_input_words + 1
regex <- paste("^",input_string,sep="")
if(ngram_needed >= 4){
prediction <- strsplit(quadgram.w[grep(regex,quadgram.w$word),][1]$word, " ")[[1]][4] }