owilks
11/3/2019
This algorithm pulls data from a corpus (a collection of written texts) provided by a team at TouchType/SwiftKey, a subsidiary of the Microsoft.
Stripping the corpus into its raw components, breaking those into documents, and then creating our predictive algorithm using the breakdown of n-grams, this lets us predict the “next word” in a sentence using a subset of the phrase provided.
If no such phrase can be found, it incorporates data from the user and updates the relevant probability distribution.
A few key choices were made while trimming the corpus to maximize the accuracy of the underlying dataset:
#Samples of Punctuation adjustment strategy
texts(sam)<-gsub("\\bisn't\\b","is not",texts(sam))
texts(sam)<-gsub("\\b1st\\b","first",texts(sam))
texts(sam)<-gsub("\\b4\\b","four",texts(sam))
rawB<-readLines(pathBlogs, encoding = "UTF-8", skipNul = TRUE, n=9000)
rawT<-readLines(pathTwit, encoding = "UTF-8", skipNul = TRUE, n=7000)
rawN<-readLines(pathNews, encoding = "UTF-8", skipNul = TRUE, n=3000)
Once the user enters input into the box, the app trims the phrase down to the final 3 words and uses them to predict the next word.
If the tri-gram cannot be found to create a prediction, then the algorithm searches for a prediction with the final 2 words (bi-gram) and so on so forth.
Afterwards the system is updated to include the “correct word” in the new bi-gram/tri-gram search function, weighted specifically to favor user input over the existing components of the probability distribution.
if(count>2&sum(stat4$open==phrase4)>0){nextword<-prednext(phrase4,stat4)}
else{if(count>1&sum(stat3$open==phrase3)>0){nextword<-prednext(phrase3,stat3)}
else{if(count>0&sum(stat2$open==phrase2)>0){nextword<-prednext(phrase2,stat2)}
else{nextword<-"We don't have an accurate prediction for this phrase! Try another!"}
}
Implementing longer n-grams and having a more comprehensive search model could be a straightforward next step that didn't include too much additional memory.
Additionally, there are more intensive probability distribution mechanics, e.g. the chinese restaurant process, that could be implemented in such a way that would increase our accuracy without necessarily increasing the amount of memory we need to use in a severely costly way.
That, and a more robust user-interface that gave users the option of selecting one-of-three potential predictions instead of one, and had a more robust user-feedback response tree would better flesh out the call-and-response mechanics of the app itself, better increasing the intuitiveness of the app.
Thank you for taking the time to read this! Guess well!