owilks
11/3/2019
This algorithm pulls data from a corpus (a collection of written texts) provided by a team at TouchType/SwiftKey, a subsidiary of the Microsoft.
Stripping the corpus into its raw components, breaking those into documents, and then creating our predictive algorithm using the breakdown of n-grams, this lets us predict the “next word” in a sentence using a subset of the phrase provided.
If no such phrase can be found, it incorporates data from the user and updates the relevant probability distribution.
A few key choices were made while trimming the corpus to maximize the accuracy of the underlying dataset:
fullprocess <- function(text){
sam<-sample(texts(text), size=.01*length(text),replace=FALSE)%>%corpus()
texts(sam)<-gsub("\\.{2}",".",texts(sam))
sam<-corpus_segment(sam, pattern = ".", valuetype = "fixed",
pattern_position = "after", extract_pattern = FALSE) %>%
corpus_segment(sam, pattern = "?", valuetype = "fixed",
pattern_position = "after", extract_pattern = FALSE) %>%
corpus_segment(sam, pattern = "!", valuetype = "fixed",
pattern_position = "after", extract_pattern = FALSE)
texts(sam)<-gsub("[[:punct:]]","_",texts(sam))
texts(sam)<-tolower((texts(sam)))
texts(sam)<-gsub(badp,"cnsrd",texts(sam))
texts(sam)<-gsub("\\b1st\\b","first",texts(sam))
texts(sam)<-gsub("\\b2nd\\b","second",texts(sam))
texts(sam)<-gsub("\\b3rd\\b","third",texts(sam))
texts(sam)<-gsub("\\b4th\\b","fourth",texts(sam))
texts(sam)<-gsub("\\b5th\\b","fifth",texts(sam))
texts(sam)<-gsub("\\b6th\\b","sixth",texts(sam))
texts(sam)<-gsub("\\b7th\\b","seventh",texts(sam))
texts(sam)<-gsub("\\b8th\\b","eigth",texts(sam))
texts(sam)<-gsub("\\b9th\\b","ninth",texts(sam))
texts(sam)<-gsub("\\b(\\d)+th\\b","nth",texts(sam))
texts(sam)<-gsub("\\b1\\b","one",texts(sam))
texts(sam)<-gsub("\\b2\\b","two",texts(sam))
texts(sam)<-gsub("\\b3\\b","three",texts(sam))
texts(sam)<-gsub("\\b4\\b","four",texts(sam))
texts(sam)<-gsub("\\b5\\b","five",texts(sam))
texts(sam)<-gsub("\\b6\\b","six",texts(sam))
texts(sam)<-gsub("\\b7\\b","seven",texts(sam))
texts(sam)<-gsub("\\b8\\b","eight",texts(sam))
texts(sam)<-gsub("\\b9\\b","nine",texts(sam))
texts(sam)<-gsub("\\b0\\b","zero",texts(sam))
texts(sam)<-gsub("\\b(\\d)+\\b","numrep",texts(sam))
return(sam)
}
feature frequency open last
1 thanks for the follow_ 51 thanks for the follow_
2 the rest of the 50 the rest of the
3 the end of the 49 the end of the
4 when it comes to 42 when it comes to
5 at the end of 35 at the end of
6 one of the most 34 one of the most
Once the user enters input into the box, the app trims the phrase down to the final 3 words and uses them to predict the next word.
If the tri-gram cannot be found to create a prediction, then the algorithm searches for a prediction with the final 2 words (bi-gram) and so on so forth.
Afterwards the system is updated to include the “correct word” in the new bi-gram/tri-gram search function, weighted specifically to favor user input over the existing components of the probability distribution.
predcomplex <- function(text,table){
corr<-"No"
count=0
while(corr=="No" & count<3){
nextword <- sample(
size=1,
x=table[table[,3]==text,4],
prob=table[table[,3]==text,2]
)
corr<-readline(paste("Is your next word",nextword,"? Yes/No? "))
count=count+1
}
if(corr=="Yes"){
nextword<-nextword
table[table[,3]==text,2]<-table[table[,3]==text,2]+25
}
else{
newword <-readline(paste("Sorry! Please tell us your next word! "))
nextword <-newword
}
return(nextword)
}
Implementing longer n-grams and having a more comprehensive search model could be a straightforward next step that didn't include too much additional memory.
Additionally, there are more intensive probability distribution mechanics, e.g. the chinese restaurant process, that could be implemented in such a way that would increase our accuracy without necessarily increasing the amount of memory we need to use in a severely costly way.
That, and a more robust user-interface that gave users the option of selecting one-of-three potential predictions instead of one, and had a more robust user-feedback response tree would better flesh out the call-and-response mechanics of the app itself, better increasing the intuitiveness of the app.
Thank you for taking the time to read this! Guess well!