NextWords: for Text Prediction

FC
Decemeber 2014

Demonstration

High Level

Here's how it works in plain English…

  • Start with a corpus (collection of text)
  • Remove punctuation, numbers, capital, and whitespace from that text
  • Create n-grams (distinct n-word phrases) and count how frequently they occur
    • For instance “you are my best” is a 4-gram that probably occurs dozens of times
  • Store n-grams in a searchable format and pass to Shiny app

So, user types “you are my” and the app searches the database for the most frequent 4-gram that starts with “you are my” and passes back the fourth word in that phrase… perhaps “best”.

Now With Some Code: Tiny Example

# build a tiny corpus
tinyCorpus <- c("you are my best friend", "It's possible you are my... dog", "says you are my best fan")

# Remove punctuation, change capitals to lowers, and transform lines to long list of words
tinyCorpus <- gsub("([[:punct:]])", "", tolower(tinyCorpus))
allWords <- unlist(strsplit(tinyCorpus, " "))

# create 4-grams and count frequency of those 4-grams
library(stylo)
ngrams <- make.ngrams(allWords, 4)
count <- as.data.frame(table(ngrams))
count <- count[order(count),]

# now search for "you are my " to see that "best" is the most likely next word
count[grep("^you are my", count$ngrams),]
            ngrams Freq
12 you are my best    2
13  you are my dog    1

Room for Improvement

NextWords is OK, but it could be much better…

  • Increase speed with more concise code
  • Increase accuracy with larger, more diverse corpus
  • Add functionality
    • Remove swearwords (if that's deemed necessary)
    • Part of speech recognition
    • Language recognition
    • Punctuation and emoticon prediction
    • Question answering (similar to Siri)
    • Store and learn from user input (like your smartphone)
  • Improve 'error' handling where user inputs incorrect spelling or 'unknown' word

Summary

NextWords is a project created for the Data Science Capstone course offered by Johns Hopkins University via Coursera. Overall, the app…

  • Is good at predicting next word(s)
  • Should be faster and it should have additional functionality
    • As with all projects, lifting time/monetary constraints would help to acheive those ends
  • Provides solid foundation for creating practical application

The app itself lives here, The code for the app lives here, and the code for this deck lives here.

Thanks for reading!