NextWords: for Text Prediction

FC
Decemeber 2014

Demonstration

High Level

Here's how it works in plain English…

Start with a corpus (collection of text)
Remove punctuation, numbers, capital, and whitespace from that text
Create n-grams (distinct n-word phrases) and count how frequently they occur
- For instance “you are my best” is a 4-gram that probably occurs dozens of times
Store n-grams in a searchable format and pass to Shiny app

So, user types “you are my” and the app searches the database for the most frequent 4-gram that starts with “you are my” and passes back the fourth word in that phrase… perhaps “best”.

Now With Some Code: Tiny Example

# build a tiny corpus
tinyCorpus <- c("you are my best friend", "It's possible you are my... dog", "says you are my best fan")

# Remove punctuation, change capitals to lowers, and transform lines to long list of words
tinyCorpus <- gsub("([[:punct:]])", "", tolower(tinyCorpus))
allWords <- unlist(strsplit(tinyCorpus, " "))

# create 4-grams and count frequency of those 4-grams
library(stylo)
ngrams <- make.ngrams(allWords, 4)
count <- as.data.frame(table(ngrams))
count <- count[order(count),]

# now search for "you are my " to see that "best" is the most likely next word
count[grep("^you are my", count$ngrams),]

            ngrams Freq
12 you are my best    2
13  you are my dog    1

Room for Improvement

NextWords is OK, but it could be much better…

Increase speed with more concise code
Increase accuracy with larger, more diverse corpus
Add functionality
- Remove swearwords (if that's deemed necessary)
- Part of speech recognition
- Language recognition
- Punctuation and emoticon prediction
- Question answering (similar to Siri)
- Store and learn from user input (like your smartphone)
Improve 'error' handling where user inputs incorrect spelling or 'unknown' word

Summary

NextWords is a project created for the Data Science Capstone course offered by Johns Hopkins University via Coursera. Overall, the app…

Is good at predicting next word(s)
Should be faster and it should have additional functionality
- As with all projects, lifting time/monetary constraints would help to acheive those ends
Provides solid foundation for creating practical application

The app itself lives here, The code for the app lives here, and the code for this deck lives here.

Thanks for reading!