Coursera Data Science Capstone Presentation

Jorge Bretones Santamarina
31/10/2019

Motivation: build a text prediction algorithm

  • This project lies within the Natural Language Processing (NLP) field, a subfield of artifical intelligence concerned with the understading of language by machines.
  • The goal of this project is develope a NLP tool to predict the next word of a phrase given some context.
  • The training data provided consists of a compilation of text from 3 sources: twitter, blogs and news. The main features of the data are the following:
Word.counts Line.counts File.size.Mb.
Twitter 30373543 2360148 334.48
Blog 37334131 899288 267.76
News 2643969 77259 20.73

N-grams language models and Stupid Backoff

  • Models that assign probabilities to sequences of n-words (n-grams) are called N-gram models.
  • The goal is to compute the probability of a new word given some history. E.g.: \( \small P(word|history) = P(exam|he~studied~and~passed~the) \)
  • To simplify, we use the N-gram assumption, by which the probability of the next word only depends on the last n words (n-grams). E.g.: \( \small n = 2 \rightarrow P(exam|he~studied~and~passed~the) \approx P(exam|the) \)
  • We used Stupid Backoff to calculate the probabilities using up to 3-grams, if available. The formula is the following (for more information, see Brants et al., 2007):

The Shiny application

  • We developed the following Shiny application with the algorithm:
  • To use it, go to https://jorgebs94.shinyapps.io/text_prediction_app/ and wait some time for the application to load.
  • Then, just type some text in the input bar and click on the predict button. The 3 most likely words and the word with the highest probability will be displayed below.

Additional information