APP USER'S MANUAL

Veronica Vedovetto
23/09/2019

INTRODUCTION

We want to build an app to predict the next word in a sentence.

What we need:

  • some example of text (aka corpus): 3 files from 3 different sources (blog, Twitter, news)
  • language model: Stupid backoff
  • User interface: base on Shiny package

ALGORITHM

  • The language model is built upon the “Stupid Backoff” algorithm suggested by Brants et al. (2007). This algorithm is simple and intuitive and not too much computationally intensive.
  • The model uses Trigrams, Bigrams and Unigrams calculated from a dictionary of 189861 words.
  • We decided to use a restricted dictionary to leave some probabilities for UNKNOWN words (new words).
  • Some kind of punctuation (as ,.?!) are considered “words”, others are considered as part of the word (') and also some combinations of punctuation are coded to represent emoticons (happy smile/ sad smile/ blinking smile)
  • Prediction of the new word is based on the last two words of the sentence (or last word for sentence of one word)
  • If any of the 2 words doesn't belong to the dictionary then it is marked as “UNKNOWN WORD” and treated as the other words present in the dictionary.
  • randomly weighted sampling procedure to select the new word among all the possible words that follow in the trigram/bigram or that appear in unigram. The sampling weigth is the conditional probability of the new word in trigram/bigram/unigram.
  • Due to the sampling the result isn't always the same word. However the most probable word will be selected more often than a word with lower probability.

HOW TO USE THE APP

The user interface is very simple:

  • Text box: just type your sentence here
  • Number of words to show in the result (min = 1, max = 200)
  • Run button

Type the sentence in the text box, select the number of word that you want to display in the output and click the “Run” button.

It can take some time to load the app, because it has to load the data that the model will use.

APP OUTPUT

  • The app will output a table with a row per word depending on how many words you decided to display.
  • In each row will appear the word, its conditional probability and the source (trigrams, bigrams, unigrams)
  • punctuation as .,!? or combinations as ;) :-) :) :( can appear too as they are considered special words
  • If the number of word to show is high you can navigate the table.