Text Prediction App

Jacky Wong Kae Perng
April 14 2015

The “Text Prediction” Shiny app provides the means to predict the next word based on the input of a word or a phrase by users. Next word prediction is particularly useful to suggest terms when one types on mobile devices. This presentation will detail the features, prediction algorithms and the implementation details of the application.

URL Links : App Location | GitHub | RPubs | Data Source

User Interface

Screenshot of the app: alt text

  • In the screenshot, user enters “have to” and clicks GO! The app displays “get”,“be”,“say”,“go”,“take” as the top 5 predicted words.

Behind the scene, Shiny App perform the following:

  • Process user's input, for eg. make them lower case, remove punctuations and numbers.
  • Compare the terms in user's input to the Unigram, Bigram and Trigram built from text corpus.
  • Retrieve & display the top 5 terms based on probability / frequency of the terms.

Detailed Algorithm (Part 1)

  • Sample 300,000 lines from Blogs, Twitters and News. Build Unigram, Bigram and Trigram frequency tables.
  • Eg. The trigram with “have to xx” frequency table:

    Trigram Frequency
    “have to get” 50
    “have to be” 40
    “have to say” 30
    “have to go” 20
    “have to take” 10
    “have to die” 5
  • When user enters “have to”, the app will look for Trigram with “have to” and compute the cond. probability of the third word given the preceding two words, i.e. “have to”.

Detailed Algorithm (Part 2)

  • Using the previous example, the appl will compute the following conditional probability.

    Cond. Probability Probability
    p(“get”/“have to”) 50/(50+40+30+20+10+5)= 0.32
    p(“be”/“have to”) 40/(50+40+30+20+10+5)= 0.26
    p(“say”/“have to”) 30/(50+40+30+20+10+5)= 0.19
    p(“go”/“have to”) 20/(50+40+30+20+10+5)= 0.13
    p(“take”/“have to”) 10/(50+40+30+20+10+5)= 0.06
    p(“die”/“have to”) 5 /(50+40+30+20+10+5)= 0.03
  • Based on the probibility computed, the top 5 highest probability of the next word given “have to” are:“get”, “be”, “say”, “go”, “take” in that order. The word “die” does not make it to the top 5 and hence is not displayed to the user.

Result & Documentation Tab

The result will be displayed as follow: alt text User can also click on the documentation tab: alt text

Limitation:

  • I was not able to extract a larger sample due to memory limit on my PC. Hence, I have to compromise the accuracy by only sampling 300,000 text.
  • I did not manage to get all the Quiz right, so I supposed there is problem with accuracy.
  • Thank you.