Coursera Capstone Data Science - Final Project

Leonardo Cavalcanti (cavalcanti.indg@gmail.com)
Abril 22th, 2015

Introduction

The main goal for this project is create a app in shiny serve the suggest next word in a phrase. Especially for mobile device. It'll help the user save time of typing words. For accomplish this task I did the following steps.

  • Load dataset provided for this project and clearing it by remove punctuation, all words to lowercase, remove number etcs.
  • Created N-gram of 1, 2 and 3 with all dataset (without sample)
  • Created N-gram of 4 order with sample at database (memory limitation).
  • Created a suggest for spell correction, spelling mistake could harm the ability for predict next word accurate manner.

Dataset and Strategy for use it.

This dataset are actually a sample from web, for this reason we try a lot of strategy for process this dataset without sample to reduce it size. It's important to notice that for this project I used a Macbook Pro with 8Gb of memory RAM. This was reallychallenger.

I used three main packages in R to do that. Tm, RWeka and Slam.It to say that some function are better for read file and help to manage this task, for example: DirSource function to direct access file to create a corpus and so on. We created a Corpus with all dataset and clearing without sample it.

Finally I created 4 database, each one with N-gram order (1 up to 4), each of this database have column with words and last one freq of this sequence found at dataset.

Algorithm Developed

The algortith developed follower this main steps:

  • First took the number of words that use write.
  • Choose which N-grams order should text trying first base of number of words.
  • If don't find any anwser, try N-grams with less order and so on.
  • At the end, check if have at least 7 words, if not use N-gram order 1 to complet up to 7 words. (in this case, we exclude any word that the use digit from this n-gram).
  • Check if the use make any misspelling, if true, show a possible correction.

Conclusion

This project help me a lot, I never had worked with natural language processing or text analytics any kind. Even though this app it's not state of art in NLP field, was very challenging for me.