Text Prediction Project

A brief overview

Text Mining and Natural Processing Language using R to Predict the n-grams from a text input in a Shiny App. The following link to the shiny https://vx5cud-jhon0edicson-joya0calderon.shinyapps.io/shiny/.

Natural Language Processing (NPL) & Text mining

Now days the Social Media and the news presents an unlimited text source. The data size from this source es impossible to calculate, but some estimations counts more than Gigabytes per second. The Text Mining sometimes is called an art, because there are many methods and interpretations from the results. Some methods to obtain text data are the web scraping, relational databases and third-party APIs.

The “tm” package and RWeka

The tm and RWeka packages are open source resources to manipulate text data, primary from plain text. The first object to considerate is the corpora, the corpora works like a library, indexing the different text source. The second object is the Document Term Matrixwhose goal is create a data.table with relationship between terms and the frequency of occurrence. Once the DocumentTermMatrix is create, the next step is create the frequency table with the Tokens build from the n-grams (the n-grams is the pair of words build from all combinations in the corpora).

The final product

The final product

The use case of the shiny App is create a prediction from the model loaded in the module to predict the next word in the phrase.

Conclusions

  1. More data = Most effective prediction.
  2. The text data requires a lot of computing resources.
  3. Now days more faster and effective packages are developed (quanteda for example).
  4. The data cleaning is the most important step in all the process of text mining.