Final Project - Data Science.

PREDICTIVE TEXT

Marco Aurelio Guado Zavaleta
Sep. 2017

OBJECTIVE

Create a model, which allows predicting the next word with greater possibility of occurrence, based on the information stored in the system.

TOOLS

  • R-3.4.1
  • RStudio-1.0.153
  • R Markdown

Packages:
library (“tm”), package for word processing.
library (“SnowballC”), allows you to process the root of words.
library (“wordcloud”), allows you to graph words in a cloud.
library (“RColorBrewer”), allows you to color words in the word cloud.
library (“dplyr”), allows manipulating and operating data frames.
library (“tidytext”), useful for performing text mining.

METHODOLOGY

We will use the following workflow:

DATA COLLECTED ->
—- tm ——-

CLEAN DATASET -> EXPLORATORY ->
– SnowballC, wordcloud, RColorBrewer –

MODELO & ALGORITHMS -> DATA PRODUCT
—— dplyr, tidytext ———-

Development

  • Load in memory the data of the files:
  • Normalize the data; that this all in minuscula, to eliminate the punctuation marks, the numbers, the blank spaces, the articles and the words that are not the root of it.
  • The normalized data make it a great matrix to be able to manipulate it.
  • This great matrix is classified in two-word sequence (n-gram).
  • Order the n-gram (pairs) by frequency from highest to lowest.
  • The ordered n-gram is stored in a data frame (word1, word2, freq).
  • We look for word1, we get the position and with this we can get word2 (word to predict)

Description of the application

Being a free environment, the loading of data and its process takes several minutes.

When I load the application I show a table with the most repeated words and the next word.

If we enter a word we will see a response on the right side. The answer contains the 5 possible words to use, the order of the frequency is from highest to lowest.

I also show a graph of the most repeated words.

https://magzupao.shinyapps.io/proyectmagz/