Final Project - Data Science.
Marco Aurelio Guado Zavaleta
Sep. 2017
OBJECTIVE
Create a model, which allows predicting the next word with greater possibility of occurrence, based on the information stored in the system.
Packages:
library (“tm”), package for word processing.
library (“SnowballC”), allows you to process the root of words.
library (“wordcloud”), allows you to graph words in a cloud.
library (“RColorBrewer”), allows you to color words in the word cloud.
library (“dplyr”), allows manipulating and operating data frames.
library (“tidytext”), useful for performing text mining.
We will use the following workflow:
DATA COLLECTED ->
—- tm ——-
CLEAN DATASET -> EXPLORATORY ->
– SnowballC, wordcloud, RColorBrewer –
MODELO & ALGORITHMS -> DATA PRODUCT
—— dplyr, tidytext ———-
Being a free environment, the loading of data and its process takes several minutes.
When I load the application I show a table with the most repeated words and the next word.
If we enter a word we will see a response on the right side. The answer contains the 5 possible words to use, the order of the frequency is from highest to lowest.
I also show a graph of the most repeated words.