Problem Description

There are numerous sources of text available in digital format. We can use these sources to analyze language, this is often called text mining. In this project we want to use digital text sources in order to predict the next word in a sentence given the previous words. In order to create a model, we need first to explore a large ammount of text to find which patterns of words occour with more frequency in noraml language.

Imagine that you are able to memorize everything you read in your life. Then you will also be able to tell which combination of words are more common and if someone gives you two words you could tell which third world follows. We can do basically that with the computer. However, just like humans, computer memory is limited, therefore we should find a way to efficiently store all combination of words, and decide what to keep in order to make the recommendation as fast as possible.

We were provided with three large files containing lines of text extracted from twitter, blogs and news. The goal of this project is to use the contents on this files in order to create a model which is able to complete sentences introduced by an user.

Data description

We have three sources of data, containing text from twitter, blogs entries, and news articles. We can easily read them in R with the command redLines (we don’t show code details in this report in order to keep it simple for all readers).

An example some lines on each file is:

twitter.example

## [1] "Who's going to the OVW show tomorrow? your team dominoski will be in the house! First brew after the show is on me!"

blogs.example

## [1] "The terms of 'post-authenticity' or 'inauthenticity' are misleading labels for the change in sensibility and attitudes implying a more reflexive attitude to authenticity. They lead the focus from important elements of this change, not least the weakened fundament of the American-English hegemony in popular music. For decades the insistence on rock authenticity was coupled with an understanding of culture as national in constructing centre-periphery relations in the rock world. Urban American could fake rural accents and Mick Jagger could fake cockney, but foreign accents could not be accepted as authentic. Singing in English called for mockery in the home countries and a low place in the international hierarchy."

news.example

## [1] "NEWARK  A Newark woman accused of animal cruelty in a pit bull abuse case tied the dog to a railing and left New Jersey for more than a week, according to the Essex County Prosecutors Office."

We can explore the number of lines and words for each file

##       file   lines    words
## 1: twitter 2360148 30373543
## 2:   blogs  899288 37334131
## 3:    news 1010242 34372529

The blogs file contains the least number of lines, but the higer ammount of words. We can make a simple statistic on the number of words per line

##       file    V1
## 1: twitter 12.87
## 2:   blogs 41.52
## 3:    news 34.02

Blogs have the most number of words per line and twitter the least. This makes sense, considering that a twitter only allows 250 characaters.

At this stage we realized that our home computer had not enough power to handle such large ammount of lines. We decided to create ten samples of each file (30 files in total). Each sample file contains ten percent of the lines of the original file. We will base our models on the sample files A summary of the sample files is shown here:

##        file sample   words words.per.line
##  1: twitter      1 3034930          12.86
##  2: twitter      2 3038078          12.87
##  3: twitter      3 3037159          12.87
##  4: twitter      4 3041104          12.89
##  5: twitter      5 3035640          12.86
##  6: twitter      6 3039106          12.88
##  7: twitter      7 3037967          12.87
##  8: twitter      8 3037314          12.87
##  9: twitter      9 3036104          12.86
## 10: twitter     10 3041539          12.89
## 11:    news      1 3436686          34.02
## 12:    news      2 3429321          33.95
## 13:    news      3 3425487          33.91
## 14:    news      4 3445228          34.10
## 15:    news      5 3439449          34.05
## 16:    news      6 3444560          34.10
## 17:    news      7 3438897          34.04
## 18:    news      8 3446671          34.12
## 19:    news      9 3434411          34.00
## 20:    news     10 3446044          34.11
## 21:   blogs      1 3732377          41.50
## 22:   blogs      2 3731502          41.49
## 23:   blogs      3 3738058          41.57
## 24:   blogs      4 3720884          41.38
## 25:   blogs      5 3752653          41.73
## 26:   blogs      6 3726963          41.44
## 27:   blogs      7 3747652          41.67
## 28:   blogs      8 3762023          41.83
## 29:   blogs      9 3745678          41.65
## 30:   blogs     10 3735838          41.54
##        file sample   words words.per.line

The long term objective is to use these sample files in order to generate multiple small models that complement each other

Data Cleaning

In each file there is a large ammount of characters that will make our task more difficult. For example, dashes, emojis, etc cannot be included in our model. We will limit our model to words and no symbols or numbers. Therefore, we must clean the data. We decided as first attemp to clean the data in the following way:

Separate sentences in different lines: we considered the characters . ! ( ) [] { } as sentence separators
Replace profanity words for appropiate safe words: for this we found a list common profanity words on the internet and replaced them fro adequate words
Lowercase: we don’t want to take into account any diffrence between lowercases and uppercases
Remove other characters that are not letters: any other punctuation symbol or strange character will be removed.
Remove Extra White spaces
Remove numbers: As a starting point we wont predict numbers, it is posibble to include them in the future by replacing numbers by a token, for example ‘NUMBERAMMOUNT’ or dates by ‘DATETOKEN’

Data Exploration

After we cleaned the data we can do a simple exploration of our documents. Again, because of the low computational power we had, we present the exploratory analysis for a portion of the first sample file set (sample 1 for twitter, blogs and news).

The first exploration we can do is creating a document text matrix. This matrix is simply a list of all words that appear in the given documents and the number of times they appear.

The total number of different words found is:

dtm$ncol

## [1] 88663

We can For example, explore which are the most common words. The following pot shows that the most commmon word is ‘the’, followed by and and for. This is very easy to understand because these are stopwords which are very common plot of chunk unnamed-chunk-8

Another fun way to visualise the most common words is by using a word cloud. Here, the most common words appear bigger on the image. plot of chunk unnamed-chunk-9

Another interesting feature to explore is how do the less common words look like. The following plot shows the counts for less frequent terms. There are around 50000 words that appear only once in our texts, 1000 that appear only twice.

plot of chunk unnamed-chunk-10

From the previous plot, we come to an important conclussion: the 69% of all the possible words appear only once or twice in the texts. This means that we can significantly reduce the size of our model by eliminating entries that appear very rarely

Model Plan

We need to define a way of using the data knowledge we gained to complete sentences. Our plan is the following

Create a frequency table of word combinations. Ths is: based on the document text matrix, we can find the occurence of each possible word combination in the text. We can do this for combintions of two, three or four words. This cobinations are called 2-grams, 3-grams and 4-grams.
Once we have frequency tables for word combinations, we can transform thenm to probabilities. However, we expect that not all possible combinations appear on the text. We will assign a very small probability to unseen combinations.
Our model will take three words and try to find the fourth word as follows (we use the input ‘I am gonna’ as example):

On the 4-gram table find the probablities of all ‘I am gonna XXX’ sequences and multiply by a factor(.5)
On the 3-gram table find the probablities of all ‘am gonna XXX’ sequences and multiply by a factor(.3)
On the 2-gram table find the probablities of all ‘gonna XXX’ sequences and multiply by a factor(.15)
Multiply the original table (1-grams) by a factor (0.05). This is in case all other probabilities are zero, then we just predict the most common word.
For each word XXX add the wheighted probabilities from 1-4. Predict the XXX word which higer final value

We plan to create several models using small ammount of data. Then do a wheighted voting on the outputs of each model to select the final recommmendation. This will me a challenging task beacuse we will need to decide on the weights. Also we need to decide on how many models to use an the size of each model in order to create an model that is not too slow.

App Plan

We will deploy our model in an app using shiny. The model should perform and load fast in the server.

The app will have the folowing characteristics:

A text input where the user will introduce a sentence
A predict button to start the prediction model
A simple text output containing the more probable word that completes the input sentence.
For more curious users. We will provide a nice word cloud with the most probable outputs. Also information about the results of each individual model.
Something important is that before runing the model, the input has to be pre procesed in the same way as described above.

Exploratory Data Analysis for Word Predictor