There are numerous sources of text available in digital format. We can use these sources to analyze language, this is often called text mining. In this project we want to use digital text sources in order to predict the next word in a sentence given the previous words. In order to create a model, we need first to explore a large ammount of text to find which patterns of words occour with more frequency in noraml language.
Imagine that you are able to memorize everything you read in your life. Then you will also be able to tell which combination of words are more common and if someone gives you two words you could tell which third world follows. We can do basically that with the computer. However, just like humans, computer memory is limited, therefore we should find a way to efficiently store all combination of words, and decide what to keep in order to make the recommendation as fast as possible.
We were provided with three large files containing lines of text extracted from twitter, blogs and news. The goal of this project is to use the contents on this files in order to create a model which is able to complete sentences introduced by an user.
We have three sources of data, containing text from twitter, blogs entries, and news articles. We can easily read them in R with the command redLines (we don’t show code details in this report in order to keep it simple for all readers).
An example some lines on each file is:
twitter.example
## [1] "Who's going to the OVW show tomorrow? your team dominoski will be in the house! First brew after the show is on me!"
blogs.example
## [1] "The terms of 'post-authenticity' or 'inauthenticity' are misleading labels for the change in sensibility and attitudes implying a more reflexive attitude to authenticity. They lead the focus from important elements of this change, not least the weakened fundament of the American-English hegemony in popular music. For decades the insistence on rock authenticity was coupled with an understanding of culture as national in constructing centre-periphery relations in the rock world. Urban American could fake rural accents and Mick Jagger could fake cockney, but foreign accents could not be accepted as authentic. Singing in English called for mockery in the home countries and a low place in the international hierarchy."
news.example
## [1] "NEWARK A Newark woman accused of animal cruelty in a pit bull abuse case tied the dog to a railing and left New Jersey for more than a week, according to the Essex County Prosecutors Office."
We can explore the number of lines and words for each file
## file lines words
## 1: twitter 2360148 30373543
## 2: blogs 899288 37334131
## 3: news 1010242 34372529
The blogs file contains the least number of lines, but the higer ammount of words. We can make a simple statistic on the number of words per line
## file V1
## 1: twitter 12.87
## 2: blogs 41.52
## 3: news 34.02
Blogs have the most number of words per line and twitter the least. This makes sense, considering that a twitter only allows 250 characaters.
At this stage we realized that our home computer had not enough power to handle such large ammount of lines. We decided to create ten samples of each file (30 files in total). Each sample file contains ten percent of the lines of the original file. We will base our models on the sample files A summary of the sample files is shown here:
## file sample words words.per.line
## 1: twitter 1 3034930 12.86
## 2: twitter 2 3038078 12.87
## 3: twitter 3 3037159 12.87
## 4: twitter 4 3041104 12.89
## 5: twitter 5 3035640 12.86
## 6: twitter 6 3039106 12.88
## 7: twitter 7 3037967 12.87
## 8: twitter 8 3037314 12.87
## 9: twitter 9 3036104 12.86
## 10: twitter 10 3041539 12.89
## 11: news 1 3436686 34.02
## 12: news 2 3429321 33.95
## 13: news 3 3425487 33.91
## 14: news 4 3445228 34.10
## 15: news 5 3439449 34.05
## 16: news 6 3444560 34.10
## 17: news 7 3438897 34.04
## 18: news 8 3446671 34.12
## 19: news 9 3434411 34.00
## 20: news 10 3446044 34.11
## 21: blogs 1 3732377 41.50
## 22: blogs 2 3731502 41.49
## 23: blogs 3 3738058 41.57
## 24: blogs 4 3720884 41.38
## 25: blogs 5 3752653 41.73
## 26: blogs 6 3726963 41.44
## 27: blogs 7 3747652 41.67
## 28: blogs 8 3762023 41.83
## 29: blogs 9 3745678 41.65
## 30: blogs 10 3735838 41.54
## file sample words words.per.line
The long term objective is to use these sample files in order to generate multiple small models that complement each other
In each file there is a large ammount of characters that will make our task more difficult. For example, dashes, emojis, etc cannot be included in our model. We will limit our model to words and no symbols or numbers. Therefore, we must clean the data. We decided as first attemp to clean the data in the following way:
Separate sentences in different lines: we considered the characters . ! ( ) [] { } as sentence separators
Replace profanity words for appropiate safe words: for this we found a list common profanity words on the internet and replaced them fro adequate words
Lowercase: we don’t want to take into account any diffrence between lowercases and uppercases
Remove other characters that are not letters: any other punctuation symbol or strange character will be removed.
Remove Extra White spaces
Remove numbers: As a starting point we wont predict numbers, it is posibble to include them in the future by replacing numbers by a token, for example ‘NUMBERAMMOUNT’ or dates by ‘DATETOKEN’
After we cleaned the data we can do a simple exploration of our documents. Again, because of the low computational power we had, we present the exploratory analysis for a portion of the first sample file set (sample 1 for twitter, blogs and news).
The first exploration we can do is creating a document text matrix. This matrix is simply a list of all words that appear in the given documents and the number of times they appear.
The total number of different words found is:
dtm$ncol
## [1] 88663
We can For example, explore which are the most common words. The following pot shows that the most commmon word is ‘the’, followed by and and for. This is very easy to understand because these are stopwords which are very common
Another fun way to visualise the most common words is by using a word cloud. Here, the most common words appear bigger on the image.
Another interesting feature to explore is how do the less common words look like. The following plot shows the counts for less frequent terms. There are around 50000 words that appear only once in our texts, 1000 that appear only twice.
From the previous plot, we come to an important conclussion: the 69% of all the possible words appear only once or twice in the texts. This means that we can significantly reduce the size of our model by eliminating entries that appear very rarely
We need to define a way of using the data knowledge we gained to complete sentences. Our plan is the following
Create a frequency table of word combinations. Ths is: based on the document text matrix, we can find the occurence of each possible word combination in the text. We can do this for combintions of two, three or four words. This cobinations are called 2-grams, 3-grams and 4-grams.
Once we have frequency tables for word combinations, we can transform thenm to probabilities. However, we expect that not all possible combinations appear on the text. We will assign a very small probability to unseen combinations.
Our model will take three words and try to find the fourth word as follows (we use the input ‘I am gonna’ as example):
On the 4-gram table find the probablities of all ‘I am gonna XXX’ sequences and multiply by a factor(.5)
On the 3-gram table find the probablities of all ‘am gonna XXX’ sequences and multiply by a factor(.3)
On the 2-gram table find the probablities of all ‘gonna XXX’ sequences and multiply by a factor(.15)
Multiply the original table (1-grams) by a factor (0.05). This is in case all other probabilities are zero, then we just predict the most common word.
For each word XXX add the wheighted probabilities from 1-4. Predict the XXX word which higer final value
We will deploy our model in an app using shiny. The model should perform and load fast in the server.
The app will have the folowing characteristics:
A text input where the user will introduce a sentence
A predict button to start the prediction model
A simple text output containing the more probable word that completes the input sentence.
For more curious users. We will provide a nice word cloud with the most probable outputs. Also information about the results of each individual model.
Something important is that before runing the model, the input has to be pre procesed in the same way as described above.