The purpose of this project is to create an application that is capable of predicting the next word from user text. If for example user types “Hello how are you” the predicted word might be “today”. There are a lot of applications for such a data product, especially in mobile area since typing is not an easy thing to do.
We are going to use three sources of data for that product:
These data can be downloaded from:
(https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)
Following text analyzes first steps for creating such a product.
After downloading the data, we perform some analysis in order to have a first idea about sizes, number of lines etc.
## Size Lines
## Blog 200.4242 899288
## News 196.2775 1010242
## Twitter 159.3641 2360148
Only english documents will be considered for this product. Looking at random lines for each document, we see that there are a lot of non english characters for the twitter data which will be removed at a later stage.
In this stage data documents will be converted to a Corpus, and all the necessary transformations will be performed. Stop words will not be removed since they play a major role in prediction. So we are going to apply the following transformations.
Also we are going to work with a small sample of the data, since the computations take a long to complete and they are adequate at this stage.
After that, we create a term document matrix. The first 10 lines of this matrix is as follows:
## word freq
## 1 the 44168
## 2 and 23008
## 3 that 9488
## 4 for 9126
## 5 with 6703
## 6 you 6479
## 7 was 6010
## 8 this 4694
## 9 have 4470
## 10 but 4384
The following chart displays the 20 most frequent words for our sampled data:
The most common words appear bigger on the picture.
In order to create the shiny app there are a number of requirements as follows: