Swiftkey Dataset from Twitter, Blogs and News.

The source code can be found here in the Milestone.Rmd. The codes will not be displayed to keep the report concise.

File size of the raw and sample dataset

The raw dataset were sampled by using bionomial distribution with seed set at 1. Only 1% of each dataset was used for this project for efficiency sake.

##    source filesizemb totallines totalwords
## 1    blog   200.4242     899288   37546239
## 2    news   196.2775    1010242   34762395
## 3 twitter   159.3641    2360148   30093413
##       source filesizemb totallines totalwords
## 1    blogsam   2.000921       8924     377106
## 2    newssam   1.942724       9969     345028
## 3 twittersam   1.558938      23437     298417

Quanteda to Form and Analyse Corpus

Corpus was created by combining the sample data generated from above. Using quanteda, the text was tokenised by using dfm(). The tokenisation process was adjusted such that it removes (1) stopwords, (2) punctuations, (3) numbers, (4) separators, (5) symbols and (6) any URL.

Stopwords were removed as it was showing up too frequently and exploration of other vocabulary used was more of the key interest. Hence, stopwords was used. However, it might be useful to include it back at later stage of the analysis as work prediction would include stopwords too.

## Package version: 2.1.2
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
## 
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
## 
##     View
##    Text Types Tokens Sentences
## 1 text1    20     21         1
## 2 text2    44     52         1
## 3 text3    22     22         1
## 4 text4    56     78         7
## 5 text5     4      4         1
## 6 text6    78    101         6
## Document-feature matrix of: 6 documents, 58,904 features (100.0% sparse).
##        features
## docs    attend friend kendra's wedding joy watch people see end year's
##   text1      1      1        1       1   1     1      1   1   1      1
##   text2      0      0        0       0   0     0      0   0   0      0
##   text3      0      0        0       0   0     0      0   0   0      0
##   text4      0      1        0       0   0     0      0   0   0      0
##   text5      0      0        0       0   0     0      0   0   0      0
##   text6      0      0        0       0   0     0      0   0   0      0
## [ reached max_nfeat ... 58,894 more features ]
##   feature frequency rank docfreq group
## 1    said      3131    1    2855   all
## 2    just      3069    2    2840   all
## 3     one      2815    3    2475   all
## 4    like      2730    4    2444   all
## 5     can      2460    5    2200   all
## 6     get      2287    6    2086   all

Plots to View the frequency of top 50 words of the Corpus

Summary

Looking at the analysis process, stopwords might be included back to explore if it helps with better prediction of the words used. Also looking at the frequency of words used, further analysis can be conducted to determine what are the words that are used after these first set of frequency used words. Looking at the 2-gram feature, it would also help to determine the frequency of the next word hence improving the prediction model at later stage.