The source code can be found here in the Milestone.Rmd. The codes will not be displayed to keep the report concise.
The raw dataset were sampled by using bionomial distribution with seed set at 1. Only 1% of each dataset was used for this project for efficiency sake.
## source filesizemb totallines totalwords
## 1 blog 200.4242 899288 37546239
## 2 news 196.2775 1010242 34762395
## 3 twitter 159.3641 2360148 30093413
## source filesizemb totallines totalwords
## 1 blogsam 2.000921 8924 377106
## 2 newssam 1.942724 9969 345028
## 3 twittersam 1.558938 23437 298417
Corpus was created by combining the sample data generated from above. Using quanteda, the text was tokenised by using dfm(). The tokenisation process was adjusted such that it removes (1) stopwords, (2) punctuations, (3) numbers, (4) separators, (5) symbols and (6) any URL.
Stopwords were removed as it was showing up too frequently and exploration of other vocabulary used was more of the key interest. Hence, stopwords was used. However, it might be useful to include it back at later stage of the analysis as work prediction would include stopwords too.
## Package version: 2.1.2
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following object is masked from 'package:utils':
##
## View
## Text Types Tokens Sentences
## 1 text1 20 21 1
## 2 text2 44 52 1
## 3 text3 22 22 1
## 4 text4 56 78 7
## 5 text5 4 4 1
## 6 text6 78 101 6
## Document-feature matrix of: 6 documents, 58,904 features (100.0% sparse).
## features
## docs attend friend kendra's wedding joy watch people see end year's
## text1 1 1 1 1 1 1 1 1 1 1
## text2 0 0 0 0 0 0 0 0 0 0
## text3 0 0 0 0 0 0 0 0 0 0
## text4 0 1 0 0 0 0 0 0 0 0
## text5 0 0 0 0 0 0 0 0 0 0
## text6 0 0 0 0 0 0 0 0 0 0
## [ reached max_nfeat ... 58,894 more features ]
## feature frequency rank docfreq group
## 1 said 3131 1 2855 all
## 2 just 3069 2 2840 all
## 3 one 2815 3 2475 all
## 4 like 2730 4 2444 all
## 5 can 2460 5 2200 all
## 6 get 2287 6 2086 all
Looking at the analysis process, stopwords might be included back to explore if it helps with better prediction of the words used. Also looking at the frequency of words used, further analysis can be conducted to determine what are the words that are used after these first set of frequency used words. Looking at the 2-gram feature, it would also help to determine the frequency of the next word hence improving the prediction model at later stage.