Jinkwan Hong
Saturday, Feb 30, 2019
This report was prepared as a part of Data Scinece Capstone. The final goal is to create word prediction algorithm and the Shiny app that allows the public to use easily.
As for this documents, I am going to illustrate the data summaries to grasp the data profile.
## [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
The summary plots says that there are 800,000 lines in blogs, 1 million lines in news, and 2 million lines in twitter. However number of words goes opposite way. Blogs 37 million, News 34 million, and Twitter 30 million words. It’s probably because twitter limits the number of character on each twit by 140 bytes.
The size of docsOrg corpus is 1.5 giga bytes which is quite big to work with. I am going to randomly sample then analyze.
Here I am randomly sampling 1% of the data in order to perform explarotory analysis and turn samples into files to avoid unnecessary computing and drop the original data from the memory. Finally I am writing the text into files so I do not have to go through the same process all over again.
After sampling and cleanup word counts for the whole text went down to around 450,000 and the lines counts
There are lots of irregularity in the data since they are from different sources. Here I am removing the followings utilizing tm package.
Corpus is now ready for analysis.
There are total of 5.3612510^{5} words in the corpus and 23234 are distinctive. 20497 of them are used more than 2 times.
I reduced the sparse terms to reduce computation.