Data Science Capstone

Jinkwan Hong
Saturday, Feb 30, 2019

Milstone Report

Synopsis

This report was prepared as a part of Data Scinece Capstone. The final goal is to create word prediction algorithm and the Shiny app that allows the public to use easily.

As for this documents, I am going to illustrate the data summaries to grasp the data profile.

Data Source

Getting and Cleaning the Data

Getting Data

## [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Original Data Summary

The summary plots says that there are 800,000 lines in blogs, 1 million lines in news, and 2 million lines in twitter. However number of words goes opposite way. Blogs 37 million, News 34 million, and Twitter 30 million words. It’s probably because twitter limits the number of character on each twit by 140 bytes.

The size of docsOrg corpus is 1.5 giga bytes which is quite big to work with. I am going to randomly sample then analyze.

Sampling

Here I am randomly sampling 1% of the data in order to perform explarotory analysis and turn samples into files to avoid unnecessary computing and drop the original data from the memory. Finally I am writing the text into files so I do not have to go through the same process all over again.

After sampling and cleanup word counts for the whole text went down to around 450,000 and the lines counts

Cleaning Data

There are lots of irregularity in the data since they are from different sources. Here I am removing the followings utilizing tm package.

  • Whitespace
  • Punctuation
  • Numbers
  • URL
  • Hashtags
  • Twitter Handles
  • HTML Tags
  • Stopwords (common grammartical word with little to no added meaning)

Analysis

Corpus is now ready for analysis.

There are total of 5.3612510^{5} words in the corpus and 23234 are distinctive. 20497 of them are used more than 2 times.

Top 25 Frequent Words

NGram Tokenize

I reduced the sparse terms to reduce computation.