Capstone Project: Milestone Report Week 2

1. Objectives.

Getting and cleaning data;
Exploratory analysis;
Create fundations for further exploration and modeling.

2. Prerequisite and data preparation.

The Rmarkdown file can be found at: https://github.com/Rui-Lian/datascience-capstone.git

The following packages are used for this report.

library(tidyverse)# A set of hand-on packages from Rstudio. 
library(tidytext)# Text mining package using tidy data principle.
library(wordcloud)# Word cloud for exploration.
library(quanteda)# Create document frequency matrix.
library(stringi) # Regrex to clean the word strings.

Two NLP packages are used for this project.

Package tidytext (https://cran.r-project.org/web/packages/tidytext/index.html) adopts tidy data priciple. So we can treat text data as a data frame and we can use all of the previous tools such as ggplot2, dplyr etc.

Package quanteda is a powerful, convenient package for NLP. But it takes some time to be familiar with it.

2.1 Getting the data.

Raw data was downloaded from “https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip”. The zip file was unziped in R with unzip funciton. All the data ins in a folder named final in the working directory.

Three text files are downloaded into final folder in working directory. These files represent data from twitter, news and blog, respectively. 25% ramdom samples of the three files was read into a data frame by base::scan.

As the result: a data frame of more than 1.06 million rows was generated. The size of this sampled raw data is 208 M.

2.2 Data cleaning-profanity removing.

A “bad-word” list was downloaded from “https://www.cs.cmu.edu/~biglou/resources/bad-words.txt”.

Each word in the list was pasted as a regular expression like “\bbad\b”. As such, we only detect STANDALONE bad words in the corpus. It is possible that there are hidden bad words in a string.

3. Tokenization and feature exploration.

One word tokenization was performed by tidytext::unnest_tokens function. Source of the text, twitter, news and blogs are labeled for each row. There are more than 25 million one-word tokens, representing more than 310 K unique words or symbols. The common stop words, such as “the”, “of” etc, are not excluded because predictions of these word could make the typing more easy, especially in mobile settings. A example of the first two rows of the data frame.

## # A tibble: 2 x 2
##   text                                                               sour…
##   <chr>                                                              <chr>
## 1 with graduation season right around the corner, nancy has whipped… en_U…
## 2 state contracts worth over 100 million ringgit!                    en_U…

3.1 Exploratory analysis 1: the coverages of frequent words.

Plot below shows that a few words could cover the majority of English text. On the other hand, there is a long tail of rare words, symbos, foreign languages. Potentially further exploration or modeling could trim the list for efficiency. At this state, We just leave the long tail as it is.

3.2 Exploratory analysis 2: term frequency analysis.

The most frequent words in the three sources are too common to convey any meaning. This is problematic because: 1) we need to put those words in model since they are the “backbone” of languange; 2) Including common words in the model is at the cost of system efficiency.

## Selecting by n

3.3 Exploratory analysis 3: term frequency-inverse document frequency (tf-idf)analysis.

The common words in the corpus is too common to be meaningful. So we can’t use a simple frequency model here. Term frequency-inverse document frequency (tf-idf) model could offset those “stop-words” and put more meaningful words on the table. (https://en.wikipedia.org/wiki/Tf%E2%80%93idf) Plot below: excluding common English “stop-words”, we can see the top 15 words from the three sources are quite different. Many “atypical” English words were noted in twitter. The take-homes here: 1) Need all of the three sources in training; 2) English is dynamic, some informal words are active in social media.

## Selecting by tf_idf

## Selecting by tf_idf

3.4 Exploratory analysis 4: Document-frequency-matrix.

Top 20 words accounts fro 28% of total words!

## [1] 0.2807511

Word cloud of words with 10000 + times in corpus.

4 Milestone conclusions and further steps.

Milestone conclusions: 1. Potentially this work is system (memory, CPU) demanding; 2. To avoid system overwhelming, the raw corpus need further trimming or optimization; 3. tf-idf model could be useful to predict the true meaningfull words; 4. Common stop words should be included in prediction model, but it might be possible to treat common words as default to minimize the system (memory, CPU) requirment;

Further steps: 1. Continue to expore 2gram, 3gram model; 2. Based on 1gram, 2gram, 3gram or more, build prediction model; 3. Evaluate the smoothing algorithm.