Capstone Project Milestone Report

Synopsis

This report is produced as part of the Johns Hopkins Data Science Specialization Capstone Project.
Project objective: to apply predictive text analytics on a large unstractured text data set and ultimately develop a word prediction application.
Report objective: to document my initial exploratory efforts of the Capsotne data set and to show the project progress attained thus far.

Getting and Cleaning the Data

Data for the project was provided by Coursera in partnership with Swiftkey. The data is from a corpus called HC Corpora.

The data set was downloaded from the course website and can be found at this link: (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

The data set contains four different language sets: English, Russian, Finish and German. My analysis is focused on the English set only.

Dataset Summary

The *en_US folder contains 3 files en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.

The Blogs file contains 899,288 lines (blogs), 38,154,238 words, with an average of 42.427 words per blog.
The News file contains 1,010,242 lines, 35,010,782 words, with and average of 34.656 words per news.
The Twitter file contains 2,360,148 lines (tweets), 30,218,125 words, with an average of 12.803 words per tweet.

Since all three files are very large, for my initial analysis, I took a small random subset of 100 records from each file.

Building the Corpus

A Corpus is the format text is often stored in text mining processes. To start my text-mining process, using the tm package, I created a Corpus object from the three sample files (twitter.txt, blogs.txt, and news.txt).

docs <- Corpus(DirSource("./sample/"))
summary(docs)

##             Length Class             Mode
## blogs.txt   2      PlainTextDocument list
## news.txt    2      PlainTextDocument list
## twitter.txt 2      PlainTextDocument list

Cleaning the Corpus

Next, I performed some pre-processing to clean-up and prepare the row text for analysis. The following transformations were performed using the tm package.

Convert the text to lower case to make it easier to compare.
Remove numbers
Remove punctuations
Strip white spaces
Remove special characters (# tags and @ tweeter handles).
Remove the stop words (the extremely common words in the english language).
Remove profanity words so as not to predict them. A filter list was obtained from a github repo at this link.

Tokenization and Building Document Term Matrices

The next step in text mining is tokenization (generating n-grams). This step involves breaking down the text into meaningful units such as words and phrases.

Using the TM and Rweka packages, the corpus was tokenized and Document Term Matrix (DTM’s) were created.

## create unigrams 
UnigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
unidtm <- DocumentTermMatrix(docs, control = list(tokenize = UnigramTokenizer))

## create bi-grams
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bidtm <- DocumentTermMatrix(docs, control = list(tokenize = BigramTokenizer))

## create tri-grams
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tridtm <- DocumentTermMatrix(docs, control = list(tokenize = TrigramTokenizer))

Inspecting the DTM

The DTM of my sample data contains 2933 Terms (distinct words).

Taking a peek at the first few terms in the DTM:

inspect(unidtm[1:3, 1:6])

## <<DocumentTermMatrix (documents: 3, terms: 6)>>
## Non-/sparse entries: 7/11
## Sparsity           : 61%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## Sample             :
##              Terms
## Docs          aaa able academy accents according accounts
##   blogs.txt     1    7       5       1         1        1
##   news.txt      0    3       0       0         0        0
##   twitter.txt   0    0       0       0         0        0

Exploratory Data Analysis

For further exploration of the data, I decided to leverage the standard ‘familiar’ tools such as the dplyr Package. In order to do that, I converted the DTM into a data frame (with one token per document per row). This was done using the tidytext Package which provides functions to convert a DTM format to a tidy data frame and vice versa.

## convert dtm to data frame
unidtm_df <- tidy(unidtm)
head(unidtm_df)

## # A tibble: 6 × 3
##    document      term count
##       <chr>     <chr> <dbl>
## 1 blogs.txt       aaa     1
## 2 blogs.txt      able     7
## 3 blogs.txt   academy     5
## 4 blogs.txt   accents     1
## 5 blogs.txt according     1
## 6 blogs.txt  accounts     1

#Summarize to get word frequencies.
wordfreq_df <- unidtm_df %>% count(term, wt = count, sort = TRUE)

The plots below summarize my analysis of word frequencies.

Some observations

The first plot higlights that, although similar size samples were used, the twitter data had a lot less unique words.
In the news file, the word said was the most frequent. This makes sense as news is often reporting what some one said.
As expected, the word love was the most frequent in the twitter file.

Plans for further development

Take a larger sample size so there will be more cases to train my model.
Leave stop words in the text as they will be important for the word perdiction model. Stop words were removed for the initial analysis which was focused on exploring document features and identifying the most frequent words in each of the files.
Analyze multi-grams.
Determine optimal data for model training. There is overlap and I might not necessarily use all three files together to get a good representative sample of the English language.

Capstone Project Milestone Report

ZG

May 12, 2017

Synopsis

Getting and Cleaning the Data

Exploratory Data Analysis

Plans for further development