Milestone Report - Exploratory and n-grams analysis

Introduction

This is one of milestone reports for the Data Science Specialization SwiftKey Capstone. The goal of this milestone is to report exploratory analysis and goals for the eventual application and algorithm. This document explains major features of the data and summarizes plans for creating the prediction algorithm.

Datataset

The dataset used in this report is available at [this link] (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). Only part containing US-English text was used (files en_US.twitter.txt, en_US.blogs.txt, and en_US.news.txt).

blogs   <- readLines("./final/en_US/en_US.blogs.txt", skipNul=TRUE)
news    <- readLines("./final/en_US/en_US.news.txt", skipNul=TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", skipNul=TRUE)

File statistics

Table 1: Basic statistic for files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt
File Name	File size[MB]	Number of lines	Number of words
en_US.blogs.txt	200.42	899288	37334147
en_US.news.txt	196.28	1010242	34372530
en_US.twitter.txt	159.36	2360148	30373603

For further findings, 10000 lines were sampled from each file, to provide enough statistics in unbiased samples, but also to speed up calculation process in the exploratory analysis.

lines <- 10000
blogs_sample <-sample(blogs, lines, replace=FALSE)
news_sample <-sample(news, lines, replace=FALSE)
twitter_sample <-sample(twitter, lines, replace=FALSE)

The table below shows basic statistics for word count in the twitter, blog and news files, for 10000 sampled lines in each file.

Table 2: Basic statistic for word count in files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt
	Minimum	1st Quartile	Median	Mean	3rd Quartile	Maximum
en_US.twitter.txt	2	7	12	13.31	19	34
en_US.blogs.txt	1	9	29	42.94	63	380
en_US.news.txt	1	19	33	36.08	47	296

Below is graphical representation of word frequency per line in each of the files. The statistics shown is for sampled 10000 lines in each file.

Data cleaning

The three datasets are then combined into a single dataset for cleanup. The following transformations were applied to the dataset:

Conversion of all words to lower case
Removal of web site addresses
Removal of email addresses
Removal of punctuation
Removal of numbers
Removal of twitter hashtags and twitter mention (@)
Splitting words separated by a hyphen (“-”)
Removal of stop words (like “the”, “and”, and others)
Removal of additional white space
Filtering out profanity words (based on https://github.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words)

The dataset can be now presented with a word cloud which gives more emphasis to words used more frequently than the others.

N-gram analysis

The n-gram in natural langiage processing is a sequence of n words in text or speech. The table and plots below show some basic statistics of most frequent 1-, 2- and 3-grams (uni-grams, bi-grams, and tri-grams) in the US-English dataset.

Table 3: Uni-gram, bi-gram, and tri-gram frequancy for first 10 expressions in files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt
Uni-gram	Uni-gram Freq	Bi-gram	Bi-gram Freq	Tri-gram	Tri-gram Freq
said	3025	year old	266	cant wait see	24
one	2787	last year	213	new york citi	24
will	2772	new york	191	presid barack obama	20
get	2405	right now	179	new york time	18
like	2365	year ago	173	happi mother day	17
time	2289	look like	155	let us know	15
year	2271	dont know	154	year old daughter	14
just	2266	high school	151	happi new year	13
go	2052	feel like	128	look forward see	13
can	2028	last week	121	two year ago	12

The plots below are graphical representation of the table above.

Summary

It is worth to point out several key points in this analysis:

The size of the files are considerably big to make the full dataset analysis time consuming.
Optimization of pre-processing of the dataset is very important for the future word-prediction application.
Cleaning of the dataset is a very important step of analysis and needs further study. It is also the slowest part of the exploratory analysis, because it cannot be cashed.
An unbiased sub-sample of the original dataset may be necessary to use in order to accelerate performance.
Prediction model for Shiny application should be done on already cleaned dataset.

The next steps

The final application for word prediction will be done in Shiny. The following steps are needed for the application:

Implement a prediction model
Divide exisiting dataset into training, validation and test subsets to estimate accuracy and error of the prediction.
Find a way to work with a combination of unknown words
Test and deploy the application.