Introduction

This is one of milestone reports for the Data Science Specialization SwiftKey Capstone. The goal of this milestone is to report exploratory analysis and goals for the eventual application and algorithm. This document explains major features of the data and summarizes plans for creating the prediction algorithm.

Datataset

The dataset used in this report is available at [this link] (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). Only part containing US-English text was used (files en_US.twitter.txt, en_US.blogs.txt, and en_US.news.txt).

blogs   <- readLines("./final/en_US/en_US.blogs.txt", skipNul=TRUE)
news    <- readLines("./final/en_US/en_US.news.txt", skipNul=TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", skipNul=TRUE)

File statistics

Table 1: Basic statistic for files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt
File Name File size[MB] Number of lines Number of words
en_US.blogs.txt 200.42 899288 37334147
en_US.news.txt 196.28 1010242 34372530
en_US.twitter.txt 159.36 2360148 30373603

For further findings, 10000 lines were sampled from each file, to provide enough statistics in unbiased samples, but also to speed up calculation process in the exploratory analysis.

lines <- 10000
blogs_sample <-sample(blogs, lines, replace=FALSE)
news_sample <-sample(news, lines, replace=FALSE)
twitter_sample <-sample(twitter, lines, replace=FALSE)

The table below shows basic statistics for word count in the twitter, blog and news files, for 10000 sampled lines in each file.

Table 2: Basic statistic for word count in files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt
Minimum 1st Quartile Median Mean 3rd Quartile Maximum
en_US.twitter.txt 2 7 12 13.31 19 34
en_US.blogs.txt 1 9 29 42.94 63 380
en_US.news.txt 1 19 33 36.08 47 296

Below is graphical representation of word frequency per line in each of the files. The statistics shown is for sampled 10000 lines in each file.

Data cleaning

The three datasets are then combined into a single dataset for cleanup. The following transformations were applied to the dataset:

The dataset can be now presented with a word cloud which gives more emphasis to words used more frequently than the others.

N-gram analysis

The n-gram in natural langiage processing is a sequence of n words in text or speech. The table and plots below show some basic statistics of most frequent 1-, 2- and 3-grams (uni-grams, bi-grams, and tri-grams) in the US-English dataset.

Table 3: Uni-gram, bi-gram, and tri-gram frequancy for first 10 expressions in files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt
Uni-gram Uni-gram Freq Bi-gram Bi-gram Freq Tri-gram Tri-gram Freq
said 3025 year old 266 cant wait see 24
one 2787 last year 213 new york citi 24
will 2772 new york 191 presid barack obama 20
get 2405 right now 179 new york time 18
like 2365 year ago 173 happi mother day 17
time 2289 look like 155 let us know 15
year 2271 dont know 154 year old daughter 14
just 2266 high school 151 happi new year 13
go 2052 feel like 128 look forward see 13
can 2028 last week 121 two year ago 12

The plots below are graphical representation of the table above.

Summary

It is worth to point out several key points in this analysis:

The next steps

The final application for word prediction will be done in Shiny. The following steps are needed for the application: