This is one of milestone reports for the Data Science Specialization SwiftKey Capstone. The goal of this milestone is to report exploratory analysis and goals for the eventual application and algorithm. This document explains major features of the data and summarizes plans for creating the prediction algorithm.
The dataset used in this report is available at [this link] (https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip). Only part containing US-English text was used (files en_US.twitter.txt, en_US.blogs.txt, and en_US.news.txt).
blogs <- readLines("./final/en_US/en_US.blogs.txt", skipNul=TRUE)
news <- readLines("./final/en_US/en_US.news.txt", skipNul=TRUE)
twitter <- readLines("./final/en_US/en_US.twitter.txt", skipNul=TRUE)
File Name | File size[MB] | Number of lines | Number of words |
---|---|---|---|
en_US.blogs.txt | 200.42 | 899288 | 37334147 |
en_US.news.txt | 196.28 | 1010242 | 34372530 |
en_US.twitter.txt | 159.36 | 2360148 | 30373603 |
For further findings, 10000 lines were sampled from each file, to provide enough statistics in unbiased samples, but also to speed up calculation process in the exploratory analysis.
lines <- 10000
blogs_sample <-sample(blogs, lines, replace=FALSE)
news_sample <-sample(news, lines, replace=FALSE)
twitter_sample <-sample(twitter, lines, replace=FALSE)
The table below shows basic statistics for word count in the twitter, blog and news files, for 10000 sampled lines in each file.
Minimum | 1st Quartile | Median | Mean | 3rd Quartile | Maximum | |
---|---|---|---|---|---|---|
en_US.twitter.txt | 2 | 7 | 12 | 13.31 | 19 | 34 |
en_US.blogs.txt | 1 | 9 | 29 | 42.94 | 63 | 380 |
en_US.news.txt | 1 | 19 | 33 | 36.08 | 47 | 296 |
Below is graphical representation of word frequency per line in each of the files. The statistics shown is for sampled 10000 lines in each file.
The three datasets are then combined into a single dataset for cleanup. The following transformations were applied to the dataset:
The dataset can be now presented with a word cloud which gives more emphasis to words used more frequently than the others.
The n-gram in natural langiage processing is a sequence of n words in text or speech. The table and plots below show some basic statistics of most frequent 1-, 2- and 3-grams (uni-grams, bi-grams, and tri-grams) in the US-English dataset.
Uni-gram | Uni-gram Freq | Bi-gram | Bi-gram Freq | Tri-gram | Tri-gram Freq |
---|---|---|---|---|---|
said | 3025 | year old | 266 | cant wait see | 24 |
one | 2787 | last year | 213 | new york citi | 24 |
will | 2772 | new york | 191 | presid barack obama | 20 |
get | 2405 | right now | 179 | new york time | 18 |
like | 2365 | year ago | 173 | happi mother day | 17 |
time | 2289 | look like | 155 | let us know | 15 |
year | 2271 | dont know | 154 | year old daughter | 14 |
just | 2266 | high school | 151 | happi new year | 13 |
go | 2052 | feel like | 128 | look forward see | 13 |
can | 2028 | last week | 121 | two year ago | 12 |
The plots below are graphical representation of the table above.
It is worth to point out several key points in this analysis:
The final application for word prediction will be done in Shiny. The following steps are needed for the application: