The Swiftkey Data

The Swiftkey data that is used for this project consists of three large sample files. Blogs.txt data which was taken from web logs arbitrarily. Twitter.txt which was taken from twitter messages. News.txt samples were taken from recent newspaper and internet news stories,

## [1] "C:/Albert/Coursera Data Science/Data Science Capstone/Project/data/SwiftKey 1/en_US"
##                   file size(MB) num_lines longest_line num_words
## 1   .//en_US.blogs.txt   200.42    899288       483415  37334441
## 2    .//en_US.news.txt   196.28     77259        14556   2643972
## 3 .//en_US.twitter.txt   159.36   2360148      1484357  30373792

Building the Sample Data

For this analysis, each of the three files are sampled using a binomial random sampling algorithym. Each file sample is limited to 0.5 per cent of the total file data. Each sample is filtered and converted to ASCII UTF-8 characters which removes any non-standard data. The data is preprocessed to remove profanity, contractions and other anomalies. The data is also converted to lower case.

set.seed(11149)

#take .5 % of each dataset
blogs.samp<-blogs[as.logical(rbinom(length(blogs),1,0.005))]
news.samp<-news[as.logical(rbinom(length(news),1,0.005))]
twitter.samp<-twitter[as.logical(rbinom(length(twitter),1,0.005))]

# purify
blogs.samp <- iconv(blogs.samp, "UTF-8", "ASCII", sub = "")
news.samp <- iconv(news.samp, "UTF-8", "ASCII", sub = "")
twitter.samp <- iconv(twitter.samp, "UTF-8", "ASCII", sub = "")
#preprocess
blogs.samp<-preprocess_data(blogs.samp) 
news.samp<-preprocess_data(news.samp)
twitter.samp<-preprocess_data(twitter.samp)  

#save
write(blogs.samp, "../../data/Wk2/blogs.samp.txt")
write(news.samp, "../../data/Wk2/news.samp.txt")
write(twitter.samp, "../../data/Wk2/twitter.samp.txt")


writeLines(mergedData <- c(blogs.samp, news.samp, twitter.samp),"../../data/Wk2/mergedData.txt")

The Sample Data

The three sample files, blogs.txt, twitter.txt and news.txt, are merged together into a single file, mergedDataPrep.txt. This will be used for most of the future analysis. Note that the mergedDataPrep data characteristics are the sum of the other three sample files.

##                          file size(MB) num_lines longest_line num_words
## 1   ./data/Wk2/blogs.samp.txt     0.96      4481         2053    185859
## 2    ./data/Wk2/news.samp.txt     0.92      5097          149    167552
## 3 ./data/Wk2/twitter.samp.txt     0.73     11864         7348    146461
## [1] "THE TRAINING FILE STATISTICS"
## [1] "mergedDataPrep.txt, SIZE: (2.61 MB), LINES: (21442), WORDS: (499872), LONGEST LINE: (2053)"

Create the Corpus

A Corpus is defined as a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject. This Corpus conversion is used to render the data in a prescribed vector format to be used by other software algorithyms.

##

summary(DLCorp, n = 5)
## Corpus consisting of 21442 documents, showing 5 documents.
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
##   Text Types Tokens Sentences
##  text1    33     39         1
##  text2    47     60         1
##  text3    61     76         1
##  text4     9     10         1
##  text5    28     35         1
## 
## Source:  C:/Albert/Coursera Data Science/Data Science Capstone/Project/code/Wk2/* on x86-64 by tbalg
## Created: Thu Nov 17 15:18:05 2016
## Notes:

Data Dictionary

Diversity

\(Div =\frac {T} {\sqrt{2*V}}\) . If the three samples are combined the resulting diversity is larger.

Frequency Based Dictionary

The plot below shows the required word types as a function of per cent of the Frequency Data Dictionary completed.

##                blogs    news     twitter  total   
## Documents      "4481"   "5097"   "11864"  "21442" 
## Vocabulary(V)  "185859" "167552" "146461" "499872"
## Word Types (T) "13822"  "14744"  "12394"  "26956" 
## TTR (T/V)      "0.074"  "0.088"  "0.085"  "0.054" 
## Diversity      "22.671" "25.47"  "22.9"   "26.959"

## [1] "When creating a frequency dictionary of 26956 unique words from this dfm, it will take 587 words for 50 per cent, 6907 words for 90 per cent and 21460 words for 98 per cent coverage. "

N-Grams

Unigrams contain one word, bigrams contain two words and trigrams contain three words. This may be continued for as many words, n-grams, as desired. In the following sections unigrams, bigrams, and trigrams are taken from the data sample and presented as:

Unigram Analysis

Twenty Most Populus Word Types

top.features.plot

Unigram Cloud of Word Types

options(warn=-1)
require(RColorBrewer)
    plot(DLCdfm, max.words = 75, colors = brewer.pal(1, "Spectral"), scale = c(8, .5)    ,vfont=c("serif","plain"))

options(warn=0)

Unigram Plots of Word Frequencies and Frequency Occurances

##

Bigram Analysis

Most Populus Bigram Types

bfp.plot

Bigram Cloud of Types

##

Trigram Analysis

Most Populus Trigram Types

tfp.plot

Trigram Cloud of Types

options(warn=-1)
require(RColorBrewer)
    plot(dfm.trigram, max.words = 75, colors = brewer.pal(6, "Dark2"), scale = c(8, .5)    ,vfont=c("gothic english","plain"))

options(warn=0)

proc.time() - ptm
##    user  system elapsed 
##  244.31    6.16  253.81