The Swiftkey data that is used for this project consists of three large sample files. Blogs.txt data which was taken from web logs arbitrarily. Twitter.txt which was taken from twitter messages. News.txt samples were taken from recent newspaper and internet news stories,
## [1] "C:/Albert/Coursera Data Science/Data Science Capstone/Project/data/SwiftKey 1/en_US"
## file size(MB) num_lines longest_line num_words
## 1 .//en_US.blogs.txt 200.42 899288 483415 37334441
## 2 .//en_US.news.txt 196.28 77259 14556 2643972
## 3 .//en_US.twitter.txt 159.36 2360148 1484357 30373792
For this analysis, each of the three files are sampled using a binomial random sampling algorithym. Each file sample is limited to 0.5 per cent of the total file data. Each sample is filtered and converted to ASCII UTF-8 characters which removes any non-standard data. The data is preprocessed to remove profanity, contractions and other anomalies. The data is also converted to lower case.
set.seed(11149)
#take .5 % of each dataset
blogs.samp<-blogs[as.logical(rbinom(length(blogs),1,0.005))]
news.samp<-news[as.logical(rbinom(length(news),1,0.005))]
twitter.samp<-twitter[as.logical(rbinom(length(twitter),1,0.005))]
# purify
blogs.samp <- iconv(blogs.samp, "UTF-8", "ASCII", sub = "")
news.samp <- iconv(news.samp, "UTF-8", "ASCII", sub = "")
twitter.samp <- iconv(twitter.samp, "UTF-8", "ASCII", sub = "")
#preprocess
blogs.samp<-preprocess_data(blogs.samp)
news.samp<-preprocess_data(news.samp)
twitter.samp<-preprocess_data(twitter.samp)
#save
write(blogs.samp, "../../data/Wk2/blogs.samp.txt")
write(news.samp, "../../data/Wk2/news.samp.txt")
write(twitter.samp, "../../data/Wk2/twitter.samp.txt")
writeLines(mergedData <- c(blogs.samp, news.samp, twitter.samp),"../../data/Wk2/mergedData.txt")
The three sample files, blogs.txt, twitter.txt and news.txt, are merged together into a single file, mergedDataPrep.txt. This will be used for most of the future analysis. Note that the mergedDataPrep data characteristics are the sum of the other three sample files.
## file size(MB) num_lines longest_line num_words
## 1 ./data/Wk2/blogs.samp.txt 0.96 4481 2053 185859
## 2 ./data/Wk2/news.samp.txt 0.92 5097 149 167552
## 3 ./data/Wk2/twitter.samp.txt 0.73 11864 7348 146461
## [1] "THE TRAINING FILE STATISTICS"
## [1] "mergedDataPrep.txt, SIZE: (2.61 MB), LINES: (21442), WORDS: (499872), LONGEST LINE: (2053)"
A Corpus is defined as a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject. This Corpus conversion is used to render the data in a prescribed vector format to be used by other software algorithyms.
##
summary(DLCorp, n = 5)
## Corpus consisting of 21442 documents, showing 5 documents.
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
## Text Types Tokens Sentences
## text1 33 39 1
## text2 47 60 1
## text3 61 76 1
## text4 9 10 1
## text5 28 35 1
##
## Source: C:/Albert/Coursera Data Science/Data Science Capstone/Project/code/Wk2/* on x86-64 by tbalg
## Created: Thu Nov 17 15:18:05 2016
## Notes:
\(Div =\frac {T} {\sqrt{2*V}}\) . If the three samples are combined the resulting diversity is larger.
The plot below shows the required word types as a function of per cent of the Frequency Data Dictionary completed.
## blogs news twitter total
## Documents "4481" "5097" "11864" "21442"
## Vocabulary(V) "185859" "167552" "146461" "499872"
## Word Types (T) "13822" "14744" "12394" "26956"
## TTR (T/V) "0.074" "0.088" "0.085" "0.054"
## Diversity "22.671" "25.47" "22.9" "26.959"
## [1] "When creating a frequency dictionary of 26956 unique words from this dfm, it will take 587 words for 50 per cent, 6907 words for 90 per cent and 21460 words for 98 per cent coverage. "
Unigrams contain one word, bigrams contain two words and trigrams contain three words. This may be continued for as many words, n-grams, as desired. In the following sections unigrams, bigrams, and trigrams are taken from the data sample and presented as:
top.features.plot
options(warn=-1)
require(RColorBrewer)
plot(DLCdfm, max.words = 75, colors = brewer.pal(1, "Spectral"), scale = c(8, .5) ,vfont=c("serif","plain"))
options(warn=0)
##
bfp.plot
##
tfp.plot
options(warn=-1)
require(RColorBrewer)
plot(dfm.trigram, max.words = 75, colors = brewer.pal(6, "Dark2"), scale = c(8, .5) ,vfont=c("gothic english","plain"))
options(warn=0)
proc.time() - ptm
## user system elapsed
## 244.31 6.16 253.81