Capstone Milestone Project

The Swiftkey Data

SwiftKey is an input method for Android and iOS devices, such as smartphones and tablets. SwiftKey uses a blend of artificial intelligence technologies that enable it to predict the next word the user intends to type. SwiftKey learns from previous SMS messages and output predictions based on currently input text and what it has learned.[5]

The Swiftkey data that is used for this project consists of three large sample files. Blogs.txt data which was taken from web logs arbitrarily. Twitter.txt which was taken from twitter messages. News.txt samples were taken from recent newspaper and internet news stories,

## [1] "C:/Albert/Coursera Data Science/Data Science Capstone/Project/data/SwiftKey 1/en_US"

##                   file size(MB) num_lines longest_line num_words
## 1   .//en_US.blogs.txt   200.42    899288       483415  37334441
## 2    .//en_US.news.txt   196.28     77259        14556   2643972
## 3 .//en_US.twitter.txt   159.36   2360148      1484357  30373792

Building the Sample Data

Viewing the output above, it is evident that the three file are very large and contain an enormous amount of data. To summarize, there is about half a gigabyte of data when taken in total. This consists of over seventy million English words.

For this analysis, each of the three files are sampled using a binomial random sampling algorithym. Each file sample is limited to 0.5 per cent of the total file data.

Each sample is filtered and converted to ASCII UTF-8 characters which removes any non-standard data. The data is preprocessed to remove profanity, contractions and other anomalies. The data is also converted to lower case.

set.seed(11149)

#take .5 % of each dataset
blogs.samp<-blogs[as.logical(rbinom(length(blogs),1,0.005))]
news.samp<-news[as.logical(rbinom(length(news),1,0.005))]
twitter.samp<-twitter[as.logical(rbinom(length(twitter),1,0.005))]

# purify
blogs.samp <- iconv(blogs.samp, "UTF-8", "ASCII", sub = "")
news.samp <- iconv(news.samp, "UTF-8", "ASCII", sub = "")
twitter.samp <- iconv(twitter.samp, "UTF-8", "ASCII", sub = "")
#preprocess
blogs.samp<-preprocess_data(blogs.samp) 
news.samp<-preprocess_data(news.samp)
twitter.samp<-preprocess_data(twitter.samp)  

#save
write(blogs.samp, "../../data/Wk2/blogs.samp.txt")
write(news.samp, "../../data/Wk2/news.samp.txt")
write(twitter.samp, "../../data/Wk2/twitter.samp.txt")


writeLines(mergedData <- c(blogs.samp, news.samp, twitter.samp),"../../data/Wk2/mergedData.txt")

The Sample Data

The three sample files, blogs.txt, twitter.txt and news.txt, are merged together into a single file, mergedDataPrep.txt. This will be used for most of the future analysis. Note that the mergedDataPrep data characteristics are the sum of the other three sample files.

##                          file size(MB) num_lines longest_line num_words
## 1   ./data/Wk2/blogs.samp.txt     0.96      4481         2053    185859
## 2    ./data/Wk2/news.samp.txt     0.92      5097          149    167552
## 3 ./data/Wk2/twitter.samp.txt     0.73     11864         7348    146461

 write_output_file(mergedData,'../../data/Wk2/mergedDataPrep.txt') # INPUT PARAMETERS:
 display_file_stats('../../data/Wk2/mergedDataPrep.txt',0)# INPUT PARAMETERS:

## [1] "THE TRAINING FILE STATISTICS"
## [1] "mergedDataPrep.txt, SIZE: (2.61 MB), LINES: (21442), WORDS: (499872), LONGEST LINE: (2053)"

Create the Corpus

A Corpus is defined as a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.[6]

This Corpus conversion is used to render the data in a prescribed vector format to be used by other software algorithyms.

DLCorp <- corpus(mergedData)  # build the corpus
TwitCorp <- corpus(twitter.samp)
BlogsCorp <- corpus(blogs.samp)
NewsCorp <- corpus(news.samp)

summary(DLCorp, n = 5)

## Corpus consisting of 21442 documents, showing 5 documents.

## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text

##   Text Types Tokens Sentences
##  text1    33     39         1
##  text2    47     60         1
##  text3    61     76         1
##  text4     9     10         1
##  text5    28     35         1
## 
## Source:  C:/Albert/Coursera Data Science/Data Science Capstone/Project/code/Wk2/* on x86-64 by tbalg
## Created: Thu Nov 17 10:53:04 2016
## Notes:

Document Feature Matrix(DFM) - Document Term Matrix

A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.[7]

Stemming

In search engine terminology, stemming is the comparison of a search engine query to the root form of a word used in the query. For example, a user may search for the term “cheaper,” but a search engine that uses stemming technology may return search results for any word that contains the root form of the word (e.g. cheap, cheapen, cheaper).[8]

Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.[4]

Twitdfm <- dfm(TwitCorp, ignoredFeatures = c(stopwords("english")),stem=TRUE,verbose=FALSE)

##

Blogdfm <- dfm(BlogsCorp, ignoredFeatures =c(stopwords("english")),stem=TRUE,verbose=FALSE)

##

Newsdfm <- dfm(NewsCorp, ignoredFeatures = c(stopwords("english")),stem=TRUE,verbose=FALSE)

##

DLCdfm <- dfm(DLCorp, ignoredFeatures = c( stopwords("english")),stem=TRUE,verbose=FALSE)

##

topfeatures(DLCdfm, 20)

##   one  will  said  just   get  like    go  time   can   day  year  love 
##  1583  1566  1546  1534  1507  1487  1335  1267  1212  1154  1139  1074 
##  make  dont  good   now   new  know  work peopl 
##  1019   960   954   945   939   934   895   800

Data Dictionary

Type-to-token Ratio and Diversity

Before examining the Data Dictionary, let’s examine the training data that we have produced. In mergedDataPrep Corpus output displayed below, we see the headings of Documents, Vocabulary, Word Types, TTR and Diversity. The documents refer to the number of sentences contained in the Corpus. The Vocabulary is total number of words in the sample. The Word Types column is total number of unique words in the sample. The Type-to-Token Ratio (TTR) is the number of unique words (Types) in the numerator and the number of total words (Tokens) in the denominator.[1]

Diversity is defined as “a measure of vocabulary diversity that is approximately independent of sample size” is the number of unique words divided by the square root of twice the number of words in the sample \(Div =\frac {T} {\sqrt{2*V}}\) . If the three samples are combined the resulting diversity is larger.[2]

Frequency Based Dictionary

The dictionary, as a result of the diversity analysis, is better understood. In the sample dfm being used there are total of 499872 words (V), 36796 of which are unique word types (T). This means that V is the grand total of the number of times each of the word types T is used in our sample dfm. The plot below shows the required word types as a function of per cent of the Frequency Data Dictionary completed.

##                blogs    news     twitter  total   
## Documents      "4481"   "5097"   "11864"  "21442" 
## Vocabulary(V)  "185859" "167552" "146461" "499872"
## Word Types (T) "13822"  "14744"  "12394"  "26956" 
## TTR (T/V)      "0.074"  "0.088"  "0.085"  "0.054" 
## Diversity      "22.671" "25.47"  "22.9"   "26.959"

## [1] "When creating a frequency dictionary of 26956 unique words from this dfm, it will take 587 words for 50 per cent, 6907 words for 90 per cent and 21460 words for 98 per cent coverage. "

N-Grams

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). [9]

Unigrams contain one word, bigrams contain two words and trigrams contain three words. This may be continued for as many words, n-grams, as desired. For example “I am Sam. Sam I am.”

Unigrams: I,am,Sam
Bigrams: I_am,am_Sam,Sam_I
Trigrams: I_am_Sam,Sam_I_am

In the following sections unigrams, bigrams, and trigrams are taken from the data sample and presented as:

Histograms
Word Clouds
Frequency and Frequency of Frequency (Occurance vs Frequency) Plots

The Frequency plots are similar to the Histogram plots only the entire set of data is used. Because of this, the base 10 logarithm of the Word Types is used to smooth out the data in a more revealing presentation.

The frequency of frequency plot, or more understandably Occurance vs Frequency plot, Displays the number of times a given frequency of the word type occurs in the sample. The logarithmic plot is also provided for this.

N-grams are used for a variety of different tasks. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. Google and Microsoft have developed web scale n-gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization.

Another use of n-grams is for developing features for supervised Machine Learning models such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams in the feature space instead of just unigrams.[9]

Unigram Analysis

Twenty Most Populus Word Types

top.features.plot

Unigram Cloud of Word Types

options(warn=-1)
require(RColorBrewer)
    plot(DLCdfm, max.words = 75, colors = brewer.pal(1, "Spectral"), scale = c(8, .5)    ,vfont=c("serif","plain"))

options(warn=0)

Unigram Plots of Word Frequencies and Frequency Occurances

##

Bigram Analysis

Most Populus Bigram Types

bfp.plot

bf.df<-bigram.frequency
bf.df<-data.frame(bf.df)
bf.df$bigram<-rownames(bf.df)

Bigram Cloud of Types

##

Bigram Plots of Word Frequencies and Frequency Occurances

Trigram Analysis

Most Populus Trigram Types

tfp.plot

Trigram Cloud of Types

options(warn=-1)
require(RColorBrewer)
    plot(dfm.trigram, max.words = 75, colors = brewer.pal(6, "Dark2"), scale = c(8, .5)    ,vfont=c("gothic english","plain"))

options(warn=0)

Trigram Plots of Word Frequencies and Frequency Occurances

##    user  system elapsed 
##  241.62   12.42  259.28

Further Areas of Concern

Evaluate the words that come from foreign languages

The data is already converted using UTF-8 filtering. If a word is suspect to containing one of these characters, it is eliminated.

How big is n in your n-gram model

Unigrams, bigrams, trigrams, 4-grams … can be used in the model. In general, n-grams are considered an insufficient model of language because of long range dependencies. Example:

The transmission that I replaced in my car about six months ago exploded. (transmission~exploded)[3]

But for our small sentence or phrase application they should suffice. My model will implement up to 4-gram word types.

Provide all n-grams with a non-zero probability

Words that are not contained in your dictionary can be accounted for by adding unknown type word. These unknown word types share the same probabilities as singular occurances. Thus a non-zero probability may be assinged to it.[3]

Performance - How good is your Model

Performance is based on measurement and evaluation. For starters, proc.time() can be used to measure the time of execution. This is one of the fundametal issues.

The other major issue is space or size. This limits the usefulness of the application. Can it execute on a mobile phone or does it need mainframe? The object.size() function reports the number of bytes that an R object occupies in memory and can prove to be helpful.

Other application activity report functions include the Rprof() function that runs the profiler in R that can be used to determine where bottlenecks in your function may exist. The profr package (available on CRAN) provides some additional tools for visualizing and summarizing profiling data.

Finally, the gc() function runs the garbage collector to retrieve unused RAM for R. In the process it tells you how much memory is currently being used by R.

Backoff Model Processing

Backoff modeling, in its simplest sense, is using n-grams from largest to smallest n. If trigrams fail switch to bigrams. If bigrams fail switch to unigrams. An alternative to backoff modeling is interpolation. This process uses a mix of trigrams, bigrams and unigrams in order to predict the outcome.[3]