SwiftKey is an input method for Android and iOS devices, such as smartphones and tablets. SwiftKey uses a blend of artificial intelligence technologies that enable it to predict the next word the user intends to type. SwiftKey learns from previous SMS messages and output predictions based on currently input text and what it has learned.[5]
The Swiftkey data that is used for this project consists of three large sample files. Blogs.txt data which was taken from web logs arbitrarily. Twitter.txt which was taken from twitter messages. News.txt samples were taken from recent newspaper and internet news stories,
## [1] "C:/Albert/Coursera Data Science/Data Science Capstone/Project/data/SwiftKey 1/en_US"
## file size(MB) num_lines longest_line num_words
## 1 .//en_US.blogs.txt 200.42 899288 483415 37334441
## 2 .//en_US.news.txt 196.28 77259 14556 2643972
## 3 .//en_US.twitter.txt 159.36 2360148 1484357 30373792
Viewing the output above, it is evident that the three file are very large and contain an enormous amount of data. To summarize, there is about half a gigabyte of data when taken in total. This consists of over seventy million English words.
For this analysis, each of the three files are sampled using a binomial random sampling algorithym. Each file sample is limited to 0.5 per cent of the total file data.
Each sample is filtered and converted to ASCII UTF-8 characters which removes any non-standard data. The data is preprocessed to remove profanity, contractions and other anomalies. The data is also converted to lower case.
set.seed(11149)
#take .5 % of each dataset
blogs.samp<-blogs[as.logical(rbinom(length(blogs),1,0.005))]
news.samp<-news[as.logical(rbinom(length(news),1,0.005))]
twitter.samp<-twitter[as.logical(rbinom(length(twitter),1,0.005))]
# purify
blogs.samp <- iconv(blogs.samp, "UTF-8", "ASCII", sub = "")
news.samp <- iconv(news.samp, "UTF-8", "ASCII", sub = "")
twitter.samp <- iconv(twitter.samp, "UTF-8", "ASCII", sub = "")
#preprocess
blogs.samp<-preprocess_data(blogs.samp)
news.samp<-preprocess_data(news.samp)
twitter.samp<-preprocess_data(twitter.samp)
#save
write(blogs.samp, "../../data/Wk2/blogs.samp.txt")
write(news.samp, "../../data/Wk2/news.samp.txt")
write(twitter.samp, "../../data/Wk2/twitter.samp.txt")
writeLines(mergedData <- c(blogs.samp, news.samp, twitter.samp),"../../data/Wk2/mergedData.txt")
The three sample files, blogs.txt, twitter.txt and news.txt, are merged together into a single file, mergedDataPrep.txt. This will be used for most of the future analysis. Note that the mergedDataPrep data characteristics are the sum of the other three sample files.
## file size(MB) num_lines longest_line num_words
## 1 ./data/Wk2/blogs.samp.txt 0.96 4481 2053 185859
## 2 ./data/Wk2/news.samp.txt 0.92 5097 149 167552
## 3 ./data/Wk2/twitter.samp.txt 0.73 11864 7348 146461
write_output_file(mergedData,'../../data/Wk2/mergedDataPrep.txt') # INPUT PARAMETERS:
display_file_stats('../../data/Wk2/mergedDataPrep.txt',0)# INPUT PARAMETERS:
## [1] "THE TRAINING FILE STATISTICS"
## [1] "mergedDataPrep.txt, SIZE: (2.61 MB), LINES: (21442), WORDS: (499872), LONGEST LINE: (2053)"
A Corpus is defined as a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.
A corpus is designed to be a “library” of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.[6]
This Corpus conversion is used to render the data in a prescribed vector format to be used by other software algorithyms.
DLCorp <- corpus(mergedData) # build the corpus
TwitCorp <- corpus(twitter.samp)
BlogsCorp <- corpus(blogs.samp)
NewsCorp <- corpus(news.samp)
summary(DLCorp, n = 5)
## Corpus consisting of 21442 documents, showing 5 documents.
## Warning in nsentence.character(object, ...): nsentence() does not correctly
## count sentences in all lower-cased text
## Text Types Tokens Sentences
## text1 33 39 1
## text2 47 60 1
## text3 61 76 1
## text4 9 10 1
## text5 28 35 1
##
## Source: C:/Albert/Coursera Data Science/Data Science Capstone/Project/code/Wk2/* on x86-64 by tbalg
## Created: Thu Nov 17 10:53:04 2016
## Notes:
A document-term matrix or term-document matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. There are various schemes for determining the value that each entry in the matrix should take. One such scheme is tf-idf. They are useful in the field of natural language processing.[7]
In search engine terminology, stemming is the comparison of a search engine query to the root form of a word used in the query. For example, a user may search for the term “cheaper,” but a search engine that uses stemming technology may return search results for any word that contains the root form of the word (e.g. cheap, cheapen, cheaper).[8]
Algorithms for stemming have been studied in computer science since the 1960s. Many search engines treat words with the same stem as synonyms as a kind of query expansion, a process called conflation.[4]
Twitdfm <- dfm(TwitCorp, ignoredFeatures = c(stopwords("english")),stem=TRUE,verbose=FALSE)
##
Blogdfm <- dfm(BlogsCorp, ignoredFeatures =c(stopwords("english")),stem=TRUE,verbose=FALSE)
##
Newsdfm <- dfm(NewsCorp, ignoredFeatures = c(stopwords("english")),stem=TRUE,verbose=FALSE)
##
DLCdfm <- dfm(DLCorp, ignoredFeatures = c( stopwords("english")),stem=TRUE,verbose=FALSE)
##
topfeatures(DLCdfm, 20)
## one will said just get like go time can day year love
## 1583 1566 1546 1534 1507 1487 1335 1267 1212 1154 1139 1074
## make dont good now new know work peopl
## 1019 960 954 945 939 934 895 800
Before examining the Data Dictionary, let’s examine the training data that we have produced. In mergedDataPrep Corpus output displayed below, we see the headings of Documents, Vocabulary, Word Types, TTR and Diversity. The documents refer to the number of sentences contained in the Corpus. The Vocabulary is total number of words in the sample. The Word Types column is total number of unique words in the sample. The Type-to-Token Ratio (TTR) is the number of unique words (Types) in the numerator and the number of total words (Tokens) in the denominator.[1]
Diversity is defined as “a measure of vocabulary diversity that is approximately independent of sample size” is the number of unique words divided by the square root of twice the number of words in the sample \(Div =\frac {T} {\sqrt{2*V}}\) . If the three samples are combined the resulting diversity is larger.[2]
The dictionary, as a result of the diversity analysis, is better understood. In the sample dfm being used there are total of 499872 words (V), 36796 of which are unique word types (T). This means that V is the grand total of the number of times each of the word types T is used in our sample dfm. The plot below shows the required word types as a function of per cent of the Frequency Data Dictionary completed.
## blogs news twitter total
## Documents "4481" "5097" "11864" "21442"
## Vocabulary(V) "185859" "167552" "146461" "499872"
## Word Types (T) "13822" "14744" "12394" "26956"
## TTR (T/V) "0.074" "0.088" "0.085" "0.054"
## Diversity "22.671" "25.47" "22.9" "26.959"
## [1] "When creating a frequency dictionary of 26956 unique words from this dfm, it will take 587 words for 50 per cent, 6907 words for 90 per cent and 21460 words for 98 per cent coverage. "
N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). [9]
Unigrams contain one word, bigrams contain two words and trigrams contain three words. This may be continued for as many words, n-grams, as desired. For example “I am Sam. Sam I am.”
In the following sections unigrams, bigrams, and trigrams are taken from the data sample and presented as:
The Frequency plots are similar to the Histogram plots only the entire set of data is used. Because of this, the base 10 logarithm of the Word Types is used to smooth out the data in a more revealing presentation.
The frequency of frequency plot, or more understandably Occurance vs Frequency plot, Displays the number of times a given frequency of the word type occurs in the sample. The logarithmic plot is also provided for this.
N-grams are used for a variety of different tasks. For example, when developing a language model, n-grams are used to develop not just unigram models but also bigram and trigram models. Google and Microsoft have developed web scale n-gram models that can be used in a variety of tasks such as spelling correction, word breaking and text summarization.
Another use of n-grams is for developing features for supervised Machine Learning models such as SVMs, MaxEnt models, Naive Bayes, etc. The idea is to use tokens such as bigrams in the feature space instead of just unigrams.[9]
top.features.plot
options(warn=-1)
require(RColorBrewer)
plot(DLCdfm, max.words = 75, colors = brewer.pal(1, "Spectral"), scale = c(8, .5) ,vfont=c("serif","plain"))
options(warn=0)
##
bfp.plot
bf.df<-bigram.frequency
bf.df<-data.frame(bf.df)
bf.df$bigram<-rownames(bf.df)
##
tfp.plot
options(warn=-1)
require(RColorBrewer)
plot(dfm.trigram, max.words = 75, colors = brewer.pal(6, "Dark2"), scale = c(8, .5) ,vfont=c("gothic english","plain"))
options(warn=0)
## user system elapsed
## 241.62 12.42 259.28
The data is already converted using UTF-8 filtering. If a word is suspect to containing one of these characters, it is eliminated.
Unigrams, bigrams, trigrams, 4-grams … can be used in the model. In general, n-grams are considered an insufficient model of language because of long range dependencies. Example:
But for our small sentence or phrase application they should suffice. My model will implement up to 4-gram word types.
Words that are not contained in your dictionary can be accounted for by adding unknown type word. These unknown word types share the same probabilities as singular occurances. Thus a non-zero probability may be assinged to it.[3]
Performance is based on measurement and evaluation. For starters, proc.time() can be used to measure the time of execution. This is one of the fundametal issues.
The other major issue is space or size. This limits the usefulness of the application. Can it execute on a mobile phone or does it need mainframe? The object.size() function reports the number of bytes that an R object occupies in memory and can prove to be helpful.
Other application activity report functions include the Rprof() function that runs the profiler in R that can be used to determine where bottlenecks in your function may exist. The profr package (available on CRAN) provides some additional tools for visualizing and summarizing profiling data.
Finally, the gc() function runs the garbage collector to retrieve unused RAM for R. In the process it tells you how much memory is currently being used by R.
Backoff modeling, in its simplest sense, is using n-grams from largest to smallest n. If trigrams fail switch to bigrams. If bigrams fail switch to unigrams. An alternative to backoff modeling is interpolation. This process uses a mix of trigrams, bigrams and unigrams in order to predict the outcome.[3]
[2]http://www.modsimworld.org/papers/2015/Natural_Language_Processing.pdf
[3]http://online.stanford.edu/course/natural-language-processing
[4]https://en.wikipedia.org/wiki/Stemming
[5]https://en.wikipedia.org/wiki/SwiftKey
[6]https://cran.r-project.org/web/packages/quanteda/vignettes/quickstart.html#corpus-management-tools
[7]https://en.wikipedia.org/wiki/Document-term_matrix
[8]http://www.webopedia.com/TERM/S/stemming.html
[9]http://text-analytics101.rxnlp.com/2014/11/what-are-n-grams.html