DataScience Swiftkey Capstone : Milestone Report

1. Introduction

The goal of this project is to build a predictive text Shiny Web Application, which takes a phrase of one or more words as input and predicts the next word as output.

The project file contains large compilations of text from the HC Corpora corpus. This corpus is compiled from three sources and is available in four languages:

Sources: Blogs, News, Twitter

Languages: English (en_US), German (de_DE), Russian (ru_RU), and Finnish (fi_FI)

2. Executive Summary

This milestone report provides the key insights of the exploratory analysis done on the English source of raw and the sampled datasets and also the approach and plans for developing the algorithm for the next word prediction application ("NWP App") on Shiny.

The data from the 3 text documents (Blog, News and Twitter) in English (en_US) language was analyzed using R tm and RWeka packages . Then the data was sampled,processed and cleaned and then broken down into Ngrams. These Ngrams will form the basis of our predictive algorithm.

The following steps were performed in the Explorartory Data Analysis exercise on the corpa:

Read the corpa
Sample the data
Build the corpus from the samples
Clean the data and remove profanity words
Build the N-Gram tokens
Construct the word-frequency data frames
Plot the word-frequency histograms and word clouds
Calculate number of words required for percent coverage of text

3. Pre-requisites

Before starting the analysis, I set my current working directory,the system locale and installed the below packages in RStudio as shown below.

##--!! SET YOUR CURRENT WORKING DIRECTORY
setwd("C:/Users/ABC/Desktop/Coursera/Capstone")

##--!! SET ASPECTS OF THE LOCALE FOR THE R PROCESS
Sys.setlocale(category = "LC_ALL", locale = "English")
 

##--!! INSTALL THE FOLL PACKAGES
install.packages("stringi")     # For fast text/string manipulation
install.packages("stringr")     # For wrapping common text/string manipulation
install.packages("dplyr")       # For data manipulation
install.pacakges("NLP")         # For Natural Language Processing techniques
install.packages("tm")          # For basic text-mining
install.packages("slam")        # For Sparse Matrix Arithmetics
install.packages("SnowballC")   # For Word-Stemming
install.packages("wordcloud")   # For visualizing wordclouds
install.packages("RColorBrewer")# For visualizing Color Palettes in Plots
install.packages("rJava")       # For initializing JAVA VM
install.packages("RWeka")       # For N-gram generation and tokenization
install.packages("ggplot2")     # For plotting elegant charts,graphs
install.packages("grid")        # For grid graphics
install.packages("gridExtra")   # For arranging plots in a grid
install.packages("scales")      # For generic plot scaling methods
install.packages("knitr")       # For dynamic reports generation
install.packages("xtable")      # For printing out tables

source("http://bioconductor.org/biocLite.R") # For installing Rgraphviz as it is not a CRAN package
biocLite("Rgraphviz")                        # For plotting word correlations
install.packages("markdown")    # For 'Markdown' Rendering 
install.packages("qdap")        # For  Quantitative discourse analysis of transcripts.
install.packages("R.utils")     # For  Various Programming Utilities

Next step is to load the below libraries and set the options.

# Clearing the cache
rm(list = ls(all=TRUE))

options(warn =-1)

##--!! LOAD THE LIBRARIES

suppressMessages( library(stringi))
suppressMessages( library(stringr))
suppressMessages( library(dplyr))
suppressMessages( library(NLP))
suppressMessages( library(tm))
suppressMessages( library(slam))
suppressMessages( library(SnowballC))
suppressMessages( library(wordcloud))
suppressMessages( library(RColorBrewer))
suppressMessages( library(rJava))
suppressMessages( library(RWeka))
suppressMessages( library(ggplot2))
suppressMessages( library(grid))
suppressMessages( library(gridExtra))
suppressMessages( library(scales))
suppressMessages( library(knitr))
suppressMessages( library(markdown))
suppressMessages( library(xtable))
suppressMessages( library(Rgraphviz))
suppressMessages( library(qdap))
suppressMessages( library(R.utils))


##--!! GARBAGE COLLECTION
##--!! This function runs the garbage collector to retrieve unused RAM for R. In the process it tells you how much memory is currently being used by R.
gc()

##           used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1057794 56.5    1770749 94.6  1442291 77.1
## Vcells 1156883  8.9    2060183 15.8  1592181 12.2

4. Data Acquisition
4.1 Downloading Data

Dataset for training can be downloaded from the following link :

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The unzipped file contains a directory called final, then a subdirectory called en_US, which contains the texts that needs to be analyzed.

en_US.blogs.txt - text from blog postings
en_US.news.txt - text from news articles posted online
en_US.twitter.txt - tweets on Twitter

##--!! CODE FOR DOWNLOADING THE FILE

# url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# 
# if (!file.exists("Coursera-SwiftKey.zip")) {
#         download.file(url, "Coursera-SwiftKey.zip")
#         }

4.2 Extracting Data

##--!! CODE FOR UNZIPPING THE FILE

# unzip("Coursera-SwiftKey.zip")
# 
##--!! CODE FOR LISTING THE FILES

# list.files("final/en_US")

The file downloaded had a large size of 548MB. After unzipping the file, we find the following directories:

[1] "de_DE" "en_US" "fi_FI" "ru_RU"

4.3 Loading and Reading Data in R

The files are read line by line using UTF (Universal Character Set+ Transformation Format-8-bit) encoding as UTF is capable of encoding all possible characters.

##--!! READ THE ORIGINAL DATASETS (only the ENGLISH VERSION) INTO 3 DIFFERENT VECTORS

t1 <- as.numeric(Sys.time()) # Time starts.

DataBlogs <- readLines("final/en_US/en_US.blogs.txt",encoding="UTF-8",warn=FALSE,skipNul = TRUE)

## Use con to read using binary mode as there are special character and to avoid "incomplete final line" issue

con <- file("final/en_US/en_US.news.txt",open="rb")
DataNews <- readLines(con,encoding="UTF-8",warn=FALSE,skipNul = TRUE)
close(con)
rm(con)

DataTwitter <- readLines("final/en_US/en_US.twitter.txt",encoding="UTF-8",warn=FALSE,skipNul = TRUE)

t1 <- round(as.numeric(Sys.time() - t1), 2) # Time ends.

4.4 Previewing Raw Data

Let’s preview the first and last few lines from these 3 files to get familiar about the general format and structure of the data .

##--!! DISPLAY THE FIRST FEW LINES FROM THE FILES

t2 <- as.numeric(Sys.time()) # Time starts.

head(DataBlogs,2)

## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."
## [2] "We love you Mr. Brown."

head(DataNews,2)

## [1] "He wasn't home alone, apparently."                                                                                                                        
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."

head(DataTwitter,2)

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."

tail(DataBlogs,2)

## [1] "(5) What's the barrier to entry and why is the business sustainable?"                                                                                                                                                       
## [2] "In response to an over-whelming number of comments we sat down and created a list of do (s) and don<U+0092>t (s) <U+0096> these recommendations are easy to follow and except for - adding some herbs to your rinse . So let<U+0092>s get begin<U+0085>"

tail(DataNews,2)

## [1] "That starts this Sunday at Chivas. The Goats aren't a great team, but they just beat one (a 1-0 win over Salt Lake at Rio Tinto). They also have the one player who can rival Roger Espinoza as \"The Best Guy in MLS That No One Talks About Because He Doesn't Play in New York, LA or the Pacific Northwest\" in goalkeeper Dan Kennedy. These will be tough points."
## [2] "The only outwardly religious adornment was a billboard-sized banner with an image of Our Lady of Charity, patron saint of Cuba, hanging on the side of the National Library."

tail(DataTwitter,2)

## [1] "It is #RHONJ time!!"                                                                                           
## [2] "The key to keeping your woman happy= attention, affection, treat her like a queen and sex her like a pornstar!"

t2 <- round(as.numeric(Sys.time() - t2), 2) # Time ends.

We can see that the Blogs file has very longer informal text, while the News file has very formal text and Twitter file has short, even more informal text.

5. Data Summaries of Original Raw Data Files

Table 1: Summary Statistics of the Raw Datasets
	Blogs	News	Twitter
FileSize.MB	200.42	196.28	159.36
Lines	899288.00	1010242.00	2360148.00
Words	37546246.00	34762395.00	30093410.00
Chars	206824505.00	203223159.00	162096241.00
Words.PerLine	41.75	34.41	12.75
Chars.PerLine	229.99	201.16	68.68
Max.Words	6726.00	1796.00	47.00
Max.Chars	40833.00	11384.00	140.00
Min.Words	0.00	1.00	1.00
Min.Chars	1.00	1.00	2.00
LongestLine.RowIndex	483415.00	123628.00	26.00
ShortestLine.RowIndex	278204.00	79323.00	43549.00

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  5275907 281.8    9968622  532.4   9968622  532.4
## Vcells 89970716 686.5  143912772 1098.0 142515677 1087.4

6 Exploratory Data Analysis Part 1 – Visualization with barplots

Let’s visualize some of the above findings using barplots.

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  5279376 282.0    9968622  532.4   9968622  532.4
## Vcells 89977879 686.5  143912772 1098.0 142515677 1087.4

7. Data Sampling

Since the corpus data is huge and processing is time consuming,I sampled 10,000 lines from each of the 3 files and merged into a single sample dataset for this reproting purpose.

t5 <- as.numeric(Sys.time()) # Time starts.

##--!! REMOVE Non-ASCII CHARACTERS TO BLANKS
DataBlogs <- iconv(DataBlogs, from="latin1", to="ASCII", sub="")
DataNews <- iconv(DataNews, from="latin1", to="ASCII", sub="")
DataTwitter <- iconv(DataTwitter, from="latin1", to="ASCII", sub="")

##--!! GENERATE A RANDOM SAMPLE OF 10000 LINES FROM THE 3 FILES and MERGE INTO A SINGLE SAMPLE DATASET
set.seed(369) # For reproducibility

sampleSize <- 10000 

sampleBlogs <- DataBlogs[sample(1:length(DataBlogs), sampleSize)]

sampleNews <- DataNews[sample(1:length(DataNews), sampleSize)]

sampleTwitter <- DataTwitter[sample(1:length(DataTwitter), sampleSize)]

sampleDoc <- paste(sampleTwitter, sampleNews, sampleBlogs,sep = " ")


##--!! PREVIEW THE TOP AND LAST 2 ROWS

head(sampleDoc,2)

## [1] "is Nestor going to try and choke the shit out of him? I hear he likes to do that to people. Farmers in Japan already use small drones to automatically spray their crops with pesticides, and more recently safety inspectors used them at the crippled Fukushima Daiichi nuclear power plant. Archaeologists in Russia are using small drones and their infrared cameras to construct a 3-D model of ancient burial mounds. Officials in Tampa Bay, Fla., want to use them for security surveillance at next year's Republican National Convention. Russias Volga River is the longest waterway in Europe. It winds its way south from northwest Russia to the Caspian Sea. Many tributaries pour into the Volga, causing its swiftly flowing blue-green waters to rush even faster. Through the centuries, the river has been a major transportation route. Even today, barges carry goods to and from factories along the shore."                                                                                        
## [2] "Yea! We love to hear that! RT: Time to switch to \"Wingman\" app!!! :D In a hypothetical matchup between Obama and Christie, the president would take 55 percent and the governor 38 percent. These two procedures may be overly cautious  done for all to protect the very, very few  but neither have been proven dangerous. An argument could be made for rolling with hospital policy, if only to keep your blood pressure down at a time when youll be dealing with enough stress. If you want to avoid routine post-natal medication, first talk to your obstetrician or midwife well before the birth (before thirty-five weeks) about whether you have a choice in the matter, and if so, what he or she recommends for you. If you feel really strongly about this issue, you could consider a home birth where youd have more say in the matter. Otherwise, you can find safety (or solace?) in numbers: the overwhelming majority of babies are given both treatments, and are apparently none the worse for it."

tail(sampleDoc,2)

## [1] "Having a great day in #Brookhaven Atlanta ! Stop by for some excellent eco-friendly baby gifts! Ellen Tauscher, the U.S. special envoy for strategic stability and missile defense, said no agreement was likely this year because of the U.S. political campaign. \"But in the meantime, we've got a lot of work to do to dispel the mistrust,\" she said. You get the idea. The filmmakers heart might have bee in the right place, but where his mind was is anybodys guess."                                                                                                                                                                                                                                                                                                                                                                                                                                
## [2] "Hi ! How are you?Thank you for follow me. Regularly through the Nov. 2 election, The Chronicle will publish a few of the \"lies, half-truths and contradictions\" uttered by the California statewide campaigns and their supporters from recent days.- Joe Garofoli, jgarofoli@sfchronicle.com Chief of Staff: Peace will be answered with peace, and fire with fire Hamas-linked CAIR holding rally in New York for synagogue bomber Video Compilation Of Rocket Attack Filmed by Civilians AvitalLeibovich: @ANDYLFC2011 if #hamas would spend money on #Gaza rather than extending rockets ranges-#Israel wouldnt need to supply 70%of its electricity AvitalLeibovich: @Kevremo thanks! Turkey rejects Arkia request for extra security Video: PRC Spokesman Admits that Hamas Allows Attacks Against Israel AvitalLeibovich: A few mortars fired from #Gaza into #Israel a short while ago. Cease fire???"

t5 <- round(as.numeric(Sys.time() - t5), 2) # Time ends. 

rm(sampleBlogs,sampleNews,sampleTwitter)

gc()

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  5287775 282.4    9968622  532.4   9968622  532.4
## Vcells 90139599 687.8  143912772 1098.0 142515677 1087.4

8. Data Summaries of Random Sample Dataset before cleaning

9. Creating the Corpus Preprocessing and cleaning is an important step of text analytics to standardise the input documents.

t7 <- as.numeric(Sys.time()) # Time starts.

##--!! SPLIT THE TEXT PARAGRAPHS INTO SENTENCES
sampleDoc <- sent_detect(sampleDoc, language = "en", model = NULL) 


##--!! NOW BUILD THE CORPUS FROM THE 3 DOCUMENTS
docs <- Corpus(VectorSource(list(sampleDoc)))

inspect(docs[1])

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 1
## 
## [[1]]
## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1063587

t7 <- round(as.numeric(Sys.time() - t7), 2) # Time ends.

10. Tidying the Corpus

With the help of (Text Mining) “tm” package we will clean the words as explained below.The function tm_map() is used to apply one of these transformations across all documents within a corpus. Other transformations can be implemented using R functions and wrapped within content_transformer() to create a function that can be passed through to tm_map().

t8 <- as.numeric(Sys.time() ) # Time starts.  

# Remove URLs -------------------
removeURLs <- function(x) gsub("http[[:alnum:]]*", "", x) 
docs           <- tm_map(docs, content_transformer(removeURLs))  

# Remove metacharacters -------------------
toSpace   <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs    <- tm_map(docs, toSpace,"/|@|\\|\\,|\\:|\\&|\\-|\\)|\\(|\\{|\\}|\\[|\\]|\\+|=|~|<|>|\\^")

# Convert to lower case -------------------
docs <- tm_map(docs, content_transformer(tolower))

# Remove Punctuation -------------------
docs <- tm_map(docs, content_transformer(removePunctuation))

# Remove Numbers -------------------
docs <- tm_map(docs, content_transformer(removeNumbers)) 

# Remove whitespace -------------------
docs <- tm_map(docs, stripWhitespace)

# Remove stopwords --------------------
#docs <- tm_map(docs,removeWords,stopwords("en"))

# Stemming document --------------------
docs   <- tm_map(docs, stemDocument, language="en")

# Create PTD -------------------
docs <- tm_map(docs, PlainTextDocument)

# Create DTM -------------------
dtm <- DocumentTermMatrix(docs)     
dim(dtm)

## [1]     1 14200

inspect(dtm[1,1:10])

## <<DocumentTermMatrix (documents: 1, terms: 10)>>
## Non-/sparse entries: 10/0
## Sparsity           : 0%
## Maximal term length: 9
## Weighting          : term frequency (tf)
## 
##               Terms
## Docs           aaa aaaaandgo aaaahhhhh aadvantag aaron aarti aback abandon
##   character(0)   1         1         1         3     1     1     1       6
##               Terms
## Docs           abbey abbeyroad
##   character(0)     5         1

#findFreqTerms(dtm, lowfreq=15) #terms occurring at least 15 times

t8 <- round(as.numeric(Sys.time() - t8), 2) # Time ends.

Note that the following words are not removed at this phase: Profanities and Stop Words.

I have not removed phrases containing Profanities because I want to predict non-obscene words in my model for bad words.Stop words are not removed because they indicate the words that may follow in the model.

There are 174 stop words identified in the text mining tm R package, such as i, me, my, myself, we, our, ours, ourselves, you, your etc.

11. Tokenization and N-Gram Analysis

A N-gram is a sequence of n words.An n-gram of size 1 is referred to as a “Uni-gram”; size 2 is a “Bi-gram”; size 3 is a “Tri-gram”.

Tokenization is the process of breaking a stream of text up into sequences of words, phrases, symbols, or other meaningful elements called tokens for statistical analysis and subsequent construction of prediction models.

The term-document matrices will then serve for word prediction in the algorithm to be built in the next phase of the capstone project.

t9 <- as.numeric(Sys.time()) # Time starts

##--!! Converting Corpus to Data Frame for processing by the RWeka functions
cleantext <- data.frame(text=unlist(sapply(docs, `[`, "content")), stringsAsFactors=F) 


# Constructor for tokenization : n = size of word
## ngram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))

##--!! ====== ====== ===== *** N-GRAM ANALYSIS *** ====== ======= ======== ======== ====== =======


## ======= UNIGRAM ANALYSIS ==================


OneT <- NGramTokenizer(cleantext, Weka_control(min = 1, max = 1))

# Building dataframes converting tokens of n-grams into tables
OneD <- data.frame(table(OneT))

# Ordering the word distribution frequency
OneW <- OneD[order(OneD$Freq,decreasing = TRUE),]

# n-grams Words sorted alphabetically
OneSort <- OneW[1:15,]
colnames(OneSort) <- c("Word","Frequency")





## ======= BIGRAM ANALYSIS==================

BiT  <- NGramTokenizer(cleantext, Weka_control(min = 2, max = 2))

# Building dataframes converting tokens of n-grams into tables
BiD <- data.frame(table(BiT))

# Ordering the word distribution frequency
TwoW <- BiD[order(BiD$Freq,decreasing = TRUE),]

# n-grams Words sorted alphabetically
TwoSort <- TwoW[1:15,]
colnames(TwoSort) <- c("Word","Frequency")




## ======= TRIGRAM ANALYSIS ==================

TriT <- NGramTokenizer(cleantext, Weka_control(min = 3, max = 3))

# Building dataframes converting tokens of n-grams into tables
TriD <- data.frame(table(TriT))

# Ordering the word distribution frequency

TriW <- TriD[order(TriD$Freq,decreasing = TRUE),]

# n-grams Words sorted alphabetically
TriSort <- TriW[1:15,]
colnames(TriSort) <- c("Word","Frequency")


t9 <- round(as.numeric(Sys.time() - t9), 2) # Time ends.

12. Top 15 Most Frequent Terms by N-grams

t10 <- as.numeric(Sys.time()) # Time starts.  

OneSort

##       Word Frequency
## 12767  the      9566
## 12958   to      5294
## 461    and      4909
## 1        a      4640
## 8879    of      4038
## 6273    in      3205
## 6157     i      2791
## 6555    it      2379
## 12759 that      2121
## 4840   for      2005
## 6535    is      1956
## 8944    on      1578
## 14195 with      1446
## 14410  you      1444
## 13914  was      1226

TwoSort

##            Word Frequency
## 65007    of the       892
## 46557    in the       799
## 97914    to the       427
## 66257    on the       406
## 34766   for the       346
## 96731     to be       327
## 10439    at the       301
## 7391    and the       288
## 45751      in a       234
## 107361 with the       224
## 36090  from the       198
## 49187     it is       195
## 102848  want to       195
## 49523    it was       185
## 34160     for a       178

TriSort

##               Word Frequency
## 103559  one of the        75
## 1970      a lot of        60
## 77148     it was a        42
## 67600    i want to        41
## 149828     to be a        39
## 107584 part of the        35
## 55475     go to be        34
## 106089  out of the        31
## 138041  the end of        31
## 17095   as well as        29
## 143557     the u s        29
## 20391    be abl to        28
## 126779 some of the        28
## 66723     i have a        27
## 141941 the rest of        24

t10 <- round(as.numeric(Sys.time() - t10), 2) # Time ends.

13. Distribution of Word-Frequencies – Histograms

t11 <- as.numeric(Sys.time()) # Time starts

bx <- par(mfrow =c(1,3))


ggplot(OneSort, aes(x=Word,y=Frequency)) + 
    geom_bar(stat="Identity", fill="green") +
    coord_flip() +
    ggtitle("Top 15 Unigrams by Frequency") + xlab("Unigram") + ylab("Freq") +
    scale_y_continuous(expand = c(0,0)) +
    geom_text(aes(label=Frequency),hjust=1,size=3,vjust=-0.20,angle=0) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(TwoSort, aes(x=Word,y=Frequency)) + 
    geom_bar(stat="Identity", fill="pink") +
    coord_flip() +
    ggtitle("Top 15 Bigrams by Frequency") + xlab("Bigrams") + ylab("Freq") +
    scale_y_continuous(expand = c(0,0)) +
    geom_text(aes(label=Frequency),hjust=1,size=3,vjust=-0.20,angle=0) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(TriSort, aes(x=Word,y=Frequency)) +
    geom_bar(stat="Identity", fill="lightblue") +
    coord_flip() +
    ggtitle("Top 15 Trigrams by Frequency") + xlab("Trigram") + ylab("Freq") +
    scale_y_continuous(expand = c(0,0)) +
    geom_text(aes(label=Frequency),hjust=1,size=3,vjust=-0.20,angle=0) + 
    theme(axis.text.x = element_text(angle = 90, hjust = 1))

par(bx)

t11 <- round(as.numeric(Sys.time() - t11), 2) # Time ends.

14. Exploratory Data Analysis Part 2 – Word Clouds

We chose to explore the corpus data using word clouds as they can illustrate word frequencies very effectively. The most frequent words are displayed in respect to their size and centralisation. One word cloud is plotted for each data type.

t12 <- as.numeric(Sys.time()) # Time starts.

set.seed(2345) # For Reproducibility

oz <- par(mfrow = c(1, 3)) # Plot 3 graphs in 1 row

palette <- brewer.pal(8,"Dark2")

wordcloud(OneW[,1], OneW[,2], min.freq = 25, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "1-gram cloud")


wordcloud(TwoW[,1], TwoW[,2], min.freq =50, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "2-gram cloud")


wordcloud(TriW[,1], TriW[,2], max.words = 100, 
          random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "3-gram cloud")

par(oz)

t12 <- round(as.numeric(Sys.time() - t12), 2) # Time ends.

15. Percent Coverage of Words

## [1] 7260

## [1] 188715

## [1] 90214

## [1] 188714

## [1] 165778

## [1] 188713

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  5702249 304.6    9968622  532.4   9968622  532.4
## Vcells 92748363 707.7  143912772 1098.0 142515677 1087.4

We can see that the coverage is not linear. That means each next percent of coverage is given by higher number of words from the dictionary than the previous percent.That happens due to decreasing of frequency of each latter word.

Future Development

Impelement Profanity filtering and substitute with non-obscene words and sample the subset for test and validation sets.

Prediction Algorithm:

The Word frequencies, 2-Gram, 3-Gram frequencies have been calculated. These have to be used to calculate Probabilities for the n-grams.

Given a string W1...Wi-1, the word Wi that maximizes P(Wi | Wn-i+1.Wi-1) has to be chosen as the prediction where n is the maximum n-gram.

Back-off Algorithm :

The above probability calculation suffers when unknown phrases are introduced.So will consider the foll models:

Katz Back-Off Models Interpolated Models Kneser-Ney Models

Overall Run-Time : After calculating n-gram probabilities along with Back-off Models, Observing runtime for different sample sizes for efficient memory usage and run-time.

Shiny app: Build a Shiny app that allows users to interact with the prediction algorithm. The app will accept an n-gram and predict the next word, with the highest probability, for the user.I will have the user text input section in the side bar and output 3 word predictions on the main page. Reactive expression will be used to immediately update the top 3 predicted words as the user keeps inputing into the input box.

Appendix A : References

# http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
# http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
# https://github.com/zero323/r-snippets/blob/master/R/ngram_tokenizer.R
# https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Appendix B : Misc

##--!! SPECIFICATIONS OF THE MACHINE USED FOR COMPILING THIS REPORT
# OS Name   Microsoft Windows 8.1
# Processor Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz, 2601 Mhz, 2 Core(s), 4 Logical Processor(s)
# Memory 8GB 

##--!! CURRENT SESSION INFO
sessionInfo()

## R version 3.2.1 (2015-06-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] R.utils_2.1.0          R.oo_1.19.0            R.methodsS3_1.7.0     
##  [4] qdap_2.2.2             qdapTools_1.1.0        qdapRegex_0.4.0       
##  [7] qdapDictionaries_1.0.6 Rgraphviz_2.12.0       graph_1.46.0          
## [10] xtable_1.7-4           markdown_0.7.7         knitr_1.10.5          
## [13] scales_0.2.5           gridExtra_2.0.0        ggplot2_1.0.1         
## [16] RWeka_0.4-24           rJava_0.9-6            wordcloud_2.5         
## [19] RColorBrewer_1.1-2     SnowballC_0.5.1        slam_0.1-32           
## [22] tm_0.6-2               NLP_0.1-8              dplyr_0.4.2           
## [25] stringr_1.0.0          stringi_0.5-5         
## 
## loaded via a namespace (and not attached):
##  [1] httr_1.0.0          jsonlite_0.9.16     gender_0.4.3       
##  [4] gtools_3.5.0        assertthat_0.1      highr_0.5          
##  [7] stats4_3.2.1        xlsxjars_0.6.1      yaml_2.1.13        
## [10] chron_2.3-47        digest_0.6.8        colorspace_1.2-6   
## [13] htmltools_0.2.6     plyr_1.8.3          XML_3.98-1.3       
## [16] devtools_1.8.0      gdata_2.17.0        git2r_0.10.1       
## [19] openNLP_0.2-5       reports_0.1.4       BiocGenerics_0.14.0
## [22] proto_0.3-10        magrittr_1.5        memoise_0.2.1      
## [25] evaluate_0.7        MASS_7.3-43         xml2_0.1.1         
## [28] tools_3.2.1         data.table_1.9.4    RWekajars_3.7.12-1 
## [31] formatR_1.2         xlsx_0.5.7          munsell_0.4.2      
## [34] plotrix_3.5-12      rversions_1.0.2     RCurl_1.95-4.7     
## [37] rstudioapi_0.3.1    igraph_1.0.1        bitops_1.0-6       
## [40] labeling_0.3        rmarkdown_0.7       venneuler_1.1-0    
## [43] gtable_0.1.2        DBI_0.3.1           curl_0.9.1         
## [46] reshape2_1.4.1      R6_2.1.0            openNLPdata_1.5.3-2
## [49] parallel_3.2.1      Rcpp_0.11.6

##--!! TOTAL TIME TAKEN FOR PROCESSING .
cat("It took",round(sum(t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13)/60,2) ," minutes to complete the process !")

## It took 23.18  minutes to complete the process !

print(t1)

## [1] 130.12

print(t2)

## [1] 0.04

print(t3)

## [1] 506.69

print(t4)

## [1] 68.5

print(t5)

## [1] 179.33

print(t6)

## [1] 3.13

print(t7)

## [1] 21.45

print(t8)

## [1] 10.61

print(t9)

## [1] 270.29

print(t10)

## [1] 0.04

print(t11)

## [1] 4.08

print(t12)

## [1] 131.2

print(t13)

## [1] 65.34

DataScience Swiftkey Capstone : Milestone Report

Devi

July 25, 2015