1. Introduction
The goal of this project is to build a predictive text Shiny Web Application, which takes a phrase of one or more words as input and predicts the next word as output.
The project file contains large compilations of text from the HC Corpora corpus. This corpus is compiled from three sources and is available in four languages:
Sources: Blogs, News, Twitter
Languages: English (en_US), German (de_DE), Russian (ru_RU), and Finnish (fi_FI)
2. Executive Summary
This milestone report provides the key insights of the exploratory analysis done on the English source of raw and the sampled datasets and also the approach and plans for developing the algorithm for the next word prediction application ("NWP App") on Shiny.
The data from the 3 text documents (Blog, News and Twitter) in English (en_US) language was analyzed using R tm and RWeka packages . Then the data was sampled,processed and cleaned and then broken down into Ngrams. These Ngrams will form the basis of our predictive algorithm.
The following steps were performed in the Explorartory Data Analysis exercise on the corpa:
3. Pre-requisites
Before starting the analysis, I set my current working directory,the system locale and installed the below packages in RStudio as shown below.
##--!! SET YOUR CURRENT WORKING DIRECTORY
setwd("C:/Users/ABC/Desktop/Coursera/Capstone")
##--!! SET ASPECTS OF THE LOCALE FOR THE R PROCESS
Sys.setlocale(category = "LC_ALL", locale = "English")
##--!! INSTALL THE FOLL PACKAGES
install.packages("stringi") # For fast text/string manipulation
install.packages("stringr") # For wrapping common text/string manipulation
install.packages("dplyr") # For data manipulation
install.pacakges("NLP") # For Natural Language Processing techniques
install.packages("tm") # For basic text-mining
install.packages("slam") # For Sparse Matrix Arithmetics
install.packages("SnowballC") # For Word-Stemming
install.packages("wordcloud") # For visualizing wordclouds
install.packages("RColorBrewer")# For visualizing Color Palettes in Plots
install.packages("rJava") # For initializing JAVA VM
install.packages("RWeka") # For N-gram generation and tokenization
install.packages("ggplot2") # For plotting elegant charts,graphs
install.packages("grid") # For grid graphics
install.packages("gridExtra") # For arranging plots in a grid
install.packages("scales") # For generic plot scaling methods
install.packages("knitr") # For dynamic reports generation
install.packages("xtable") # For printing out tables
source("http://bioconductor.org/biocLite.R") # For installing Rgraphviz as it is not a CRAN package
biocLite("Rgraphviz") # For plotting word correlations
install.packages("markdown") # For 'Markdown' Rendering
install.packages("qdap") # For Quantitative discourse analysis of transcripts.
install.packages("R.utils") # For Various Programming Utilities
Next step is to load the below libraries and set the options.
# Clearing the cache
rm(list = ls(all=TRUE))
options(warn =-1)
##--!! LOAD THE LIBRARIES
suppressMessages( library(stringi))
suppressMessages( library(stringr))
suppressMessages( library(dplyr))
suppressMessages( library(NLP))
suppressMessages( library(tm))
suppressMessages( library(slam))
suppressMessages( library(SnowballC))
suppressMessages( library(wordcloud))
suppressMessages( library(RColorBrewer))
suppressMessages( library(rJava))
suppressMessages( library(RWeka))
suppressMessages( library(ggplot2))
suppressMessages( library(grid))
suppressMessages( library(gridExtra))
suppressMessages( library(scales))
suppressMessages( library(knitr))
suppressMessages( library(markdown))
suppressMessages( library(xtable))
suppressMessages( library(Rgraphviz))
suppressMessages( library(qdap))
suppressMessages( library(R.utils))
##--!! GARBAGE COLLECTION
##--!! This function runs the garbage collector to retrieve unused RAM for R. In the process it tells you how much memory is currently being used by R.
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1057794 56.5 1770749 94.6 1442291 77.1
## Vcells 1156883 8.9 2060183 15.8 1592181 12.2
4. Data Acquisition
4.1 Downloading Data
Dataset for training can be downloaded from the following link :
https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The unzipped file contains a directory called final, then a subdirectory called en_US, which contains the texts that needs to be analyzed.
en_US.blogs.txt - text from blog postingsen_US.news.txt - text from news articles posted onlineen_US.twitter.txt - tweets on Twitter##--!! CODE FOR DOWNLOADING THE FILE
# url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
#
# if (!file.exists("Coursera-SwiftKey.zip")) {
# download.file(url, "Coursera-SwiftKey.zip")
# }
4.2 Extracting Data
##--!! CODE FOR UNZIPPING THE FILE
# unzip("Coursera-SwiftKey.zip")
#
##--!! CODE FOR LISTING THE FILES
# list.files("final/en_US")
The file downloaded had a large size of 548MB. After unzipping the file, we find the following directories:
[1] "de_DE" "en_US" "fi_FI" "ru_RU"
4.3 Loading and Reading Data in R
The files are read line by line using UTF (Universal Character Set+ Transformation Format-8-bit) encoding as UTF is capable of encoding all possible characters.
##--!! READ THE ORIGINAL DATASETS (only the ENGLISH VERSION) INTO 3 DIFFERENT VECTORS
t1 <- as.numeric(Sys.time()) # Time starts.
DataBlogs <- readLines("final/en_US/en_US.blogs.txt",encoding="UTF-8",warn=FALSE,skipNul = TRUE)
## Use con to read using binary mode as there are special character and to avoid "incomplete final line" issue
con <- file("final/en_US/en_US.news.txt",open="rb")
DataNews <- readLines(con,encoding="UTF-8",warn=FALSE,skipNul = TRUE)
close(con)
rm(con)
DataTwitter <- readLines("final/en_US/en_US.twitter.txt",encoding="UTF-8",warn=FALSE,skipNul = TRUE)
t1 <- round(as.numeric(Sys.time() - t1), 2) # Time ends.
4.4 Previewing Raw Data
Let’s preview the first and last few lines from these 3 files to get familiar about the general format and structure of the data .
##--!! DISPLAY THE FIRST FEW LINES FROM THE FILES
t2 <- as.numeric(Sys.time()) # Time starts.
head(DataBlogs,2)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan <U+0093>gods<U+0094>."
## [2] "We love you Mr. Brown."
head(DataNews,2)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
head(DataTwitter,2)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
tail(DataBlogs,2)
## [1] "(5) What's the barrier to entry and why is the business sustainable?"
## [2] "In response to an over-whelming number of comments we sat down and created a list of do (s) and don<U+0092>t (s) <U+0096> these recommendations are easy to follow and except for - adding some herbs to your rinse . So let<U+0092>s get begin<U+0085>"
tail(DataNews,2)
## [1] "That starts this Sunday at Chivas. The Goats aren't a great team, but they just beat one (a 1-0 win over Salt Lake at Rio Tinto). They also have the one player who can rival Roger Espinoza as \"The Best Guy in MLS That No One Talks About Because He Doesn't Play in New York, LA or the Pacific Northwest\" in goalkeeper Dan Kennedy. These will be tough points."
## [2] "The only outwardly religious adornment was a billboard-sized banner with an image of Our Lady of Charity, patron saint of Cuba, hanging on the side of the National Library."
tail(DataTwitter,2)
## [1] "It is #RHONJ time!!"
## [2] "The key to keeping your woman happy= attention, affection, treat her like a queen and sex her like a pornstar!"
t2 <- round(as.numeric(Sys.time() - t2), 2) # Time ends.
We can see that the Blogs file has very longer informal text, while the News file has very formal text and Twitter file has short, even more informal text.
5. Data Summaries of Original Raw Data Files
| Blogs | News | ||
|---|---|---|---|
| FileSize.MB | 200.42 | 196.28 | 159.36 |
| Lines | 899288.00 | 1010242.00 | 2360148.00 |
| Words | 37546246.00 | 34762395.00 | 30093410.00 |
| Chars | 206824505.00 | 203223159.00 | 162096241.00 |
| Words.PerLine | 41.75 | 34.41 | 12.75 |
| Chars.PerLine | 229.99 | 201.16 | 68.68 |
| Max.Words | 6726.00 | 1796.00 | 47.00 |
| Max.Chars | 40833.00 | 11384.00 | 140.00 |
| Min.Words | 0.00 | 1.00 | 1.00 |
| Min.Chars | 1.00 | 1.00 | 2.00 |
| LongestLine.RowIndex | 483415.00 | 123628.00 | 26.00 |
| ShortestLine.RowIndex | 278204.00 | 79323.00 | 43549.00 |
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5275907 281.8 9968622 532.4 9968622 532.4
## Vcells 89970716 686.5 143912772 1098.0 142515677 1087.4
6 Exploratory Data Analysis Part 1 – Visualization with barplots
Let’s visualize some of the above findings using barplots.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5279376 282.0 9968622 532.4 9968622 532.4
## Vcells 89977879 686.5 143912772 1098.0 142515677 1087.4
7. Data Sampling
Since the corpus data is huge and processing is time consuming,I sampled 10,000 lines from each of the 3 files and merged into a single sample dataset for this reproting purpose.
t5 <- as.numeric(Sys.time()) # Time starts.
##--!! REMOVE Non-ASCII CHARACTERS TO BLANKS
DataBlogs <- iconv(DataBlogs, from="latin1", to="ASCII", sub="")
DataNews <- iconv(DataNews, from="latin1", to="ASCII", sub="")
DataTwitter <- iconv(DataTwitter, from="latin1", to="ASCII", sub="")
##--!! GENERATE A RANDOM SAMPLE OF 10000 LINES FROM THE 3 FILES and MERGE INTO A SINGLE SAMPLE DATASET
set.seed(369) # For reproducibility
sampleSize <- 10000
sampleBlogs <- DataBlogs[sample(1:length(DataBlogs), sampleSize)]
sampleNews <- DataNews[sample(1:length(DataNews), sampleSize)]
sampleTwitter <- DataTwitter[sample(1:length(DataTwitter), sampleSize)]
sampleDoc <- paste(sampleTwitter, sampleNews, sampleBlogs,sep = " ")
##--!! PREVIEW THE TOP AND LAST 2 ROWS
head(sampleDoc,2)
## [1] "is Nestor going to try and choke the shit out of him? I hear he likes to do that to people. Farmers in Japan already use small drones to automatically spray their crops with pesticides, and more recently safety inspectors used them at the crippled Fukushima Daiichi nuclear power plant. Archaeologists in Russia are using small drones and their infrared cameras to construct a 3-D model of ancient burial mounds. Officials in Tampa Bay, Fla., want to use them for security surveillance at next year's Republican National Convention. Russias Volga River is the longest waterway in Europe. It winds its way south from northwest Russia to the Caspian Sea. Many tributaries pour into the Volga, causing its swiftly flowing blue-green waters to rush even faster. Through the centuries, the river has been a major transportation route. Even today, barges carry goods to and from factories along the shore."
## [2] "Yea! We love to hear that! RT: Time to switch to \"Wingman\" app!!! :D In a hypothetical matchup between Obama and Christie, the president would take 55 percent and the governor 38 percent. These two procedures may be overly cautious done for all to protect the very, very few but neither have been proven dangerous. An argument could be made for rolling with hospital policy, if only to keep your blood pressure down at a time when youll be dealing with enough stress. If you want to avoid routine post-natal medication, first talk to your obstetrician or midwife well before the birth (before thirty-five weeks) about whether you have a choice in the matter, and if so, what he or she recommends for you. If you feel really strongly about this issue, you could consider a home birth where youd have more say in the matter. Otherwise, you can find safety (or solace?) in numbers: the overwhelming majority of babies are given both treatments, and are apparently none the worse for it."
tail(sampleDoc,2)
## [1] "Having a great day in #Brookhaven Atlanta ! Stop by for some excellent eco-friendly baby gifts! Ellen Tauscher, the U.S. special envoy for strategic stability and missile defense, said no agreement was likely this year because of the U.S. political campaign. \"But in the meantime, we've got a lot of work to do to dispel the mistrust,\" she said. You get the idea. The filmmakers heart might have bee in the right place, but where his mind was is anybodys guess."
## [2] "Hi ! How are you?Thank you for follow me. Regularly through the Nov. 2 election, The Chronicle will publish a few of the \"lies, half-truths and contradictions\" uttered by the California statewide campaigns and their supporters from recent days.- Joe Garofoli, jgarofoli@sfchronicle.com Chief of Staff: Peace will be answered with peace, and fire with fire Hamas-linked CAIR holding rally in New York for synagogue bomber Video Compilation Of Rocket Attack Filmed by Civilians AvitalLeibovich: @ANDYLFC2011 if #hamas would spend money on #Gaza rather than extending rockets ranges-#Israel wouldnt need to supply 70%of its electricity AvitalLeibovich: @Kevremo thanks! Turkey rejects Arkia request for extra security Video: PRC Spokesman Admits that Hamas Allows Attacks Against Israel AvitalLeibovich: A few mortars fired from #Gaza into #Israel a short while ago. Cease fire???"
t5 <- round(as.numeric(Sys.time() - t5), 2) # Time ends.
rm(sampleBlogs,sampleNews,sampleTwitter)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5287775 282.4 9968622 532.4 9968622 532.4
## Vcells 90139599 687.8 143912772 1098.0 142515677 1087.4
8. Data Summaries of Random Sample Dataset before cleaning
9. Creating the Corpus Preprocessing and cleaning is an important step of text analytics to standardise the input documents.
t7 <- as.numeric(Sys.time()) # Time starts.
##--!! SPLIT THE TEXT PARAGRAPHS INTO SENTENCES
sampleDoc <- sent_detect(sampleDoc, language = "en", model = NULL)
##--!! NOW BUILD THE CORPUS FROM THE 3 DOCUMENTS
docs <- Corpus(VectorSource(list(sampleDoc)))
inspect(docs[1])
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1063587
t7 <- round(as.numeric(Sys.time() - t7), 2) # Time ends.
10. Tidying the Corpus
With the help of (Text Mining) “tm” package we will clean the words as explained below.The function tm_map() is used to apply one of these transformations across all documents within a corpus. Other transformations can be implemented using R functions and wrapped within content_transformer() to create a function that can be passed through to tm_map().
t8 <- as.numeric(Sys.time() ) # Time starts.
# Remove URLs -------------------
removeURLs <- function(x) gsub("http[[:alnum:]]*", "", x)
docs <- tm_map(docs, content_transformer(removeURLs))
# Remove metacharacters -------------------
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace,"/|@|\\|\\,|\\:|\\&|\\-|\\)|\\(|\\{|\\}|\\[|\\]|\\+|=|~|<|>|\\^")
# Convert to lower case -------------------
docs <- tm_map(docs, content_transformer(tolower))
# Remove Punctuation -------------------
docs <- tm_map(docs, content_transformer(removePunctuation))
# Remove Numbers -------------------
docs <- tm_map(docs, content_transformer(removeNumbers))
# Remove whitespace -------------------
docs <- tm_map(docs, stripWhitespace)
# Remove stopwords --------------------
#docs <- tm_map(docs,removeWords,stopwords("en"))
# Stemming document --------------------
docs <- tm_map(docs, stemDocument, language="en")
# Create PTD -------------------
docs <- tm_map(docs, PlainTextDocument)
# Create DTM -------------------
dtm <- DocumentTermMatrix(docs)
dim(dtm)
## [1] 1 14200
inspect(dtm[1,1:10])
## <<DocumentTermMatrix (documents: 1, terms: 10)>>
## Non-/sparse entries: 10/0
## Sparsity : 0%
## Maximal term length: 9
## Weighting : term frequency (tf)
##
## Terms
## Docs aaa aaaaandgo aaaahhhhh aadvantag aaron aarti aback abandon
## character(0) 1 1 1 3 1 1 1 6
## Terms
## Docs abbey abbeyroad
## character(0) 5 1
#findFreqTerms(dtm, lowfreq=15) #terms occurring at least 15 times
t8 <- round(as.numeric(Sys.time() - t8), 2) # Time ends.
Note that the following words are not removed at this phase: Profanities and Stop Words.
I have not removed phrases containing Profanities because I want to predict non-obscene words in my model for bad words.Stop words are not removed because they indicate the words that may follow in the model.
There are 174 stop words identified in the text mining tm R package, such as i, me, my, myself, we, our, ours, ourselves, you, your etc.
11. Tokenization and N-Gram Analysis
A N-gram is a sequence of n words.An n-gram of size 1 is referred to as a “Uni-gram”; size 2 is a “Bi-gram”; size 3 is a “Tri-gram”.
Tokenization is the process of breaking a stream of text up into sequences of words, phrases, symbols, or other meaningful elements called tokens for statistical analysis and subsequent construction of prediction models.
The term-document matrices will then serve for word prediction in the algorithm to be built in the next phase of the capstone project.
t9 <- as.numeric(Sys.time()) # Time starts
##--!! Converting Corpus to Data Frame for processing by the RWeka functions
cleantext <- data.frame(text=unlist(sapply(docs, `[`, "content")), stringsAsFactors=F)
# Constructor for tokenization : n = size of word
## ngram_tokenizer <- function(x) NGramTokenizer(x, Weka_control(min = n, max = n))
##--!! ====== ====== ===== *** N-GRAM ANALYSIS *** ====== ======= ======== ======== ====== =======
## ======= UNIGRAM ANALYSIS ==================
OneT <- NGramTokenizer(cleantext, Weka_control(min = 1, max = 1))
# Building dataframes converting tokens of n-grams into tables
OneD <- data.frame(table(OneT))
# Ordering the word distribution frequency
OneW <- OneD[order(OneD$Freq,decreasing = TRUE),]
# n-grams Words sorted alphabetically
OneSort <- OneW[1:15,]
colnames(OneSort) <- c("Word","Frequency")
## ======= BIGRAM ANALYSIS==================
BiT <- NGramTokenizer(cleantext, Weka_control(min = 2, max = 2))
# Building dataframes converting tokens of n-grams into tables
BiD <- data.frame(table(BiT))
# Ordering the word distribution frequency
TwoW <- BiD[order(BiD$Freq,decreasing = TRUE),]
# n-grams Words sorted alphabetically
TwoSort <- TwoW[1:15,]
colnames(TwoSort) <- c("Word","Frequency")
## ======= TRIGRAM ANALYSIS ==================
TriT <- NGramTokenizer(cleantext, Weka_control(min = 3, max = 3))
# Building dataframes converting tokens of n-grams into tables
TriD <- data.frame(table(TriT))
# Ordering the word distribution frequency
TriW <- TriD[order(TriD$Freq,decreasing = TRUE),]
# n-grams Words sorted alphabetically
TriSort <- TriW[1:15,]
colnames(TriSort) <- c("Word","Frequency")
t9 <- round(as.numeric(Sys.time() - t9), 2) # Time ends.
12. Top 15 Most Frequent Terms by N-grams
t10 <- as.numeric(Sys.time()) # Time starts.
OneSort
## Word Frequency
## 12767 the 9566
## 12958 to 5294
## 461 and 4909
## 1 a 4640
## 8879 of 4038
## 6273 in 3205
## 6157 i 2791
## 6555 it 2379
## 12759 that 2121
## 4840 for 2005
## 6535 is 1956
## 8944 on 1578
## 14195 with 1446
## 14410 you 1444
## 13914 was 1226
TwoSort
## Word Frequency
## 65007 of the 892
## 46557 in the 799
## 97914 to the 427
## 66257 on the 406
## 34766 for the 346
## 96731 to be 327
## 10439 at the 301
## 7391 and the 288
## 45751 in a 234
## 107361 with the 224
## 36090 from the 198
## 49187 it is 195
## 102848 want to 195
## 49523 it was 185
## 34160 for a 178
TriSort
## Word Frequency
## 103559 one of the 75
## 1970 a lot of 60
## 77148 it was a 42
## 67600 i want to 41
## 149828 to be a 39
## 107584 part of the 35
## 55475 go to be 34
## 106089 out of the 31
## 138041 the end of 31
## 17095 as well as 29
## 143557 the u s 29
## 20391 be abl to 28
## 126779 some of the 28
## 66723 i have a 27
## 141941 the rest of 24
t10 <- round(as.numeric(Sys.time() - t10), 2) # Time ends.
13. Distribution of Word-Frequencies – Histograms
t11 <- as.numeric(Sys.time()) # Time starts
bx <- par(mfrow =c(1,3))
ggplot(OneSort, aes(x=Word,y=Frequency)) +
geom_bar(stat="Identity", fill="green") +
coord_flip() +
ggtitle("Top 15 Unigrams by Frequency") + xlab("Unigram") + ylab("Freq") +
scale_y_continuous(expand = c(0,0)) +
geom_text(aes(label=Frequency),hjust=1,size=3,vjust=-0.20,angle=0) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(TwoSort, aes(x=Word,y=Frequency)) +
geom_bar(stat="Identity", fill="pink") +
coord_flip() +
ggtitle("Top 15 Bigrams by Frequency") + xlab("Bigrams") + ylab("Freq") +
scale_y_continuous(expand = c(0,0)) +
geom_text(aes(label=Frequency),hjust=1,size=3,vjust=-0.20,angle=0) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
ggplot(TriSort, aes(x=Word,y=Frequency)) +
geom_bar(stat="Identity", fill="lightblue") +
coord_flip() +
ggtitle("Top 15 Trigrams by Frequency") + xlab("Trigram") + ylab("Freq") +
scale_y_continuous(expand = c(0,0)) +
geom_text(aes(label=Frequency),hjust=1,size=3,vjust=-0.20,angle=0) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
par(bx)
t11 <- round(as.numeric(Sys.time() - t11), 2) # Time ends.
14. Exploratory Data Analysis Part 2 – Word Clouds
We chose to explore the corpus data using word clouds as they can illustrate word frequencies very effectively. The most frequent words are displayed in respect to their size and centralisation. One word cloud is plotted for each data type.
t12 <- as.numeric(Sys.time()) # Time starts.
set.seed(2345) # For Reproducibility
oz <- par(mfrow = c(1, 3)) # Plot 3 graphs in 1 row
palette <- brewer.pal(8,"Dark2")
wordcloud(OneW[,1], OneW[,2], min.freq = 25,
random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "1-gram cloud")
wordcloud(TwoW[,1], TwoW[,2], min.freq =50,
random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "2-gram cloud")
wordcloud(TriW[,1], TriW[,2], max.words = 100,
random.order = F, ordered.colors = F, colors=palette)
text(x=0.5, y=0, "3-gram cloud")
par(oz)
t12 <- round(as.numeric(Sys.time() - t12), 2) # Time ends.
15. Percent Coverage of Words
## [1] 7260
## [1] 188715
## [1] 90214
## [1] 188714
## [1] 165778
## [1] 188713
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 5702249 304.6 9968622 532.4 9968622 532.4
## Vcells 92748363 707.7 143912772 1098.0 142515677 1087.4
We can see that the coverage is not linear. That means each next percent of coverage is given by higher number of words from the dictionary than the previous percent.That happens due to decreasing of frequency of each latter word.
Future Development
Impelement Profanity filtering and substitute with non-obscene words and sample the subset for test and validation sets.
Prediction Algorithm:
The Word frequencies, 2-Gram, 3-Gram frequencies have been calculated. These have to be used to calculate Probabilities for the n-grams.
Given a string W1...Wi-1, the word Wi that maximizes P(Wi | Wn-i+1.Wi-1) has to be chosen as the prediction where n is the maximum n-gram.
Back-off Algorithm :
The above probability calculation suffers when unknown phrases are introduced.So will consider the foll models:
Katz Back-Off Models Interpolated Models Kneser-Ney Models
Overall Run-Time : After calculating n-gram probabilities along with Back-off Models, Observing runtime for different sample sizes for efficient memory usage and run-time.
Shiny app: Build a Shiny app that allows users to interact with the prediction algorithm. The app will accept an n-gram and predict the next word, with the highest probability, for the user.I will have the user text input section in the side bar and output 3 word predictions on the main page. Reactive expression will be used to immediately update the top 3 predicted words as the user keeps inputing into the input box.
Appendix A : References
# http://cran.r-project.org/web/packages/tm/vignettes/tm.pdf
# http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know
# https://github.com/zero323/r-snippets/blob/master/R/ngram_tokenizer.R
# https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf
Appendix B : Misc
##--!! SPECIFICATIONS OF THE MACHINE USED FOR COMPILING THIS REPORT
# OS Name Microsoft Windows 8.1
# Processor Intel(R) Core(TM) i7-4510U CPU @ 2.00GHz, 2601 Mhz, 2 Core(s), 4 Logical Processor(s)
# Memory 8GB
##--!! CURRENT SESSION INFO
sessionInfo()
## R version 3.2.1 (2015-06-18)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 8 x64 (build 9200)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] R.utils_2.1.0 R.oo_1.19.0 R.methodsS3_1.7.0
## [4] qdap_2.2.2 qdapTools_1.1.0 qdapRegex_0.4.0
## [7] qdapDictionaries_1.0.6 Rgraphviz_2.12.0 graph_1.46.0
## [10] xtable_1.7-4 markdown_0.7.7 knitr_1.10.5
## [13] scales_0.2.5 gridExtra_2.0.0 ggplot2_1.0.1
## [16] RWeka_0.4-24 rJava_0.9-6 wordcloud_2.5
## [19] RColorBrewer_1.1-2 SnowballC_0.5.1 slam_0.1-32
## [22] tm_0.6-2 NLP_0.1-8 dplyr_0.4.2
## [25] stringr_1.0.0 stringi_0.5-5
##
## loaded via a namespace (and not attached):
## [1] httr_1.0.0 jsonlite_0.9.16 gender_0.4.3
## [4] gtools_3.5.0 assertthat_0.1 highr_0.5
## [7] stats4_3.2.1 xlsxjars_0.6.1 yaml_2.1.13
## [10] chron_2.3-47 digest_0.6.8 colorspace_1.2-6
## [13] htmltools_0.2.6 plyr_1.8.3 XML_3.98-1.3
## [16] devtools_1.8.0 gdata_2.17.0 git2r_0.10.1
## [19] openNLP_0.2-5 reports_0.1.4 BiocGenerics_0.14.0
## [22] proto_0.3-10 magrittr_1.5 memoise_0.2.1
## [25] evaluate_0.7 MASS_7.3-43 xml2_0.1.1
## [28] tools_3.2.1 data.table_1.9.4 RWekajars_3.7.12-1
## [31] formatR_1.2 xlsx_0.5.7 munsell_0.4.2
## [34] plotrix_3.5-12 rversions_1.0.2 RCurl_1.95-4.7
## [37] rstudioapi_0.3.1 igraph_1.0.1 bitops_1.0-6
## [40] labeling_0.3 rmarkdown_0.7 venneuler_1.1-0
## [43] gtable_0.1.2 DBI_0.3.1 curl_0.9.1
## [46] reshape2_1.4.1 R6_2.1.0 openNLPdata_1.5.3-2
## [49] parallel_3.2.1 Rcpp_0.11.6
##--!! TOTAL TIME TAKEN FOR PROCESSING .
cat("It took",round(sum(t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11,t12,t13)/60,2) ," minutes to complete the process !")
## It took 23.18 minutes to complete the process !
print(t1)
## [1] 130.12
print(t2)
## [1] 0.04
print(t3)
## [1] 506.69
print(t4)
## [1] 68.5
print(t5)
## [1] 179.33
print(t6)
## [1] 3.13
print(t7)
## [1] 21.45
print(t8)
## [1] 10.61
print(t9)
## [1] 270.29
print(t10)
## [1] 0.04
print(t11)
## [1] 4.08
print(t12)
## [1] 131.2
print(t13)
## [1] 65.34