Capstone Project - Final Test Report

Executive Summary

This is a Test Final Report related with the Coursera Capstone Project, the target is show the total running steps of the implementation.

1. Load the Neccesary Libraries

For this project we will use basicly the quanteda,ggplot2, data.table and knitr.

library(quanteda)

## quanteda version 0.9.6.9
## 
## 
## Attaching package: 'quanteda'
## 
## The following object is masked from 'package:base':
## 
##     sample

library(data.table)
library(ggplot2)
library(knitr)

wd.R <- "D:/001 -- Coursera/Capstone Project/Coursera---Data-Science---Capstone-Project"

setwd(wd.R)

source("Create Ngrams Data Table vFinal.R")
source("Knersey-ney Optimazed vFinal.R")
source("Main Predict Word vFinal.R")
source("Pred Next Word Regex vFinal.R")
source("Pred Next Word vFinal.R")

# For reproducibility
set.seed(12345)

2. Create Ngram Data Table

We will create the ngrams table for our quadgram model using 80% of the corpora.

2.1 Load and Clean the Data

This function load the 80% and clean the data:

list_filenames <- c("en_US.blogs.txt","en_US.news.txt","en_US.twitter.txt")
create_mydata(list_filenames, 80)

## [1] "-----> INIT: create_mydata(80)......."
## [1] "... Loading the Data from the file: en_US.blogs.txt ..."
## [1] "... Loading the Data from the file: en_US.news.txt ..."
## [1] "... Loading the Data from the file: en_US.twitter.txt ..."
## [1] "... Taking a Training Sample of: 80% ..."
## [1] "... Removing emojies and other characters ...."
## [1] "... To Lower Data ...."
## [1] "... Replace URL ...."
## [1] "... Replace Email ...."
## [1] "... Replace twitter ...."
## [1] "... Replace Hashtag ...."
## [1] "... Replacing apostrophe between words (') for special character ffff ...."
## [1] "... Replacing left ' ...."
## [1] "... Replacing punctuation for special characters ...."
## [1] "... Replacing $ + < > ...."
## [1] "... Replace Word that start with numbers ...."
## [1] "... Replace Word that finish with numbers ...."
## [1] "... Replace Digits ...."
## [1] "... Replacing rest of punctuation ...."
## [1] "... Removing Profanity Words ...."
## [1] "... Putting back apostrophe (') ..."
## [1] "... Saving mydata file:mydata_80.RData"
## [1] "-----> FINISH: create_mydata(80): Running Time .......413 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3989025 213.1   12002346  641.0   9272007  495.2
## Vcells 56700430 432.6  158781237 1211.5 158781205 1211.5

An example of ‘mydata’ content is:

mydata[1:5]

## [1] "listening to vh   presents eeee  donna summer live in concert eeee  rip"   
## [2] "one of the most interestingly informative posts yet eeee "                 
## [3] "plzzz followww me i love u so much   "                                     
## [4] "we have   new salads for spring eeee  garden ranch and carrot ginger eeee "
## [5] "beautiful sunday  beautiful brunch eeee  happy easter friends eeee "

2.2 Create All Tokens

One important step is create alltokens in order to be used to generate the ngrams (unigrams, bigrams, trigrams and quadgrams)

create_alltokens(list_files,80)

## [1] "-----> INIT: create_alltokens(training_set:=80)......."
## [1] "-----> INIT: create_mydata(80)......."
## [1] "-----> FINISH: create_mydata(80): Running Time .......0 seconds ..."
## [1] "... Creating alltokens ..."
## Starting tokenization...
##   ...tokenizing texts...total elapsed:  54.96 seconds.
##   ...replacing names...total elapsed:  0.07 seconds.
## Finished tokenizing and cleaning 2,669,356 texts.
## [1] "... Saving alltokens file:alltokens_80.RData"
## [1] "-----> INIT: create_alltokens(80): Running Time .......100 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4478643 239.2   14442815  771.4  14442815  771.4
## Vcells 75381939 575.2  218007256 1663.3 213006711 1625.2

An example of ‘alltokens’ content is:

alltokens[1:5]

## [[1]]
##  [1] "listening" "to"        "vh"        "presents"  "eeee"     
##  [6] "donna"     "summer"    "live"      "in"        "concert"  
## [11] "eeee"      "rip"      
## 
## [[2]]
## [1] "one"           "of"            "the"           "most"         
## [5] "interestingly" "informative"   "posts"         "yet"          
## [9] "eeee"         
## 
## [[3]]
## [1] "plzzz"    "followww" "me"       "i"        "love"     "u"       
## [7] "so"       "much"    
## 
## [[4]]
##  [1] "we"     "have"   "new"    "salads" "for"    "spring" "eeee"  
##  [8] "garden" "ranch"  "and"    "carrot" "ginger" "eeee"  
## 
## [[5]]
## [1] "beautiful" "sunday"    "beautiful" "brunch"    "eeee"      "happy"    
## [7] "easter"    "friends"   "eeee"

2.3 Create Unigrams Frequency Table

To create the unigrams frequency table (dfm), we will use the alltokens to create the unigrams, clean it removing fake unigrams and finally create the dfm.

2.3.1 Create and Clean Unigrams

Let’s create and clean the unigrams:

create_ngram(n=1,list_filenames,training_set= 80)

## [1] "-----> INIT: create_ngram(n:=1 training_set:=80)......."
## [1] "-----> INIT: create_alltokens(training_set:=80)......."
## [1] "-----> INIT: create_alltokens(80): Running Time .......0 seconds ..."
## [1] "... Creating Ngram:uni.ngram"
## [1] "... Saving Ngram file:uni_ngram_80.RData"
## [1] "-----> FINISH: create_ngram(n:=1 training_set:=80): Running Time .......117 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4477976 239.2   14442815  771.4  14442815  771.4
## Vcells 75382218 575.2  209595281 1599.1 295041012 2251.0

clean_ngram(n=1,list_filenames,training_set= 80)

## [1] "-----> INIT: clean_ngram(n:= 1 training_set:=80)......."
## [1] "-----> INIT: create_ngram(n:=1 training_set:=80)......."
## [1] "-----> FINISH: create_ngram(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Cleaning Ngram: uni.ngram"
## [1] "... Saving Ngram Cleaned file: uni_ngram_clean_80.RData"
## [1] "-----> INIT: clean_ngram(n:= 1 training_set:=80): Running Time .......781 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4477453 239.2   14442815  771.4  14442815  771.4
## Vcells 68379486 521.7  209594796 1599.1 295041012 2251.0

An example of ‘uni.ngram’ content is:

uni.ngram[1:5]

## [[1]]
##  [1] "listening" "to"        "vh"        "presents"  "eeee"     
##  [6] "donna"     "summer"    "live"      "in"        "concert"  
## [11] "eeee"      "rip"      
## 
## [[2]]
## [1] "one"           "of"            "the"           "most"         
## [5] "interestingly" "informative"   "posts"         "yet"          
## [9] "eeee"         
## 
## [[3]]
## [1] "plzzz"    "followww" "me"       "i"        "love"     "u"       
## [7] "so"       "much"    
## 
## [[4]]
##  [1] "we"     "have"   "new"    "salads" "for"    "spring" "eeee"  
##  [8] "garden" "ranch"  "and"    "carrot" "ginger" "eeee"  
## 
## [[5]]
## [1] "beautiful" "sunday"    "beautiful" "brunch"    "eeee"      "happy"    
## [7] "easter"    "friends"   "eeee"

2.3.2 Create and Trim Uni-dfm

Let’s create and Trim the unigrams dfm:

create_dfm(n=1,list_filenames,training_set= 80)

## [1] "-----> INIT: create_dfm(n:=1 training_set:=80)......."
## [1] "-----> INIT: clean_ngram(n:= 1 training_set:=80)......."
## [1] "-----> INIT: clean_ngram(n:= 1 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Creating dfm:uni.dfm"
## 
##    ... indexing documents: 2,669,356 documents
##    ... indexing features: 408,925 feature types
##    ... created a 2669356 x 408926 sparse dfm
##    ... complete. 
## Elapsed time: 19.09 seconds.
## [1] "... Saving dfm file:uni_dfm_80.RData"
## [1] "-----> FINISH: create_dfm(n:=1 training_set:=80): Running Time .......43 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4484302 239.5   14442815  771.4  14442815  771.4
## Vcells 84942988 648.1  424108907 3235.7 526561784 4017.4

trim_dfm(n=1,list_filenames,training_set= 80,mincount=1)

## [1] "-----> INIT: trim_dfm(n:=1 training_set:=80 mincount:=1)......."
## [1] "-----> INIT: create_dfm(n:=1 training_set:=80)......."
## [1] "-----> FINISH: create_dfm(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Trim dfm:uni.dfm.trim"
## [1] "... Saving dfm clean: uni.dfm.clean .."
## [1] "-----> FINISH: trim_dfm(n:=1 training_set:=80 mincount:=1): Running Time .......23 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  4484314 239.5   14442815  771.4  14442815  771.4
## Vcells 84943119 648.1  271429700 2070.9 526561784 4017.4

2.3.3 Create Unigram Data Table with Tokens and Frequency

Finally let’s create the unigram data table with tokens and frequency:

create_DT(n=1,list_filenames,training_set= 80,mincount=1)

## [1] "-----> INIT: create_DT(n:=1 training_set:=80 mincount:=1)......."
## [1] "-----> INIT: trim_dfm(n:=1 training_set:=80 mincount:=1)......."
## [1] "... Loading dfm trim file:uni_dfm_trim_80.RData"
## [1] "-----> FINISH: trim_dfm(n:=1 training_set:=80 mincount:=1): Running Time .......3.6 seconds ..."
## [1] "... Creating DT:DT.uni"
## [1] "... Saving DT.uni .."
## [1] "-----> FINISH: create_DT(n:=1 training_set:=80 mincount:=1): Running Time .......5.3 seconds ..."

##           used (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 1832753 97.9    9243401  493.7  14442815  771.4
## Vcells 8368213 63.9  173715008 1325.4 526561784 4017.4

An example of unigrams frequency and tokens table content is:

kable(DT.uni[order(-freq)][1:10])

t1	freq
the	2360302
to	1543244
and	1280366
a	1268408
i	1211750
of	1037463
in	826934
you	683341
is	650369
for	621363

2.4 Create Bigrams Frequency Table

To create the bigrams frequency table (dfm), we will use the alltokens to create the bigrams, clean it removing fake bigrams and finally create the dfm.

2.4.1 Create and Clean Bigrams

Let’s create and clean the bigrams:

create_ngram(n=2,list_filenames,training_set= 80)

## [1] "-----> INIT: create_ngram(n:=2 training_set:=80)......."
## [1] "-----> INIT: create_alltokens(training_set:=80)......."
## [1] "... Loading file alltokens_80.RData  ...."
## [1] "-----> INIT: create_alltokens(80): Running Time .......21 seconds ..."
## [1] "... Creating Ngram:bi.ngram"
## [1] "... Saving Ngram file:bi_ngram_80.RData"
## [1] "-----> FINISH: create_ngram(n:=2 training_set:=80): Running Time .......393 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 11747240 627.4   19381835 1035.2  19381835 1035.2
## Vcells 90318593 689.1  208538009 1591.1 526561784 4017.4

clean_ngram(n=2,list_filenames,training_set= 80)

## [1] "-----> INIT: clean_ngram(n:= 2 training_set:=80)......."
## [1] "-----> INIT: create_ngram(n:=2 training_set:=80)......."
## [1] "-----> FINISH: create_ngram(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Cleaning Ngram: bi.ngram"
## [1] "... Saving Ngram Cleaned file: bi_ngram_clean_80.RData"
## [1] "-----> INIT: clean_ngram(n:= 2 training_set:=80): Running Time .......870 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells 11438747 610.9   23298202 1244.3  23298202 1244.3
## Vcells 77725068 593.0  220500314 1682.3 526561784 4017.4

An example of ‘bi.ngram’ content is:

bi.ngram[1:5]

## [[1]]
##  [1] "listening_to"  "to_vh"         "vh_presents"   "presents_eeee"
##  [5] "eeee_donna"    "donna_summer"  "summer_live"   "live_in"      
##  [9] "in_concert"    "concert_eeee"  "eeee_rip"     
## 
## [[2]]
## [1] "one_of"                    "of_the"                   
## [3] "the_most"                  "most_interestingly"       
## [5] "interestingly_informative" "informative_posts"        
## [7] "posts_yet"                 "yet_eeee"                 
## 
## [[3]]
## [1] "plzzz_followww" "followww_me"    "me_i"           "i_love"        
## [5] "love_u"         "u_so"           "so_much"       
## 
## [[4]]
##  [1] "we_have"       "have_new"      "new_salads"    "salads_for"   
##  [5] "for_spring"    "spring_eeee"   "eeee_garden"   "garden_ranch" 
##  [9] "ranch_and"     "and_carrot"    "carrot_ginger" "ginger_eeee"  
## 
## [[5]]
## [1] "beautiful_sunday" "sunday_beautiful" "beautiful_brunch"
## [4] "brunch_eeee"      "eeee_happy"       "happy_easter"    
## [7] "easter_friends"   "friends_eeee"

2.4.2 Create and Trim Bi-dfm

Let’s create and Trim the bigrams dfm:

create_dfm(n=2,list_filenames,training_set= 80)

## [1] "-----> INIT: create_dfm(n:=2 training_set:=80)......."
## [1] "-----> INIT: clean_ngram(n:= 2 training_set:=80)......."
## [1] "-----> INIT: clean_ngram(n:= 2 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Creating dfm:bi.dfm"
## 
##    ... indexing documents: 2,669,356 documents
##    ... indexing features: 6,947,567 feature types
##    ... created a 2669356 x 6947568 sparse dfm
##    ... complete. 
## Elapsed time: 48.19 seconds.
## [1] "... Saving dfm file:bi_dfm_80.RData"
## [1] "-----> FINISH: create_dfm(n:=2 training_set:=80): Running Time .......81 seconds ..."

##             used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  11449635 611.5   23298202 1244.3  23298202 1244.3
## Vcells 115966483 884.8  498714379 3804.9 526561784 4017.4

trim_dfm(n=2,list_filenames,training_set= 80,mincount=1)

## [1] "-----> INIT: trim_dfm(n:=2 training_set:=80 mincount:=1)......."
## [1] "-----> INIT: create_dfm(n:=2 training_set:=80)......."
## [1] "-----> FINISH: create_dfm(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Trim dfm:bi.dfm.trim"
## [1] "... Saving dfm clean: bi.dfm.clean .."
## [1] "-----> FINISH: trim_dfm(n:=2 training_set:=80 mincount:=1): Running Time .......34 seconds ..."

##             used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  11449645 611.5   23298202 1244.3  23298202 1244.3
## Vcells 115966610 884.8  319177202 2435.2 526561784 4017.4

2.4.3 Create Bigram Data Table with Tokens and Frequency

Finally let’s create the bigram data table with tokens and frequency:

create_DT(n=2,list_filenames,training_set= 80,mincount=1)

## [1] "-----> INIT: create_DT(n:=2 training_set:=80 mincount:=1)......."
## [1] "-----> INIT: trim_dfm(n:=2 training_set:=80 mincount:=1)......."
## [1] "... Loading dfm trim file:bi_dfm_trim_80.RData"
## [1] "-----> FINISH: trim_dfm(n:=2 training_set:=80 mincount:=1): Running Time .......8.6 seconds ..."
## [1] "... Creating DT:DT.bi"
## [1] "... Saving DT.bi .."
## [1] "-----> FINISH: create_DT(n:=2 training_set:=80 mincount:=1): Running Time .......35 seconds ..."

##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  1835594  98.1   18638561  995.5  23298202 1244.3
## Vcells 53436665 407.7  165369663 1261.7 526561784 4017.4

An example of bigrams frequency and tokens table content is:

kable(DT.bi[order(-freq)][1:10])

t1	t2	freq
of	the	206830
in	the	196376
for	the	109707
to	the	108480
on	the	103227
to	be	95381
at	the	71282
i	have	64409
and	the	62197
i	was	61034

2.5 Create Trigrams Frequency Table

To create the trigrams frequency table (dfm), we will use the alltokens to create the trigrams, clean it removing fake trigrams and finally create the dfm.

2.5.1 Create and Clean Trigrams

Let’s create and clean the trigrams:

create_ngram(n=3,list_filenames,training_set= 80)

## [1] "-----> INIT: create_ngram(n:=3 training_set:=80)......."
## [1] "-----> INIT: create_alltokens(training_set:=80)......."
## [1] "... Loading file alltokens_80.RData  ...."
## [1] "-----> INIT: create_alltokens(80): Running Time .......21 seconds ..."
## [1] "... Creating Ngram:tri.ngram"
## [1] "... Saving Ngram file:tri_ngram_80.RData"
## [1] "-----> FINISH: create_ngram(n:=3 training_set:=80): Running Time .......528 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  28893214 1543.1   44920746 2399.1  44920746 2399.1
## Vcells 181876255 1387.7  412087965 3144.0 526561784 4017.4

clean_ngram(n=3,list_filenames,training_set= 80)

## [1] "-----> INIT: clean_ngram(n:= 3 training_set:=80)......."
## [1] "-----> INIT: create_ngram(n:=3 training_set:=80)......."
## [1] "-----> FINISH: create_ngram(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Cleaning Ngram: tri.ngram"
## [1] "... Saving Ngram Cleaned file: tri_ngram_clean_80.RData"
## [1] "-----> INIT: clean_ngram(n:= 3 training_set:=80): Running Time .......924 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  23955286 1279.4   44920746 2399.1  44920746 2399.1
## Vcells 149555134 1141.1  493528279 3765.4 552953330 4218.7

An example of ‘tri.ngram’ content is:

tri.ngram[1:5]

## [[1]]
##  [1] "listening_to_vh"     "to_vh_presents"      "vh_presents_eeee"   
##  [4] "presents_eeee_donna" "eeee_donna_summer"   "donna_summer_live"  
##  [7] "summer_live_in"      "live_in_concert"     "in_concert_eeee"    
## [10] "concert_eeee_rip"   
## 
## [[2]]
## [1] "one_of_the"                      "of_the_most"                    
## [3] "the_most_interestingly"          "most_interestingly_informative" 
## [5] "interestingly_informative_posts" "informative_posts_yet"          
## [7] "posts_yet_eeee"                 
## 
## [[3]]
## [1] "plzzz_followww_me" "followww_me_i"     "me_i_love"        
## [4] "i_love_u"          "love_u_so"         "u_so_much"        
## 
## [[4]]
##  [1] "we_have_new"        "have_new_salads"    "new_salads_for"    
##  [4] "salads_for_spring"  "for_spring_eeee"    "spring_eeee_garden"
##  [7] "eeee_garden_ranch"  "garden_ranch_and"   "ranch_and_carrot"  
## [10] "and_carrot_ginger"  "carrot_ginger_eeee"
## 
## [[5]]
## [1] "beautiful_sunday_beautiful" "sunday_beautiful_brunch"   
## [3] "beautiful_brunch_eeee"      "brunch_eeee_happy"         
## [5] "eeee_happy_easter"          "happy_easter_friends"      
## [7] "easter_friends_eeee"

2.5.2 Create and Trim Tri-dfm

Let’s create and Trim the trigrams dfm:

create_dfm(n=3,list_filenames,training_set= 80)

## [1] "-----> INIT: create_dfm(n:=3 training_set:=80)......."
## [1] "-----> INIT: clean_ngram(n:= 3 training_set:=80)......."
## [1] "-----> INIT: clean_ngram(n:= 3 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Creating dfm:tri.dfm"
## 
##    ... indexing documents: 2,669,356 documents
##    ... indexing features: 19,499,404 feature types
##    ... created a 2669356 x 19499405 sparse dfm
##    ... complete. 
## Elapsed time: 50.29 seconds.
## [1] "... Saving dfm file:tri_dfm_80.RData"
## [1] "-----> FINISH: create_dfm(n:=3 training_set:=80): Running Time .......109 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  24004392 1282.0   44920746 2399.1  44920746 2399.1
## Vcells 200222836 1527.6  571254934 4358.4 585518015 4467.2

trim_dfm(n=3,list_filenames,training_set= 80,mincount=1)

## [1] "-----> INIT: trim_dfm(n:=3 training_set:=80 mincount:=1)......."
## [1] "-----> INIT: create_dfm(n:=3 training_set:=80)......."
## [1] "-----> FINISH: create_dfm(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Trim dfm:tri.dfm.trim"
## [1] "... Saving dfm clean: tri.dfm.clean .."
## [1] "-----> FINISH: trim_dfm(n:=3 training_set:=80 mincount:=1): Running Time .......63 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  24004402 1282.0   44920746 2399.1  44920746 2399.1
## Vcells 200222962 1527.6  571254934 4358.4 585518015 4467.2

2.5.3 Create Trigram Data Table with Tokens and Frequency

Finally let’s create the trigrams data table with tokens and frequency:

create_DT(n=3,list_filenames,training_set= 80,mincount=1)

## [1] "-----> INIT: create_DT(n:=3 training_set:=80 mincount:=1)......."
## [1] "-----> INIT: trim_dfm(n:=3 training_set:=80 mincount:=1)......."
## [1] "... Loading dfm trim file:tri_dfm_trim_80.RData"
## [1] "-----> FINISH: trim_dfm(n:=3 training_set:=80 mincount:=1): Running Time .......19 seconds ..."
## [1] "... Creating DT:DT.tri"
## [1] "... Saving DT.tri .."
## [1] "-----> FINISH: create_DT(n:=3 training_set:=80 mincount:=1): Running Time .......105 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells   1835654   98.1   35936596 1919.3  44920746 2399.1
## Vcells 196978578 1502.9  457003947 3486.7 585518015 4467.2

An example of trigrams frequency and tokens table content is:

kable(DT.tri[order(-freq)][1:10])

t1	t2	t3	freq
thanks	for	the	19056
one	of	the	16899
a	lot	of	15576
i	want	to	10669
to	be	a	10502
going	to	be	10182
i	have	a	8812
looking	forward	to	8425
i	have	to	8279
it	was	a	8235

2.6 Create Quadgrams Frequency Table

To create the quadgrams frequency table (dfm), we will use the alltokens to create the quadgrams, clean it removing fake quadgrams and finally create the dfm.

2.6.1 Create and Clean Quadgrams

Let’s create and clean the quadgrams:

create_ngram(n=4,list_filenames,training_set= 80)

## [1] "-----> INIT: create_ngram(n:=4 training_set:=80)......."
## [1] "-----> INIT: create_alltokens(training_set:=80)......."
## [1] "... Loading file alltokens_80.RData  ...."
## [1] "-----> INIT: create_alltokens(80): Running Time .......22 seconds ..."
## [1] "... Creating Ngram:quad.ngram"
## [1] "... Saving Ngram file:quad_ngram_80.RData"
## [1] "-----> FINISH: create_ngram(n:=4 training_set:=80): Running Time .......604 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  44452070 2374.0   71835060 3836.5  59829217 3195.3
## Vcells 353594819 2697.8  658261683 5022.2 585518015 4467.2

clean_ngram(n=4,list_filenames,training_set= 80)

## [1] "-----> INIT: clean_ngram(n:= 4 training_set:=80)......."
## [1] "-----> INIT: create_ngram(n:=4 training_set:=80)......."
## [1] "-----> FINISH: create_ngram(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Cleaning Ngram: quad.ngram"
## [1] "... Saving Ngram Cleaned file: quad_ngram_clean_80.RData"
## [1] "-----> INIT: clean_ngram(n:= 4 training_set:=80): Running Time .......1023 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  31290638 1671.2   71835060 3836.5  71835060 3836.5
## Vcells 281834779 2150.3  891387480 6800.8 782282630 5968.4

An example of ‘quad.ngram’ content is:

quad.ngram[1:5]

## [[1]]
## [1] "listening_to_vh_presents"   "to_vh_presents_eeee"       
## [3] "vh_presents_eeee_donna"     "presents_eeee_donna_summer"
## [5] "eeee_donna_summer_live"     "donna_summer_live_in"      
## [7] "summer_live_in_concert"     "live_in_concert_eeee"      
## [9] "in_concert_eeee_rip"       
## 
## [[2]]
## [1] "one_of_the_most"                     
## [2] "of_the_most_interestingly"           
## [3] "the_most_interestingly_informative"  
## [4] "most_interestingly_informative_posts"
## [5] "interestingly_informative_posts_yet" 
## [6] "informative_posts_yet_eeee"          
## 
## [[3]]
## [1] "plzzz_followww_me_i" "followww_me_i_love"  "me_i_love_u"        
## [4] "i_love_u_so"         "love_u_so_much"     
## 
## [[4]]
##  [1] "we_have_new_salads"       "have_new_salads_for"     
##  [3] "new_salads_for_spring"    "salads_for_spring_eeee"  
##  [5] "for_spring_eeee_garden"   "spring_eeee_garden_ranch"
##  [7] "eeee_garden_ranch_and"    "garden_ranch_and_carrot" 
##  [9] "ranch_and_carrot_ginger"  "and_carrot_ginger_eeee"  
## 
## [[5]]
## [1] "beautiful_sunday_beautiful_brunch" "sunday_beautiful_brunch_eeee"     
## [3] "beautiful_brunch_eeee_happy"       "brunch_eeee_happy_easter"         
## [5] "eeee_happy_easter_friends"         "happy_easter_friends_eeee"

2.6.2 Create and Trim Quad-dfm

Let’s create and Trim the quadgrams dfm:

create_dfm(n=4,list_filenames,training_set= 80)

## [1] "-----> INIT: create_dfm(n:=4 training_set:=80)......."
## [1] "-----> INIT: clean_ngram(n:= 4 training_set:=80)......."
## [1] "-----> INIT: clean_ngram(n:= 4 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Creating dfm:quad.dfm"
## 
##    ... indexing documents: 2,669,356 documents
##    ... indexing features: 26,919,002 feature types
##    ... created a 2669356 x 26919003 sparse dfm
##    ... complete. 
## Elapsed time: 27.78 seconds.
## [1] "... Saving dfm file:quad_dfm_80.RData"
## [1] "-----> FINISH: create_dfm(n:=4 training_set:=80): Running Time .......120 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  31424051 1678.3   71835060 3836.5  71835060 3836.5
## Vcells 341047952 2602.0  891387480 6800.8 868194854 6623.9

trim_dfm(n=4,list_filenames,training_set= 80,mincount=1)

## [1] "-----> INIT: trim_dfm(n:=4 training_set:=80 mincount:=1)......."
## [1] "-----> INIT: create_dfm(n:=4 training_set:=80)......."
## [1] "-----> FINISH: create_dfm(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "... Trim dfm:quad.dfm.trim"
## [1] "... Saving dfm clean: quad.dfm.clean .."
## [1] "-----> FINISH: trim_dfm(n:=4 training_set:=80 mincount:=1): Running Time .......101 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  31424061 1678.3   71835060 3836.5  71835060 3836.5
## Vcells 341048082 2602.0  891387480 6800.8 868194854 6623.9

2.6.2 Create Quadgram Data Table with Tokens and Frequency

Finally let’s create the quadgrams data table with tokens and frequency:

create_DT(n=4,list_filenames,training_set= 80,mincount=1)

## [1] "-----> INIT: create_DT(n:=4 training_set:=80 mincount:=1)......."
## [1] "-----> INIT: trim_dfm(n:=4 training_set:=80 mincount:=1)......."
## [1] "... Loading dfm trim file:quad_dfm_trim_80.RData"
## [1] "-----> FINISH: trim_dfm(n:=4 training_set:=80 mincount:=1): Running Time .......25 seconds ..."
## [1] "... Creating DT:DT.quad"
## [1] "... Saving DT.quad ..."
## [1] "-----> FINISH: create_DT(n:=4 training_set:=80 mincount:=1): Running Time .......247 seconds ..."

##             used   (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  28754719 1535.7   71835060 3836.5  71835060 3836.5
## Vcells 536919939 4096.4  891387480 6800.8 886870239 6766.3

An example of unigrams frequency and tokens table content is:

kable(DT.quad[order(-freq)][1:10])

t1	t2	t3	t4	freq
thanks	for	the	follow	5090
the	end	of	the	4050
the	rest	of	the	3727
at	the	end	of	3427
for	the	first	time	3030
at	the	same	time	2863
is	going	to	be	2816
thanks	for	the	rt	2685
thank	you	for	the	2540
can’t	wait	to	see	2366

3. Calculate Probability for Each Ngram

Let’s calculate the Kneser-ney Probability for each ngram:

3.1 Create Unigram Knersey-ney Probability Table

calculate_prob_kn(n=1,training_set=80,p1=1)

## [1] "-----> INIT: calculate_prob_kn(n:=1 training_set:=80 p1:=1)......."
## [1] "... Loading DT Prob Final file: DT_uni_prob_80.RData"

An example of unigrams probability table content is:

kable(DT.uni.prob.final[order(-freq1)][1:10])

t1	freq1	prob
the	2360302	0.0427344
to	1543244	0.0279412
and	1280366	0.0231816
a	1268408	0.0229651
i	1211750	0.0219393
of	1037463	0.0187838
in	826934	0.0149720
you	683341	0.0123722
is	650369	0.0117752
for	621363	0.0112501

3.2 Create Bigram Knersey-ney Probability Table

calculate_prob_kn(n=2,training_set=80,p1=1)

## [1] "-----> INIT: calculate_prob_kn(n:=2 training_set:=80 p1:=1)......."
## [1] "-----> INIT: load_DT_prob_tables(n:=2 p1:=1 training_set:=80)......."
## [1] "-----> INIT: load_DT_prob_table(n:=1 p1:=1 training_set:=80)......."
## [1] "... Loading DT Prob Temp File: DT_uni_prob_temp_80.RData"
## [1] "-----> FINISH: load_DT_prob_table(n:=1 p1:=1 training_set:=80): Running Time .......0.31 seconds ..."
## [1] "-----> INIT: load_DT_prob_table(n:=2 p1:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_table(n:=2 p1:=1 training_set:=80): Running Time .......0.34 seconds ..."
## [1] "-----> FINISH: load_DT_prob_tables(n:=2 p1:=1 training_set:=80): Running Time .......0.65 seconds ..."
## [1] "... Calculating DT Prob Table:DT.bi.prob"
## [1] "---> Calculating and Adding neccesary values for Bigrams High Order Prob Calculation: Pkn(t1 t2) ..."
## [1] "... Calculating: sum(w) c(t1 w) = sum.freq2(t1) ..."
## [1] "... Calculating: N1+(t1 *) = n12(t1) ..."
## [1] "... Adding to Bigrams Table: pkn12(t2) ..."
## [1] "--> Calculating Kneser-ney Prob for High Order Bigrams ..."
## [1] "    Pkn(t1 t2) = max{ c(t1 t2) - D2, 0 } / (sum(w) c(t1 w)) + "
## [1] "                    D2 / (sum(w) c(t1 w)) * N1+(t1 *) x Pknr (t2) ..."
## [1] "... Saving DT probability Temp file:DT_bi_prob_temp_80.RData"
## [1] "... Saving DT probability final file:DT_bi_prob_final_80.RData"
## [1] "-----> FINISH: calculate_prob_kn(n:=2 training_set:=80 p1:=1): Running Time .......47 seconds ..."

An example of bigrams probability table content is:

kable(DT.bi.prob.final[order(-freq2)][1:10])

t1	t2	freq2	prob
of	the	206830	0.2022995
in	the	196376	0.2468229
for	the	109707	0.1809920
to	the	108480	0.0714742
on	the	103227	0.2364533
to	be	95381	0.0627854
at	the	71282	0.2570547
i	have	64409	0.0535525
and	the	62197	0.0493147
i	was	61034	0.0507584

3.3 Create Trigram Knersey-ney Probability Table

calculate_prob_kn(n=3,training_set=80,p1=1)

## [1] "-----> INIT: calculate_prob_kn(n:=3 training_set:=80 p1:=1)......."
## [1] "-----> INIT: load_DT_prob_tables(n:=3 p1:=1 training_set:=80)......."
## [1] "-----> INIT: load_DT_prob_table(n:=1 p1:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_table(n:=1 p1:=1 training_set:=80): Running Time .......0.02 seconds ..."
## [1] "-----> INIT: load_DT_prob_table(n:=2 p1:=1 training_set:=80)......."
## [1] "... Loading DT Prob Temp File: DT_bi_prob_temp_80.RData"
## [1] "-----> FINISH: load_DT_prob_table(n:=2 p1:=1 training_set:=80): Running Time .......8.1 seconds ..."
## [1] "-----> INIT: load_DT_prob_table(n:=3 p1:=1 training_set:=80)......."
## [1] "-----> INIT: load_DT_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> FINISH: load_DT_prob_table(n:=3 p1:=1 training_set:=80): Running Time .......3 seconds ..."
## [1] "-----> FINISH: load_DT_prob_tables(n:=3 p1:=1 training_set:=80): Running Time .......11 seconds ..."
## [1] "... Calculating DT Prob Table:DT.tri.prob"
## [1] "---> Calculating and Adding neccesary values for Trigrams High Order Prob Calculation: Pkn(t1 t2 t3) ..."
## [1] "... Calculating: sum(w) c(t1 t2 w) = sum.freq3(t1 t2) ..."
## [1] "... Calculating: N1+(t1 t2 *) = n22(t1 t2) ..."
## [1] "... Calculating and Adding neccesary values for Bigrams Low Order Prob Calculation: Pknr(t2 t3)..."
## [1] "...... Calculating: N1+(* t2 t3) = n21(t2 t3) ..."
## [1] "...... Calculating: sum(w) N1+(* t2 w) = sum.n21(t2) ..."
## [1] "...... Adding to Trigrams Table: n12(t2) ..."
## [1] "...... Adding to Trigrams Table: pkn12(t3) ..."
## [1] "...... Calculating Kneser-ney Prob for Low Order Bigrams = pkn22(t2 t3)..."
## [1] "...... Pknr(t2 t3) = max{ N1+(t2 t3) - D2, 0 } / (sum(w) N1+(* t2 w)) + "
## [1] "                     D2 / (sum(w) N1+(* t2 w)) * N1+(t2 *) x Pknr(t3) ..."
## [1] "--> Calculating Kneser-ney Prob for High Order Trigrams = pkn31(t1 t2 t3)..."
## [1] "    Pkn(t1 t2 t3) = max{ c(t1 t2 t3) - D3, 0 } / (sum(w) c(t1 t2 w)) + "
## [1] "                    D3 / (sum(w) c(t1 t2 w)) * N1+(t1 t2 *) x Pknr (t2 t3)"
## [1] "... Saving DT probability Temp file:DT_tri_prob_temp_80.RData"
## [1] "... Saving DT probability final file:DT_tri_prob_final_80.RData"
## [1] "-----> FINISH: calculate_prob_kn(n:=3 training_set:=80 p1:=1): Running Time .......281 seconds ..."

An example of trigrams probability table content is:

kable(DT.tri.prob.final[order(-freq3)][1:10])

t1	t2	t3	freq3	prob
thanks	for	the	19056	0.5311848
one	of	the	16899	0.4328772
a	lot	of	15576	0.6659718
i	want	to	10669	0.5680719
to	be	a	10502	0.1148977
going	to	be	10182	0.2120998
i	have	a	8812	0.1390641
looking	forward	to	8425	0.9848397
i	have	to	8279	0.1306504
it	was	a	8235	0.1507128

3.4 Create Quadgram Knersey-ney Probability Table

calculate_prob_kn(n=4,training_set=80,p1=1)

## [1] "-----> INIT: calculate_prob_kn(n:=4 training_set:=80 p1:=1)......."
## [1] "-----> INIT: load_DT_prob_tables(n:=4 p1:=1 training_set:=80)......."
## [1] "-----> INIT: load_DT_prob_table(n:=3 p1:=1 training_set:=80)......."
## [1] "... Loading DT Prob Temp File: DT_tri_prob_temp_80.RData"
## [1] "-----> FINISH: load_DT_prob_table(n:=3 p1:=1 training_set:=80): Running Time .......34 seconds ..."
## [1] "-----> INIT: load_DT_prob_table(n:=4 p1:=1 training_set:=80)......."
## [1] "-----> INIT: load_DT_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> FINISH: load_DT_prob_table(n:=4 p1:=1 training_set:=80): Running Time .......11 seconds ..."
## [1] "-----> FINISH: load_DT_prob_tables(n:=4 p1:=1 training_set:=80): Running Time .......45 seconds ..."
## [1] "... Calculating DT Prob Table:DT.quad.prob.final"
## [1] "---> Calculating and Adding neccesary values for Quadgrams High Order Prob Calculation: Pkn(t1 t2 t3 t4) ..."
## [1] "... Calculating: sum(w) c(t1 t2 t3 w) = sum.freq4(t1 t2 t3) ..."
## [1] "... Calculating: N1+(t1 t2 t3 *) = n32(t1 t2 t3) ..."
## [1] "... Calculating and Adding neccesary values for Trigrams Low Order Prob Calculation: Pknr(t2 t3 t4)..."
## [1] "...... Calculating: N1+(* t2 t3 t4) = n31(t2 t3 t4) ..."
## [1] "...... Calculating: sum(w) N1+(* t2 t3 w) = sum.n31(t2 t3) ..."
## [1] "...... Adding to Quadgrams Table: n22(t2 t3) ..."
## [1] "...... Adding to Quadgrams Table: pkn22(t3 t4) ..."
## [1] "...... Calculating Kneser-ney Prob for Low Order Trigrams = pkn32(t2 t3 t4)..."
## [1] "...... Pknr(t2 t3 t4) = max{ N1+(t2 t3 t4) - D3, 0 } / (sum(w) N1+(* t2 t3 w)) + "
## [1] "                        D3 / (sum(w) N1+(* t2 t3 w)) * N1+(t2 t3 *) x Pknr(t3 t4) ..."
## [1] "--> Calculating Kneser-ney Prob for High Order Quadgrams = pkn41(t1 t2 t3 t4)..."
## [1] "    Pkn(t1 t2 t3 t4) = max{ c(t1 t2 t3 t4) - D4, 0 } / (sum(w) c(t1 t2 t3 w) + "
## [1] "                       D4 / (sum(w) c(t1 t2 t3 w)) * N1+(t1 t2 t3 *) x Pknr (t2 t3 t4)"
## [1] "... Saving DT probability Temp file:DT_quad_prob_temp_80.RData"
## [1] "... Saving DT probability final file:DT_quad_prob_final_80.RData"
## [1] "-----> FINISH: calculate_prob_kn(n:=4 training_set:=80 p1:=1): Running Time .......455 seconds ..."

An example of quadgrams probability table content is:

kable(DT.quad.prob.final[order(-freq4)][1:10])

t1	t2	t3	t4	freq4	prob
thanks	for	the	follow	5090	0.2758026
the	end	of	the	4050	0.5268055
the	rest	of	the	3727	0.5764063
at	the	end	of	3427	0.8932204
for	the	first	time	3030	0.7775625
at	the	same	time	2863	0.8641693
is	going	to	be	2816	0.4511452
thanks	for	the	rt	2685	0.1454642
thank	you	for	the	2540	0.3158351
can’t	wait	to	see	2366	0.3605652

4. Ngrams Probability Frequency Table Analysis

Let’s show the information regarding the frequency of the ngrams in the data tables:

DT.prob.freq <- DT_prob_freq(training_set=80)

## [1] "-----> INIT: DT_prob_freq(training_set:=80)......."
## [1] "-----> INIT: load_DT_prob_final_table(n:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=2 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> FINISH: DT_prob_freq(training_set:=80): Running Time .......46 seconds ..."

kable(DT.prob.freq)

Ngram	Freq	Amount	Percent
1	Total Ngrams	408926	100.00000
1	Min Freq == 1	220031	53.80705
1	Freq == 2	50423	12.33059
1	Freq == 3	23894	5.84311
1	Freq == 4	14795	3.61801
1	Freq == 5	10085	2.46622
1	Max Freq == 2360302	1	0.00024
1	Freq <= 2	270454	66.13764
1	Freq <= 3	294348	71.98075
1	Freq <= 4	309143	75.59876
1	Freq <= 5	319228	78.06498
1	Freq >= 6	89698	21.93502
2	Total Ngrams	6947568	100.00000
2	Min Freq == 1	4823856	69.43230
2	Freq == 2	821013	11.81727
2	Freq == 3	342442	4.92895
2	Freq == 4	191209	2.75217
2	Freq == 5	124632	1.79389
2	Max Freq == 206830	1	1e-05
2	Freq <= 2	5644869	81.24957
2	Freq <= 3	5987311	86.17852
2	Freq <= 4	6178520	88.93069
2	Freq <= 5	6303152	90.72458
2	Freq >= 6	644416	9.27542
3	Total Ngrams	19499405	100.00000
3	Min Freq == 1	16151800	82.83227
3	Freq == 2	1610852	8.26103
3	Freq == 3	572366	2.93530
3	Freq == 4	292746	1.50131
3	Freq == 5	177797	0.91181
3	Max Freq == 19056	1	1e-05
3	Freq <= 2	17762652	91.09330
3	Freq <= 3	18335018	94.02860
3	Freq <= 4	18627764	95.52991
3	Freq <= 5	18805561	96.44172
3	Freq >= 6	693844	3.55828
4	Total Ngrams	26919003	100.00000
4	Min Freq == 1	24657722	91.59969
4	Freq == 2	1288670	4.78721
4	Freq == 3	383326	1.42400
4	Freq == 4	178843	0.66437
4	Freq == 5	101637	0.37757
4	Max Freq == 5090	1	0.00000
4	Freq <= 2	25946392	96.38690
4	Freq <= 3	26329718	97.81090
4	Freq <= 4	26508561	98.47527
4	Freq <= 5	26610198	98.85284
4	Freq >= 6	308805	1.14716

5. Singleton Probability Table

We will create two groups of probability tables to be used with the Shiny app removing:

Ngrams with frequency == 1
Ngrmas with frequency < 5

DT.prob.sing <- DT_prob_singletons(training_set=80)

## [1] "-----> INIT: DT_prob_singletons(training_set:=80)......."
## [1] "-----> INIT: load_DT_prob_final_table(n:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=2 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> FINISH: DT_prob_singletons(training_set:=80): Running Time .......45 seconds ..."

kable(DT.prob.sing)

Ngram	Object Size All Freq (Mbytes)	Object Size Freq > 1 (Mbytes)	Percent Freq > 1	Object Size Freq >= 5 (Mbytes)	Percent Freq >= 5
1	31.33272	13.62089	43.47177	7.18066	22.91744
2	272.76759	73.72966	27.03021	27.44813	10.06283
3	863.69994	135.00061	15.63050	35.86026	4.15194
4	1389.64787	109.46494	7.87717	20.42956	1.47012

6. Prediction of Next Word

Let’s see some example of prediction:

6.1 Example with Unigram

prediction1 <- predict_nextword(c("how"),p=0,n=5,training_set = 80)

## [1] "-----> INIT: predict_nextword( word:=(how), training_set:=80, prob:=0, n:=5)......."
## [1] "-----> INIT: load_DT_prob_final_table(n:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=2 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: topn_predict( word:=(how), prob:=0, nun_words:=5, factor:=1)......."
## [1] "... Found: 5728  words ..."
##    word       prob
## 1:   to 0.11710652
## 2: much 0.06558854
## 3: many 0.04785503
## 4:    i 0.04610697
## 5:   do 0.04034563
## [1] "-----> topn_predict: Running Time .......0.48 seconds ..."
## [1] "-----> FINISH: predict_nextword: Running Time .......0.48 seconds ..."

kable(prediction1)

word	prob
to	0.1171065
much	0.0655885
many	0.0478550
i	0.0461070
do	0.0403456

6.2 Example with Bigram

prediction2 <- predict_nextword(c("how","are"),p=0,n=5,training_set = 80)

## [1] "-----> INIT: predict_nextword( word:=(how,are), training_set:=80, prob:=0, n:=5)......."
## [1] "-----> INIT: load_DT_prob_final_table(n:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=2 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: topn_predict( word:=(how,are), prob:=0, nun_words:=5, factor:=1)......."
## [1] "...Found: 121  words ..."
##      word       prob
## 1:    you 0.73549958
## 2:      u 0.05091751
## 3: things 0.04909117
## 4:     we 0.02743726
## 5:    the 0.02378666
## [1] "-----> topn_predict: Running Time .......4.8 seconds ..."
## [1] "-----> FINISH: predict_nextword: Running Time .......4.8 seconds ..."

kable(prediction2)

word	prob
you	0.7354996
u	0.0509175
things	0.0490912
we	0.0274373
the	0.0237867

6.3 Example with Trigram

prediction3 <- predict_nextword(c("how","are","you"),p=0,n=5,training_set = 80)

## [1] "-----> INIT: predict_nextword( word:=(how,are,you), training_set:=80, prob:=0, n:=5)......."
## [1] "-----> INIT: load_DT_prob_final_table(n:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=2 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: topn_predict( word:=(how,are,you), prob:=0, nun_words:=5, factor:=1)......."
## [1] "...Found: 278  words..."
##           word       prob
## 1:       doing 0.22016549
## 2:       today 0.08081740
## 3:     feeling 0.05787095
## 4:       going 0.04778157
## 5: celebrating 0.03711988
## [1] "-----> topn_predict: Running Time .......4.7 seconds ..."
## [1] "-----> FINISH: predict_nextword: Running Time .......4.7 seconds ..."

kable(prediction3)

word	prob
doing	0.2201655
today	0.0808174
feeling	0.0578709
going	0.0477816
celebrating	0.0371199

7. Prediction of Next Word using Regex

Let’s see some examples of prediction using regex:

prediction1 <- predict_nextword_regex(c("how","are",""),p=0,n=5,training_set = 80)

## [1] "-----> INIT: predict_nextword_regex( word:=(how,are,), training_set:=80, prob:=0, n:=5)......."
## [1] "-----> INIT: load_DT_prob_final_table(n:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=2 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: topn_predict_regex( word:=(how,are,), prob:=0, nun_words:=5, factor:=1)......."
## [1] "...Found: 121  words ..."
##      word       prob
## 1:    you 0.73549958
## 2:      u 0.05091751
## 3: things 0.04909117
## 4:     we 0.02743726
## 5:    the 0.02378666
## [1] "-----> FINISH: topn_predict_regex: Running Time .......2.3 seconds ..."
## [1] "-----> FINISH: predict_nextword_regex: Running Time .......2.3 seconds ..."

prediction2 <- predict_nextword(c("how","are"),p=0,n=5,training_set = 80)

## [1] "-----> INIT: predict_nextword( word:=(how,are), training_set:=80, prob:=0, n:=5)......."
## [1] "-----> INIT: load_DT_prob_final_table(n:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=2 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: topn_predict( word:=(how,are), prob:=0, nun_words:=5, factor:=1)......."
## [1] "...Found: 121  words ..."
##      word       prob
## 1:    you 0.73549958
## 2:      u 0.05091751
## 3: things 0.04909117
## 4:     we 0.02743726
## 5:    the 0.02378666
## [1] "-----> topn_predict: Running Time .......2.3 seconds ..."
## [1] "-----> FINISH: predict_nextword: Running Time .......2.3 seconds ..."

prediction3 <- predict_nextword_regex(c("how","are","y"),p=0,n=5,training_set = 80)

## [1] "-----> INIT: predict_nextword_regex( word:=(how,are,y), training_set:=80, prob:=0, n:=5)......."
## [1] "-----> INIT: load_DT_prob_final_table(n:=1 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=1 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=2 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=2 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=3 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=3 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: load_DT_prob_final_table(n:=4 training_set:=80)......."
## [1] "-----> FINISH: load_DT_prob_final_table(n:=4 training_set:=80): Running Time .......0 seconds ..."
## [1] "-----> INIT: topn_predict_regex( word:=(how,are,y), prob:=0, nun_words:=5, factor:=1)......."
## [1] "...Found: 19  words ..."
##     word        prob
## 1:   you 0.735499580
## 2:    ya 0.023784606
## 3:  your 0.015175273
## 4:  youu 0.002391419
## 5: y'all 0.001086965
## [1] "-----> FINISH: topn_predict_regex: Running Time .......2.3 seconds ..."
## [1] "-----> FINISH: predict_nextword_regex: Running Time .......2.3 seconds ..."

# Print the basic information about the files. 
kable(prediction1)

word	prob
you	0.7354996
u	0.0509175
things	0.0490912
we	0.0274373
the	0.0237867

kable(prediction2)

word	prob
you	0.7354996
u	0.0509175
things	0.0490912
we	0.0274373
the	0.0237867

kable(prediction3)

word	prob
you	0.7354996
ya	0.0237846
your	0.0151753
youu	0.0023914
y’all	0.0010870