Capstone - Natural Language Processing_Week2

Milestone Report Summary

For this Natural Language Processing capstone project (partnered with SwiftKey), the goal is to create a predictive text model where if a user types in a sequence of words, the model should suggest the most likely next word.

In this milestone report, exploratory data analysis is performed for three English-language corpora that were obtained by crawling three types of websites: news articles, blog articles, and tweets. The corpora are first loaded and summarized, and then pre-processed to remove profane words and to remove commonly used words. The frequency of individual words, bigrams (pairs of words), and trigrams (triplets of words) is considered for each corpus and barplots are generated to visualize the relative frequency of the most commonly appearing words, bigrams, and trigrams. Finally, a plan for how to proceed to using the data to train a text generation prediction algorithm is presented.

Data Processing

Load the data

The data was initially downloaded from: Capstone Dataset

COnsider the three en_US.news.txt, en_US.blogs.txt, en_US.twitter.txt data files in the en_US folder.

Load the three .txt files and change the encoding to UTF-8. Read the number of lines and number of words in each entire .txt file.

# Calculate number of lines and number of words in entire News .txt file
connectNews <- file("en_US.news.txt") 
USNews_all <- readLines(connectNews,encoding="UTF-8")
numLinesNews <- length(readLines(connectNews))
numWordsNews <- sum(sapply(strsplit(USNews_all,"\\s+"), length))
close(connectNews)

# Calculate number of lines and number of words in entire Blogs .txt file
connectBlogs <- file("en_US.blogs.txt") 
USBlogs_all <- readLines(connectBlogs,encoding="UTF-8")
numLinesBlogs <- length(readLines(connectBlogs))
numWordsBlogs <- sum(sapply(strsplit(USBlogs_all,"\\s+"), length))
close(connectBlogs)

# Calculate number of lines and number of words in entire Twitter .txt file
connectTwitter <- file("en_US.twitter.txt") 
USTwitter_all <- readLines(connectTwitter,encoding="UTF-8")
numLinesTwitter <- length(readLines(connectTwitter))
numWordsTwitter <- sum(sapply(strsplit(USTwitter_all,"\\s+"), length))
close(connectTwitter)

Create a table of the total number of lines and total number of words in each file, prior to any processing:

# Make a table of line count and word count for each of the three files, before any processing
numLines <- matrix(c(numLinesNews,numLinesBlogs,numLinesTwitter,numWordsNews,numWordsBlogs,numWordsTwitter,format(numWordsNews/numLinesNews,digits=0),format(numWordsBlogs/numLinesBlogs,digits=0),format(numWordsTwitter/numLinesTwitter,digits=0)),ncol=3, byrow=FALSE)
colnames(numLines) <- c("Line Count","Word Count","Avg Word Count per Line")
rownames(numLines) <- c("News","Blogs","Twitter")
numLines <- as.table(numLines)
numLines

##         Line Count Word Count Avg Word Count per Line
## News    77259      2643969    34                     
## Blogs   899288     37334131   42                     
## Twitter 2360148    30373543   13

From the table above, en_US.twitter.txt has the most number of lines, however since each tweet is relatively short, the Twitter word count is not as high as the Blogs word count. en_US.blogs.txt has the highest average word count per line and also the highest word count overall in the .txt file.

Remove profane words from each file

Profanity should be removed from each of the three files, since profane words will not be included in the prediction model. A list of profane words is obtained from: List of Profane Words

After the profane words are removed, calculations are performed to express the number of profane words removed from each file and what percentage of words are removed.

library(tm)

# Create list of profane words from the github link above
profane <- read.table("profanity.txt",sep="\n")$V1

# Remove profanities from News and calculate the number of profanities removed. Additionally, express the number of 
# profanities removed as a percentage 
USNews_all_NoProf <- removeWords(USNews_all,profane)
numWordsNews_NoProf <- sum(sapply(strsplit(USNews_all_NoProf,"\\s+"), length))
numWordsNews_removed <- numWordsNews - numWordsNews_NoProf
numWordsNews_Premoved <- numWordsNews_removed/numWordsNews*100

# Remove profanities from Blogs and calculate the number of profanities removed. Additionally, express the number of 
# profanities removed as a percentage 
USBlogs_all_NoProf <- removeWords(USBlogs_all,profane)
numWordsBlogs_NoProf <- sum(sapply(strsplit(USBlogs_all_NoProf,"\\s+"), length))
numWordsBlogs_removed <- numWordsBlogs - numWordsBlogs_NoProf
numWordsBlogs_Premoved <- numWordsBlogs_removed/numWordsBlogs*100

# Remove profanities from Twitter and calculate the number of profanities removed. Additionally, express the number of 
# profanities removed as a percentage 
USTwitter_all_NoProf <- removeWords(USTwitter_all,profane)
numWordsTwitter_NoProf <- sum(sapply(strsplit(USTwitter_all_NoProf,"\\s+"), length))
numWordsTwitter_removed <- numWordsTwitter - numWordsTwitter_NoProf
numWordsTwitter_Premoved <- numWordsTwitter_removed/numWordsTwitter*100

Create a table of the number of profane words removed in each file, expressed as an absolute number and also as a percentage of the original number of words:

# Make a table of profanities removed and percentage removed for each of the three files
numWords <- matrix(c(numWordsNews_removed,numWordsBlogs_removed,numWordsTwitter_removed,numWordsNews_Premoved,numWordsBlogs_Premoved,numWordsTwitter_Premoved),ncol=2, byrow=FALSE)
colnames(numWords) <- c("Words Removed","Percent Removed (%)")
rownames(numWords) <- c("News","Blogs","Twitter")
numWords <- as.table(format(numWords,scientific=FALSE,digits=1))
numWords

##         Words Removed Percent Removed (%)
## News      470.00          0.02           
## Blogs   18073.00          0.05           
## Twitter 72927.00          0.24

From the above table, as expected there are very few profane words removed from en_US.news.txt and en_US.blogs.txt, but a relatively high 0.24% of words classified as profanity have been removed from en_US.twitter.txt. This is most likely due to News and Blog sites containing more professional content where there are less profanities used.

Remove special characters from each file

Remove non-standard characters from each file, such as right single quotation marks, em dash and en dash.

library(qdap)

# Remove special characters using their Unicode code point
USNews_all_NoProf <- multigsub(c("\u201C","\u201D","\u2013","\u2014","\u2019"),c("","","","",""),USNews_all_NoProf)
USBlogs_all_NoProf <- multigsub(c("\u201C","\u201D","\u2013","\u2014","\u2019"),c("","","","",""),USBlogs_all_NoProf)
USTwitter_all_NoProf <- multigsub(c("\u201C","\u201D","\u2013","\u2014","\u2019"),c("","","","",""),USTwitter_all_NoProf)

Partition data into training, validation and testing sets

For each of the three .txt files, take a subset of 20% of the data since the .txt files are so large. Consider only 20% to balance both the computation speed and a large enough sample size. Then from the randomly selected subset of 20%, split the subset into training, validation and testing sets, with the following split:

Training: 60%
Validation: 20%
Testing: 20%

All partitioning is randomized using the sample() function which generates a vector of line indices in random order for each .txt file. After setting the seed and re-ordering the lines in each .txt file randomly, the first 20% of each file is extracted to use in this study. Then the lines in the reduced 20% subset are re-ordered again and further split into the three training, validation and testing sets.

set.seed(12)

# Create a function that partitions a data set into training, validation and testing sets
partitionData <- function(curLines,US_all_NoProf){
  
  vecRand <- sample(curLines)
  numTraining <- floor(curLines*0.6)
  numValid <- floor(curLines*0.2)
  numTesting <- curLines - (numTraining+numValid)
  
  training <- US_all_NoProf[vecRand[1:numTraining]]
  valid <- US_all_NoProf[vecRand[(numTraining+1):(numTraining+numValid)]]
  testing <- US_all_NoProf[vecRand[(numTraining+numValid+1):curLines]]
  
  dataList <- list(training,valid,testing)
  
}

# Partition the News data, after taking a random 20% subset
NewsList <- partitionData(floor(0.2*numLinesNews),USNews_all_NoProf[sample(floor(0.2*numLinesNews))])
trainingNews <- NewsList[[1]]
validNews <- NewsList[[2]]
testingNews <- NewsList[[3]]

# Partition the Blogs data, after taking a random 20% subset
BlogsList <- partitionData(floor(0.2*numLinesBlogs),USBlogs_all_NoProf[sample(floor(0.2*numLinesBlogs))])
trainingBlogs <- BlogsList[[1]]
validBlogs <- BlogsList[[2]]
testingBlogs <- BlogsList[[3]]

# Partition the Twitter data, after taking a random 20% subset
TwitterList <- partitionData(floor(0.2*numLinesTwitter),USTwitter_all_NoProf[sample(floor(0.2*numLinesTwitter))])
trainingTwitter <- TwitterList[[1]]
validTwitter <- TwitterList[[2]]
testingTwitter <- TwitterList[[3]]

Exploratory Data Analysis

Tokenization is important in Natural Language Processing, to break the text into smaller chunks, or “tokens.” In this report, single words, bigrams and trigrams are analyzed from the training sets of each .txt file. Further processing is done to the training set, such as removing punctuation. Single words are examined with and without stop words (which are the most commonly used words such as “i”, “me”, “we”, “have”, “from”, etc), however bigrams and trigrams are analyzed without stop words. Tables and graphs are provided to show the single words, bigrams and trigrams that appear most frequently.

Single Words

As part of the data processing, for each .txt file, transform all characters to lower case, remove punctuation and remove stop words if specified. Use functions in the tm package to manipulate the text data and count the frequency of single words that appear.

library(tm)

# Create a function that obtains the frequency of all words in a data set, with the option to filter out 
# stop words or not
getWordFreq <- function(trainingData,rmStop,lowCut){
  curCorpus<-Corpus(VectorSource(trainingData))
  curCorpus<-tm_map(curCorpus, content_transformer(tolower))
  if (rmStop){
    curCorpus <-tm_map(curCorpus, removeWords, stopwords('english'))
  }
  curCorpus<-tm_map(curCorpus, removePunctuation)
  
  curWords <-TermDocumentMatrix(curCorpus,control=list(bounds=list(global = c(lowCut, Inf))))
  curWordsMatrix <-as.matrix(curWords)
  curWordFreq <-sort(rowSums(curWordsMatrix), decreasing = TRUE)
  
  curWordFreq
}

# Frequency of words for News, including common stop words
wordfreqNews <- getWordFreq(trainingNews,FALSE,500)
# Frequency of words for News, removing common stop words
wordfreqNewsNoSW <- getWordFreq(trainingNews,TRUE,200)

# Frequency of words for Blogs, including common stop words
wordfreqBlogs <- getWordFreq(trainingBlogs,FALSE,5000)
# Frequency of words for Blogs, removing common stop words
wordfreqBlogsNoSW <- getWordFreq(trainingBlogs,TRUE,2000)

# Frequency of words for Twitter, including common stop words
wordfreqTwitter <- getWordFreq(trainingTwitter,FALSE,6000)
# Frequency of words for Twitter, removing common stop words
wordfreqTwitterNoSW <- getWordFreq(trainingTwitter,TRUE,5000)

Show the 30 most common words in each file in a tabular form:

# Create a function that gets the top N most common words in a given set with and without stop words
getWordFreqTopN <- function(wordFreq1,wordFreq2,N){
  wordfreqN <- data.frame(word=names(wordFreq1[1:N]),freq=unname(wordFreq1[1:N]),wordNoSW=names(wordFreq2[1:N]),freqNoSW=unname(wordFreq2[1:N]))
  wordfreqN
}

# Get the top 30 most common words for News
wordfreqNews30 <- getWordFreqTopN(wordfreqNews,wordfreqNewsNoSW,30)
# Get the top 30 most common words for Blogs
wordfreqBlogs30 <- getWordFreqTopN(wordfreqBlogs,wordfreqBlogsNoSW,30)
# Get the top 30 most common words for Twitter
wordfreqTwitter30 <- getWordFreqTopN(wordfreqTwitter,wordfreqTwitterNoSW,30)

wordfreqNews30

##     word  freq wordNoSW freqNoSW
## 1    the 17792     said     2277
## 2    and  8187     will     1008
## 3    for  3231      one      754
## 4   that  3069      new      628
## 5   with  2342      two      553
## 6   said  2277      can      543
## 7    was  2074     also      527
## 8    his  1375     year      509
## 9   from  1371     just      488
## 10   but  1358    years      485
## 11   are  1263    first      455
## 12  have  1252     time      452
## 13   its  1164     last      450
## 14   has  1121     like      434
## 15  this  1090    state      432
## 16  will  1008   people      420
## 17  they  1008      get      384
## 18   not   986  percent      335
## 19   who   945      now      333
## 20   you   866  million      312
## 21 about   850   school      309
## 22  were   823    three      302
## 23  more   796     city      293
## 24 their   779     many      292
## 25   had   757    going      290
## 26   one   751     game      290
## 27  when   722     back      288
## 28 would   676     home      285
## 29   out   666     says      274
## 30   new   628      may      272

wordfreqBlogs30

##     word   freq wordNoSW freqNoSW
## 1    the 221113      one    14690
## 2    and 129799     will    13426
## 3   that  54698     just    11894
## 4    for  43431      can    11850
## 5    you  35489     like    11690
## 6   with  34118     time    10670
## 7    was  33200      get     8414
## 8   this  30799      now     7357
## 9   have  26243     know     7353
## 10   but  24368   people     7055
## 11   are  22830     also     6636
## 12   not  20718      new     6483
## 13  from  17569     even     6118
## 14   all  17280     back     6094
## 15  they  16478     make     6083
## 16   one  14627    first     6080
## 17 about  13773      day     6022
## 18  will  13421   really     5944
## 19   its  13275      see     5909
## 20  what  13094     good     5866
## 21   out  13023     well     5803
## 22   his  12863     much     5783
## 23   had  12672    think     5716
## 24   her  12510      way     5595
## 25  when  12366   little     5525
## 26  your  12096     love     5284
## 27  just  11885      two     4859
## 28   can  11841    going     4780
## 29  like  11671     life     4696
## 30 there  11341   things     4670

wordfreqTwitter30

##      word   freq wordNoSW freqNoSW
## 1     the 112469     just    17921
## 2     you  65100     like    14616
## 3     and  52232      get    13309
## 4     for  46168     love    13023
## 5    that  28315     good    11960
## 6    with  20792     will    11216
## 7    your  20582      can    10776
## 8    have  20157      day    10775
## 9    this  19547   thanks    10555
## 10    are  19095      now    10034
## 11   just  17889      one     9820
## 12    but  14721     know     9501
## 13    not  14686     time     9235
## 14    its  14600    great     9191
## 15   like  14593    today     8491
## 16    all  14575      new     8237
## 17    was  14137      lol     8090
## 18    out  13612      see     7881
## 19   what  13316     back     6882
## 20    get  13293      got     6778
## 21   love  12994    going     6598
## 22   good  11920    think     6490
## 23   will  11200   people     6324
## 24  about  11145     need     6141
## 25    can  10762     want     5848
## 26    day  10713    happy     5798
## 27   dont  10695     make     5705
## 28 thanks  10526   follow     5684
## 29   from  10056    right     5439
## 30    now   9965   really     5417

Graph the 10 most common words (including and excluding stop words) for each of the three .txt files:

library(ggplot2)
library(patchwork)

# Create a function that generates a bar plot from a given data frame of word frequencies
getBarplot <- function(wordfreq,curTitle,SW,range){
  if (SW){
    wordfreq$word <- factor(wordfreq$word,levels=wordfreq$word)
  p1 <- ggplot(wordfreq[range,],aes(x=freq,y=word))+geom_bar(stat="identity",color="salmon",fill="salmon")+
  xlab("Frequency")+ylab("Word")+ ggtitle(curTitle) + geom_text(aes(label = freq), size = 6, hjust = 0.6, vjust = 1)+      theme(text=element_text(size=25))
  } else {
    wordfreq$wordNoSW <- factor(wordfreq$wordNoSW,levels=wordfreq$wordNoSW)
  p1 <- ggplot(wordfreq[range,],aes(x=freqNoSW,y=wordNoSW))+geom_bar(stat="identity",color="salmon",fill="salmon")+        xlab("Frequency")+ylab("Word")+ ggtitle(curTitle) + geom_text(aes(label = freqNoSW), size = 6, hjust = 0.6, vjust = 1)+   theme(text=element_text(size=25))
  }
  p1
}

# Create plot objects for the case excluding stop words
newsNoSW <- getBarplot(wordfreqNews30,"Word Frequency excl. Stop \nWords For News \nin Training Set",FALSE,seq(from=1,to=10,by=1))
blogsNoSW <- getBarplot(wordfreqBlogs30,"Word Frequency excl. Stop \nWords For Blogs \nin Training Set",FALSE,seq(from=1,to=10,by=1))
twitterNoSW <- getBarplot(wordfreqTwitter30,"Word Frequency excl. Stop \nWords For Twitter \nin Training Set",FALSE,seq(from=1,to=10,by=1))

# Create plot objects for the case including stop words
newsSW <- getBarplot(wordfreqNews30,"Word Frequency incl. Stop \nWords For News \nin Training Set",TRUE,seq(from=1,to=10,by=1))
blogsSW <- getBarplot(wordfreqBlogs30,"Word Frequency incl. Stop \nWords For Blogs \nin Training Set",TRUE,seq(from=1,to=10,by=1))
twitterSW <- getBarplot(wordfreqTwitter30,"Word Frequency incl. Stop \nWords For Twitter \nin Training Set",TRUE,seq(from=1,to=10,by=1))

# Plot both graphs
newsNoSW + blogsNoSW + twitterNoSW

newsSW + blogsSW + twitterSW

As can be seen from the above graphs, in the instance where stop words are excluded, the following words are in the top 10 across all three corpora: “will,” “can,” and “just.”

In the instance where stop words are included, the following words are in the top 10 across all three corpora: “the,” “and,” “for,” “that,” and ’with."

N-grams (N=2 and N=3)

As part of the data processing, for each .txt file, transform all characters to lower case, remove punctuation, remove stop words, remove empty characters from each vector, and remove numbers before formulating bigrams and trigrams. Care is taken to split sentences before creating the n-grams to avoid the situation of having n-grams that span sentences and don’t make sense. The most frequent bigrams and trigrams from each file are tabulated and plotted.

library(tokenizers)
library(dplyr)
library(tm)

# Create a function that returns the most frequent n-grams from a given data set while splitting sentences
getNgramFreq <- function(trainingData,N,lowCut,rmU){
  
  trainingData <- unlist(tokenize_sentences(trainingData))
  trainingData <- tolower(trainingData)
  trainingData <- removeNumbers(trainingData)
  trainingData <- removePunctuation(trainingData)
  if (rmU){
    trainingData <- removeWords(trainingData,"u")
  }
  trainingData <- stripWhitespace(trainingData)
  trainingData <- trainingData[trainingData != ""]
  
  ngrams <- unlist(tokenize_ngrams(trainingData,n=N,n_min=N))
  
  curNgramFreq <- as.data.frame(table(ngrams))
  curNgramFreq <- curNgramFreq %>% 
    rename(ngram = ngrams) %>%
    rename(freq = Freq)
  
  curNgramFreq = curNgramFreq[curNgramFreq$freq>=lowCut,]
  
  curNgramFreq %>%
    arrange(desc(freq))
}

# Find most frequent bigrams from each data source
bigramfreqNews <- getNgramFreq(removeWords(tolower(trainingNews),stopwords("english")),2,25,TRUE)
bigramfreqBlogs <- getNgramFreq(removeWords(tolower(trainingBlogs),stopwords("english")),2,20,FALSE)
bigramfreqTwitter <- getNgramFreq(removeWords(tolower(trainingTwitter),stopwords("english")),2,300,FALSE)

# Find most frequent trigrams from each data source
trigramfreqNews <- getNgramFreq(removeWords(tolower(trainingNews),stopwords("english")),3,5,TRUE)
trigramfreqBlogs <- getNgramFreq(removeWords(tolower(trainingBlogs),stopwords("english")),3,20,FALSE)
trigramfreqTwitter <- getNgramFreq(removeWords(tolower(trainingTwitter),stopwords("english")),3,20,FALSE)

Show the first 20 most common bigrams and trigrams in each file in a tabular form, excluding stop words:

# Get the top 20 most common bigrams for News
bigramFreqNews20 <- bigramfreqNews[1:20,]
# Get the top 20 most common bigrams for Blogs
bigramFreqBlogs20 <- bigramfreqBlogs[1:20,]
# Get the top 20 most common bigrams for Twitter
bigramFreqTwitter20 <- bigramfreqTwitter[1:20,]

bigramFreqNews20

##             ngram freq
## 1       last year  114
## 2        new york   95
## 3        st louis   80
## 4     high school   76
## 5       years ago   64
## 6      new jersey   62
## 7       last week   54
## 8       two years   46
## 9  officials said   40
## 10    health care   39
## 11  san francisco   38
## 12     first time   37
## 13  united states   37
## 14      next year   36
## 15    los angeles   34
## 16    police said   30
## 17      right now   30
## 18    three years   29
## 19    even though   28
## 20     four years   26

bigramFreqBlogs20

##          ngram freq
## 1    years ago  627
## 2    right now  609
## 3  even though  569
## 4     new york  569
## 5    feel like  511
## 6   first time  498
## 7      can see  494
## 8    last year  484
## 9    make sure  482
## 10   dont know  463
## 11  last night  438
## 12   last week  427
## 13     one day  410
## 14   every day  391
## 15     can get  375
## 16   just like  357
## 17   one thing  325
## 18 many people  324
## 19   long time  313
## 20   look like  313

bigramFreqTwitter20

##              ngram freq
## 1        right now 1994
## 2       last night 1346
## 3  looking forward 1080
## 4   happy birthday 1040
## 5     good morning  934
## 6         just got  890
## 7        feel like  832
## 8        good luck  792
## 9    thanks follow  734
## 10     follow back  726
## 11      looks like  714
## 12         can get  690
## 13        let know  671
## 14     mothers day  629
## 15       next week  577
## 16     sounds like  536
## 17       great day  535
## 18       make sure  495
## 19   please follow  481
## 20       thanks rt  476

# Get the top 20 most common trigrams for News
trigramFreqNews20 <- trigramfreqNews[1:20,]
# Get the top 20 most common trigrams for Blogs
trigramFreqBlogs20 <- trigramfreqBlogs[1:20,]
# Get the top 20 most common trigrams for Twitter
trigramFreqTwitter20 <- trigramfreqTwitter[1:20,]

trigramFreqNews20

##                      ngram freq
## 1            two years ago   14
## 2     pates fountain parks   11
## 3   president barack obama   11
## 4          st louis county   10
## 5  chief financial officer    9
## 6            new york city    9
## 7   classic pates fountain    8
## 8         first time since    7
## 9           new york times    7
## 10          past two years    7
## 11 chief operating officer    6
## 12      gov chris christie    6
## 13         gov john kasich    6
## 14          late last year    6
## 15  said written statement    6
## 16       us district court    6
## 17         will take place    6
## 18            world war ii    6
## 19 according court records    5
## 20           carter kick p    5

trigramFreqBlogs20

##                   ngram freq
## 1         new york city   90
## 2        new york times   68
## 3   amazon services llc   54
## 4         llc amazon eu   54
## 5   services llc amazon   54
## 6           new york ny   49
## 7         two years ago   44
## 8      couple weeks ago   42
## 9      every single day   40
## 10      please let know   40
## 11  incorporated item c   35
## 12    lord jesus christ   35
## 13 preheat oven degrees   35
## 14         world war ii   35
## 15     please feel free   34
## 16          let us know   33
## 17        long time ago   33
## 18       blah blah blah   32
## 19            item c pp   32
## 20        two weeks ago   32

trigramFreqTwitter20

##                     ngram freq
## 1       happy mothers day  375
## 2             let us know  289
## 3          happy new year  209
## 4           cinco de mayo  123
## 5  looking forward seeing   92
## 6    happy valentines day   80
## 7         st patricks day   71
## 8          keep good work   70
## 9             come see us   69
## 10          just got home   64
## 11         love love love   64
## 12     follow back please   61
## 13          just got back   61
## 14     please follow back   61
## 15  good morning everyone   59
## 16          new years eve   57
## 17         cake cake cake   53
## 18   please please please   51
## 19       dreams come true   50
## 20      happy th birthday   49

Graph the 10 most common bigrams and trigrams (excluding stop words) for each of the three files:

# Create a function that generates a barplot object from a given table of n-grams
getBarplotNgrams <- function(ngramfreqPlot,curTitle,bi_tri,range){
  ngramfreqPlot$ngram <- factor(ngramfreqPlot$ngram,levels=ngramfreqPlot$ngram)
  p1 <- ggplot(ngramfreqPlot[range,],aes(x=freq,y=ngram))+geom_bar(stat="identity",color="cadetblue2",fill="cadetblue2")+
  xlab("Frequency")+ylab(bi_tri)+ ggtitle(curTitle) + geom_text(aes(label = freq), size = 6, hjust = 0.6, vjust = 1) +      theme(text=element_text(size=23))
  p1
}

# Create plot objects for the bigrams plot
newsBi <- getBarplotNgrams(bigramFreqNews20,"Bigram Frequency excl. \nStop Words For \nNews in Training \nSet","Bigram",seq(from=1,to=10,by=1))
blogsBi <- getBarplotNgrams(bigramFreqBlogs20,"Bigram Frequency excl. \nStop Words For \nBlogs in Training \nSet","Bigram",seq(from=1,to=10,by=1))
twitterBi <- getBarplotNgrams(bigramFreqTwitter20,"Bigram Frequency excl. \nStop Words For \nTwitter in Training \nSet","Bigram",seq(from=1,to=10,by=1))

# Create plot objects for the trigrams plot
newsTri <- getBarplotNgrams(trigramFreqNews20,"Trigram Freq. excl. \nStop Words For \nNews in Training \nSet","Trigram",seq(from=1,to=10,by=1))
blogsTri <- getBarplotNgrams(trigramFreqBlogs20,"Trigram Freq. excl. \nStop Words For \nBlogs in Training \nSet","Trigram",seq(from=1,to=10,by=1))
twitterTri <- getBarplotNgrams(trigramFreqTwitter20,"Trigram Freq. excl. \nStop Words For \nTwitter in Training \nSet","Trigram",seq(from=1,to=10,by=1))

# Plot both graphs
newsBi + blogsBi + twitterBi

newsTri + blogsTri + twitterTri

From the graphs above, there are no bigrams shared between the top 10 of all three corpora, but the following bigram is in the top 20 across all three corpora: “right now.”

From the graphs and tables above, there are no trigrams shared between the top 20 of all three corpora, however, for example the trigram “new york city” is in the top 10 of two corpora.

Future Steps

Consider more advanced data processing steps, such as filtering out any non-English text that may reside in these three predominantly English language files,en_US.news.txt, en_US.blogs.txt and en_US.twitter.txt.
Consider whether to include other sources of data in addition to the three files explored above.
Train a Markov chain model that predicts the next word based on the previous word typed by the user.
Evaluate whether using the previous bigram or trigram in the Markov chain model rather than the previous word improves model performance.
Explore more complex modeling strategies, such as recurrent neural networks with long short-term memory (LSTM).
Decide on a model that has the best combination of accuracy and computational efficiency.
Create a Shiny app where users can enter text and see the words that are most likely to follow, and create a supporting presentation to explain the features of the algorithms used and the app itself.

Capstone - Natural Language Processing_Week2 - Milestone Report

SZB

2/14/2021