For this Natural Language Processing capstone project (partnered with SwiftKey), the goal is to create a predictive text model where if a user types in a sequence of words, the model should suggest the most likely next word.
In this milestone report, exploratory data analysis is performed for three English-language corpora that were obtained by crawling three types of websites: news articles, blog articles, and tweets. The corpora are first loaded and summarized, and then pre-processed to remove profane words and to remove commonly used words. The frequency of individual words, bigrams (pairs of words), and trigrams (triplets of words) is considered for each corpus and barplots are generated to visualize the relative frequency of the most commonly appearing words, bigrams, and trigrams. Finally, a plan for how to proceed to using the data to train a text generation prediction algorithm is presented.
The data was initially downloaded from: Capstone Dataset
COnsider the three en_US.news.txt, en_US.blogs.txt, en_US.twitter.txt data files in the en_US folder.
Load the three .txt files and change the encoding to UTF-8. Read the number of lines and number of words in each entire .txt file.
# Calculate number of lines and number of words in entire News .txt file
connectNews <- file("en_US.news.txt")
USNews_all <- readLines(connectNews,encoding="UTF-8")
numLinesNews <- length(readLines(connectNews))
numWordsNews <- sum(sapply(strsplit(USNews_all,"\\s+"), length))
close(connectNews)
# Calculate number of lines and number of words in entire Blogs .txt file
connectBlogs <- file("en_US.blogs.txt")
USBlogs_all <- readLines(connectBlogs,encoding="UTF-8")
numLinesBlogs <- length(readLines(connectBlogs))
numWordsBlogs <- sum(sapply(strsplit(USBlogs_all,"\\s+"), length))
close(connectBlogs)
# Calculate number of lines and number of words in entire Twitter .txt file
connectTwitter <- file("en_US.twitter.txt")
USTwitter_all <- readLines(connectTwitter,encoding="UTF-8")
numLinesTwitter <- length(readLines(connectTwitter))
numWordsTwitter <- sum(sapply(strsplit(USTwitter_all,"\\s+"), length))
close(connectTwitter)
Create a table of the total number of lines and total number of words in each file, prior to any processing:
# Make a table of line count and word count for each of the three files, before any processing
numLines <- matrix(c(numLinesNews,numLinesBlogs,numLinesTwitter,numWordsNews,numWordsBlogs,numWordsTwitter,format(numWordsNews/numLinesNews,digits=0),format(numWordsBlogs/numLinesBlogs,digits=0),format(numWordsTwitter/numLinesTwitter,digits=0)),ncol=3, byrow=FALSE)
colnames(numLines) <- c("Line Count","Word Count","Avg Word Count per Line")
rownames(numLines) <- c("News","Blogs","Twitter")
numLines <- as.table(numLines)
numLines
## Line Count Word Count Avg Word Count per Line
## News 77259 2643969 34
## Blogs 899288 37334131 42
## Twitter 2360148 30373543 13
From the table above, en_US.twitter.txt has the most number of lines, however since each tweet is relatively short, the Twitter word count is not as high as the Blogs word count. en_US.blogs.txt has the highest average word count per line and also the highest word count overall in the .txt file.
Profanity should be removed from each of the three files, since profane words will not be included in the prediction model. A list of profane words is obtained from: List of Profane Words
After the profane words are removed, calculations are performed to express the number of profane words removed from each file and what percentage of words are removed.
library(tm)
# Create list of profane words from the github link above
profane <- read.table("profanity.txt",sep="\n")$V1
# Remove profanities from News and calculate the number of profanities removed. Additionally, express the number of
# profanities removed as a percentage
USNews_all_NoProf <- removeWords(USNews_all,profane)
numWordsNews_NoProf <- sum(sapply(strsplit(USNews_all_NoProf,"\\s+"), length))
numWordsNews_removed <- numWordsNews - numWordsNews_NoProf
numWordsNews_Premoved <- numWordsNews_removed/numWordsNews*100
# Remove profanities from Blogs and calculate the number of profanities removed. Additionally, express the number of
# profanities removed as a percentage
USBlogs_all_NoProf <- removeWords(USBlogs_all,profane)
numWordsBlogs_NoProf <- sum(sapply(strsplit(USBlogs_all_NoProf,"\\s+"), length))
numWordsBlogs_removed <- numWordsBlogs - numWordsBlogs_NoProf
numWordsBlogs_Premoved <- numWordsBlogs_removed/numWordsBlogs*100
# Remove profanities from Twitter and calculate the number of profanities removed. Additionally, express the number of
# profanities removed as a percentage
USTwitter_all_NoProf <- removeWords(USTwitter_all,profane)
numWordsTwitter_NoProf <- sum(sapply(strsplit(USTwitter_all_NoProf,"\\s+"), length))
numWordsTwitter_removed <- numWordsTwitter - numWordsTwitter_NoProf
numWordsTwitter_Premoved <- numWordsTwitter_removed/numWordsTwitter*100
Create a table of the number of profane words removed in each file, expressed as an absolute number and also as a percentage of the original number of words:
# Make a table of profanities removed and percentage removed for each of the three files
numWords <- matrix(c(numWordsNews_removed,numWordsBlogs_removed,numWordsTwitter_removed,numWordsNews_Premoved,numWordsBlogs_Premoved,numWordsTwitter_Premoved),ncol=2, byrow=FALSE)
colnames(numWords) <- c("Words Removed","Percent Removed (%)")
rownames(numWords) <- c("News","Blogs","Twitter")
numWords <- as.table(format(numWords,scientific=FALSE,digits=1))
numWords
## Words Removed Percent Removed (%)
## News 470.00 0.02
## Blogs 18073.00 0.05
## Twitter 72927.00 0.24
From the above table, as expected there are very few profane words removed from en_US.news.txt and en_US.blogs.txt, but a relatively high 0.24% of words classified as profanity have been removed from en_US.twitter.txt. This is most likely due to News and Blog sites containing more professional content where there are less profanities used.
Remove non-standard characters from each file, such as right single quotation marks, em dash and en dash.
library(qdap)
# Remove special characters using their Unicode code point
USNews_all_NoProf <- multigsub(c("\u201C","\u201D","\u2013","\u2014","\u2019"),c("","","","",""),USNews_all_NoProf)
USBlogs_all_NoProf <- multigsub(c("\u201C","\u201D","\u2013","\u2014","\u2019"),c("","","","",""),USBlogs_all_NoProf)
USTwitter_all_NoProf <- multigsub(c("\u201C","\u201D","\u2013","\u2014","\u2019"),c("","","","",""),USTwitter_all_NoProf)
For each of the three .txt files, take a subset of 20% of the data since the .txt files are so large. Consider only 20% to balance both the computation speed and a large enough sample size. Then from the randomly selected subset of 20%, split the subset into training, validation and testing sets, with the following split:
All partitioning is randomized using the sample() function which generates a vector of line indices in random order for each .txt file. After setting the seed and re-ordering the lines in each .txt file randomly, the first 20% of each file is extracted to use in this study. Then the lines in the reduced 20% subset are re-ordered again and further split into the three training, validation and testing sets.
set.seed(12)
# Create a function that partitions a data set into training, validation and testing sets
partitionData <- function(curLines,US_all_NoProf){
vecRand <- sample(curLines)
numTraining <- floor(curLines*0.6)
numValid <- floor(curLines*0.2)
numTesting <- curLines - (numTraining+numValid)
training <- US_all_NoProf[vecRand[1:numTraining]]
valid <- US_all_NoProf[vecRand[(numTraining+1):(numTraining+numValid)]]
testing <- US_all_NoProf[vecRand[(numTraining+numValid+1):curLines]]
dataList <- list(training,valid,testing)
}
# Partition the News data, after taking a random 20% subset
NewsList <- partitionData(floor(0.2*numLinesNews),USNews_all_NoProf[sample(floor(0.2*numLinesNews))])
trainingNews <- NewsList[[1]]
validNews <- NewsList[[2]]
testingNews <- NewsList[[3]]
# Partition the Blogs data, after taking a random 20% subset
BlogsList <- partitionData(floor(0.2*numLinesBlogs),USBlogs_all_NoProf[sample(floor(0.2*numLinesBlogs))])
trainingBlogs <- BlogsList[[1]]
validBlogs <- BlogsList[[2]]
testingBlogs <- BlogsList[[3]]
# Partition the Twitter data, after taking a random 20% subset
TwitterList <- partitionData(floor(0.2*numLinesTwitter),USTwitter_all_NoProf[sample(floor(0.2*numLinesTwitter))])
trainingTwitter <- TwitterList[[1]]
validTwitter <- TwitterList[[2]]
testingTwitter <- TwitterList[[3]]
Tokenization is important in Natural Language Processing, to break the text into smaller chunks, or “tokens.” In this report, single words, bigrams and trigrams are analyzed from the training sets of each .txt file. Further processing is done to the training set, such as removing punctuation. Single words are examined with and without stop words (which are the most commonly used words such as “i”, “me”, “we”, “have”, “from”, etc), however bigrams and trigrams are analyzed without stop words. Tables and graphs are provided to show the single words, bigrams and trigrams that appear most frequently.
As part of the data processing, for each .txt file, transform all characters to lower case, remove punctuation and remove stop words if specified. Use functions in the tm package to manipulate the text data and count the frequency of single words that appear.
library(tm)
# Create a function that obtains the frequency of all words in a data set, with the option to filter out
# stop words or not
getWordFreq <- function(trainingData,rmStop,lowCut){
curCorpus<-Corpus(VectorSource(trainingData))
curCorpus<-tm_map(curCorpus, content_transformer(tolower))
if (rmStop){
curCorpus <-tm_map(curCorpus, removeWords, stopwords('english'))
}
curCorpus<-tm_map(curCorpus, removePunctuation)
curWords <-TermDocumentMatrix(curCorpus,control=list(bounds=list(global = c(lowCut, Inf))))
curWordsMatrix <-as.matrix(curWords)
curWordFreq <-sort(rowSums(curWordsMatrix), decreasing = TRUE)
curWordFreq
}
# Frequency of words for News, including common stop words
wordfreqNews <- getWordFreq(trainingNews,FALSE,500)
# Frequency of words for News, removing common stop words
wordfreqNewsNoSW <- getWordFreq(trainingNews,TRUE,200)
# Frequency of words for Blogs, including common stop words
wordfreqBlogs <- getWordFreq(trainingBlogs,FALSE,5000)
# Frequency of words for Blogs, removing common stop words
wordfreqBlogsNoSW <- getWordFreq(trainingBlogs,TRUE,2000)
# Frequency of words for Twitter, including common stop words
wordfreqTwitter <- getWordFreq(trainingTwitter,FALSE,6000)
# Frequency of words for Twitter, removing common stop words
wordfreqTwitterNoSW <- getWordFreq(trainingTwitter,TRUE,5000)
Show the 30 most common words in each file in a tabular form:
# Create a function that gets the top N most common words in a given set with and without stop words
getWordFreqTopN <- function(wordFreq1,wordFreq2,N){
wordfreqN <- data.frame(word=names(wordFreq1[1:N]),freq=unname(wordFreq1[1:N]),wordNoSW=names(wordFreq2[1:N]),freqNoSW=unname(wordFreq2[1:N]))
wordfreqN
}
# Get the top 30 most common words for News
wordfreqNews30 <- getWordFreqTopN(wordfreqNews,wordfreqNewsNoSW,30)
# Get the top 30 most common words for Blogs
wordfreqBlogs30 <- getWordFreqTopN(wordfreqBlogs,wordfreqBlogsNoSW,30)
# Get the top 30 most common words for Twitter
wordfreqTwitter30 <- getWordFreqTopN(wordfreqTwitter,wordfreqTwitterNoSW,30)
wordfreqNews30
## word freq wordNoSW freqNoSW
## 1 the 17792 said 2277
## 2 and 8187 will 1008
## 3 for 3231 one 754
## 4 that 3069 new 628
## 5 with 2342 two 553
## 6 said 2277 can 543
## 7 was 2074 also 527
## 8 his 1375 year 509
## 9 from 1371 just 488
## 10 but 1358 years 485
## 11 are 1263 first 455
## 12 have 1252 time 452
## 13 its 1164 last 450
## 14 has 1121 like 434
## 15 this 1090 state 432
## 16 will 1008 people 420
## 17 they 1008 get 384
## 18 not 986 percent 335
## 19 who 945 now 333
## 20 you 866 million 312
## 21 about 850 school 309
## 22 were 823 three 302
## 23 more 796 city 293
## 24 their 779 many 292
## 25 had 757 going 290
## 26 one 751 game 290
## 27 when 722 back 288
## 28 would 676 home 285
## 29 out 666 says 274
## 30 new 628 may 272
wordfreqBlogs30
## word freq wordNoSW freqNoSW
## 1 the 221113 one 14690
## 2 and 129799 will 13426
## 3 that 54698 just 11894
## 4 for 43431 can 11850
## 5 you 35489 like 11690
## 6 with 34118 time 10670
## 7 was 33200 get 8414
## 8 this 30799 now 7357
## 9 have 26243 know 7353
## 10 but 24368 people 7055
## 11 are 22830 also 6636
## 12 not 20718 new 6483
## 13 from 17569 even 6118
## 14 all 17280 back 6094
## 15 they 16478 make 6083
## 16 one 14627 first 6080
## 17 about 13773 day 6022
## 18 will 13421 really 5944
## 19 its 13275 see 5909
## 20 what 13094 good 5866
## 21 out 13023 well 5803
## 22 his 12863 much 5783
## 23 had 12672 think 5716
## 24 her 12510 way 5595
## 25 when 12366 little 5525
## 26 your 12096 love 5284
## 27 just 11885 two 4859
## 28 can 11841 going 4780
## 29 like 11671 life 4696
## 30 there 11341 things 4670
wordfreqTwitter30
## word freq wordNoSW freqNoSW
## 1 the 112469 just 17921
## 2 you 65100 like 14616
## 3 and 52232 get 13309
## 4 for 46168 love 13023
## 5 that 28315 good 11960
## 6 with 20792 will 11216
## 7 your 20582 can 10776
## 8 have 20157 day 10775
## 9 this 19547 thanks 10555
## 10 are 19095 now 10034
## 11 just 17889 one 9820
## 12 but 14721 know 9501
## 13 not 14686 time 9235
## 14 its 14600 great 9191
## 15 like 14593 today 8491
## 16 all 14575 new 8237
## 17 was 14137 lol 8090
## 18 out 13612 see 7881
## 19 what 13316 back 6882
## 20 get 13293 got 6778
## 21 love 12994 going 6598
## 22 good 11920 think 6490
## 23 will 11200 people 6324
## 24 about 11145 need 6141
## 25 can 10762 want 5848
## 26 day 10713 happy 5798
## 27 dont 10695 make 5705
## 28 thanks 10526 follow 5684
## 29 from 10056 right 5439
## 30 now 9965 really 5417
Graph the 10 most common words (including and excluding stop words) for each of the three .txt files:
library(ggplot2)
library(patchwork)
# Create a function that generates a bar plot from a given data frame of word frequencies
getBarplot <- function(wordfreq,curTitle,SW,range){
if (SW){
wordfreq$word <- factor(wordfreq$word,levels=wordfreq$word)
p1 <- ggplot(wordfreq[range,],aes(x=freq,y=word))+geom_bar(stat="identity",color="salmon",fill="salmon")+
xlab("Frequency")+ylab("Word")+ ggtitle(curTitle) + geom_text(aes(label = freq), size = 6, hjust = 0.6, vjust = 1)+ theme(text=element_text(size=25))
} else {
wordfreq$wordNoSW <- factor(wordfreq$wordNoSW,levels=wordfreq$wordNoSW)
p1 <- ggplot(wordfreq[range,],aes(x=freqNoSW,y=wordNoSW))+geom_bar(stat="identity",color="salmon",fill="salmon")+ xlab("Frequency")+ylab("Word")+ ggtitle(curTitle) + geom_text(aes(label = freqNoSW), size = 6, hjust = 0.6, vjust = 1)+ theme(text=element_text(size=25))
}
p1
}
# Create plot objects for the case excluding stop words
newsNoSW <- getBarplot(wordfreqNews30,"Word Frequency excl. Stop \nWords For News \nin Training Set",FALSE,seq(from=1,to=10,by=1))
blogsNoSW <- getBarplot(wordfreqBlogs30,"Word Frequency excl. Stop \nWords For Blogs \nin Training Set",FALSE,seq(from=1,to=10,by=1))
twitterNoSW <- getBarplot(wordfreqTwitter30,"Word Frequency excl. Stop \nWords For Twitter \nin Training Set",FALSE,seq(from=1,to=10,by=1))
# Create plot objects for the case including stop words
newsSW <- getBarplot(wordfreqNews30,"Word Frequency incl. Stop \nWords For News \nin Training Set",TRUE,seq(from=1,to=10,by=1))
blogsSW <- getBarplot(wordfreqBlogs30,"Word Frequency incl. Stop \nWords For Blogs \nin Training Set",TRUE,seq(from=1,to=10,by=1))
twitterSW <- getBarplot(wordfreqTwitter30,"Word Frequency incl. Stop \nWords For Twitter \nin Training Set",TRUE,seq(from=1,to=10,by=1))
# Plot both graphs
newsNoSW + blogsNoSW + twitterNoSW
newsSW + blogsSW + twitterSW
As can be seen from the above graphs, in the instance where stop words are excluded, the following words are in the top 10 across all three corpora: “will,” “can,” and “just.”
In the instance where stop words are included, the following words are in the top 10 across all three corpora: “the,” “and,” “for,” “that,” and ’with."
As part of the data processing, for each .txt file, transform all characters to lower case, remove punctuation, remove stop words, remove empty characters from each vector, and remove numbers before formulating bigrams and trigrams. Care is taken to split sentences before creating the n-grams to avoid the situation of having n-grams that span sentences and don’t make sense. The most frequent bigrams and trigrams from each file are tabulated and plotted.
library(tokenizers)
library(dplyr)
library(tm)
# Create a function that returns the most frequent n-grams from a given data set while splitting sentences
getNgramFreq <- function(trainingData,N,lowCut,rmU){
trainingData <- unlist(tokenize_sentences(trainingData))
trainingData <- tolower(trainingData)
trainingData <- removeNumbers(trainingData)
trainingData <- removePunctuation(trainingData)
if (rmU){
trainingData <- removeWords(trainingData,"u")
}
trainingData <- stripWhitespace(trainingData)
trainingData <- trainingData[trainingData != ""]
ngrams <- unlist(tokenize_ngrams(trainingData,n=N,n_min=N))
curNgramFreq <- as.data.frame(table(ngrams))
curNgramFreq <- curNgramFreq %>%
rename(ngram = ngrams) %>%
rename(freq = Freq)
curNgramFreq = curNgramFreq[curNgramFreq$freq>=lowCut,]
curNgramFreq %>%
arrange(desc(freq))
}
# Find most frequent bigrams from each data source
bigramfreqNews <- getNgramFreq(removeWords(tolower(trainingNews),stopwords("english")),2,25,TRUE)
bigramfreqBlogs <- getNgramFreq(removeWords(tolower(trainingBlogs),stopwords("english")),2,20,FALSE)
bigramfreqTwitter <- getNgramFreq(removeWords(tolower(trainingTwitter),stopwords("english")),2,300,FALSE)
# Find most frequent trigrams from each data source
trigramfreqNews <- getNgramFreq(removeWords(tolower(trainingNews),stopwords("english")),3,5,TRUE)
trigramfreqBlogs <- getNgramFreq(removeWords(tolower(trainingBlogs),stopwords("english")),3,20,FALSE)
trigramfreqTwitter <- getNgramFreq(removeWords(tolower(trainingTwitter),stopwords("english")),3,20,FALSE)
Show the first 20 most common bigrams and trigrams in each file in a tabular form, excluding stop words:
# Get the top 20 most common bigrams for News
bigramFreqNews20 <- bigramfreqNews[1:20,]
# Get the top 20 most common bigrams for Blogs
bigramFreqBlogs20 <- bigramfreqBlogs[1:20,]
# Get the top 20 most common bigrams for Twitter
bigramFreqTwitter20 <- bigramfreqTwitter[1:20,]
bigramFreqNews20
## ngram freq
## 1 last year 114
## 2 new york 95
## 3 st louis 80
## 4 high school 76
## 5 years ago 64
## 6 new jersey 62
## 7 last week 54
## 8 two years 46
## 9 officials said 40
## 10 health care 39
## 11 san francisco 38
## 12 first time 37
## 13 united states 37
## 14 next year 36
## 15 los angeles 34
## 16 police said 30
## 17 right now 30
## 18 three years 29
## 19 even though 28
## 20 four years 26
bigramFreqBlogs20
## ngram freq
## 1 years ago 627
## 2 right now 609
## 3 even though 569
## 4 new york 569
## 5 feel like 511
## 6 first time 498
## 7 can see 494
## 8 last year 484
## 9 make sure 482
## 10 dont know 463
## 11 last night 438
## 12 last week 427
## 13 one day 410
## 14 every day 391
## 15 can get 375
## 16 just like 357
## 17 one thing 325
## 18 many people 324
## 19 long time 313
## 20 look like 313
bigramFreqTwitter20
## ngram freq
## 1 right now 1994
## 2 last night 1346
## 3 looking forward 1080
## 4 happy birthday 1040
## 5 good morning 934
## 6 just got 890
## 7 feel like 832
## 8 good luck 792
## 9 thanks follow 734
## 10 follow back 726
## 11 looks like 714
## 12 can get 690
## 13 let know 671
## 14 mothers day 629
## 15 next week 577
## 16 sounds like 536
## 17 great day 535
## 18 make sure 495
## 19 please follow 481
## 20 thanks rt 476
# Get the top 20 most common trigrams for News
trigramFreqNews20 <- trigramfreqNews[1:20,]
# Get the top 20 most common trigrams for Blogs
trigramFreqBlogs20 <- trigramfreqBlogs[1:20,]
# Get the top 20 most common trigrams for Twitter
trigramFreqTwitter20 <- trigramfreqTwitter[1:20,]
trigramFreqNews20
## ngram freq
## 1 two years ago 14
## 2 pates fountain parks 11
## 3 president barack obama 11
## 4 st louis county 10
## 5 chief financial officer 9
## 6 new york city 9
## 7 classic pates fountain 8
## 8 first time since 7
## 9 new york times 7
## 10 past two years 7
## 11 chief operating officer 6
## 12 gov chris christie 6
## 13 gov john kasich 6
## 14 late last year 6
## 15 said written statement 6
## 16 us district court 6
## 17 will take place 6
## 18 world war ii 6
## 19 according court records 5
## 20 carter kick p 5
trigramFreqBlogs20
## ngram freq
## 1 new york city 90
## 2 new york times 68
## 3 amazon services llc 54
## 4 llc amazon eu 54
## 5 services llc amazon 54
## 6 new york ny 49
## 7 two years ago 44
## 8 couple weeks ago 42
## 9 every single day 40
## 10 please let know 40
## 11 incorporated item c 35
## 12 lord jesus christ 35
## 13 preheat oven degrees 35
## 14 world war ii 35
## 15 please feel free 34
## 16 let us know 33
## 17 long time ago 33
## 18 blah blah blah 32
## 19 item c pp 32
## 20 two weeks ago 32
trigramFreqTwitter20
## ngram freq
## 1 happy mothers day 375
## 2 let us know 289
## 3 happy new year 209
## 4 cinco de mayo 123
## 5 looking forward seeing 92
## 6 happy valentines day 80
## 7 st patricks day 71
## 8 keep good work 70
## 9 come see us 69
## 10 just got home 64
## 11 love love love 64
## 12 follow back please 61
## 13 just got back 61
## 14 please follow back 61
## 15 good morning everyone 59
## 16 new years eve 57
## 17 cake cake cake 53
## 18 please please please 51
## 19 dreams come true 50
## 20 happy th birthday 49
Graph the 10 most common bigrams and trigrams (excluding stop words) for each of the three files:
# Create a function that generates a barplot object from a given table of n-grams
getBarplotNgrams <- function(ngramfreqPlot,curTitle,bi_tri,range){
ngramfreqPlot$ngram <- factor(ngramfreqPlot$ngram,levels=ngramfreqPlot$ngram)
p1 <- ggplot(ngramfreqPlot[range,],aes(x=freq,y=ngram))+geom_bar(stat="identity",color="cadetblue2",fill="cadetblue2")+
xlab("Frequency")+ylab(bi_tri)+ ggtitle(curTitle) + geom_text(aes(label = freq), size = 6, hjust = 0.6, vjust = 1) + theme(text=element_text(size=23))
p1
}
# Create plot objects for the bigrams plot
newsBi <- getBarplotNgrams(bigramFreqNews20,"Bigram Frequency excl. \nStop Words For \nNews in Training \nSet","Bigram",seq(from=1,to=10,by=1))
blogsBi <- getBarplotNgrams(bigramFreqBlogs20,"Bigram Frequency excl. \nStop Words For \nBlogs in Training \nSet","Bigram",seq(from=1,to=10,by=1))
twitterBi <- getBarplotNgrams(bigramFreqTwitter20,"Bigram Frequency excl. \nStop Words For \nTwitter in Training \nSet","Bigram",seq(from=1,to=10,by=1))
# Create plot objects for the trigrams plot
newsTri <- getBarplotNgrams(trigramFreqNews20,"Trigram Freq. excl. \nStop Words For \nNews in Training \nSet","Trigram",seq(from=1,to=10,by=1))
blogsTri <- getBarplotNgrams(trigramFreqBlogs20,"Trigram Freq. excl. \nStop Words For \nBlogs in Training \nSet","Trigram",seq(from=1,to=10,by=1))
twitterTri <- getBarplotNgrams(trigramFreqTwitter20,"Trigram Freq. excl. \nStop Words For \nTwitter in Training \nSet","Trigram",seq(from=1,to=10,by=1))
# Plot both graphs
newsBi + blogsBi + twitterBi
newsTri + blogsTri + twitterTri
From the graphs above, there are no bigrams shared between the top 10 of all three corpora, but the following bigram is in the top 20 across all three corpora: “right now.”
From the graphs and tables above, there are no trigrams shared between the top 20 of all three corpora, however, for example the trigram “new york city” is in the top 10 of two corpora.
Consider more advanced data processing steps, such as filtering out any non-English text that may reside in these three predominantly English language files,en_US.news.txt, en_US.blogs.txt and en_US.twitter.txt.
Consider whether to include other sources of data in addition to the three files explored above.
Train a Markov chain model that predicts the next word based on the previous word typed by the user.
Evaluate whether using the previous bigram or trigram in the Markov chain model rather than the previous word improves model performance.
Explore more complex modeling strategies, such as recurrent neural networks with long short-term memory (LSTM).
Decide on a model that has the best combination of accuracy and computational efficiency.
Create a Shiny app where users can enter text and see the words that are most likely to follow, and create a supporting presentation to explain the features of the algorithms used and the app itself.