Swiftkey - Next Word Prediction Milestone Report

1. Introduction


1.1 Background

Natural Language Processing (NLP) has come a long way according to Dan Durafsky from Stanford. It has taken over language translation and helped in solving issues such as spam detection, part of speech tagging and named entity recognition. However the technology for dialog, question and answering, summurisation is still really hard due to ambuiguity.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the …

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.


1.2 Objective

In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.


1.3 How to achieve this objective?

The first step in analyzing any new data set is figuring out:

  1. what data you have and
  2. what are the standard tools and models used for that type of data.


1.3.1 Data:

This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora. The files have been language filtered but may still contain some foreign text. This is the training data to get started that will be the basis for most of the capstone. Download the data from the link below and not from external websites to start.

HC Corpora Dataset


1.3.2 Tools:

As a first step toward working on this project is to familiarize Natural Language Processing, Text Mining, and the associated tools in R. Please refer to the reference below for all the materials in regards to NLP and Text Mining.

The main development application for this project that I will be using is RStudio and the main libraries for NLP will be TM, Quanteda, Wordcloud and Tidytext.

There are other libraries such as ggplot2 for plotting, caret for machine learning, dplyr for data manipulation and etc for this projects.


1.3.3 Modelling:

The first step in building a predictive text mining application is to build the first simple model for the relationship between words.

Given the huge scale of more than 500MB of data and It is inefficient and impractical accomplish the objective as it will take a lot of computing power and time to perform all the required task. Therefore we will take a sample size from each En_US text file and combined them into a single corpora. A good maximum sample size is usually around 10% of the population, For example, in a population of 5000, 10% would be 500. In a population of 200,000, 10% would be 20,000. Therefore we will need create samples for the blogs, news and twitter dataset.

According to R Blogger Site after loading the data, we need to generate a corpus. A corpus is a type of dataset that is used in text analysis. It contains “a collection of text or speech material that has been brought together according to a certain set of predetermined criteria”.

Another essential component for text analysis is a data frequency matrix (DFM); also called document-term matrix (DTM). These two terms are synonyms but quanteda refers to a DFM whereas others will refer to DTM. It describes how frequently terms occur in the corpus by counting single terms. To generate a DFM, we first split the text into its single terms (tokens). We then count how frequently each term (token) occurs in each document.

A corpus is positional (string of words) and a DFM is non-positional (bag of words). Put differently, the order of the words matters in a corpus whereas a DFM does not have information on the position of words.

Next to understand the relationship between words, I will build basic n-gram model - using the exploratory analysis performed then build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. After build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora.


2. Preprocessing


Please refer to the code appendix below for the detail of the code used to perform the task.


2.1 Gathering and Cleaning Data

Download the data on the the directory specified and unzip the file at the same time.


Below is the information of the En_US files that we have successfully downloaded and unzip:

File Size (MB) Lines Characters Words Longest Line
US Blogs 255.35 899288 206824505 37334131 40833
US News 19.77 77259 15639408 2643969 5760
US Twitter 318.99 2360148 162096241 30373583 140


From the information of the En_US data of three different sources, we could see that Blogs source contains highest number of words and it has the longest line. Twitter source has the largest data of 319MB with the shortest line whereas the News source occupied the smallest data. The wordcloud also shows that words such “the”, “just”, “like”, etc are the most frequently used words.


3. Exploratory Data Analysis

As mentioned above according to R Blogger Site after loading the data, we need to generate a corpus the followed by DFM or DTM. These two terms are synonyms but quanteda refers to a DFM whereas others will refer to DTM. It describes how frequently terms occur in the corpus by counting single terms. Before generating a DFM, it is first split the text into its single terms (tokens). We then count how frequently each term (token) occurs in each document.

A token is each individual word in a text (but it could also be a sentence, paragraph, or character). This is why we call creating a “bag of words” also tokenizing text. In a nutshell, a DFM is a very efficient way of organizing the frequency of features/tokens but does not contain any information on their position. In our example, the features of a text are represented by the columns of a DFM and aggregate the frequency of each token.

In most projects you want one corpus to contain all your data and generate many DFMs from that.


3.1 We will explore the data by looking at

  1. Top Words - Visualise using wordcloud and ggplot function to view the top words.
  2. Word Coverage - How many unique words needed in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?



3.2 The top 100 words from the blogs, news and Twitter combined into a single corpus:


3.3 Cleaning Data

As I was doing some exploratory analysis of the data using grepl() function on the raw corpus data, I realised there were a lot of profanity words. There were 27236 number of the word “fuck” related in the raw data and this is not acceptable for our text prediction. To prevent the profanity words from appearing in our prediction, we will create a list of profanity words using dictionary() function.

The list of profanity word in txt format is to be downloaded from profanity words

As mentioned above, we want one corpus to contain all our data and generate many DFMs from that. To prevent repetitive task of creating DFMs I will create a dfm.function for blogs, news and Twitter using Quanteda package. I will also create a ngram.Function using tokens_ngrams() as we need to look at the relationship between the words such as bigram, trigram and quadgram.


The best way to explore text data is to look at the data visually.


Below is top 30 word frequency plot:

Figure: This is a RAW data of blogs, news and Twitter in a single corpus.

Figure: This is a RAW data of blogs, news and Twitter in a single corpus.

Initial Observation:

From our initial observation of the raw corpus data plot and wordcloud, the data contains lots of punctuations and stop words. From the Top 30 plot above we could see that periods and commas made up of the top 3 high frequencies in our data. The stop word “the” also rank second in the top 20 frequency and other stop words are also among the top 10.

For the next word prediction, it will not make sense to have punctuation as the next word. Therefore we need to remove punctuations when we perfom data cleaning.


3.4 Sampling and Tokenization

As mentioned above, to improve efficieny due to the large dataset it would be better to work on the samples of the dataset. The sample will be 10% from the data and use set.seed() function so as to ensure reproducibility.

Below is the information of the sample data of blogs, news and Twitter:

File Size (MB) Lines Characters Words Longest Line
US Blogs - Sample 25.49 89928 20630198 3724025 9810
US News - Sample 1.98 7725 1570651 264978 1477
US Twitter - Sample 32.18 236014 16190208 3035147 140


Comparing between the raw and the sample data information, we can see the sample data is exactly 10% from the raw data.



As part of the data cleaning the next word prediction, the following will be removed:

  • Numbers
  • Punctuations
  • Symbols
  • URL
  • Profanity



After tokenising and analysis of the sample, the corpus contains 174445 features.

The Table of Top 20 Most Frequent Words
feature frequency rank docfreq group
the 294295 1 133784 all
to 192336 2 114545 all
and 159323 3 90483 all
a 157627 4 97656 all
i 150440 5 87786 all
of 129107 6 76730 all
in 102119 7 70979 all
you 85407 8 61691 all
is 81423 9 60530 all
for 77008 10 62095 all
that 72345 11 50890 all
it 71115 12 51735 all
on 57035 13 47530 all
my 56613 14 42756 all
with 47948 15 38831 all
this 43122 16 35650 all
was 40993 17 27893 all
be 40439 18 34058 all
have 39773 19 33108 all
at 37558 20 32212 all


3.5 Word Coverage

Question: How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

To address this we will be using textstat_frequency() function to find out the word frequency. Then use the cumsum function divide by the total sum of the word frequency to find the word coverage. Credit to the members in the course discussion forums for this method.

This is table that shows the number of unique words needed to cover from 50% to 95%:

50% - 95% Unique Word Coverage
Coverage in % No. of Words
9 50% 118
10 55% 177
11 60% 277
12 65% 438
13 70% 705
14 75% 1164
15 80% 1959
16 85% 3486
17 90% 6894
18 95% 18372

The plot below shows the unique word needed to cover up to 95% of the sampled corpora:

Figure: This is a SAMPLE data of blogs, news and Twitter in a single corpus.

Figure: This is a SAMPLE data of blogs, news and Twitter in a single corpus.

Observation:

We will need 118 number of unique words needed to cover 50% of all word instances in the language and the number of unique words needed to cover 90% of all words instances in the language is 6894.

From the plot we also notice that a small fraction of unique words accounts for the majority of text



4. Modelling

Let us build the first simple model for viewing the relationship between words by using the ngram.function that was described earlier. This is a brief detail of the tokens_ngrams(). We will use this function to create bigram, trigram and quadgram to look a the relationship between words. Then we will visually look at the the top 20 most frequenct words in the plot and word cloud.


Side by Side comparison of bigram, trigram and quadgram wordcloud.


Summary and Next Step

From our initial observation of the raw corpus data plot and wordcloud, the data contains lots of punctuations and stop words. From the Top 30 plot above we could see that periods and commas made up of the top 3 high frequencies in our data. The stop word “the” also rank second in the top 20 frequency and other stop words are also among the top 10.

For the next word prediction, it will not make sense to have punctuation as the next word. Therefore we need to remove punctuations when we perfom data cleaning.

We will need 118 number of unique words needed to cover 50% of all word instances in the language and the number of unique words needed to cover 90% of all words instances in the language is 6894.

From the plot we also notice that a small fraction of unique words accounts for the majority of text.

We will also need to improve the processing speed by taking a smaller sample of the raw data.

Next Plan:

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

We are going to build a simple Markov chain function that creates a dictionary:

  1. The keys should be all of the words in the corpus
  2. The values should be a list of the words that follow the keys

We will apply a Katz’s Backoff Model, whereby if a matching n-gram is not available, the model will “back off” to the next lowest (n-1)-gram. This will happen through a set of n tables, containing the corresponding list of n-grams and their frequencies.

Create a Text Generator We’re going to create a function that generates sentences. It will take two things as inputs:

  1. The dictionary you just created
  2. The number of words you want generated

Saving Data Object

To preserve memory so we can use only the relevant data or objects in our next step. I will save these objects as rds ext files.


Appendix Code

library(knitr)  ##Load Knitr package
library(ggplot2)  ##Plotting and data
library(caret)  ##Load package for ML
library(dplyr)  ##Data transformation package
library(quanteda)
library(ngram)  ## 
library(tm)
library(RColorBrewer)
library(ggthemes)
library(gridExtra)
library(tidytext)
library(wordcloud)
## Setting Global Option where echo = true so that someone will be able to read
## the code and results.
knitr::opts_chunk$set(echo = FALSE, results = "hold", tidy = TRUE)
# Task 1: Getting and cleaning data

# URL is the given link to download from HC Corpora dataset
data.Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
filepath <- "./data/SwiftKey.zip"  #set the location and file name of the downloaded zip file

# Create directory named data for the file to download
if (!file.exists("./data")) {
    dir.create("./data")
}

# Download and unzip the file.
if (!file.exists(filepath)) {
    download.file(url, destfile = filepath, method = "curl")
    unzip(zipfile = filepath, exdir = "./data")  #Unzip the file and store the folder in the data folder
}

# Assign English data location
loc.Blogs <- "./data/final/en_US/en_US.blogs.txt"
loc.News <- "./data/final/en_US/en_US.news.txt"
loc.Twitter <- "./data/final/en_US/en_US.twitter.txt"

# Read the data files
raw.Blogs <- readLines(loc.Blogs, encoding = "UTF-8", skipNul = TRUE)
raw.News <- readLines(loc.News, encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
raw.Twitter <- readLines(loc.Twitter, encoding = "UTF-8", skipNul = TRUE)
Rprof(checking <- "./data/checking.txt")  #Checking for bottleneck

# En_US Files information

MB <- 2^20  #Byte to MB conversion

# Check File Size
size <- round(c(object.size(raw.Blogs)/MB, object.size(raw.News)/MB, object.size(raw.Twitter)/MB), 
    2)

# Number of lines in each file
lines <- c(length(raw.Blogs), length(raw.News), length(raw.Twitter))

# Number of characters in each file
char <- c(sum(nchar(raw.Blogs)), sum(nchar(raw.News)), sum(nchar(raw.Twitter)))

# Number of words
words <- c(wordcount(raw.Blogs, sep = " "), wordcount(raw.News, sep = " "), wordcount(raw.Twitter, 
    sep = " "))

# Longest line
long.Line <- c(max(sapply(raw.Blogs, nchar)), max(sapply(raw.News, nchar)), max(sapply(raw.Twitter, 
    nchar)))

raw.Info <- cbind(size, lines, char, words, long.Line)
colnames(raw.Info) <- c("File Size (MB)", "Lines", "Characters", "Words", "Longest Line")
rownames(raw.Info) <- c("US Blogs", "US News", "US Twitter")
kable(raw.Info)

Rprof(NULL)
# summaryRprof(checking) The following is the analysis on Twitter

# In the en_US twitter data set, if you divide the number of lines where the word
# 'love' (all lowercase) occurs by the number of lines the word 'hate' (all
# lowercase) occurs, about what do you get?

love <- sum(grepl("love", raw.Twitter, ignore.case = FALSE))
hate <- sum(grepl("hate", raw.Twitter, ignore.case = FALSE))
love/hate

# The one tweet in the en_US twitter data set that matches the word 'biostats'
# says what?
raw.Twitter[grepl("biostats", raw.Twitter, ignore.case = FALSE)]


# How many tweets have the exact characters 'A computer once beat me at chess,
# but it was no match for me at kickboxing'. (I.e. the line matches those
# characters exactly.)
sum(grepl("A computer once beat me at chess, but it was no match for me at kickboxing", 
    raw.Twitter, ignore.case = TRUE))
# Wordcloud of the raw data combined from blogs, news and Twitter

# Rprof(corpus <- './data/corpus.txt') #Checking for bottleneck

# create one corpus consists of blogs, news and twitter samples
raw.Corpus <- corpus(c(raw.Blogs, raw.News, raw.Twitter))
wordcloud::wordcloud(raw.Corpus, max.words = 100, random.order = FALSE, rot.per = 0.35, 
    use.r.layout = TRUE, colors = brewer.pal(8, "Dark2"))
# Rprof(NULL) Profanity word filter

# Download profanity file from freewebheader
url_1 <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
filepath_1 <- "./data/profanity_words.txt"  #set the location and file name of the downloaded zip file

# Create directory named data for the file to download
if (!file.exists("./data")) {
    dir.create("./data")
}

if (!file.exists(filepath_1)) {
    download.file(url_1, destfile = filepath_1)
}

profanityWords <- readLines("./data/profanity_words.txt", encoding = "UTF-8", skipNul = TRUE)
dict.Profanity <- dictionary(list(badWord = profanityWords))
# For the the purpose of text analysis, we will create two functions for dfm and
# ngram to apply to accomplish our task
dfm.Function <- function(corpus, n) {
    dfm(x = corpus, remove = dict.Profanity)
}

ngram.Function <- function(corpus, n) {
    tokens_ngrams(corpus, n = n)
}
# Create data for the plot
raw.Plot <- raw.Corpus %>% tokens(what = "word") %>% ngram.Function(n = 1) %>% dfm.Function() %>% 
    topfeatures(30) %>% as.data.frame()

# Change column name to frequency
colnames(raw.Plot) <- "frequency"

# Added a column to the dataframe for plotting purpose
raw.Plot$ngram <- row.names(raw.Plot)

## Generate plots for the raw data
r <- ggplot(raw.Plot, aes(y = frequency, x = reorder(ngram, frequency)))
r <- r + geom_bar(stat = "identity") + coord_flip()
r <- r + ggtitle("Top 30 Frequency of Word in the Data")
r <- r + geom_text(aes(label = frequency), position = position_stack(vjust = 0.5), 
    color = "white", size = 3, fontface = "bold")
r <- r + ylab("Frequency") + xlab("Word")
r <- r + theme_few()
plot(r)
# Rprof(sampling <- './data/sampling.txt') #Checking for bottleneck

set.seed(1234)  #Ensure the same result for reproducibilty

sampleSize <- 0.1  #sample size is 10% of the population

sample.Blogs <- sample(raw.Blogs, size = sampleSize * length(raw.Blogs), replace = FALSE)
sample.News <- sample(raw.News, size = sampleSize * length(raw.News), replace = FALSE)
sample.Twitter <- sample(raw.Twitter, size = sampleSize * length(raw.Twitter), replace = FALSE)

# Check size
sample.Size <- round(c(object.size(sample.Blogs)/MB, object.size(sample.News)/MB, 
    object.size(sample.Twitter)/MB), 2)


# Number of lines in each file
sample.Lines <- c(length(sample.Blogs), length(sample.News), length(sample.Twitter))


# Number of characters in each file
sample.Char <- c(sum(nchar(sample.Blogs)), sum(nchar(sample.News)), sum(nchar(sample.Twitter)))


# Number of words
sample.Words <- c(wordcount(sample.Blogs, sep = " "), wordcount(sample.News, sep = " "), 
    wordcount(sample.Twitter, sep = " "))


# Longest line
sample.LongLine <- c(max(sapply(sample.Blogs, nchar)), max(sapply(sample.News, nchar)), 
    max(sapply(sample.Twitter, nchar)))


sample.Info <- cbind(sample.Size, sample.Lines, sample.Char, sample.Words, sample.LongLine)
colnames(sample.Info) <- c("File Size (MB)", "Lines", "Characters", "Words", "Longest Line")
rownames(sample.Info) <- c("US Blogs - Sample", "US News - Sample", "US Twitter - Sample")
kable(sample.Info)

# Rprof(NULL) summaryRprof(sampling) Rprof(corpus <- './data/corpus.txt')
# #Checking for bottleneck


# create one corpus consists of blogs, news and twitter samples
sample.Corpus <- corpus(c(sample.Blogs, sample.News, sample.Twitter))

# wordcloud::wordcloud(sample.Corpus, max.words = 50, random.order = FALSE,
# rot.per=0.35, use.r.layout=TRUE, colors=brewer.pal(8, 'Dark2'))

# head(sample.Corpus) length(sample.Corpus)

# Rprof(NULL) Rprof(tokenising <- './data/tokenising.txt') #Checking for
# bottleneck Preprocess the text

# Create tokens
sample.Token <- tokens(sample.Corpus, remove_numbers = TRUE, remove_punct = TRUE, 
    remove_symbols = TRUE, remove_url = TRUE, include_docvars = TRUE)

# remove profanity words
sample.TokenV1 <- tokens_remove(tokens(sample.Token, dict.Profanity))

sample.Dfm <- dfm(sample.TokenV1, remove = dict.Profanity)
# head(sample.Dfm)


## Word frequency
document.Frequncy <- docfreq(sample.Dfm, scheme = "count")
# document.Frequncy

# Total number of feature
total.Features <- length(as.factor(featnames(sample.Dfm)))


# Top 20 features using topfeature function
top20.Features <- sample.Dfm %>% topfeatures(20) %>% head(20) %>% data.frame()
# colnames(top20.Features) <- 'Frequency'


# Top 20 features using textstat_frequency function
tstat <- textstat_frequency(sample.Dfm, n = 20, ties_method = c("min"))


# Top 20 Textstat_keyness
tstat2 <- textstat_keyness(sample.Dfm, measure = c("chi2", "exact", "lr", "pmi"), 
    sort = TRUE, 20)

# Token summary
tstat3 <- textstat_summary(sample.Dfm, cache = TRUE)

# Rprof(NULL) headsummaryRprof(tokenising)
unigram <- sample.TokenV1 %>% ngram.Function(n = 1) %>% dfm.Function()

unigram.Frequency <- textstat_frequency(unigram)$frequency
p <- cumsum(unigram.Frequency)/sum(unigram.Frequency)

get.Coverage <- function(n = 0.5) {
    which(p >= n)[1]
}

coverage.50 <- get.Coverage(0.5)
coverage.90 <- get.Coverage(0.9)

coverages <- seq(0.1, 0.95, 0.05)

coverage.Seq <- sapply(coverages, get.Coverage)


cov.Df <- as.data.frame(cbind(coverages, coverage.Seq))


g <- ggplot(data = cov.Df, aes(y = coverages, x = coverage.Seq))
g <- g + geom_point(size = 2, colour = "red") + geom_line(size = 1, colour = "red3")
g <- g + geom_hline(yintercept = 0.5) + geom_hline(yintercept = 0.9)
g <- g + scale_y_continuous(labels = scales::percent)
g <- g + scale_x_continuous(breaks = c(coverage.50, coverage.90, get.Coverage(0.95)))
g <- g + geom_vline(xintercept = coverage.50, linetype = "dashed") + geom_vline(xintercept = coverage.90, 
    linetype = "dashed")
g <- g + ylab("Coverage %") + xlab("Number of words")
g <- g + ggtitle("Unique Word Coverage in A Corpus")
g <- g + theme_bw()

cov.Df2 <- cov.Df
cov.Df2$coverages <- paste(cov.Df2$coverages * 100, "%", sep = "")
plot(g)

for (i in 1:4) {
    ## Prepare data frame for plotting
    ngram.Plot <- sample.TokenV1 %>% ngram.Function(n = i) %>% dfm.Function() %>% 
        topfeatures(20) %>% as.data.frame()
    
    colnames(ngram.Plot) <- "frequency"
    ngram.Plot$ngram <- row.names(ngram.Plot)
    
    
    ## Generate plots
    g <- ggplot(ngram.Plot, aes(y = frequency, x = reorder(ngram, frequency)))
    g <- g + geom_bar(stat = "identity") + coord_flip()
    g <- g + ggtitle(paste("Top 20 - ", i, " grams", sep = " "))
    g <- g + geom_text(aes(label = frequency), position = position_stack(vjust = 0.5), 
        color = "white", size = 3, fontface = "bold")
    g <- g + ylab("") + xlab("")
    g <- g + theme_few()
    assign(paste("p", i, sep = ""), g)
}

grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2, top = "Top 20 ngrams")

# par(mfrow=c(1,3))

# Create wordcloud of bigram
biCloud <- sample.TokenV1 %>% ngram.Function(n = 2) %>% dfm.Function() %>% textplot_wordcloud(max.words = 50, 
    colors = brewer.pal(8, "Dark2"))

# Create wordcloud of trigram
triCloud <- sample.TokenV1 %>% ngram.Function(n = 3) %>% dfm.Function() %>% textplot_wordcloud(max.words = 35, 
    colors = brewer.pal(8, "Dark2"))

# Create wordcloud of quadgram
quadCloud <- sample.TokenV1 %>% ngram.Function(n = 4) %>% dfm.Function() %>% textplot_wordcloud(max.words = 20, 
    colors = brewer.pal(8, "Dark2"))

## Raw.Corpus data
saveRDS(raw.Corpus, file = "./data/rawCorpus.rds")
saveRDS(sample.Corpus, file = "./data/sampleCorpus.rds")
saveRDS(sample.Token, file = "./data/sampleToken.rds")
saveRDS(sample.TokenV1, file = "./data/sampleTokenV1.rds")
saveRDS(sample.Dfm, file = "./data/sampleDfm.rds")

References:

Natural Language Processing Stanford Uni by Jurafsky & Manning - https://www.youtube.com/watch?v=oWsMIW-5xUc&list=PLLssT5z_DsK8HbD2sPcUIDfQ7zmBarMYv

N-Gram Models by Durafsky & Martin - https://web.stanford.edu/~jurafsky/slp3/3.pdf

Text Analysis in R by Welbers, Atteveldt & Benoit - https://kenbenoit.net/pdfs/text_analysis_in_R.pdf

Basic Text Analysis in R by Bail - https://compsocialscience.github.io/summer-institute/2018/materials/day3-text-analysis/basic-text-analysis/rmarkdown/Basic_Text_Analysis_in_R.html

Introduction to Text Analytics in R by Langer - https://www.youtube.com/playlist?list=PL8eNk_zTBST8olxIRFoo0YeXxEOkYdoxi

Cran Task View: Natural Language Processing by Fridolin Wild - https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Text Mining Infrastructure in R by Ingo, Kurt & David - https://www.researchgate.net/publication/26539008_Text_Mining_Infrastructure_in_R

Guide to Ngram Package by Schmidt- https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf

Text Mining in R: A Tidy Approach, by Julia Silge and David Robinson - https://www.tidytextmining.com/index.html

Text Mining and Wordcloud Fundemental in R - http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

textstat_frequency: Tabulate feature frequencies - https://rdrr.io/cran/quanteda/man/textstat_frequency.html

Predicting the Next Word: Back-Off Language Modeling by Masse - https://medium.com/@davidmasse8/predicting-the-next-word-back-off-language-modeling-8db607444ba9

Natural language processing: What would Shakespeare say? by Ganesh - https://www.r-bloggers.com/natural-language-processing-what-would-shakespeare-say/

Natural Language processing in Python by Zhao - https://youtu.be/xvqsFTUsOmc

How to Generate Word Clouds in R by Van den Rul - https://towardsdatascience.com/create-a-word-cloud-with-r-bde3e7422e8a


The system platform specification used:

Spec Description
OS Windows 10 Pro - 64 bit
CPU AMD Ryzen 5 - 3400G (4 cores & 8 threads)
RAM 16GB DDR4 3000MHz
Storage 500GB SSD - M.2 NVMe (PCIe)
Tool RStudio