Swiftkey - Next Word Prediction Milestone Report

1. Introduction

1.1 Background

Natural Language Processing (NLP) has come a long way according to Dan Durafsky from Stanford. It has taken over language translation and helped in solving issues such as spam detection, part of speech tagging and named entity recognition. However the technology for dialog, question and answering, summurisation is still really hard due to ambuiguity.

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. But typing on mobile devices can be a serious pain. SwiftKey, our corporate partner in this capstone, builds a smart keyboard that makes it easier for people to type on their mobile devices. One cornerstone of their smart keyboard is predictive text models. When someone types:

I went to the …

the keyboard presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.

1.2 Objective

In this capstone we will work on understanding and building predictive text models like those used by SwiftKey.

1.3 How to achieve this objective?

The first step in analyzing any new data set is figuring out:

what data you have and
what are the standard tools and models used for that type of data.

1.3.1 Data:

This project uses the files named LOCALE.blogs.txt where LOCALE is the each of the four locales en_US, de_DE, ru_RU and fi_FI. The data is from a corpus called HC Corpora. The files have been language filtered but may still contain some foreign text. This is the training data to get started that will be the basis for most of the capstone. Download the data from the link below and not from external websites to start.

HC Corpora Dataset

1.3.2 Tools:

As a first step toward working on this project is to familiarize Natural Language Processing, Text Mining, and the associated tools in R. Please refer to the reference below for all the materials in regards to NLP and Text Mining.

The main development application for this project that I will be using is RStudio and the main libraries for NLP will be TM, Quanteda, Wordcloud and Tidytext.

There are other libraries such as ggplot2 for plotting, caret for machine learning, dplyr for data manipulation and etc for this projects.

1.3.3 Modelling:

The first step in building a predictive text mining application is to build the first simple model for the relationship between words.

Given the huge scale of more than 500MB of data and It is inefficient and impractical accomplish the objective as it will take a lot of computing power and time to perform all the required task. Therefore we will take a sample size from each En_US text file and combined them into a single corpora. A good maximum sample size is usually around 10% of the population, For example, in a population of 5000, 10% would be 500. In a population of 200,000, 10% would be 20,000. Therefore we will need create samples for the blogs, news and twitter dataset.

According to R Blogger Site after loading the data, we need to generate a corpus. A corpus is a type of dataset that is used in text analysis. It contains “a collection of text or speech material that has been brought together according to a certain set of predetermined criteria”.

Another essential component for text analysis is a data frequency matrix (DFM); also called document-term matrix (DTM). These two terms are synonyms but quanteda refers to a DFM whereas others will refer to DTM. It describes how frequently terms occur in the corpus by counting single terms. To generate a DFM, we first split the text into its single terms (tokens). We then count how frequently each term (token) occurs in each document.

A corpus is positional (string of words) and a DFM is non-positional (bag of words). Put differently, the order of the words matters in a corpus whereas a DFM does not have information on the position of words.

Next to understand the relationship between words, I will build basic n-gram model - using the exploratory analysis performed then build a basic n-gram model for predicting the next word based on the previous 1, 2, or 3 words. After build a model to handle unseen n-grams - in some cases people will want to type a combination of words that does not appear in the corpora.

2. Preprocessing

Please refer to the code appendix below for the detail of the code used to perform the task.

2.1 Gathering and Cleaning Data

Download the data on the the directory specified and unzip the file at the same time.

Below is the information of the En_US files that we have successfully downloaded and unzip:

	File Size (MB)	Lines	Characters	Words	Longest Line
US Blogs	255.35	899288	206824505	37334131	40833
US News	19.77	77259	15639408	2643969	5760
US Twitter	318.99	2360148	162096241	30373583	140

From the information of the En_US data of three different sources, we could see that Blogs source contains highest number of words and it has the longest line. Twitter source has the largest data of 319MB with the shortest line whereas the News source occupied the smallest data. The wordcloud also shows that words such “the”, “just”, “like”, etc are the most frequently used words.

3. Exploratory Data Analysis

As mentioned above according to R Blogger Site after loading the data, we need to generate a corpus the followed by DFM or DTM. These two terms are synonyms but quanteda refers to a DFM whereas others will refer to DTM. It describes how frequently terms occur in the corpus by counting single terms. Before generating a DFM, it is first split the text into its single terms (tokens). We then count how frequently each term (token) occurs in each document.

A token is each individual word in a text (but it could also be a sentence, paragraph, or character). This is why we call creating a “bag of words” also tokenizing text. In a nutshell, a DFM is a very efficient way of organizing the frequency of features/tokens but does not contain any information on their position. In our example, the features of a text are represented by the columns of a DFM and aggregate the frequency of each token.

In most projects you want one corpus to contain all your data and generate many DFMs from that.

3.1 We will explore the data by looking at

Top Words - Visualise using wordcloud and ggplot function to view the top words.
Word Coverage - How many unique words needed in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

3.2 The top 100 words from the blogs, news and Twitter combined into a single corpus:

3.3 Cleaning Data

As I was doing some exploratory analysis of the data using grepl() function on the raw corpus data, I realised there were a lot of profanity words. There were 27236 number of the word “fuck” related in the raw data and this is not acceptable for our text prediction. To prevent the profanity words from appearing in our prediction, we will create a list of profanity words using dictionary() function.

The list of profanity word in txt format is to be downloaded from profanity words

As mentioned above, we want one corpus to contain all our data and generate many DFMs from that. To prevent repetitive task of creating DFMs I will create a dfm.function for blogs, news and Twitter using Quanteda package. I will also create a ngram.Function using tokens_ngrams() as we need to look at the relationship between the words such as bigram, trigram and quadgram.

The best way to explore text data is to look at the data visually.

Below is top 30 word frequency plot:

Figure: This is a RAW data of blogs, news and Twitter in a single corpus.

Initial Observation:

From our initial observation of the raw corpus data plot and wordcloud, the data contains lots of punctuations and stop words. From the Top 30 plot above we could see that periods and commas made up of the top 3 high frequencies in our data. The stop word “the” also rank second in the top 20 frequency and other stop words are also among the top 10.

For the next word prediction, it will not make sense to have punctuation as the next word. Therefore we need to remove punctuations when we perfom data cleaning.

3.4 Sampling and Tokenization

As mentioned above, to improve efficieny due to the large dataset it would be better to work on the samples of the dataset. The sample will be 10% from the data and use set.seed() function so as to ensure reproducibility.

Below is the information of the sample data of blogs, news and Twitter:

	File Size (MB)	Lines	Characters	Words	Longest Line
US Blogs - Sample	25.49	89928	20630198	3724025	9810
US News - Sample	1.98	7725	1570651	264978	1477
US Twitter - Sample	32.18	236014	16190208	3035147	140

Comparing between the raw and the sample data information, we can see the sample data is exactly 10% from the raw data.

As part of the data cleaning the next word prediction, the following will be removed:

Numbers
Punctuations
Symbols
URL
Profanity

After tokenising and analysis of the sample, the corpus contains 174445 features.

The Table of Top 20 Most Frequent Words
feature	frequency	rank	docfreq	group
the	294295	1	133784	all
to	192336	2	114545	all
and	159323	3	90483	all
a	157627	4	97656	all
i	150440	5	87786	all
of	129107	6	76730	all
in	102119	7	70979	all
you	85407	8	61691	all
is	81423	9	60530	all
for	77008	10	62095	all
that	72345	11	50890	all
it	71115	12	51735	all
on	57035	13	47530	all
my	56613	14	42756	all
with	47948	15	38831	all
this	43122	16	35650	all
was	40993	17	27893	all
be	40439	18	34058	all
have	39773	19	33108	all
at	37558	20	32212	all

3.5 Word Coverage

Question: How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

To address this we will be using textstat_frequency() function to find out the word frequency. Then use the cumsum function divide by the total sum of the word frequency to find the word coverage. Credit to the members in the course discussion forums for this method.

This is table that shows the number of unique words needed to cover from 50% to 95%:

50% - 95% Unique Word Coverage
	Coverage in %	No. of Words
9	50%	118
10	55%	177
11	60%	277
12	65%	438
13	70%	705
14	75%	1164
15	80%	1959
16	85%	3486
17	90%	6894
18	95%	18372

The plot below shows the unique word needed to cover up to 95% of the sampled corpora:

Figure: This is a SAMPLE data of blogs, news and Twitter in a single corpus.

Observation:

We will need 118 number of unique words needed to cover 50% of all word instances in the language and the number of unique words needed to cover 90% of all words instances in the language is 6894.

From the plot we also notice that a small fraction of unique words accounts for the majority of text

4. Modelling

Let us build the first simple model for viewing the relationship between words by using the ngram.function that was described earlier. This is a brief detail of the tokens_ngrams(). We will use this function to create bigram, trigram and quadgram to look a the relationship between words. Then we will visually look at the the top 20 most frequenct words in the plot and word cloud.

Side by Side comparison of bigram, trigram and quadgram wordcloud.

Summary and Next Step

For the next word prediction, it will not make sense to have punctuation as the next word. Therefore we need to remove punctuations when we perfom data cleaning.

We will need 118 number of unique words needed to cover 50% of all word instances in the language and the number of unique words needed to cover 90% of all words instances in the language is 6894.

From the plot we also notice that a small fraction of unique words accounts for the majority of text.

We will also need to improve the processing speed by taking a smaller sample of the raw data.

Next Plan:

Markov chains can be used for very basic text generation. Think about every word in a corpus as a state. We can make a simple assumption that the next word is only dependent on the previous word - which is the basic assumption of a Markov chain.

We are going to build a simple Markov chain function that creates a dictionary:

The keys should be all of the words in the corpus
The values should be a list of the words that follow the keys

We will apply a Katz’s Backoff Model, whereby if a matching n-gram is not available, the model will “back off” to the next lowest (n-1)-gram. This will happen through a set of n tables, containing the corresponding list of n-grams and their frequencies.

Create a Text Generator We’re going to create a function that generates sentences. It will take two things as inputs:

The dictionary you just created
The number of words you want generated

Saving Data Object

To preserve memory so we can use only the relevant data or objects in our next step. I will save these objects as rds ext files.

Appendix Code

library(knitr)  ##Load Knitr package
library(ggplot2)  ##Plotting and data
library(caret)  ##Load package for ML
library(dplyr)  ##Data transformation package
library(quanteda)
library(ngram)  ## 
library(tm)
library(RColorBrewer)
library(ggthemes)
library(gridExtra)
library(tidytext)
library(wordcloud)
## Setting Global Option where echo = true so that someone will be able to read
## the code and results.
knitr::opts_chunk$set(echo = FALSE, results = "hold", tidy = TRUE)
# Task 1: Getting and cleaning data

# URL is the given link to download from HC Corpora dataset
data.Url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
filepath <- "./data/SwiftKey.zip"  #set the location and file name of the downloaded zip file

# Create directory named data for the file to download
if (!file.exists("./data")) {
    dir.create("./data")
}

# Download and unzip the file.
if (!file.exists(filepath)) {
    download.file(url, destfile = filepath, method = "curl")
    unzip(zipfile = filepath, exdir = "./data")  #Unzip the file and store the folder in the data folder
}

# Assign English data location
loc.Blogs <- "./data/final/en_US/en_US.blogs.txt"
loc.News <- "./data/final/en_US/en_US.news.txt"
loc.Twitter <- "./data/final/en_US/en_US.twitter.txt"

# Read the data files
raw.Blogs <- readLines(loc.Blogs, encoding = "UTF-8", skipNul = TRUE)
raw.News <- readLines(loc.News, encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
raw.Twitter <- readLines(loc.Twitter, encoding = "UTF-8", skipNul = TRUE)
Rprof(checking <- "./data/checking.txt")  #Checking for bottleneck

# En_US Files information

MB <- 2^20  #Byte to MB conversion

# Check File Size
size <- round(c(object.size(raw.Blogs)/MB, object.size(raw.News)/MB, object.size(raw.Twitter)/MB), 
    2)

# Number of lines in each file
lines <- c(length(raw.Blogs), length(raw.News), length(raw.Twitter))

# Number of characters in each file
char <- c(sum(nchar(raw.Blogs)), sum(nchar(raw.News)), sum(nchar(raw.Twitter)))

# Number of words
words <- c(wordcount(raw.Blogs, sep = " "), wordcount(raw.News, sep = " "), wordcount(raw.Twitter, 
    sep = " "))

# Longest line
long.Line <- c(max(sapply(raw.Blogs, nchar)), max(sapply(raw.News, nchar)), max(sapply(raw.Twitter, 
    nchar)))

raw.Info <- cbind(size, lines, char, words, long.Line)
colnames(raw.Info) <- c("File Size (MB)", "Lines", "Characters", "Words", "Longest Line")
rownames(raw.Info) <- c("US Blogs", "US News", "US Twitter")
kable(raw.Info)

Rprof(NULL)
# summaryRprof(checking) The following is the analysis on Twitter

# In the en_US twitter data set, if you divide the number of lines where the word
# 'love' (all lowercase) occurs by the number of lines the word 'hate' (all
# lowercase) occurs, about what do you get?

love <- sum(grepl("love", raw.Twitter, ignore.case = FALSE))
hate <- sum(grepl("hate", raw.Twitter, ignore.case = FALSE))
love/hate

# The one tweet in the en_US twitter data set that matches the word 'biostats'
# says what?
raw.Twitter[grepl("biostats", raw.Twitter, ignore.case = FALSE)]


# How many tweets have the exact characters 'A computer once beat me at chess,
# but it was no match for me at kickboxing'. (I.e. the line matches those
# characters exactly.)
sum(grepl("A computer once beat me at chess, but it was no match for me at kickboxing", 
    raw.Twitter, ignore.case = TRUE))
# Wordcloud of the raw data combined from blogs, news and Twitter

# Rprof(corpus <- './data/corpus.txt') #Checking for bottleneck

# create one corpus consists of blogs, news and twitter samples
raw.Corpus <- corpus(c(raw.Blogs, raw.News, raw.Twitter))
wordcloud::wordcloud(raw.Corpus, max.words = 100, random.order = FALSE, rot.per = 0.35, 
    use.r.layout = TRUE, colors = brewer.pal(8, "Dark2"))
# Rprof(NULL) Profanity word filter

# Download profanity file from freewebheader
url_1 <- "https://www.cs.cmu.edu/~biglou/resources/bad-words.txt"
filepath_1 <- "./data/profanity_words.txt"  #set the location and file name of the downloaded zip file

# Create directory named data for the file to download
if (!file.exists("./data")) {
    dir.create("./data")
}

if (!file.exists(filepath_1)) {
    download.file(url_1, destfile = filepath_1)
}

profanityWords <- readLines("./data/profanity_words.txt", encoding = "UTF-8", skipNul = TRUE)
dict.Profanity <- dictionary(list(badWord = profanityWords))
# For the the purpose of text analysis, we will create two functions for dfm and
# ngram to apply to accomplish our task
dfm.Function <- function(corpus, n) {
    dfm(x = corpus, remove = dict.Profanity)
}

ngram.Function <- function(corpus, n) {
    tokens_ngrams(corpus, n = n)
}
# Create data for the plot
raw.Plot <- raw.Corpus %>% tokens(what = "word") %>% ngram.Function(n = 1) %>% dfm.Function() %>% 
    topfeatures(30) %>% as.data.frame()

# Change column name to frequency
colnames(raw.Plot) <- "frequency"

# Added a column to the dataframe for plotting purpose
raw.Plot$ngram <- row.names(raw.Plot)

## Generate plots for the raw data
r <- ggplot(raw.Plot, aes(y = frequency, x = reorder(ngram, frequency)))
r <- r + geom_bar(stat = "identity") + coord_flip()
r <- r + ggtitle("Top 30 Frequency of Word in the Data")
r <- r + geom_text(aes(label = frequency), position = position_stack(vjust = 0.5), 
    color = "white", size = 3, fontface = "bold")
r <- r + ylab("Frequency") + xlab("Word")
r <- r + theme_few()
plot(r)
# Rprof(sampling <- './data/sampling.txt') #Checking for bottleneck

set.seed(1234)  #Ensure the same result for reproducibilty

sampleSize <- 0.1  #sample size is 10% of the population

sample.Blogs <- sample(raw.Blogs, size = sampleSize * length(raw.Blogs), replace = FALSE)
sample.News <- sample(raw.News, size = sampleSize * length(raw.News), replace = FALSE)
sample.Twitter <- sample(raw.Twitter, size = sampleSize * length(raw.Twitter), replace = FALSE)

# Check size
sample.Size <- round(c(object.size(sample.Blogs)/MB, object.size(sample.News)/MB, 
    object.size(sample.Twitter)/MB), 2)


# Number of lines in each file
sample.Lines <- c(length(sample.Blogs), length(sample.News), length(sample.Twitter))


# Number of characters in each file
sample.Char <- c(sum(nchar(sample.Blogs)), sum(nchar(sample.News)), sum(nchar(sample.Twitter)))


# Number of words
sample.Words <- c(wordcount(sample.Blogs, sep = " "), wordcount(sample.News, sep = " "), 
    wordcount(sample.Twitter, sep = " "))


# Longest line
sample.LongLine <- c(max(sapply(sample.Blogs, nchar)), max(sapply(sample.News, nchar)), 
    max(sapply(sample.Twitter, nchar)))


sample.Info <- cbind(sample.Size, sample.Lines, sample.Char, sample.Words, sample.LongLine)
colnames(sample.Info) <- c("File Size (MB)", "Lines", "Characters", "Words", "Longest Line")
rownames(sample.Info) <- c("US Blogs - Sample", "US News - Sample", "US Twitter - Sample")
kable(sample.Info)

# Rprof(NULL) summaryRprof(sampling) Rprof(corpus <- './data/corpus.txt')
# #Checking for bottleneck


# create one corpus consists of blogs, news and twitter samples
sample.Corpus <- corpus(c(sample.Blogs, sample.News, sample.Twitter))

# wordcloud::wordcloud(sample.Corpus, max.words = 50, random.order = FALSE,
# rot.per=0.35, use.r.layout=TRUE, colors=brewer.pal(8, 'Dark2'))

# head(sample.Corpus) length(sample.Corpus)

# Rprof(NULL) Rprof(tokenising <- './data/tokenising.txt') #Checking for
# bottleneck Preprocess the text

# Create tokens
sample.Token <- tokens(sample.Corpus, remove_numbers = TRUE, remove_punct = TRUE, 
    remove_symbols = TRUE, remove_url = TRUE, include_docvars = TRUE)

# remove profanity words
sample.TokenV1 <- tokens_remove(tokens(sample.Token, dict.Profanity))

sample.Dfm <- dfm(sample.TokenV1, remove = dict.Profanity)
# head(sample.Dfm)


## Word frequency
document.Frequncy <- docfreq(sample.Dfm, scheme = "count")
# document.Frequncy

# Total number of feature
total.Features <- length(as.factor(featnames(sample.Dfm)))


# Top 20 features using topfeature function
top20.Features <- sample.Dfm %>% topfeatures(20) %>% head(20) %>% data.frame()
# colnames(top20.Features) <- 'Frequency'


# Top 20 features using textstat_frequency function
tstat <- textstat_frequency(sample.Dfm, n = 20, ties_method = c("min"))


# Top 20 Textstat_keyness
tstat2 <- textstat_keyness(sample.Dfm, measure = c("chi2", "exact", "lr", "pmi"), 
    sort = TRUE, 20)

# Token summary
tstat3 <- textstat_summary(sample.Dfm, cache = TRUE)

# Rprof(NULL) headsummaryRprof(tokenising)
unigram <- sample.TokenV1 %>% ngram.Function(n = 1) %>% dfm.Function()

unigram.Frequency <- textstat_frequency(unigram)$frequency
p <- cumsum(unigram.Frequency)/sum(unigram.Frequency)

get.Coverage <- function(n = 0.5) {
    which(p >= n)[1]
}

coverage.50 <- get.Coverage(0.5)
coverage.90 <- get.Coverage(0.9)

coverages <- seq(0.1, 0.95, 0.05)

coverage.Seq <- sapply(coverages, get.Coverage)


cov.Df <- as.data.frame(cbind(coverages, coverage.Seq))


g <- ggplot(data = cov.Df, aes(y = coverages, x = coverage.Seq))
g <- g + geom_point(size = 2, colour = "red") + geom_line(size = 1, colour = "red3")
g <- g + geom_hline(yintercept = 0.5) + geom_hline(yintercept = 0.9)
g <- g + scale_y_continuous(labels = scales::percent)
g <- g + scale_x_continuous(breaks = c(coverage.50, coverage.90, get.Coverage(0.95)))
g <- g + geom_vline(xintercept = coverage.50, linetype = "dashed") + geom_vline(xintercept = coverage.90, 
    linetype = "dashed")
g <- g + ylab("Coverage %") + xlab("Number of words")
g <- g + ggtitle("Unique Word Coverage in A Corpus")
g <- g + theme_bw()

cov.Df2 <- cov.Df
cov.Df2$coverages <- paste(cov.Df2$coverages * 100, "%", sep = "")
plot(g)

for (i in 1:4) {
    ## Prepare data frame for plotting
    ngram.Plot <- sample.TokenV1 %>% ngram.Function(n = i) %>% dfm.Function() %>% 
        topfeatures(20) %>% as.data.frame()
    
    colnames(ngram.Plot) <- "frequency"
    ngram.Plot$ngram <- row.names(ngram.Plot)
    
    
    ## Generate plots
    g <- ggplot(ngram.Plot, aes(y = frequency, x = reorder(ngram, frequency)))
    g <- g + geom_bar(stat = "identity") + coord_flip()
    g <- g + ggtitle(paste("Top 20 - ", i, " grams", sep = " "))
    g <- g + geom_text(aes(label = frequency), position = position_stack(vjust = 0.5), 
        color = "white", size = 3, fontface = "bold")
    g <- g + ylab("") + xlab("")
    g <- g + theme_few()
    assign(paste("p", i, sep = ""), g)
}

grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2, top = "Top 20 ngrams")

# par(mfrow=c(1,3))

# Create wordcloud of bigram
biCloud <- sample.TokenV1 %>% ngram.Function(n = 2) %>% dfm.Function() %>% textplot_wordcloud(max.words = 50, 
    colors = brewer.pal(8, "Dark2"))

# Create wordcloud of trigram
triCloud <- sample.TokenV1 %>% ngram.Function(n = 3) %>% dfm.Function() %>% textplot_wordcloud(max.words = 35, 
    colors = brewer.pal(8, "Dark2"))

# Create wordcloud of quadgram
quadCloud <- sample.TokenV1 %>% ngram.Function(n = 4) %>% dfm.Function() %>% textplot_wordcloud(max.words = 20, 
    colors = brewer.pal(8, "Dark2"))

## Raw.Corpus data
saveRDS(raw.Corpus, file = "./data/rawCorpus.rds")
saveRDS(sample.Corpus, file = "./data/sampleCorpus.rds")
saveRDS(sample.Token, file = "./data/sampleToken.rds")
saveRDS(sample.TokenV1, file = "./data/sampleTokenV1.rds")
saveRDS(sample.Dfm, file = "./data/sampleDfm.rds")

References:

Natural Language Processing Stanford Uni by Jurafsky & Manning - https://www.youtube.com/watch?v=oWsMIW-5xUc&list=PLLssT5z_DsK8HbD2sPcUIDfQ7zmBarMYv

N-Gram Models by Durafsky & Martin - https://web.stanford.edu/~jurafsky/slp3/3.pdf

Text Analysis in R by Welbers, Atteveldt & Benoit - https://kenbenoit.net/pdfs/text_analysis_in_R.pdf

Basic Text Analysis in R by Bail - https://compsocialscience.github.io/summer-institute/2018/materials/day3-text-analysis/basic-text-analysis/rmarkdown/Basic_Text_Analysis_in_R.html

Introduction to Text Analytics in R by Langer - https://www.youtube.com/playlist?list=PL8eNk_zTBST8olxIRFoo0YeXxEOkYdoxi

Cran Task View: Natural Language Processing by Fridolin Wild - https://cran.r-project.org/web/views/NaturalLanguageProcessing.html

Text Mining Infrastructure in R by Ingo, Kurt & David - https://www.researchgate.net/publication/26539008_Text_Mining_Infrastructure_in_R

Guide to Ngram Package by Schmidt- https://cran.r-project.org/web/packages/ngram/vignettes/ngram-guide.pdf

Text Mining in R: A Tidy Approach, by Julia Silge and David Robinson - https://www.tidytextmining.com/index.html

Text Mining and Wordcloud Fundemental in R - http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know

textstat_frequency: Tabulate feature frequencies - https://rdrr.io/cran/quanteda/man/textstat_frequency.html

Predicting the Next Word: Back-Off Language Modeling by Masse - https://medium.com/@davidmasse8/predicting-the-next-word-back-off-language-modeling-8db607444ba9

Natural language processing: What would Shakespeare say? by Ganesh - https://www.r-bloggers.com/natural-language-processing-what-would-shakespeare-say/

Natural Language processing in Python by Zhao - https://youtu.be/xvqsFTUsOmc

How to Generate Word Clouds in R by Van den Rul - https://towardsdatascience.com/create-a-word-cloud-with-r-bde3e7422e8a

The system platform specification used:

Spec	Description
OS	Windows 10 Pro - 64 bit
CPU	AMD Ryzen 5 - 3400G (4 cores & 8 threads)
RAM	16GB DDR4 3000MHz
Storage	500GB SSD - M.2 NVMe (PCIe)
Tool	RStudio

Swiftkey_Next_Word_Prediction

Willianto Asalim

18/08/2020