The capstone project of the Coursera Data Science Specialization deals with NLP (Natural Language Processing) and proposes to create an application from scratch that takes user text input and predicts the next word in the sentence, similar to the SwiftKey technology.
Goals of this milestone report:
Load in the data, basic summary statistics about the data sets, interesting findings and plan for creating a prediction algorithm and Shiny app.
library(tm); library(dplyr); library(ggplot2); library(stringi); library(stringr); library(tokenizers); library(tidytext); library(knitr)
The data can be found here. We assume that the user downloaded and unzipped the data, and placed it into the working directory, retaining the folder structure.
Datasets are available in English, Finnish, German, and Russian.
We use only English data for this exploratory phase, using the twitter, blog and news datasets.
Load the data.
#loading twitter and blogs data
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul=TRUE)
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8")
#loading news dataset using "rb" (read binary) attribute
con_news <- file("final/en_US/en_US.news.txt", open = "rb")
news <- readLines(con_news, encoding="UTF-8")
close(con_news); rm(con_news)
Tokenize per sentence
The purpose of the application is to predict the next word within a sentence, not how the next sentence will start. Therefore the base unit for the examples to train the model on should be sentences and not full blogs or news articles.
I use the “tokenizers” package to divide the text into sentences.
#tokenize per sentence as a first step
blogs <- unlist(tokenize_sentences(blogs))
news <- unlist(tokenize_sentences(news))
twitter <- unlist(tokenize_sentences(twitter))
Perform basic word and line count summary.
#file sizes
blogsSize <- paste(round(file.info("final/en_US/en_US.blogs.txt")$size /2^20, 2), "MB")
newsSize <- paste(round(file.info("final/en_US/en_US.news.txt")$size /2^20, 2), "MB")
twitterSize <- paste(round(file.info("final/en_US/en_US.twitter.txt")$size /2^20, 2), "MB")
totalSize <- paste(round((file.info("final/en_US/en_US.blogs.txt")$size + file.info("final/en_US/en_US.news.txt")$size + file.info("final/en_US/en_US.twitter.txt")$size) /2^20, 2), "MB")
#lines in each file
blogsLength <- length(blogs)
newsLength <- length(news)
twitterLength <- length(twitter)
totalLength <- blogsLength + newsLength + twitterLength
#word count per file (from stringi package)
blogsWC <- stri_stats_latex(blogs)[[4]]
newsWC <- stri_stats_latex(news)[[4]]
twitterWC <- stri_stats_latex(twitter)[[4]]
totalWC <- blogsWC + newsWC + twitterWC
#average words per sentence
blogsAveWords <- ceiling(blogsWC/blogsLength)
newsAveWords <- ceiling(newsWC/newsLength)
twitterAveWords <- ceiling(twitterWC/twitterLength)
totalAveWords <- ceiling(totalWC/totalLength)
#summarize in a table
summary <- data.frame(File_Size = c(blogsSize, newsSize, twitterSize, totalSize), Lines = c(blogsLength, newsLength, twitterLength, totalLength), Total_Words = c(blogsWC, newsWC, twitterWC, totalWC), Mean_Words_perLine = c(blogsAveWords, newsAveWords, twitterAveWords, totalAveWords), row.names = c("blogs", "news", "twitter", "total"))
#Format the numbers
summary$Lines <- formatC(summary$Lines, big.mark = ",", format = "f", digits = 0)
summary$Total_Words <- formatC(summary$Total_Words, big.mark = ",", format = "f", digits = 0)
#Show in a table
kable(summary)
| File_Size | Lines | Total_Words | Mean_Words_perLine | |
|---|---|---|---|---|
| blogs | 200.42 MB | 2,380,481 | 37,777,303 | 16 |
| news | 196.28 MB | 2,025,776 | 34,550,069 | 18 |
| 159.36 MB | 3,780,376 | 30,476,074 | 9 | |
| total | 556.07 MB | 8,186,633 | 102,803,446 | 13 |
The word count per line is pretty homogeneous for blogs and news, and a bit less for twitter.
The main takeaway is the large size of the dataset with 102,803,446 words, 8,186,633 lines of text and a total size of 556.07 MB. This is a lot and in order to make the exploration phase not too computationtally intensive, we choose a sample of 10% of the total lines.
Sampling 10% of the lines of each dataset
#remove unnecessary variables, we keep only blogs, news and twitter.
rm(list=setdiff(ls(), c("blogs", "news", "twitter")))
#create samples per page type, 10% text mentions of each dataset
set.seed(23456)
blogsSample <- blogs[sample(1:length(blogs),length(blogs)/10)]
#convert character vector from UTF-8 to ASCII format, this command is needed for the tm package to work properly
blogsSample <- iconv(blogsSample,'UTF-8', 'ASCII', "byte")
newsSample <- news[sample(1:length(news),length(news)/10)]
newsSample <- iconv(newsSample,'UTF-8', 'ASCII', "byte")
twitterSample <- twitter[sample(1:length(twitter),length(twitter)/10)]
twitterSample <- iconv(twitterSample,'UTF-8', 'ASCII', "byte")
#remove the source variables
rm(blogs, news, twitter)
#merge all samples into one data file
sampleData <- c(blogsSample, newsSample, twitterSample)
#save the sampleData file
writeLines(sampleData, "sampleData.txt")
#remove unnecessary objects
rm(blogsSample, newsSample, twitterSample)
The tm package offers the following functions to process NLP tasks:
- Corpus() converts the dataset in corpora, collections of documents containing natural language text.
- tm_map() and content_transformer() create and apply functions which modify the content of the corpus for text cleaning tasks.
We will perform cleaning operations in the following order:
- lowercase the whole text (we want to avoid lower- and uppercase variations of the same word in order to save space)
- remove urls (by removing every string that starts with http… and www…, until the next white space)
- remove profanity: we don’t want to recommend any offending words and remove profanity by loading the bannedwordlist.com “swearWords.txt” file.
- remove numbers, since we want to predict text and not numbers. Instead of the built-in removeNumbers() function, we use a custom function that removes the entire word containing a number so that we don’t leave behind meaningless characters.
- remove special characters and punctuation, keep apostrophes: we don’t use the built-in functions for this, but simply remove everything that is not a character ([^a-zA-Z]) or that is not an apostrophe.
- remove blanks between the words: our transformations may have left more than one space between words, we remove them as a last step.
We decided not to run some text cleaning tasks that are usual to NLP:
- Apostrophes: we decided to keep them since we don’t want to confuse “it’s” and “its”, or “I’ll” and “ill”.
- Stemming, meaning converting words to their root form (learning, learner, learned, … –> “learn”): we want to predict the next words for this project and we don’t want to predict only the root of the word.
- Stopwords: these are articles or basic verbs very common in the language: a, the, she, he, you, … In NLP they are considered to hold a poor predictive value since they’re overwhelmingly present and show little semantic interest, but for predicicting the next words, and considering strings of 2, 3 or more words, they regain all their importance.
#converting sample dataset into a corpus
sampleCorpus <- VCorpus(VectorSource(sampleData))
#1. lowercase all text
sampleCorpus <- tm_map(sampleCorpus, content_transformer(tolower))
#2. create replacement function, replace a pattern with a space
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
#3. remove URLS http first, then www, until the next space
sampleCorpus <- tm_map(sampleCorpus, toSpace, "http\\S*")
sampleCorpus <- tm_map(sampleCorpus, toSpace, "www\\S*")
#4. remove bad words (I needed to edit the file a little bit)
if(!file.exists("swearWords.txt")){
download.file("http://www.bannedwordlist.com/lists/swearWords.txt", "swearWords.txt", mode="wb")
}
swearWords <- read.table(file = "swearWords.txt",header = F,as.is = T)
swearWords <- data.frame(swearWords[nchar(swearWords[,1]) >1, 1])
swearWords <- data.frame(swearWords[-c(12, 13, 42, 43, 44, 49, 50, 57, 66, 68),])
swearWords <- data.frame(c(as.character(swearWords[,1]), c("fucked", "fucking", "fuckin", "fucker", "fuckers")))
sampleCorpus <- tm_map(sampleCorpus, removeWords, swearWords[,1])
#5. Remove words that contain numbers, not just the numbers since it may leave letters without any meaning
sampleCorpus <- tm_map(sampleCorpus, toSpace, "\\w*[0-9]+\\w*\\s*")
#6. Remove everything that is not a character (punctuation, special characters, numbers), except for apostrophe (I want to distinguish "it's" from "its", or "I'll" from "ill")
sampleCorpus <- tm_map(sampleCorpus, toSpace, "[^a-zA-Z|']")
#6. remove white space
sampleCorpus <- tm_map(sampleCorpus, stripWhitespace)
#remove unnecessary variables
rm(sampleData, swearWords, toSpace)
save(sampleCorpus, file = "sampleCorpus.Rda")
Now we can do a quick check, did all our transformations provide a sensible result? We take a few random lines.
lapply(sampleCorpus[20:23], as.character)
## $`20`
## [1] "gossip girl"
##
## $`21`
## [1] "i apologize to all of my readers "
##
## $`22`
## [1] "so the first thing you need to do when you get your baby home is take a deep cleansing breath "
##
## $`23`
## [1] "out of his experience working with older folks he has gained interesting insights regarding aging today "
We create unigrams (each word isolated, showing their frequency across all lines or sentences of the corpus), bigrams (a sequence of two consecutive words), and trigrams. We will stop at the trigrams although additional n-grams will be considered in the second phase of the project.
The RWeka package is often used for this, but it requires the extra step of creating a term document frequency matrix (each row is a line of text, each column is an ngram). We finally reduce it to a word frequency data frame across all documents (each ngram with its count/sum across all documents in the corpus). Since we do not need a term document matrix and the extra step costs a lot of memory, we use instead the tidytext package that allows us to directly create a frequency table.
#transform corpus into a data frame that the tidytext package can understand
sample_df <- data.frame(unlist(lapply(sampleCorpus[1:length(sampleCorpus)], as.character)))
#rename the column
names(sample_df) <- "texts"
#unigrams
#create unigrams with the unnest_tokens() function
unigrams <- sample_df %>%
unnest_tokens(ngrams, texts, token = "ngrams", n = 1)
#calculate frequency for each unigram, and sort the data frame in descending order
unigrams <- unigrams %>%
count(ngrams, sort = TRUE)
#bigrams
bigrams <- sample_df %>%
unnest_tokens(ngrams, texts, token = "ngrams", n = 2)
bigrams <- bigrams %>%
count(ngrams, sort = TRUE)
#trigrams
trigrams <- sample_df %>%
unnest_tokens(ngrams, texts, token = "ngrams", n = 3)
trigrams <- trigrams %>%
count(ngrams, sort = TRUE)
Saving our three datasets:
save(unigrams, file = "unigrams.Rda")
save(bigrams, file = "bigrams.Rda")
save(trigrams, file = "trigrams.Rda")
Now that we have our three objects to work with, we can explore the datasets.
First we can show a frequency plot of the top 30 words for 1-grams, 2-grams, 3-grams:
ggplot(unigrams[1:30,], aes(x=reorder(ngrams, n), y=n)) + coord_flip() +
labs(x = "", y = "Frequency") + ggtitle("unigrams, top 30, by frequency") +
geom_bar(stat = "identity", fill = "lightblue", colour = "darkgrey")
ggplot(bigrams[1:30,], aes(x=reorder(ngrams, n), y=n)) + coord_flip() +
labs(x = "", y = "Frequency") + ggtitle("bigrams, top 30, by frequency") +
geom_bar(stat = "identity", fill = "lightgreen", colour = "darkgrey")
ggplot(trigrams[1:30,], aes(x=reorder(ngrams, n), y=n)) + coord_flip() +
labs(x = "", y = "Frequency") + ggtitle("trigrams, top 30, by frequency") +
geom_bar(stat = "identity", fill = "beige", colour = "darkgrey")
Words/expressions with top frequency present no surprise and show something that could look like a stopword list, since we didn’t remove them during the cleanup process. We can go a bit deeper in the list and see if terms with lower frequencies also make sense.
kable(data.frame(Unigrams = unigrams[900:905,], Bigrams = bigrams[900:905,], Trigrams = trigrams[900:905,]))
| Unigrams.ngrams | Unigrams.n | Bigrams.ngrams | Bigrams.n | Trigrams.ngrams | Trigrams.n |
|---|---|---|---|---|---|
| paid | 1208 | which was | 698 | i can’t believe | 167 |
| opening | 1206 | amount of | 697 | i want a | 167 |
| we’ve | 1206 | expected to | 697 | is a little | 167 |
| gotta | 1204 | made it | 697 | it might be | 167 |
| plus | 1203 | and even | 696 | know what to | 167 |
| practice | 1199 | out a | 695 | the next few | 167 |
These seem to make sense as well.
We have now three datasets we can work with to build our predictive models.
We look first at a few summary elements:
#get file size`
unigrams_size <- paste(round(file.info("unigrams.Rda")$size /2^20, 2), "MB")
bigrams_size <- paste(round(file.info("bigrams.Rda")$size /2^20, 2), "MB")
trigrams_size <- paste(round(file.info("trigrams.Rda")$size /2^20, 2), "MB")
#get qty ngrams
unigrams_length <- dim(unigrams)[1]
bigrams_length <- dim(bigrams)[1]
trigrams_length <- dim(trigrams)[1]
#collate info in data frame, rename rows, format numbers
summary <- data.frame(FileSize = c(unigrams_size, bigrams_size, trigrams_size), QtyNgrams = c(unigrams_length, bigrams_length, trigrams_length))
rownames(summary) <- c("Unigrams", "Bigrams", "Trigrams")
summary$QtyNgrams <- formatC(summary$QtyNgrams, big.mark = ",", format = "f", digits = 0)
#Display in a table
kable(summary)
| FileSize | QtyNgrams | |
|---|---|---|
| Unigrams | 0.78 MB | 169,121 |
| Bigrams | 13.91 MB | 2,829,322 |
| Trigrams | 41.83 MB | 6,997,804 |
File size and quantity of ngrams increase dramatically from unigrams to bigrams, trigrams, …
We have a hunch that ngrams with low frequency (appearing only 1 or a few times) take a lot of space. We can look more into this, first by checking examples of ngrams with low frequencies.
head(unigrams[unigrams$n == 1,])
## # A tibble: 6 x 2
## ngrams n
## <chr> <int>
## 1 a'bloomin 1
## 2 a'brewing 1
## 3 a'comin 1
## 4 a'famosa 1
## 5 a'ight 1
## 6 a'sauras 1
head(bigrams[bigrams$n == 1,])
## # A tibble: 6 x 2
## ngrams n
## <chr> <int>
## 1 a'bloomin the 1
## 2 a'brewing this 1
## 3 a'comin cc 1
## 4 a'f doe 1
## 5 a'f endangering 1
## 6 a'f hey 1
These examples of low frequency ngrams show us expressions with little or no use for our end purpose. We will want to remove ngrams with low frequency but before doing so, we can explore the distribution of the ngram frequencies.
#frequency counts tables with 3 values:
#Frequency (number of times an ngram appears in the dataset)
#Count (how many ngrams correspond to this frequency)
#Perc (pecentage of ngrams with a specific frequency over all ngrams/frequencies in the dataset)
unigrams_freqs <- data.frame(table(unigrams$n))
bigrams_freqs <- data.frame(table(bigrams$n))
trigrams_freqs <- data.frame(table(trigrams$n))
#create the Perc variable
unigrams_freqs$Perc <- unigrams_freqs$Freq/sum(unigrams_freqs$Freq)
bigrams_freqs$Perc <- bigrams_freqs$Freq/sum(bigrams_freqs$Freq)
trigrams_freqs$Perc <- trigrams_freqs$Freq/sum(trigrams_freqs$Freq)
#rename variables
names(unigrams_freqs) <- c("Frequency", "Count", "Perc")
names(bigrams_freqs) <- c("Frequency", "Count", "Perc")
names(trigrams_freqs) <- c("Frequency", "Count", "Perc")
#plot the distribution
ggplot(unigrams_freqs, aes(Perc)) +
geom_histogram(show.legend = FALSE) +
xlim(NA, 0.0009) +
ggtitle("Zipf Distribution of unigram frequencies")
Our unigram frequency tables follow long tailed distributions, which correspond to the Zipf law that is very common in large NLP corpuses: the frequeny that a word appears is inversely proportional to its rank [1].
In other words, we end up with a sparse term frequency matrix, with a very high amount of ngrams appearing only once or a few times. Not only do these low-frequency ngrams bring very little predictive value to the model, but they also increase dramatically the size of the datasets, negatively impacting the performance of our model.
We need to limit the size of our dataset while preserving a sufficient predictive value and before setting a frequency threshold, we take a look at the percentage that low frequencies represent in the ngrams, we limit this view to frequencies from 1 to 6:
kable(head(unigrams_freqs))
| Frequency | Count | Perc |
|---|---|---|
| 1 | 86781 | 0.5131297 |
| 2 | 20574 | 0.1216525 |
| 3 | 10183 | 0.0602113 |
| 4 | 6418 | 0.0379492 |
| 5 | 4501 | 0.0266141 |
| 6 | 3372 | 0.0199384 |
kable(head(bigrams_freqs))
| Frequency | Count | Perc |
|---|---|---|
| 1 | 2121449 | 0.7498083 |
| 2 | 296394 | 0.1047580 |
| 3 | 117931 | 0.0416817 |
| 4 | 65144 | 0.0230246 |
| 5 | 40878 | 0.0144480 |
| 6 | 28498 | 0.0100724 |
kable(head(trigrams_freqs))
| Frequency | Count | Perc |
|---|---|---|
| 1 | 6217586 | 0.8885053 |
| 2 | 416366 | 0.0594995 |
| 3 | 134701 | 0.0192490 |
| 4 | 65178 | 0.0093141 |
| 5 | 38680 | 0.0055274 |
| 6 | 24762 | 0.0035385 |
This shows several things:
- how sparse the term frequency matrices are
- the higher the ngram orders (unigram, bigram, …), the more sparse the distributions are
- the exploration phase is performed on 10% of the data, we can imagine how much bigger the sparsity can become when working on the full dataset, and rolling out the process to fourgrams and fivegrams.
- by setting the threshold to frequency 4 for instance, we would reduce the file size and ngrams by 73.29% for unigrams, 91.93% for bigrams and 97.66% for trigrams.
I will reiterate this exploration phase when working on the full dataset in order to decide which frequency threshold should be chosen. The choice is fairly arbitrary though and consists of a tradeoff between reducing the size of the dataset (i.e. improving the performance/speed of the language model) and the expected loss of predictive value by trimming too many low frequency ngrams.
The final project will include the following files:
- 1_PredNext_Milsetone_Report: this document
- 2_PredNext_Prepare_FullDataSet: the steps and code used to create the full dataset
- 3_PredNext_Language_Models: code for testing the selected models
- shiny_app: folder with the ui.R and server.R files - The 5 data files: unigrams, bigrams, trigrams, fourgrams, fivegrams
- Pitch
On top of a reactive application, I will endeavour to provide additional features such as:
- keyboard shortkeys to select a suggested word
- suggest the currently typed word
- user-based login allowing to prioritize the history of ngrams from a specific user in order to improve individual user experience over time
- suggest current word
“Speech and Language Processing”, Jurafsky & Martin, 28.08.2017, Chapter 4, https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf: the book as well as the course videos (https://www.youtube.com/watch?v=s3kKlUBa3b0) are a must-read/view.
“Text Mining with R”, Julia Silge and David Robinson, 20.01.2018, https://www.tidytextmining.com/tfidf.html#zipfs-law:
Word Prediction Using Stupid Backoff With a 5-gram Language Model, Phil Ferriere, April 2016, https://rpubs.com/pferriere/dscapreport: a great explanation of the stupid backoff model and Kneser Ney smoothing.