This report presents the initial steps towards building a predictive algorithm in natural language processing (NLP) with the goal of predicting the next word in a sequence. We will describe the process of importing the data, an exploratory analysis and the concept and approach of the potential model.
In natural language processing, the most common tasks are tokenizing and forming ngrams. Tokenizing is a computational process used to break up texts into words, sentences or any other meaningful unit for analysis which are called tokens. We will only explore the data in this report so that we can form ideas on how to approach the predictive model and tokenizing is going to be the main step for now.
The method will follow the expected sequence: import, clean and process the data, and exploratory analysis. We will use a list of “bad words” to filter out the profanities when we clean the data set.
The data is provided by Coursera: HC Corpora
When coding with larger data sets, we must pay attention to memory use and what is known, in programming as “garbage collection” or GC. This is simply a method to reclaim memory that is taken by objects no longer used in the code. GC can be done manually or automated throughout the code. We are using the pryr package to keep on eye on memory and will clean unused objects as we go. The package pryr will be used to track memory usage and the gc() function to reclaim memory after some clean up. This technique will be useful when we build the model later.
| dataset | Characters | Lines | Words | Size |
|---|---|---|---|---|
| Blogs | 40,833 | 899,288 | 38,309,620 | 200.4242 |
| News | 11,384 | 1,010,242 | 35,624,454 | 196.2775 |
| 140 | 2,360,148 | 31,003,501 | 159.3641 | |
| All | 52,357 | 4,269,678 | 104,937,575 | 556.0658 |
We are dealing with three data sets of about 160 to 200MB varying in characteristics. In terms of number of words, they appear similar but because we have not done any cleaning yet, those estimates might be wrong. We will process the data into a corpus and we will see where that takes us. The objectives for now are to clean the data to prepare the sets for a model and explore them to get a sense of what we are dealing with.
The data is quite large for processing so we will take a sample for now to keep things under control. A sample of 40% should be manageable.
sampblogs <- sample(blogs, length(blogs) * 0.4)
sampnews <- sample(news, length(news) * 0.4)
samptwitter <- sample(twitter, length(twitter) * 0.4)
Garbage collection:
## 1.11 GB
## -595 MB
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3470979 185.4 9968622 532.4 6929769 370.1
## Vcells 40298892 307.5 129044425 984.6 111256161 848.9
There are a couple of things we should do with the data before proceeding. The first of these things should be to remove what is not desired or deemed not useful for a predictive model:
We assemble everything in a corpus, which is just a collection of texts that we can start from for analysis. We will use the function dfm() in the quanteda package because it acts like swiss army knife in applying much of the cleaning steps we need. Cleaning can be applied using individual functions from quanteda like char_tolower(), punct_remove(), and more. Much of those functions can also be passed to dfm as well.
We are also going to eliminate some unsavory words we do not wish to predict. We are using a list of words banned by Google: “Full list of bad words banned by Google”.
## the and to a of i in that is it
## 741893 435886 425810 359073 350074 309491 237953 184370 172864 161455
## the to and a of in for that is on
## 788575 360045 353825 349688 308196 269785 140346 139118 113773 106468
## the to i a you and for in is of
## 375141 314554 288514 243667 218953 175374 154154 151109 144021 143353
Further garbage collection:
## 1.4 GB
## -450 MB
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3208324 171.4 9601876 512.8 12002346 641.0
## Vcells 95949879 732.1 257359320 1963.5 321596828 2453.6
We now have clean corpuses that we can look at. We will extract the top features and visualize them. they are called features as a more generic expression. Objects in a text can take many forms dependinf on how we process the text: words are only one type of object, there are sentences, dictionary classes and more. We refer to features as a way to speak about what objects or units we are processing. We extract the top 20 features of each three data sets. Then we converted the data to a data frame and visualized like so:
There is a very high similarity in the occurrences of single words. Most of them are called “stop words”, it means words that do not have much significance when queries are done in computing. Examples are “the”, “and”, “me”, and such. It is not yet clear if they should be removed yet for a model we will be developing later where the objective is to predict the next word.
## 955 MB
## -425 MB
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 3330236 177.9 9601876 512.8 12002346 641.0
## Vcells 42959552 327.8 164709964 1256.7 321596828 2453.6
Exploring the data sets helped to identify potential next steps. On cleaning and tokenizing: we will use a more structured cleaning process with iterative steps. Instead of removing the words in the profanity list, we will replace them with a token like BLEEP. This will reduce the loss of context which would lead to loss of predictive accuracy. Using the tokenize function in the quanteda package will allow for greater fine tuning. This will affect prediction so we will carefully test that part.
We will use an ngram model, a probabilistic language model to predict the next term in a sequence. Ngrams take the form of (n-1) and represent a simple model and it is scalable by increasing the n.
Because the model is probabilistic, some form of smoothing will be used. There will be problems of balance between grams in terms of frequency and also for absent grams, the zero-frequency problem. We will test Good-Turing discounting, back-off, and Knesser-Nay models.
library(tidyverse)
library(stringr)
library(quanteda)
library(kableExtra)
library(pryr)
library(plotly)
library(scales)
if(!file.exists("./data")) dir.create("./data")
fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./data/data.zip")
unzip("./data/data.zip", exdir = "./data")
blogs <- read_lines("./data/final/en_US/en_US.blogs.txt")
news <- read_lines("./data/final/en_US/en_US.news.txt")
twitter <- read_lines("./data/final/en_US/en_US.twitter.txt")
dataAll <- c(blogs, news, twitter)
size_blogs <- file.info("./data/final/en_US/en_US.blogs.txt")$size / 1024^2
size_news <- file.info("./data/final/en_US/en_US.news.txt")$size / 1024^2
size_twitter <- file.info("./data/final/en_US/en_US.twitter.txt")$size / 1024^2
size_dataAll <- sum(size_blogs, size_news, size_twitter)
char_blogs <- max(nchar(blogs))
char_news <- max(nchar(news))
char_twitter <- max(nchar(twitter))
char_dataAll <- sum(char_blogs, char_news, char_twitter)
len_blogs <- length(blogs)
len_news <- length(news)
len_twitter <- length(twitter)
len_dataAll <- length(dataAll)
wcblogs <- sum(str_count(blogs, '\\w+'))
wcnews <- sum(str_count(news, '\\w+'))
wctwitter <- sum(str_count(twitter, '\\w+'))
wcdataAll <- sum(str_count(dataAll, '\\w+'))
datasets <- c("Blogs", "News", "Twitter", "All")
num_char <- c(char_blogs, char_news, char_twitter, char_dataAll)
length_dataAll <- c(len_blogs, len_news, len_twitter, len_dataAll)
wc <- c(wcblogs, wcnews, wctwitter, wcdataAll)
size <- c(size_blogs, size_news, size_twitter, size_dataAll)
dataf <- data_frame(datasets, num_char, length_dataAll, wc, size)
names(dataf) <- c("dataset", "Characters", "Lines", "Words", "Size")
knitr::kable(dataf, format = "html", format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
# Import profanities (I always wanted to say that!)
profanities <- read_lines('unsavouryWords.txt')
# Create corpus
corpblogs <- corpus(sampblogs)
corpnews <- corpus(sampnews)
corptwitter <- corpus(samptwitter)
# Transformations
corpblogs_clean <- dfm(corpblogs, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = profanities)
corpnews_clean <- dfm(corpnews, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = profanities)
corptwitter_clean <- dfm(corptwitter, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = profanities)
corpblogs_noSW <- dfm(corpblogs, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = c(profanities, stopwords("english")))
corpnews_noSW <- dfm(corpnews, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = c(profanities, stopwords("english")))
corptwitter_noSW <- dfm(corptwitter, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = c(profanities, stopwords("english")))
topfeatures(corpblogs_clean, 10)
topfeatures(corpnews_clean, 10)
topfeatures(corptwitter_clean, 10)
blogsfeat <- topfeatures(corpblogs_clean, 20)
newsfeat <- topfeatures(corpnews_clean, 20)
twitterfeat <- topfeatures(corptwitter_clean, 20)
# Create a data.frame for ggplot
blogsdf <- data.frame(features=names(blogsfeat), NumWords=blogsfeat)
newsdf <- data.frame(features=names(newsfeat), NumWords=newsfeat)
twitterdf <- data.frame(features=names(twitterfeat), NumWords=twitterfeat)
a <- list(x = 0.5 , y = 1.05, text = "Blogs", showarrow = F, xref='paper', yref='paper')
b <- list(x = 0.5 , y = 1.05, text = "News", showarrow = F, xref='paper', yref='paper')
c <- list(x = 0.5 , y = 1.05, text = "Twitter", showarrow = F, xref='paper', yref='paper')
gg <- ggplot(blogsdf, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
blogsgg <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = a)
gg <- ggplot(newsdf, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
newsgg <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = b)
gg <- ggplot(twitterdf, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
twittergg <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = c)
subplot(blogsgg, newsgg, twittergg, titleX = TRUE, titleY = TRUE, margin = 0.05)
blogsfeat_noSW <- topfeatures(corpblogs_noSW, 20)
newsfeat_noSW <- topfeatures(corpnews_noSW, 20)
twitterfeat_noSW <- topfeatures(corptwitter_noSW, 20)
# Create a data.frame for ggplot
blogsdf_noSW <- data.frame(features=names(blogsfeat_noSW), NumWords=blogsfeat_noSW)
newsdf_noSW <- data.frame(features=names(newsfeat_noSW), NumWords=newsfeat_noSW)
twitterdf_noSW <- data.frame(features=names(twitterfeat_noSW), NumWords=twitterfeat_noSW)
a <- list(x = 0.5 , y = 1.05, text = "Blogs", showarrow = F, xref='paper', yref='paper')
b <- list(x = 0.5 , y = 1.05, text = "News", showarrow = F, xref='paper', yref='paper')
c <- list(x = 0.5 , y = 1.05, text = "Twitter", showarrow = F, xref='paper', yref='paper')
gg <- ggplot(blogsdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
blogsgg_noSW <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = a)
gg <- ggplot(newsdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
newsgg_noSW <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = b)
gg <- ggplot(twitterdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
twittergg_noSW <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = c)
subplot(blogsgg_noSW, newsgg_noSW, twittergg_noSW, titleX = TRUE, titleY = TRUE, margin = 0.05)
blogsfeat_noSW <- topfeatures(corpblogs_noSW, 20)
newsfeat_noSW <- topfeatures(corpnews_noSW, 20)
twitterfeat_noSW <- topfeatures(corptwitter_noSW, 20)
# Create a data.frame for ggplot
blogsdf_noSW <- data.frame(features=names(blogsfeat_noSW), NumWords=blogsfeat_noSW)
newsdf_noSW <- data.frame(features=names(newsfeat_noSW), NumWords=newsfeat_noSW)
twitterdf_noSW <- data.frame(features=names(twitterfeat_noSW), NumWords=twitterfeat_noSW)
a <- list(x = 0.5 , y = 1.05, text = "Blogs", showarrow = F, xref='paper', yref='paper')
b <- list(x = 0.5 , y = 1.05, text = "News", showarrow = F, xref='paper', yref='paper')
c <- list(x = 0.5 , y = 1.05, text = "Twitter", showarrow = F, xref='paper', yref='paper')
gg <- ggplot(blogsdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
blogsgg_noSW <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = a)
gg <- ggplot(newsdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
newsgg_noSW <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = b)
gg <- ggplot(twitterdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
twittergg_noSW <- ggplotly(gg, tooltip = "text") %>%
layout(annotations = c)
subplot(blogsgg_noSW, newsgg_noSW, twittergg_noSW, titleX = TRUE, titleY = TRUE, margin = 0.05)