Summary

This report presents the initial steps towards building a predictive algorithm in natural language processing (NLP) with the goal of predicting the next word in a sequence. We will describe the process of importing the data, an exploratory analysis and the concept and approach of the potential model.

In natural language processing, the most common tasks are tokenizing and forming ngrams. Tokenizing is a computational process used to break up texts into words, sentences or any other meaningful unit for analysis which are called tokens. We will only explore the data in this report so that we can form ideas on how to approach the predictive model and tokenizing is going to be the main step for now.

The method will follow the expected sequence: import, clean and process the data, and exploratory analysis. We will use a list of “bad words” to filter out the profanities when we clean the data set.

Data

The data is provided by Coursera: HC Corpora

Memory and garbage collection

When coding with larger data sets, we must pay attention to memory use and what is known, in programming as “garbage collection” or GC. This is simply a method to reclaim memory that is taken by objects no longer used in the code. GC can be done manually or automated throughout the code. We are using the pryr package to keep on eye on memory and will clean unused objects as we go. The package pryr will be used to track memory usage and the gc() function to reclaim memory after some clean up. This technique will be useful when we build the model later.

Importing the data

dataset Characters Lines Words Size
Blogs 40,833 899,288 38,309,620 200.4242
News 11,384 1,010,242 35,624,454 196.2775
Twitter 140 2,360,148 31,003,501 159.3641
All 52,357 4,269,678 104,937,575 556.0658

We are dealing with three data sets of about 160 to 200MB varying in characteristics. In terms of number of words, they appear similar but because we have not done any cleaning yet, those estimates might be wrong. We will process the data into a corpus and we will see where that takes us. The objectives for now are to clean the data to prepare the sets for a model and explore them to get a sense of what we are dealing with.

Sampling

The data is quite large for processing so we will take a sample for now to keep things under control. A sample of 40% should be manageable.

sampblogs <- sample(blogs, length(blogs) * 0.4)
sampnews <- sample(news, length(news) * 0.4)
samptwitter <- sample(twitter, length(twitter) * 0.4)

Garbage collection:

## 1.11 GB
## -595 MB
##            used  (Mb) gc trigger  (Mb)  max used  (Mb)
## Ncells  3470979 185.4    9968622 532.4   6929769 370.1
## Vcells 40298892 307.5  129044425 984.6 111256161 848.9

Cleaning and processing

There are a couple of things we should do with the data before proceeding. The first of these things should be to remove what is not desired or deemed not useful for a predictive model:

  • Transform the data to be case insensitive
  • Remove special characters and some unsavory words
  • Remove numbers and punctuation
  • White spaces will have to go too.

We assemble everything in a corpus, which is just a collection of texts that we can start from for analysis. We will use the function dfm() in the quanteda package because it acts like swiss army knife in applying much of the cleaning steps we need. Cleaning can be applied using individual functions from quanteda like char_tolower(), punct_remove(), and more. Much of those functions can also be passed to dfm as well.

We are also going to eliminate some unsavory words we do not wish to predict. We are using a list of words banned by Google: “Full list of bad words banned by Google”.

##    the    and     to      a     of      i     in   that     is     it 
## 741893 435886 425810 359073 350074 309491 237953 184370 172864 161455
##    the     to    and      a     of     in    for   that     is     on 
## 788575 360045 353825 349688 308196 269785 140346 139118 113773 106468
##    the     to      i      a    you    and    for     in     is     of 
## 375141 314554 288514 243667 218953 175374 154154 151109 144021 143353

Further garbage collection:

## 1.4 GB
## -450 MB
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3208324 171.4    9601876  512.8  12002346  641.0
## Vcells 95949879 732.1  257359320 1963.5 321596828 2453.6

Exploratory analysis

We now have clean corpuses that we can look at. We will extract the top features and visualize them. they are called features as a more generic expression. Objects in a text can take many forms dependinf on how we process the text: words are only one type of object, there are sentences, dictionary classes and more. We refer to features as a way to speak about what objects or units we are processing. We extract the top 20 features of each three data sets. Then we converted the data to a data frame and visualized like so:

There is a very high similarity in the occurrences of single words. Most of them are called “stop words”, it means words that do not have much significance when queries are done in computing. Examples are “the”, “and”, “me”, and such. It is not yet clear if they should be removed yet for a model we will be developing later where the objective is to predict the next word.

## 955 MB
## -425 MB
##            used  (Mb) gc trigger   (Mb)  max used   (Mb)
## Ncells  3330236 177.9    9601876  512.8  12002346  641.0
## Vcells 42959552 327.8  164709964 1256.7 321596828 2453.6

Stop words removed

When stop words are removed, we see an expected difference in the features. A different cleaning approach should be used for the twitter set because we have a high occurrence of meaningless characters that are not words like “rt” and “u”.

Predictive model

Exploring the data sets helped to identify potential next steps. On cleaning and tokenizing: we will use a more structured cleaning process with iterative steps. Instead of removing the words in the profanity list, we will replace them with a token like BLEEP. This will reduce the loss of context which would lead to loss of predictive accuracy. Using the tokenize function in the quanteda package will allow for greater fine tuning. This will affect prediction so we will carefully test that part.

We will use an ngram model, a probabilistic language model to predict the next term in a sequence. Ngrams take the form of (n-1) and represent a simple model and it is scalable by increasing the n.

Because the model is probabilistic, some form of smoothing will be used. There will be problems of balance between grams in terms of frequency and also for absent grams, the zero-frequency problem. We will test Good-Turing discounting, back-off, and Knesser-Nay models.

Appendix

Needed libraries

library(tidyverse)
library(stringr)
library(quanteda)
library(kableExtra)
library(pryr)
library(plotly)
library(scales)

Importing the data

if(!file.exists("./data")) dir.create("./data")

fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
download.file(fileUrl, destfile = "./data/data.zip")
unzip("./data/data.zip", exdir = "./data")

blogs <- read_lines("./data/final/en_US/en_US.blogs.txt")
news <- read_lines("./data/final/en_US/en_US.news.txt")
twitter <- read_lines("./data/final/en_US/en_US.twitter.txt")
dataAll <- c(blogs, news, twitter)

size_blogs <- file.info("./data/final/en_US/en_US.blogs.txt")$size / 1024^2
size_news <- file.info("./data/final/en_US/en_US.news.txt")$size / 1024^2
size_twitter <- file.info("./data/final/en_US/en_US.twitter.txt")$size / 1024^2
size_dataAll <- sum(size_blogs, size_news, size_twitter)

char_blogs <- max(nchar(blogs))
char_news <- max(nchar(news))
char_twitter <- max(nchar(twitter))
char_dataAll <- sum(char_blogs, char_news, char_twitter)

len_blogs <- length(blogs)
len_news <- length(news)
len_twitter <- length(twitter)
len_dataAll <- length(dataAll)

wcblogs <- sum(str_count(blogs, '\\w+'))
wcnews <- sum(str_count(news, '\\w+'))
wctwitter <- sum(str_count(twitter, '\\w+'))
wcdataAll <- sum(str_count(dataAll, '\\w+'))

datasets <- c("Blogs", "News", "Twitter", "All")
num_char <- c(char_blogs, char_news, char_twitter, char_dataAll)
length_dataAll <- c(len_blogs, len_news, len_twitter, len_dataAll)
wc <- c(wcblogs, wcnews, wctwitter, wcdataAll)
size <- c(size_blogs, size_news, size_twitter, size_dataAll)

dataf <- data_frame(datasets, num_char, length_dataAll, wc, size)

names(dataf) <- c("dataset", "Characters", "Lines", "Words", "Size")
knitr::kable(dataf, format = "html", format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Cleaning and processing

# Import profanities (I always wanted to say that!)
profanities <- read_lines('unsavouryWords.txt')

# Create corpus
corpblogs <-  corpus(sampblogs)
corpnews <-  corpus(sampnews)
corptwitter <-  corpus(samptwitter)

# Transformations
corpblogs_clean <- dfm(corpblogs, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = profanities)

corpnews_clean <- dfm(corpnews, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = profanities)

corptwitter_clean <- dfm(corptwitter, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = profanities)


corpblogs_noSW <- dfm(corpblogs, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = c(profanities, stopwords("english")))

corpnews_noSW <- dfm(corpnews, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = c(profanities, stopwords("english")))

corptwitter_noSW <- dfm(corptwitter, remove_numbers = TRUE, remove_punct = TRUE, remove_twitter = TRUE, remove_url = TRUE, remove = c(profanities, stopwords("english")))

topfeatures(corpblogs_clean, 10)
topfeatures(corpnews_clean, 10)
topfeatures(corptwitter_clean, 10)

Exploratory analysis

blogsfeat <- topfeatures(corpblogs_clean, 20)
newsfeat <- topfeatures(corpnews_clean, 20)
twitterfeat <- topfeatures(corptwitter_clean, 20)

# Create a data.frame for ggplot
blogsdf <- data.frame(features=names(blogsfeat), NumWords=blogsfeat)
newsdf <- data.frame(features=names(newsfeat), NumWords=newsfeat)
twitterdf <- data.frame(features=names(twitterfeat), NumWords=twitterfeat)

a <- list(x = 0.5 , y = 1.05, text = "Blogs", showarrow = F, xref='paper', yref='paper')
b <- list(x = 0.5 , y = 1.05, text = "News", showarrow = F, xref='paper', yref='paper')
c <- list(x = 0.5 , y = 1.05, text = "Twitter", showarrow = F, xref='paper', yref='paper')

gg <- ggplot(blogsdf, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
blogsgg <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = a)

gg <- ggplot(newsdf, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
newsgg <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = b)

gg <- ggplot(twitterdf, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
twittergg <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = c)

subplot(blogsgg, newsgg, twittergg, titleX = TRUE, titleY = TRUE, margin = 0.05)
blogsfeat_noSW <- topfeatures(corpblogs_noSW, 20)
newsfeat_noSW <- topfeatures(corpnews_noSW, 20)
twitterfeat_noSW <- topfeatures(corptwitter_noSW, 20)

# Create a data.frame for ggplot
blogsdf_noSW <- data.frame(features=names(blogsfeat_noSW), NumWords=blogsfeat_noSW)
newsdf_noSW <- data.frame(features=names(newsfeat_noSW), NumWords=newsfeat_noSW)
twitterdf_noSW <- data.frame(features=names(twitterfeat_noSW), NumWords=twitterfeat_noSW)

a <- list(x = 0.5 , y = 1.05, text = "Blogs", showarrow = F, xref='paper', yref='paper')
b <- list(x = 0.5 , y = 1.05, text = "News", showarrow = F, xref='paper', yref='paper')
c <- list(x = 0.5 , y = 1.05, text = "Twitter", showarrow = F, xref='paper', yref='paper')

gg <- ggplot(blogsdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
blogsgg_noSW <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = a)

gg <- ggplot(newsdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
newsgg_noSW <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = b)

gg <- ggplot(twitterdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
twittergg_noSW <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = c)

subplot(blogsgg_noSW, newsgg_noSW, twittergg_noSW, titleX = TRUE, titleY = TRUE, margin = 0.05)
Stop words removed
blogsfeat_noSW <- topfeatures(corpblogs_noSW, 20)
newsfeat_noSW <- topfeatures(corpnews_noSW, 20)
twitterfeat_noSW <- topfeatures(corptwitter_noSW, 20)

# Create a data.frame for ggplot
blogsdf_noSW <- data.frame(features=names(blogsfeat_noSW), NumWords=blogsfeat_noSW)
newsdf_noSW <- data.frame(features=names(newsfeat_noSW), NumWords=newsfeat_noSW)
twitterdf_noSW <- data.frame(features=names(twitterfeat_noSW), NumWords=twitterfeat_noSW)

a <- list(x = 0.5 , y = 1.05, text = "Blogs", showarrow = F, xref='paper', yref='paper')
b <- list(x = 0.5 , y = 1.05, text = "News", showarrow = F, xref='paper', yref='paper')
c <- list(x = 0.5 , y = 1.05, text = "Twitter", showarrow = F, xref='paper', yref='paper')

gg <- ggplot(blogsdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
blogsgg_noSW <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = a)

gg <- ggplot(newsdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
newsgg_noSW <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = b)

gg <- ggplot(twitterdf_noSW, aes(x=NumWords, y=reorder(features, NumWords), text = comma(NumWords)))
gg <- gg + geom_segment(aes(xend = 0, yend=features), color="#ececec")
gg <- gg + geom_point(color = "#3282bd", size = 2)
gg <- gg + scale_x_discrete(expand = c(0.03,0.03), labels = comma)
gg <- gg + labs(x = NULL, y = NULL)
gg <- gg + theme(strip.background=element_blank())
gg <- gg + theme_bw(base_family = "Helvetica")
gg <- gg + theme(panel.border = element_blank())
gg <- gg + theme(panel.grid.major = element_blank())
gg <- gg + theme(panel.grid.minor = element_blank())
gg <- gg + theme(axis.ticks.y=element_blank())
gg <- gg + theme(axis.text.x=element_text(size=9))
gg <- gg + theme(axis.text.y=element_text(size=9))
gg <- gg + ggtitle("Top 20 features")
twittergg_noSW <- ggplotly(gg, tooltip = "text") %>%
  layout(annotations = c)

subplot(blogsgg_noSW, newsgg_noSW, twittergg_noSW, titleX = TRUE, titleY = TRUE, margin = 0.05)