Predicting Next Word - Milestone #1

This is the first milestone for the Coursera Data Science Capstone conducted by John Hopkins University in partnership with Swiftkey.

Loading Data

Out of the four language corpuses given to us, I have knowledge regarding only English. So that is what I am going to use for now.

twitter <- readLines("en_US.twitter.txt.bz2")
blogs <- readLines("en_US.blogs.txt.bz2")
news <- readLines("en_US.news.txt.bz2")

Each file is close to 200MB in size. The twitter file contains around 2.3 million lines, the blogs files contains 0.9 million lines, and the news file contains 1 million lines. Conducting analysis on this whole corpus would require very high computing power, the likes of which are unavailable with most individuals. Therefore I will take up a small random sample from each, which will hopefully be represntative of the whole.

Sampling and Pre-processing

set.seed(1234) #for reproducibility
sizeSample <- 0.05 #only taking 5% of the data.

# creating samples for each dataset

# sample of twitter
ts <- sample(length(twitter),length(twitter)*sizeSample)
twitSample <- twitter[ts]

# sample of news
ns <- sample(length(news),length(news)*sizeSample)
newsSample <- news[ns]

# sample of blogs
bs <- sample(length(blogs),length(blogs)*sizeSample)
blSample <- blogs[bs]

The data, like in most cases, requires cleaning. For instance, people use emoticons, which are not represented properly, and other characters (perhaps foreign ones). These all need to be cleaned as well.

# cleaning
twitter <- iconv(twitSample,to = "ASCII",sub = "")
blogs <- iconv(blSample,to = "ASCII",sub = "")
news <- iconv(newsSample,to = "ASCII",sub = "")

The iconv command transforms text from one encoding to another(here everything was converted to ASCII characters), and the sub argument is used to define what should the characters not in the new encoding be substituted with. So essentially I have removed all non-ASCII characters from the three data sources.

Finally writing out these samples for future use.

# saving for easier future reading
writeLines(twitSample,file("./Sample 5%/twitter.txt"))
writeLines(newsSample,file("./Sample 5%/news.txt"))
writeLines(blSample,file("./Sample 5%/blog.txt"))

Creating Corpus

Text mining in R can be done using a number of different libraries. The tm package is the most popular one, although it is also very slow. Searching on the internet and the Coursera forums lead me to the package quanteda, which is a fast and efficient library for text analysis which used data.table and C++. If the last 9 courses have taught me anything, it’s that these two are always faster.

require(quanteda)
corpSource <- textfile("Sample 5%/*.txt")
sampleCorpus <- corpus(corpSource)
save(sampleCorpus,file = "combinedCorpusSample.Rdata") # saving for future use

Here’s a brief summary of the sample corpus.

require(pander)
# creating summary to hide messages
k <- summary(sampleCorpus)

## Corpus consisting of 3 documents.
## 
##         Text Types  Tokens Sentences
##    blogs.txt 83788 2145536    118560
##     news.txt 30441  320200     15789
##  twitter.txt 88282 1840334    187835
## 
## Source:  D:/RData/Coursera-SwiftKey/* on x86-64 by Areeb Khan
## Created: Sun Mar 20 21:24:29 2016
## Notes:

pander(k)

Text	Types	Tokens	Sentences
blogs.txt	83,788	2,145,536	118,560
news.txt	30,441	320,200	15,789
twitter.txt	88,282	1,840,334	187,835

Tokenisation

Now to tokenise the text, i.e., to separate the words.

tokensAll <- tokenize(x = toLower(sampleCorpus),
                      removePunct = TRUE,
                      removeTwitter = TRUE,
                      removeNumbers = TRUE,
                      removeHyphens = TRUE,
                      verbose = TRUE)

Numbers, punctuations, hyphens, twitter special characters (@#) were all removed, the corpus was converted to lower characters, and then tokenised.

pander(data.frame(Tokens = ntoken(tokensAll)))

	Tokens
blogs.txt	1,855,417
news.txt	266,086
twitter.txt	1,479,499

Profanity removal and Frequency Matrix

One of our requirements was to remove profanity. In its simplest forms, this involves first creating a list of swear words, and then using that list to remove words from our Corpus, or replace them with some placeholder text.

There exist various sources online that have such lists. For example one such resource is this(a list of almost 1300 bad words), another one is on Shutterstock. However, these contained a number of words that are not really swear words, for instance the first list contains words like “abuse,violence, arab”, etc. One good list that I found was the list used by google which can be found here. It had the least amount of incorrect words in it (I only found “God”).

profane <- readLines("Profanity/google bad words.txt")

# tokenising it since without it, the next step hangs up the computer for some reason
profanity <- tokenize(profane,
                      removePunct = TRUE,
                      removeSeparators = TRUE,
                      removeHyphens = TRUE,
                      simplify = TRUE)

Now we could either remove those words, or put a placeholder text for them somthing like " @#$%&!" (grawlix). However, since we are supposed to be predicting the next word, we would not want to recommend “Use a swear word of your choice please…”, which is why it is better to just remove all swear words altogether.

newTokens <- removeFeatures(tokensAll,profanity)

Let us now finally create a frequency matrix, which shows how many times a certain word appears.

dfm1 <- dfm(newTokens,
            stem = TRUE,
            verbose = TRUE)
dfm2 <- dfm(newTokens,
            stem = TRUE,
            ignoredFeatures = stopwords("english"),
            verbose = TRUE)

The first command creates a data frequency matrix using using dfm command from the quanteda package. The ignoredFeatures argument is used to remove words from the Corpus. The first command removes the swear words. The second matrix has been created after removing stopwords (common words like “and,or, of, in, it”, etc. which don’t add much meaning to the text) to further analyse what were the most frequent words besides the most common words in the language. Stemming was used, which reduces different forms of words to their root words. For example, “tall”,“taller”,“tallest” are all reduced to just “tall”.

Analyis

The top 100 words can be easily visualised using a wordcloud, in which the size of the word represents its frequency.

require(RColorBrewer)
# wordcloud including common words
plot(dfm1,max.words = 100,
     colors = brewer.pal(6,"Dark2"),
     random.order = FALSE,
     scale = c(8,1))

# wordcloud excluding common words
plot(dfm2,max.words = 100,
     colors = brewer.pal(6,"Dark2"),
     random.order = FALSE,use.r.layout = TRUE)

Here are the top 20 most frequent words.

pander(t(data.frame(Freq = topfeatures(dfm1,20))),justify = "left",caption = "Including common words")

Including common words (continued below)
	the	to	and	i	in	it	you	is	that
Freq	155,118	100,103	83,367	75,752	54,271	49,467	43,216	41,652	41,164

	for	on	my	with	be	have	this	was	at	are	but
Freq	40,524	29,759	28,336	24,754	24,622	22,173	21,982	21,657	19,668	19,065	18,008

pander(t(data.frame(Freq = topfeatures(dfm2,20))),justify = "left",caption = "Excluding common words")

Excluding common words (continued below)
	just	get	like	one	will	go	time	can	love	day
Freq	12,877	12,609	12,365	11,852	11,439	11,062	10,358	10,078	9,635	9,322

	make	know	good	thank	now	see	work	new	year	think
Freq	8,341	8,085	7,789	7,534	7,406	6,826	6,785	6,756	6,692	6,552

n-gram models

Let us also create a 3-gram model or a trigram, which can be later used for prediction applying the chain rule of probability. A trigram is a combination of 3 words that appear together. For instance “I am going”. The n-gram will find every combination of n words that appear in our sample, and create a frequency matrix using that.

Going step by step, rather than putting everything directly in the dfm command really speeds up the process, which is why this approach was used (of separately tokenising, removing profanity, and creating trigrams).

trigrams <- ngrams(newTokens,n = 3)

dfm3 <- dfm(trigrams,
            stem = TRUE,
            verbose = TRUE)

Let us now look at the top 20 trigrams with the highest frequency.

pander(data.frame(Freq = topfeatures(dfm3,20)),justify = "left")

	Freq
thanks_for_th	1164
going_to_b	749
i_want_to	686
looking_forward_to	528
thank_you_for	519
i_have_to	518
i_love_you	507
as_well_a	465
be_able_to	460
i_need_to	442
can’t_wait_to	427
for_the_follow	421
you_want_to	407
in_the_world	394
you_have_to	385
is_going_to	372
i_don’t_know	364
if_you_ar	360
the_fact_that	356
i_think_i	350

Creating a barplot (also called frequency matrix) for the same.

require(ggplot2)
temp <- topfeatures(dfm3,20)
dat <- data.frame(freq = temp,gram = names(temp))

ggplot(dat,aes(gram,freq)) + 
     geom_bar(stat = "identity",fill = "steelblue") + 
     scale_x_discrete(limits = dat$gram) + 
     coord_flip() + 
     labs(title = "Top trigrams in the sample dataset") +
     theme(axis.text.y = element_text(size = 13,face = "italic"),
           axis.text.x = element_text(size = 14),
           axis.title = element_text(size = 16),
           plot.title = element_text(size = 20))