This is the first milestone for the Coursera Data Science Capstone conducted by John Hopkins University in partnership with Swiftkey.

Loading Data

Out of the four language corpuses given to us, I have knowledge regarding only English. So that is what I am going to use for now.

twitter <- readLines("en_US.twitter.txt.bz2")
blogs <- readLines("en_US.blogs.txt.bz2")
news <- readLines("en_US.news.txt.bz2")

Each file is close to 200MB in size. The twitter file contains around 2.3 million lines, the blogs files contains 0.9 million lines, and the news file contains 1 million lines. Conducting analysis on this whole corpus would require very high computing power, the likes of which are unavailable with most individuals. Therefore I will take up a small random sample from each, which will hopefully be represntative of the whole.

Sampling and Pre-processing

set.seed(1234) #for reproducibility
sizeSample <- 0.05 #only taking 5% of the data.

# creating samples for each dataset

# sample of twitter
ts <- sample(length(twitter),length(twitter)*sizeSample)
twitSample <- twitter[ts]

# sample of news
ns <- sample(length(news),length(news)*sizeSample)
newsSample <- news[ns]

# sample of blogs
bs <- sample(length(blogs),length(blogs)*sizeSample)
blSample <- blogs[bs]

The data, like in most cases, requires cleaning. For instance, people use emoticons, which are not represented properly, and other characters (perhaps foreign ones). These all need to be cleaned as well.

# cleaning
twitter <- iconv(twitSample,to = "ASCII",sub = "")
blogs <- iconv(blSample,to = "ASCII",sub = "")
news <- iconv(newsSample,to = "ASCII",sub = "")

The iconv command transforms text from one encoding to another(here everything was converted to ASCII characters), and the sub argument is used to define what should the characters not in the new encoding be substituted with. So essentially I have removed all non-ASCII characters from the three data sources.

Finally writing out these samples for future use.

# saving for easier future reading
writeLines(twitSample,file("./Sample 5%/twitter.txt"))
writeLines(newsSample,file("./Sample 5%/news.txt"))
writeLines(blSample,file("./Sample 5%/blog.txt"))

Creating Corpus

Text mining in R can be done using a number of different libraries. The tm package is the most popular one, although it is also very slow. Searching on the internet and the Coursera forums lead me to the package quanteda, which is a fast and efficient library for text analysis which used data.table and C++. If the last 9 courses have taught me anything, it’s that these two are always faster.

require(quanteda)
corpSource <- textfile("Sample 5%/*.txt")
sampleCorpus <- corpus(corpSource)
save(sampleCorpus,file = "combinedCorpusSample.Rdata") # saving for future use

Here’s a brief summary of the sample corpus.

require(pander)
# creating summary to hide messages
k <- summary(sampleCorpus)
## Corpus consisting of 3 documents.
## 
##         Text Types  Tokens Sentences
##    blogs.txt 83788 2145536    118560
##     news.txt 30441  320200     15789
##  twitter.txt 88282 1840334    187835
## 
## Source:  D:/RData/Coursera-SwiftKey/* on x86-64 by Areeb Khan
## Created: Sun Mar 20 21:24:29 2016
## Notes:
pander(k)
Text Types Tokens Sentences
blogs.txt 83,788 2,145,536 118,560
news.txt 30,441 320,200 15,789
twitter.txt 88,282 1,840,334 187,835

Tokenisation

Now to tokenise the text, i.e., to separate the words.

tokensAll <- tokenize(x = toLower(sampleCorpus),
                      removePunct = TRUE,
                      removeTwitter = TRUE,
                      removeNumbers = TRUE,
                      removeHyphens = TRUE,
                      verbose = TRUE)

Numbers, punctuations, hyphens, twitter special characters (@#) were all removed, the corpus was converted to lower characters, and then tokenised.

pander(data.frame(Tokens = ntoken(tokensAll)))
  Tokens
blogs.txt 1,855,417
news.txt 266,086
twitter.txt 1,479,499

Profanity removal and Frequency Matrix

One of our requirements was to remove profanity. In its simplest forms, this involves first creating a list of swear words, and then using that list to remove words from our Corpus, or replace them with some placeholder text.

There exist various sources online that have such lists. For example one such resource is this(a list of almost 1300 bad words), another one is on Shutterstock. However, these contained a number of words that are not really swear words, for instance the first list contains words like “abuse,violence, arab”, etc. One good list that I found was the list used by google which can be found here. It had the least amount of incorrect words in it (I only found “God”).

profane <- readLines("Profanity/google bad words.txt")

# tokenising it since without it, the next step hangs up the computer for some reason
profanity <- tokenize(profane,
                      removePunct = TRUE,
                      removeSeparators = TRUE,
                      removeHyphens = TRUE,
                      simplify = TRUE)

Now we could either remove those words, or put a placeholder text for them somthing like " @#$%&!" (grawlix). However, since we are supposed to be predicting the next word, we would not want to recommend “Use a swear word of your choice please…”, which is why it is better to just remove all swear words altogether.

newTokens <- removeFeatures(tokensAll,profanity)

Let us now finally create a frequency matrix, which shows how many times a certain word appears.

dfm1 <- dfm(newTokens,
            stem = TRUE,
            verbose = TRUE)
dfm2 <- dfm(newTokens,
            stem = TRUE,
            ignoredFeatures = stopwords("english"),
            verbose = TRUE)

The first command creates a data frequency matrix using using dfm command from the quanteda package. The ignoredFeatures argument is used to remove words from the Corpus. The first command removes the swear words. The second matrix has been created after removing stopwords (common words like “and,or, of, in, it”, etc. which don’t add much meaning to the text) to further analyse what were the most frequent words besides the most common words in the language. Stemming was used, which reduces different forms of words to their root words. For example, “tall”,“taller”,“tallest” are all reduced to just “tall”.

Analyis

The top 100 words can be easily visualised using a wordcloud, in which the size of the word represents its frequency.

require(RColorBrewer)
# wordcloud including common words
plot(dfm1,max.words = 100,
     colors = brewer.pal(6,"Dark2"),
     random.order = FALSE,
     scale = c(8,1))

# wordcloud excluding common words
plot(dfm2,max.words = 100,
     colors = brewer.pal(6,"Dark2"),
     random.order = FALSE,use.r.layout = TRUE)

Here are the top 20 most frequent words.

pander(t(data.frame(Freq = topfeatures(dfm1,20))),justify = "left",caption = "Including common words")
Including common words (continued below)
  the to and i in it you is that
Freq 155,118 100,103 83,367 75,752 54,271 49,467 43,216 41,652 41,164
  for on my with be have this was at are but
Freq 40,524 29,759 28,336 24,754 24,622 22,173 21,982 21,657 19,668 19,065 18,008
pander(t(data.frame(Freq = topfeatures(dfm2,20))),justify = "left",caption = "Excluding common words")
Excluding common words (continued below)
  just get like one will go time can love day
Freq 12,877 12,609 12,365 11,852 11,439 11,062 10,358 10,078 9,635 9,322
  make know good thank now see work new year think
Freq 8,341 8,085 7,789 7,534 7,406 6,826 6,785 6,756 6,692 6,552

n-gram models

Let us also create a 3-gram model or a trigram, which can be later used for prediction applying the chain rule of probability. A trigram is a combination of 3 words that appear together. For instance “I am going”. The n-gram will find every combination of n words that appear in our sample, and create a frequency matrix using that.

Going step by step, rather than putting everything directly in the dfm command really speeds up the process, which is why this approach was used (of separately tokenising, removing profanity, and creating trigrams).

trigrams <- ngrams(newTokens,n = 3)

dfm3 <- dfm(trigrams,
            stem = TRUE,
            verbose = TRUE)

Let us now look at the top 20 trigrams with the highest frequency.

pander(data.frame(Freq = topfeatures(dfm3,20)),justify = "left")
  Freq
thanks_for_th 1164
going_to_b 749
i_want_to 686
looking_forward_to 528
thank_you_for 519
i_have_to 518
i_love_you 507
as_well_a 465
be_able_to 460
i_need_to 442
can’t_wait_to 427
for_the_follow 421
you_want_to 407
in_the_world 394
you_have_to 385
is_going_to 372
i_don’t_know 364
if_you_ar 360
the_fact_that 356
i_think_i 350

Creating a barplot (also called frequency matrix) for the same.

require(ggplot2)
temp <- topfeatures(dfm3,20)
dat <- data.frame(freq = temp,gram = names(temp))

ggplot(dat,aes(gram,freq)) + 
     geom_bar(stat = "identity",fill = "steelblue") + 
     scale_x_discrete(limits = dat$gram) + 
     coord_flip() + 
     labs(title = "Top trigrams in the sample dataset") +
     theme(axis.text.y = element_text(size = 13,face = "italic"),
           axis.text.x = element_text(size = 14),
           axis.title = element_text(size = 16),
           plot.title = element_text(size = 20))