This is the first milestone for the Coursera Data Science Capstone conducted by John Hopkins University in partnership with Swiftkey.
Out of the four language corpuses given to us, I have knowledge regarding only English. So that is what I am going to use for now.
twitter <- readLines("en_US.twitter.txt.bz2")
blogs <- readLines("en_US.blogs.txt.bz2")
news <- readLines("en_US.news.txt.bz2")
Each file is close to 200MB in size. The twitter file contains around 2.3 million lines, the blogs files contains 0.9 million lines, and the news file contains 1 million lines. Conducting analysis on this whole corpus would require very high computing power, the likes of which are unavailable with most individuals. Therefore I will take up a small random sample from each, which will hopefully be represntative of the whole.
set.seed(1234) #for reproducibility
sizeSample <- 0.05 #only taking 5% of the data.
# creating samples for each dataset
# sample of twitter
ts <- sample(length(twitter),length(twitter)*sizeSample)
twitSample <- twitter[ts]
# sample of news
ns <- sample(length(news),length(news)*sizeSample)
newsSample <- news[ns]
# sample of blogs
bs <- sample(length(blogs),length(blogs)*sizeSample)
blSample <- blogs[bs]
The data, like in most cases, requires cleaning. For instance, people use emoticons, which are not represented properly, and other characters (perhaps foreign ones). These all need to be cleaned as well.
# cleaning
twitter <- iconv(twitSample,to = "ASCII",sub = "")
blogs <- iconv(blSample,to = "ASCII",sub = "")
news <- iconv(newsSample,to = "ASCII",sub = "")
The iconv command transforms text from one encoding to another(here everything was converted to ASCII characters), and the sub argument is used to define what should the characters not in the new encoding be substituted with. So essentially I have removed all non-ASCII characters from the three data sources.
Finally writing out these samples for future use.
# saving for easier future reading
writeLines(twitSample,file("./Sample 5%/twitter.txt"))
writeLines(newsSample,file("./Sample 5%/news.txt"))
writeLines(blSample,file("./Sample 5%/blog.txt"))
Text mining in R can be done using a number of different libraries. The tm package is the most popular one, although it is also very slow. Searching on the internet and the Coursera forums lead me to the package quanteda, which is a fast and efficient library for text analysis which used data.table and C++. If the last 9 courses have taught me anything, it’s that these two are always faster.
require(quanteda)
corpSource <- textfile("Sample 5%/*.txt")
sampleCorpus <- corpus(corpSource)
save(sampleCorpus,file = "combinedCorpusSample.Rdata") # saving for future use
Here’s a brief summary of the sample corpus.
require(pander)
# creating summary to hide messages
k <- summary(sampleCorpus)
## Corpus consisting of 3 documents.
##
## Text Types Tokens Sentences
## blogs.txt 83788 2145536 118560
## news.txt 30441 320200 15789
## twitter.txt 88282 1840334 187835
##
## Source: D:/RData/Coursera-SwiftKey/* on x86-64 by Areeb Khan
## Created: Sun Mar 20 21:24:29 2016
## Notes:
pander(k)
| Text | Types | Tokens | Sentences |
|---|---|---|---|
| blogs.txt | 83,788 | 2,145,536 | 118,560 |
| news.txt | 30,441 | 320,200 | 15,789 |
| twitter.txt | 88,282 | 1,840,334 | 187,835 |
Now to tokenise the text, i.e., to separate the words.
tokensAll <- tokenize(x = toLower(sampleCorpus),
removePunct = TRUE,
removeTwitter = TRUE,
removeNumbers = TRUE,
removeHyphens = TRUE,
verbose = TRUE)
Numbers, punctuations, hyphens, twitter special characters (@#) were all removed, the corpus was converted to lower characters, and then tokenised.
pander(data.frame(Tokens = ntoken(tokensAll)))
| Tokens | |
|---|---|
| blogs.txt | 1,855,417 |
| news.txt | 266,086 |
| twitter.txt | 1,479,499 |
One of our requirements was to remove profanity. In its simplest forms, this involves first creating a list of swear words, and then using that list to remove words from our Corpus, or replace them with some placeholder text.
There exist various sources online that have such lists. For example one such resource is this(a list of almost 1300 bad words), another one is on Shutterstock. However, these contained a number of words that are not really swear words, for instance the first list contains words like “abuse,violence, arab”, etc. One good list that I found was the list used by google which can be found here. It had the least amount of incorrect words in it (I only found “God”).
profane <- readLines("Profanity/google bad words.txt")
# tokenising it since without it, the next step hangs up the computer for some reason
profanity <- tokenize(profane,
removePunct = TRUE,
removeSeparators = TRUE,
removeHyphens = TRUE,
simplify = TRUE)
Now we could either remove those words, or put a placeholder text for them somthing like " @#$%&!" (grawlix). However, since we are supposed to be predicting the next word, we would not want to recommend “Use a swear word of your choice please…”, which is why it is better to just remove all swear words altogether.
newTokens <- removeFeatures(tokensAll,profanity)
Let us now finally create a frequency matrix, which shows how many times a certain word appears.
dfm1 <- dfm(newTokens,
stem = TRUE,
verbose = TRUE)
dfm2 <- dfm(newTokens,
stem = TRUE,
ignoredFeatures = stopwords("english"),
verbose = TRUE)
The first command creates a data frequency matrix using using dfm command from the quanteda package. The ignoredFeatures argument is used to remove words from the Corpus. The first command removes the swear words. The second matrix has been created after removing stopwords (common words like “and,or, of, in, it”, etc. which don’t add much meaning to the text) to further analyse what were the most frequent words besides the most common words in the language. Stemming was used, which reduces different forms of words to their root words. For example, “tall”,“taller”,“tallest” are all reduced to just “tall”.
The top 100 words can be easily visualised using a wordcloud, in which the size of the word represents its frequency.
require(RColorBrewer)
# wordcloud including common words
plot(dfm1,max.words = 100,
colors = brewer.pal(6,"Dark2"),
random.order = FALSE,
scale = c(8,1))
# wordcloud excluding common words
plot(dfm2,max.words = 100,
colors = brewer.pal(6,"Dark2"),
random.order = FALSE,use.r.layout = TRUE)
Here are the top 20 most frequent words.
pander(t(data.frame(Freq = topfeatures(dfm1,20))),justify = "left",caption = "Including common words")
| the | to | and | i | in | it | you | is | that | |
|---|---|---|---|---|---|---|---|---|---|
| Freq | 155,118 | 100,103 | 83,367 | 75,752 | 54,271 | 49,467 | 43,216 | 41,652 | 41,164 |
| for | on | my | with | be | have | this | was | at | are | but | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Freq | 40,524 | 29,759 | 28,336 | 24,754 | 24,622 | 22,173 | 21,982 | 21,657 | 19,668 | 19,065 | 18,008 |
pander(t(data.frame(Freq = topfeatures(dfm2,20))),justify = "left",caption = "Excluding common words")
| just | get | like | one | will | go | time | can | love | day | |
|---|---|---|---|---|---|---|---|---|---|---|
| Freq | 12,877 | 12,609 | 12,365 | 11,852 | 11,439 | 11,062 | 10,358 | 10,078 | 9,635 | 9,322 |
| make | know | good | thank | now | see | work | new | year | think | |
|---|---|---|---|---|---|---|---|---|---|---|
| Freq | 8,341 | 8,085 | 7,789 | 7,534 | 7,406 | 6,826 | 6,785 | 6,756 | 6,692 | 6,552 |
Let us also create a 3-gram model or a trigram, which can be later used for prediction applying the chain rule of probability. A trigram is a combination of 3 words that appear together. For instance “I am going”. The n-gram will find every combination of n words that appear in our sample, and create a frequency matrix using that.
Going step by step, rather than putting everything directly in the dfm command really speeds up the process, which is why this approach was used (of separately tokenising, removing profanity, and creating trigrams).
trigrams <- ngrams(newTokens,n = 3)
dfm3 <- dfm(trigrams,
stem = TRUE,
verbose = TRUE)
Let us now look at the top 20 trigrams with the highest frequency.
pander(data.frame(Freq = topfeatures(dfm3,20)),justify = "left")
| Freq | |
|---|---|
| thanks_for_th | 1164 |
| going_to_b | 749 |
| i_want_to | 686 |
| looking_forward_to | 528 |
| thank_you_for | 519 |
| i_have_to | 518 |
| i_love_you | 507 |
| as_well_a | 465 |
| be_able_to | 460 |
| i_need_to | 442 |
| can’t_wait_to | 427 |
| for_the_follow | 421 |
| you_want_to | 407 |
| in_the_world | 394 |
| you_have_to | 385 |
| is_going_to | 372 |
| i_don’t_know | 364 |
| if_you_ar | 360 |
| the_fact_that | 356 |
| i_think_i | 350 |
Creating a barplot (also called frequency matrix) for the same.
require(ggplot2)
temp <- topfeatures(dfm3,20)
dat <- data.frame(freq = temp,gram = names(temp))
ggplot(dat,aes(gram,freq)) +
geom_bar(stat = "identity",fill = "steelblue") +
scale_x_discrete(limits = dat$gram) +
coord_flip() +
labs(title = "Top trigrams in the sample dataset") +
theme(axis.text.y = element_text(size = 13,face = "italic"),
axis.text.x = element_text(size = 14),
axis.title = element_text(size = 16),
plot.title = element_text(size = 20))