Data Science Capstone - Milestone Report

Pre-requisities

To be able to mine text properly, it’s essential to use appropriate tools. Below a list of packages which I used to process text based data is presented. To start, we need to install and then load their libraries in RStudio.

    require(stringi)
    require(knitr)
    require(tm)
    require(RWeka)
    require(SnowballC)
    require(ggplot2)
    require(wordcloud)
    require(cluster)
    require(fpc)

Loading Text Data

The data for this capstone is from a corpus called HC Corpora http://www.corpora.heliohost.org/. The available corpora in this website is HERE. Some corpora may contain some foreign text. The dataset for this capstone is available HERE. This dataset contains four LOCALE files en_US, de_DE, ru_RU and fi_FI selected from three sources twitter, blogs and news. For this capstone, English based dataset has been selected for data analysis.

     dataset.dir <- "./dataset/en_US"
    dataset.file <- dir(dataset.dir)
    dataset.path <- paste0(dataset.dir, "/", dataset.file)
    dataset.size <- file.size(dataset.path)
    
      blogs.data <- readLines(dataset.path[1], encoding = "latin1", skipNul=TRUE)
       news.data <- readLines(dataset.path[2], encoding = "latin1", skipNul=TRUE)
    twitter.data <- readLines(dataset.path[3], encoding = "latin1", skipNul=TRUE)
    
      WordCounter <- function(x) {sum(sapply(gregexpr("\\S+", x), length))}
      blogs.words <- WordCounter(blogs.data)
       news.words <- WordCounter(news.data)
    twitter.words <- WordCounter(twitter.data)
    dataset.words <- c(blogs.words, news.words, twitter.words)
    
      dataset.summary.blogs <- stri_stats_general(blogs.data)
       dataset.summary.news <- stri_stats_general(news.data)
    dataset.summary.twitter <- stri_stats_general(twitter.data)
    
    dataset.summary <- rbind(dataset.summary.blogs, 
                         dataset.summary.news, 
                         dataset.summary.twitter)
    dataset.summary <- data.frame(dataset.summary, words = dataset.words, 
                                  size_Mb = dataset.size/1024^2)
    rownames(dataset.summary) <- dataset.file

    kable(dataset.summary)

	Lines	LinesNEmpty	Chars	CharsNWhite	words	size_Mb
en_US.blogs.txt	899288	899288	208361438	171925775	37334441	200.4242
en_US.news.txt	1010242	1010242	203791401	170428696	34372597	196.2775
en_US.twitter.txt	2360148	2360148	162385035	134370242	30373832	159.3641

Sampling

The datasets for this project are fairly large, so to accelerate the process in this step, it’s suggested to work with a smaller subset of these data. A sample of 5000 lines has been selected for further analysis.

       sample.news <- news.data[sample(1:length(news.data), 5000)]
    sample.twitter <- twitter.data[sample(1:length(twitter.data), 5000)]
      sample.blogs <- blogs.data[sample(1:length(blogs.data), 5000)]
       sample.data <- c(sample.news,sample.twitter,sample.blogs)

Preprocessing

First we should convert the sampled data to a corpur vector which is the base format for text analysis in tm package.

    corpus <- VCorpus(VectorSource(sample.data))

To see the content of corpus, we can use inspect command in R terminal. Now we can go on to preprocess the text data. In this step we remove numbers, capitalization, common words, punctuation, and otherwise prepare the texts for analysis. This is a time consuming job, but at the end of this step, we have a high quality data for analysis.

Removing punctuation

R cannot read like a hunman and treat the same way with punctuation and other special characters. The following chunk remove punctuation from the text.

    corpus <- tm_map(corpus, removePunctuation)

Removing numbers

    corpus <- tm_map(corpus, removeNumbers)

Removing URL

Below code chunk will remove URLs completely. :alnum: matches any alphanumeric characters, incl. letters and numbers, and :punct: matches punctuation characters. See details by running “?regex” under R or googling for “regular expression”.

    removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
    corpus <- tm_map(corpus, content_transformer(removeURL))

Removing Non-English letters

Below chunck code will remove anaything other than english letters or space

    removeNumPunct <- function(x) gsub("[^[:alpha:][:space:]]*", "", x)
    corpus <- tm_map(corpus, content_transformer(removeNumPunct))

Converting to lowercase

we want a word to appear exactly the same every time it appears. We therefore change everything to lowercase.

    corpus <- tm_map(corpus, content_transformer(tolower))

Removing “stopwords”

stop words are common words in any languages that usually have no analytic value. So we will remove them from the dataset.

    corpus <- tm_map(corpus, removeWords, stopwords(kind = "en"))

if you like to see some stop words in english, below chunk will help you:

    length(stopwords("english"))

## [1] 174

    stopwords("english")[1:20]

##  [1] "i"          "me"         "my"         "myself"     "we"        
##  [6] "our"        "ours"       "ourselves"  "you"        "your"      
## [11] "yours"      "yourself"   "yourselves" "he"         "him"       
## [16] "his"        "himself"    "she"        "her"        "hers"

Removing particular words

The data contain words of offensive and profane meaning. So, a list of these bad words (profanity) collected and added to the dataset to filter and remove these words from dataset. These data will not use in prediction model.

    corpus <- tm_map(corpus, removeWords, data.profanity)

Stemming

Some words have a variety of possible endings in the original text. Stemming referes to removing common word endings (e.g., “ing”, “es”, “s”) and make them recognizable to the computer. For this, we used SnowballC package.

    corpus <- tm_map(corpus, stemDocument)

Remove Whitespaces

The above preprocessing will leave a lot of white space as the result of all the left over spaces that were not removed along with the words that were deleted. These white spaces should be removed in this step.

    corpus <- tm_map(corpus, stripWhitespace)

To finish the data processing, we need to tells R to treat the preprocessed documents as text documents.

    corpus <- tm_map(corpus, PlainTextDocument)

Stage the Data

To continue, we need to create a document-term matrix or term-document matrix which is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to termsWikipedia.

    dtm <- DocumentTermMatrix(corpus)
    dtm

To inspect, we can use: inspect(dtm), this will, however, fill up the terminal quickly. So you may prefer to view a subset: inspect(dtm[1:5, 1:20]) view first 5 docs & first 20 terms like dim(dtm) that display the number of documents & terms (in that order). Youll also need a transpose of this matrix:

    tdm <- TermDocumentMatrix(corpus)
    tdm

Exploratory Analysis

Words frequency

Below code cunck will organize terms by their frequency.

    freq <- colSums(as.matrix(dtm))
    length(freq)

## [1] 26055

Here check out some of the most and least frequently occurring words.

    ord <- order(freq)
    head(freq[1:10])

##              aaa              âââ ââââââââââââââââ         aaaaaaar 
##                3                1                1                1 
##             aaah       aaahahahah 
##                2                1

    freq[tail(ord)]

##  get just like will  one said 
## 1206 1215 1262 1303 1351 1526

Check out the frequency of frequencies.

    head(table(freq), 20)

## freq
##     1     2     3     4     5     6     7     8     9    10    11    12 
## 14583  3448  1651   973   652   465   377   304   259   226   169   159 
##    13    14    15    16    17    18    19    20 
##   171   124   117   101    90    86    81    68

    tail(table(freq), 20)

## freq
##  632  635  648  659  686  689  704  723  763  781  824  960 1075 1124 1206 
##    1    1    1    1    1    2    1    1    1    1    1    1    1    1    1 
## 1215 1262 1303 1351 1526 
##    1    1    1    1    1

Another way to display the reult:

    freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
    freq[1:20 ]

##  said   one  will  like  just   get  time   can  year  make   day   new 
##  1526  1351  1303  1262  1215  1206  1124  1075   960   824   781   763 
##  work  know   now peopl   say  love  also  want 
##   723   704   689   689   686   659   648   635

    wf <- data.frame(word=names(freq), freq=freq)
    wf[1:20, ]

##        word freq
## said   said 1526
## one     one 1351
## will   will 1303
## like   like 1262
## just   just 1215
## get     get 1206
## time   time 1124
## can     can 1075
## year   year  960
## make   make  824
## day     day  781
## new     new  763
## work   work  723
## know   know  704
## now     now  689
## peopl peopl  689
## say     say  686
## love   love  659
## also   also  648
## want   want  635

Plot Word Frequencies

Here we plot words that appear at least 500 times

    p <- ggplot(subset(wf, freq>500), aes(word, freq))     
    p <- p + geom_bar(stat="identity")    
    p <- p + coord_flip()
    p

Relationships Between Terms

Term Correlations

If you have a term in mind that you have found to be particularly meaningful to your analysis, then you may find it helpful to identify the words that most highly correlate with that term. If words always appear together, then correlation=1.0.

    # specifying a correlation limit of 0.20
    findAssocs(dtm, "year", corlimit=0.20)

## $year
##  ago last 
## 0.24 0.23

Word Clouds

    set.seed(2015)
    wordcloud(names(freq), freq, min.freq=100, scale=c(5, 0.1), 
              colors=brewer.pal(6, "Dark2"))

Clustering by Term Similarity

To do this well, we should always first remove a lot of the uninteresting or infrequent words. We can remove these with the following code. This makes a matrix that is only 96.5% empty space, at maximum.

    dtm.new <- removeSparseTerms(dtm, 0.965)
    dtm.new

## <<DocumentTermMatrix (documents: 15000, terms: 28)>>
## Non-/sparse entries: 20961/399039
## Sparsity           : 95%
## Maximal term length: 5
## Weighting          : term frequency (tf)

Now we can calculate distance between words & then cluster them according to similarity.

    d <- dist(t(dtm.new), method="euclidian")
    fit <- hclust(d=d, method="ward")
    fit

## 
## Call:
## hclust(d = d, method = "ward")
## 
## Cluster method   : ward.D 
## Distance         : euclidean 
## Number of objects: 28

    plot(fit, hang = -1)
    groups <- cutree(fit, k=5)
    rect.hclust(fit, k=5, border="red")

K-means clustering

The k-means clustering method will attempt to cluster words into a specified number of groups (in this case 2), such that the sum of squared distances between individual words and one of the group centers. You can change the number of groups you seek by changing the number specified within the kmeans() command.

    kfit <- kmeans(d, 2) 
    clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)

Tokenization

Tokenization is the act of breaking up a sequence of strings into pieces such as words, keywords, phrases, symbols and other elements called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenization, some characters like punctuation marks are discarded. The tokens become the input for another process like parsing and text mining.

Tokenization is used in computer science, where it plays a large part in the process of lexical analysis.

Word ferquency

    wordCloudPlot <- function(tdm, n) {
    m <- as.matrix(tdm)
    # calculate the frequency of words and sort it by frequency
    word.freq <- sort(rowSums(m), decreasing = T)
    df <- data.frame(word = names(word.freq),freq=word.freq)
    
    # plot word cloud
    wordcloud(df$word,df$freq, c(5,0.3), n,
          random.order=FALSE, colors=brewer.pal(8, "Dark2"))
    }