Data Gathering and Sampling

The dataset from Coursera is available in form of .txt files, compressed via zip to reduce its size. In this dataset, there are different languages for people around the world to use it. In this case, we chose the english language files.

1 - Data Loading and Basic Summary Statistics

Because of its size and for testings purposes, we manually downloaded and decompressed the file in our working directory. We selected three files: “en_US.blogs.txt”, with approximately 200 mb, “en_US.news”, with approximately 196 mb and “en_US.twitter.txt”, with approximately 159 mb. In the R code, we defined the working directory, read and counted the approximated number of words of each document by counting the number of non alphabetic caracters in it. For the twitter data, because of the existence of non UTF-8 caracters, such as emoticons, we used the “iconv” function to remove them.

# set working directory
setwd("C:\\Users\\Caio\\Documents\\Coursera\\Data Science Capstone\\DataSet 0")

# read blogs data: # 899.299 lines 
blogsData <- readLines(file("en_US.blogs.txt", encoding = "UTF-8"))

blogNWords <- sum(sapply(gregexpr("\\W+", blogsData), length))
print(paste("Number of lines for blogs dataset:",blogNWords,"words",sep = " "))

## [1] "Number of lines for blogs dataset: 38222279 words"

# read news data: # 77.259 lines
newsData <- readLines(file("en_US.news.txt", encoding = "UTF-8"))

newsNWords <- sum(sapply(gregexpr("\\W+", newsData), length))
print(paste("Number of lines for news dataset:",newsNWords,"words",sep = " "))

## [1] "Number of lines for news dataset: 2748070 words"

# read twitter data: # 2.360.148 lines
twitterData <- readLines(file("en_US.twitter.txt", encoding = "UTF-8"))
# remove emoticons and symbols non UTF-8
twitterData <- iconv(twitterData, from = "latin1", to = "UTF-8", sub="")

twitterNWords <- sum(sapply(gregexpr("\\W+", twitterData), length))
print(paste("Number of lines for twitter dataset:",twitterNWords,"words",sep = " "))

## [1] "Number of lines for twitter dataset: 30513860 words"

Because of the size of the files, we need to create a randomized sample of them in order to process it in a feasible time. We decided to extract 10.000 lines of each file. Finally, we saved each dataset as a RDS file because they are more memory efficient. As we are not going to use the original datasets, we remove them from memory.

n <- 10000
set.seed(3235) # for reproducibility purposes
######
filterIndexes <- sample(length(blogsData), # sample
                        n,                 # and
                        replace = FALSE)   # replace
blogsData <- blogsData[filterIndexes]      # data
######
######
filterIndexes <- sample(length(newsData),
                        n,
                        replace = FALSE)
newsData <- newsData[filterIndexes]
######
######
filterIndexes <- sample(length(twitterData),
                        n,
                        replace = FALSE)
twitterData <- twitterData[filterIndexes]

saveRDS(blogsData,'SampleBlogData.rds')
saveRDS(newsData,'SampleNewsData.rds')
saveRDS(twitterData,'SampleTwitterData.rds')

rm(blogsData); rm(newsData); rm(twitterData); # memory release
######

2 - Exploratory Analysis

For the exploratory analysis, we used a few addicional packages:

tm: basic package for text mining - corpus creating and basic data structures
SnowballC: word steeming, one of the preprocessing steps
wordcloud: creates a visual information about the most used words in a corpus
slam: tm works with simple triplets matrices, a type of sparse matrix for space optimization. slam provides functions to do arithmethic manipulations with sparce matrices
RWeka: N-gram processing, one of the pre-processing steps

Note that some packages are dependent on other sources, such as other packages or enviroments (Java)

#load libraries
library(tm)         # basic package
library(SnowballC)  # word steeming
library(wordcloud)  # word cloud
library(slam)       # sparse matrix arithmethics
library(RWeka)      # N-gram creation

The basic element for text mining using the tm package is the Corpus object. So, we convert each sampled dataset into a Corpus object.

#load data
corpusB <- Corpus(VectorSource(readRDS("SampleBlogData.rds")))
corpusN <- Corpus(VectorSource(readRDS("SampleNewsData.rds")))
corpusT <- Corpus(VectorSource(readRDS("SampleTwitterData.rds")))

Corpus Preprocessing

Next step was to do basic transformations to the corpus dataset that are pertinent to text mining, such as lower case, remove punctuations, numbers and stopwords, word steeming and, finally, creation of the document term matrix, actually the final type of data in which we do our processing. An example of this type of processing can be seen here.

# concatenated list
corpusVector <- list(corpusB, corpusN, corpusT)
myTdmn <- list() # used to store document term matrix

# memory dealocation
rm(corpusB); rm(corpusN); rm(corpusT);

for (i in 1:length(corpusVector)){
    # transform to lower case
    corpusVector[[i]] <- tm_map(corpusVector[[i]], tolower)  
    # remove punctuation
    corpusVector[[i]] <- tm_map(corpusVector[[i]], removePunctuation)
    # remove numbers
    corpusVector[[i]] <- tm_map(corpusVector[[i]], removeNumbers)
    # remove english stop words
    corpusVector[[i]] <- tm_map(corpusVector[[i]], removeWords,stopwords("english"))
    # stemm words, keep only radicals
    corpusVector[[i]] <- tm_map(corpusVector[[i]], stemDocument)
    # transform to plain text
    corpusVector[[i]] <- tm_map(corpusVector[[i]], PlainTextDocument)
    # calculate document (row) term frequency (column)
    myTdmn[[i]] <- DocumentTermMatrix(corpusVector[[i]], control=list(wordLengths=c(0,Inf)))
}

Word Cloud

One way to display the most used words in a text is a word cloud, in which the most used words are displayed in respect to their size. From these clouds we can already see some interesting stuff: - Indirect speech (i.e said) and more formal words used in the news wordcloud
- A wider variety of words used in the blogs dataset
- Increased use of positive sentimental words used in the Twitter dataset

# create word cloud
# obs: col_sums from package slam. Used to calculate col sum from sparse matrix
set.seed(3366)      # for reproducibility
par(mfrow = c(1,3)) # all 3 plots in 1 row
titles = c("Blog Wordcloud", "News wordcloud", "Twitter Wordcloud")

for(i in 1:3){
    # words and its frequencies
    wordcloud(words=colnames(myTdmn[[i]]), freq=col_sums(myTdmn[[i]]), 
              scale = c(3,1),max.words = 100,random.order = F,
              rot.per = 0.35,use.r.layout = F, colors = brewer.pal(8,"Dark2"))
    title(titles[i])
}

Fig. 1: Word clouds as a visual representation of words used. More often used words are in the center and bigger. Less colors in a word cloud indicates a less complex text structure (much more of one word a few others around it).

2-Gram Histograms (Bigrams)

N-grams are a continuous sequence of n terms from a given sequence of speech or text. These N-grams form a widely used model in probabilistic language models because of its efficiency and simplicity. We analyzed the most used bigrams used in each of the 3 datasets.

# NGramTokenizer is a a function from the RWeka package and passed as control
# in the DocumentTermMatrix function.
bigramTokenizer <- function(x)NGramTokenizer(x,Weka_control(min = 2, max = 2))
dsNames <- c("Blog","News","Twitter")
par(mfrow = c(3,1))
for(i in 1:3){
    # create document term matrix for 2-gram words
    bigramDTM <- DocumentTermMatrix(corpusVector[[i]],
                                    control = list(tokenize = bigramTokenizer))
    # get 40 most ocurring 2-gram words
    bigramTermsCount <- sort(col_sums(bigramDTM),decreasing = TRUE)[1:40]
    # create barplot
    bar <- barplot(bigramTermsCount, axes = FALSE,axisnames = FALSE,
                   density = bigramTermsCount+30,
                   border = "red",
                   ylab="Frequency", ylim = c(0,max(bigramTermsCount)+9),
                   main = paste("Frequency of bigrams for",
                                dsNames[i],
                                "data set",
                                sep = " "))
    # rotate x labels
    text(bar, par("usr")[3], labels = names(bigramTermsCount),
         srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex = 0.9)
    # add frequency number to the top of each bar
    text(bar, y = bigramTermsCount, label = bigramTermsCount, pos = 3,
         cex = 0.8, col = "red")
    # make y axis appear
    axis(2)    
}

Fig. 2: Bigrams from each dataset.

Hierarchical clustering (corpus complexity)

From a complexity view, clustering the datasets can show us which corpus uses a wider variety of words togheter. The hierarchical clustering display which words can be grouped togheter, i.e. were found more frequently within the corpus. From these plots, we can see:

The blog text sample shows the most complex word structure while the twitter dataset has the simplest complexity os used phrases.

dsNames <- c("Blog","News","Twitter")
par(mfrow = c(1,3))

for(i in 1:3){
    # remove zeroes
    sparseDTM <- removeSparseTerms(t(myTdmn[[i]]),sparse = 0.97)
    sparseDTM <- as.matrix(sparseDTM)
    
    distMatrix <- dist(scale(sparseDTM))
    # do hierarchical clustering using ward's method
    cluster <- hclust(distMatrix,method = "ward.D2")
    
    plot(cluster, cex = 0.9,
         xlab = "Hierarchical Clusterization",
         main = paste("Estructure complexity for ",
                      dsNames[i],
                      "data set",
                      sep = " "))
}

Fig. 3: Hierarchical clustering from each dataset.

Conclusion and Next Steps

The following ideas can be deducted from this data:

The vocabulary is enviroment dependent. This can be a advantage to predict words depending on where the user is typing the text.
The phrase structure is much more complex in the blog dataset and the opposite is found in the Twitter dataset.
Indirect speech is better for the news data set, while direct speech is better for Twitter.
Sentiment analysis can be applied, depending on the enviroment, where the Twitter database seems to use positive words, while the news and blog datasets tend to be more neutral.

For the next part of this capstone project, some ideas are:

Use the N-grams to create a language model that helps to predict words and misspellings.
Evaluate the possibility to include sentiment analysis to detect possible moods in the sentence, and consequently enhaces the next word predictability.
Approach to detect words that were not seen in the training set (databases), i.e. some smoothing technique
Reinforce the use of sparse matrices for application optimization
Implement the ideas as a Shiny App

Note: All code is available in this public dropbox folder, just remember to change the Working Directory =]

Caio H. K. Miyashiro - Brazil

Text Mining and Natural Language Processing - Preprocessing

Caio Miyashiro

Sunday, March 22, 2015

Introduction