Abstract

The use of social media has grown significantly in recent years. With the growth in its use, there has also been a substantial growth in the amount of information generated by users if social media. Here we have taken news analysis using Text mining to predict trends of stock in stock market. News related to Infosys stocks from Jun 2014-sept 2014 are considered for analysis.

Here we have discusses the application of Tokenisation, association,wordcloud and clustering analysis to News media. This is demostrated by analysing Infosys Stocks News from Money Control,Reuter india,CNBC-TV18 , ET Now and HT news. The results of these analysis help identify keywords and concepts in the news media data, and can facilate the application of this information by traders,investors. Traders and investors can analyse this information and apply the results of the analysis in relevant area, they will be able to proactively address potential market and investment opportunity.

Keywords: Social media, News,analytics,data mining, text mining,clusterig,association analysis, sentiment analysis.

Introduction

News of Infosys Stock is used as the data to analyze. It starts with extracting text from News related to infosys stock from New media. The extracted text is then transformed to build a document-term matrix. After that, frequent words and associations are found from the matrix. A word cloud is used to present important words in documents. In the end, words and news are clustered to find groups of words and also groups of news.

There are few important packages used in the examples: twitteR, tm and wordcloud. Package twitteR [Gentry, 2013] provides access to Twitter data, tm [Feinerer and Hornik, 2013] provides functions for text mining, and wordcloud [Fellows, 2013] visualizes the result with a word cloud 2.

1 Retrieving Text from New articals

News text related to infosys stock taken from from Money Control,Reuter india,CNBC-TV18 , ET Now and HT news are stored in excel file. News texts are extracted from excel with the code below.

## Loading required package: NLP
## 
## Attaching package: 'ggplot2'
## 
## The following object is masked from 'package:NLP':
## 
##     annotate
## 
## Loading required package: RColorBrewer
## Loading required package: lattice
## 
## Attaching package: 'latticeExtra'
## 
## The following object is masked from 'package:ggplot2':
## 
##     layer
## 
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'plyr'
## 
## The following object is masked from 'package:twitteR':
## 
##     id
## 
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## 
## The following objects are masked from 'package:twitteR':
## 
##     id, location
## 
## The following object is masked from 'package:randomForest':
## 
##     combine
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#read data from dataset excel
InfosysNews <- read.csv("InfosysTestWeka.csv")
InfosysNews.Details <-InfosysNews$Details
testReplace<-InfosysNews.Details

Transforming Text

The news are first converted to a data frame and then to a corpus, which is a collection of text documents. After that, the corpus can be processed with functions provided in package tm [Feinerer and Hornik, 2013].

# build a corpus, and specify the source to be character vectors
InfosysCorpus<-Corpus(VectorSource(testReplace))

After that, the corpus needs a couple of transformations, including changing letters to lower case, and removing punctuations, numbers and stop words. The general English stop-word list is tailored here by adding “also”,“always” etc.

document.collection <-InfosysCorpus
document.collection <-tm_map(document.collection,stripWhitespace)
document.collection <- tm_map(document.collection, content_transformer(tolower))
document.collection <-tm_map(document.collection,removeNumbers)
document.collection <-tm_map(document.collection,removePunctuation)
document.collection <- tm_map(document.collection,removeWords, stopwords("english"))

more.stop.words <- c("also","always","among","can","come","comes","even",
                     "comming","say","saying","says","said","still","two","three","ltd") 
document.collection <- tm_map(document.collection, 
                              removeWords, more.stop.words)

dataframe<-data.frame(text=unlist(sapply(document.collection, `[`, "content")), stringsAsFactors=F)

Building a Term-Document Matrix

A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document.

initial.tdm <- TermDocumentMatrix(document.collection)
termDocMatrix <- as.matrix(initial.tdm)
Ndoc <- length(InfosysCorpus)
#write.csv(termDocMatrix, "test.csv")
initial.tdm

## <<TermDocumentMatrix (terms: 3891, documents: 343)>>
## Non-/sparse entries: 18144/1316469
## Sparsity           : 99%
## Maximal term length: 24
## Weighting          : term frequency (tf)

As we can see from the above result, the term-document matrix is composed of 3891 terms and 343 documents. It is very sparse, with 99% of the entries being zero.

Frequent Terms and Associations

We have a look at the popular words and the association between words. Note that there are 343 news in total.

# Frequent Words and Association
myTdm <- TermDocumentMatrix(document.collection, control=list(wordLengths=c(1,Inf)))
myTdm <- TermDocumentMatrix(document.collection, control=list(minWordLength=1))

# Frequent Words
findFreqTerms(myTdm, lowfreq=60)

##  [1] "around"      "attrition"   "back"        "big"         "board"      
##  [6] "business"    "buy"         "ceo"         "change"      "client"     
## [11] "companies"   "company"     "continue"    "deal"        "executive"  
## [16] "gain"        "going"       "good"        "growth"      "high"       
## [21] "india"       "indian"      "infosys"     "investors"   "large"      
## [26] "last"        "like"        "management"  "market"      "may"        
## [31] "murthy"      "new"         "now"         "one"         "people"     
## [36] "percent"     "positive"    "qoq"         "quarter"     "revenue"    
## [41] "rupee"       "see"         "services"    "shares"      "software"   
## [46] "stock"       "take"        "time"        "top"         "vishalsikka"
## [51] "well"        "will"        "year"

# Arrange frequent words in ascending order
wordmatrix <- as.matrix(myTdm)
vsort <- sort(rowSums(wordmatrix), decreasing=TRUE)
head(vsort,55)

##     infosys     percent vishalsikka     company        will       stock 
##         808         393         329         315         302         245 
##      growth         ceo        year         new         one     revenue 
##         215         207         145         140         122         118 
##  management         buy         top        time     quarter    services 
##         117         110         109         106         103         100 
##         may      market   companies        last      change      murthy 
##          99          98          92          91          88          86 
##        high         see      shares    continue       india         now 
##          83          83          83          81          80          80 
##        deal       rupee    software      client    business         big 
##          79          79          77          73          71          70 
##       board      indian       large        like   executive         qoq 
##          70          70          70          70          69          68 
##        take      around        gain    positive        well   investors 
##          67          66          65          65          65          64 
##        back       going        good   attrition      people      margin 
##          63          63          63          62          60          59 
##       level 
##          58

termFrequency <- rowSums(as.matrix(myTdm))
termFrequency <- subset(termFrequency, termFrequency>=60)
df <- data.frame(term=names(termFrequency), freq=termFrequency)

# Plot frequently occuring words
ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
  xlab("Terms") + ylab("Count") + coord_flip()

In the code above, findFreqTerms() finds frequent terms with frequency no less than 60. Note that they are ordered alphabetically, instead of by frequency or popularity. To show the top frequent words visually, we next make a barplot for them. From the termdocumentmatrix, we can derive the frequency of terms with rowSums(). Then we select terms thatappears in ten or more documents and shown them with a barplot using package ggplot2 [Wickham,2009]. In the code below, geom=“bar” specifies a barplot and coord_flip() swaps x- and y-axis.The barplot in Figure clearly shows that the three most frequent words are “infosys”,“percent” and “vishalsikka”.

We can also find what are highly associated with a word with function findAssocs(). Below we try to find terms associated with “infosys” with correlation no less than 0.5, and the words are ordered by their correlation with “vishalsikka” .

# which words are associated with "infosys"?
findAssocs(myTdm, 'infosys', 0.50)

##             infosys
## company        0.69
## new            0.61
## change         0.60
## ceo            0.57
## management     0.54
## vishalsikka    0.52
## fill           0.51
## murthy         0.51
## challenge      0.50
## founder        0.50

# which words are associated with "vishalsikka"?
findAssocs(myTdm, 'vishalsikka', 0.50)

##           vishalsikka
## sap              0.66
## choice           0.60
## executive        0.55
## drive            0.53
## ceo              0.52
## company          0.52
## hana             0.52
## infosys          0.52
## board            0.51
## chief            0.51

Tokenize and Plot

As you can see word frequency is not providing much information. Let us do n-gram tokenization of the Corpus using RWeka. First check 2 gram and followed by 3 gram

#Bigram Tokenise

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- DocumentTermMatrix(document.collection, control = list(tokenize = BigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)

pl <- ggplot(subset(wof, freq > 20) ,aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="blue")
pl + theme(axis.text.x=element_text(angle=90))

#Trigram Tokennise
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(document.collection, control = list(tokenize = TrigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)

pl <- ggplot(subset(wof, freq > 7) ,aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="brown")
pl + theme(axis.text.x=element_text(angle=90))

Association Analysis

An analyis that highlights combination of words regardless of where they occur can also help an analyst understand the key and important concepts in a body of text data. the technique which is used to highlight combination of words is association analysis. This analysis determines the likelihood of a combination of items occuring tpgether as well as a confidance around the projection. Ultimately , the association abalusis produces a set of if-the rules(if item A is present in a transaction, then item B will be present as well) and the lift associated with the rule.

#Association rule
# source("http://bioconductor.org/biocLite.R")
# biocLite("Rgraphviz")

library(graph)

## 
## Attaching package: 'graph'
## 
## The following object is masked from 'package:plyr':
## 
##     join

library(Rgraphviz)

## 
## Attaching package: 'Rgraphviz'
## 
## The following object is masked from 'package:twitteR':
## 
##     name

freq.terms=findFreqTerms(myTdm, lowfreq=60)
plot(myTdm, term = freq.terms, corThreshold = 0.50, weighting = T)

Word Cloud

After building a term-document matrix, we can show the importance of words with a word cloud(also known as a tag cloud), which can be easily produced with package wordcloud [Fellows,2013]. In the code below, we first convert the term-document matrix to a normal matrix, and then calculate word frequencies. After that, we set gray levels based on word frequency and usewordcloud() to make a plot for it. With wordcloud(), the first two parameters give a list of words and their frequencies. Words with frequency below three are not plotted, as specified bymin.freq=3. By setting random.order=F, frequent words are plotted first, which makes them appear in the center of cloud. We also set the colors to gray levels based on frequency. A colorful cloud can be generated by setting colors with rainbow().

m <- as.matrix(myTdm)
# calculate the frequency of words and sort it descendingly by frequency
wordFreq <- sort(rowSums(m), decreasing=TRUE)
# word cloud
set.seed(375) # to make it reproducible
col <- brewer.pal(5,"Dark2")
grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) )
#wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=70, random.order=F,
#        colors=grayLevels)
wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=80, rot.per=0.5, scale=c(4,1),
          random.color=T, random.order=F,colors=col)

The above word cloud clearly shows again that “infosys”, “percent” and “vishalsikka” are the top three words, which validates that the news present information on infosys stocks growth in percent and new CEO viashal sikka. Some other important words are “company”, “will”“,”stock“,”growth" and “ceo”“, which shows that it focuses on growth of companys stock on appointment of new CEO. Another set of frequent words,”last“,”change" , “murthy”, and “high” , are from news about Last year when narayan Murthy came back on management there were high growth seen in infosys stock. There are also some new on the topic of people attrition as indicated by words “attrition” and “people”in the cloud.

Clustering Words

We then try to nd clusters of words with hierarchical clustering. Sparse terms are removed, so that the plot of clustering will not be crowded with words. Then the distances between terms are calculated with dist() after scaling. After that, the terms are clustered with hclust() and the dendrogram is cut into 10 clusters. The agglomeration method is set to ward, which denotes the increase in variance when two clusters are merged. Some other options are single linkage, complete linkage, average linkage, median and centroid. Details about dierent agglomeration methods can be found in data mining textbooks [Han and Kamber, 2000,Hand et al., 2001,Witten and Frank,2005].

# remove sparse terms
tdm2 <- removeSparseTerms(myTdm, sparse = 0.80)
m2 <- as.matrix(tdm2)
# cluster terms
distMatrix <- dist(scale(m2))
fit <- hclust(distMatrix, method = "ward.D2")

plot(fit)
rect.hclust(fit, k = 10) # cut tree into 6 clusters

In the above dendrogram, we can see the topics in the tweets. Words “buy”, “may” and “one” are clustered into one group, because there are a couple of news. This indiactaes positive sentiments that one can buy stock hoping growth in stock. The second cluster from left comprises “year”, “ceo” and “new” they are clustered into one group because of news about new appointment of CEO which was expected to be in this year.

Clustering News

News are clustered below with the k-means and the k-medoids algorithms.

Clustering News with the k-means Algorithm

We first try k-means clustering, which takes the values in the matrix as numeric. We transpose the term-document matrix to a document-term one. The tweets are then clustered with kmeans() with the number of clusters set to eight. After that, we check the popular words in every cluster and also the cluster centers. Note that a fixed random seed is set with set.seed() before running kmeans(), so that the clustering result can be reproduced.

m3 <- t(m2) # transpose the matrix to cluster documents (news)
set.seed(122) # set a fixed random seed
k <- 100 # number of clusters
kmeansResult <- kmeans(m3, k)
kmeansResult$size

##   [1]  3  3  1 11  5  2 18  2  4  8  3  2  1  2 10  1  3  1  2 12  3  4  5
##  [24]  3  5  1  3  1 22  2  4  1  1  1  2  8  1  2  3  4  4 20  3  4  2  1
##  [47]  1  1  8  2  3  1  1  3  2  2  1  2  2  2  4  4  1  5  4  4  3  1  1
##  [70]  1  5  2  5  3  2  3  5  1  7  1  2  1  1  3  5  1  4  2  3  3  3  2
##  [93]  2  1  2  2  1  4  2  2

round(kmeansResult$centers, digits = 3) # cluster centers

##       buy    ceo company growth infosys   may    new   one percent stock
## 1   2.000  1.000   1.333  2.667   5.000 0.000  2.333 0.333   0.000 1.333
## 2   3.333  0.000   0.000  0.000   2.333 0.000  0.000 0.667   1.000 0.667
## 3   2.000  3.000   0.000  2.000   5.000 0.000  3.000 0.000   0.000 2.000
## 4   0.000  0.000   0.000  0.000   1.091 0.091  0.000 0.000   0.000 0.182
## 5   0.800  0.000   0.000  0.200   1.600 1.000  0.000 1.200   0.000 1.000
## 6   0.000  0.500   4.000  1.500   1.000 1.500  0.000 0.000   1.000 0.000
## 7   0.000  0.111   0.056  0.000   1.111 0.000  0.000 0.000   1.167 0.000
## 8   0.000  3.000   7.500  1.500   5.000 0.000  1.500 0.000   0.500 0.000
## 9   0.750  0.750   5.000  0.000   4.000 0.000  0.000 0.500   0.000 0.250
## 10  1.375  0.000   0.000  0.000   1.750 0.500  0.000 1.250   0.000 0.000
## 11  1.667  0.000   0.000  0.000   2.333 1.333  0.000 1.000   1.333 2.667
## 12  0.000  1.000   3.000  1.500  10.000 2.500  1.000 1.500   0.000 0.000
## 13  0.000  0.000   1.000  7.000   5.000 0.000  0.000 1.000   0.000 2.000
## 14  0.000  1.500   4.500  2.000   6.500 0.000  1.000 1.500   0.000 0.000
## 15  0.000  0.800   0.200  0.200   1.300 0.000  0.400 0.100   0.600 0.500
## 16  0.000  0.000   0.000  4.000   0.000 0.000  0.000 0.000   2.000 1.000
## 17  0.333  2.333   8.000  1.000   9.333 0.333  1.333 0.667   2.333 1.000
## 18  0.000  1.000   3.000  7.000   2.000 0.000  0.000 0.000   0.000 6.000
## 19  1.000  0.000   1.000  0.000   4.500 1.500  0.000 1.500   0.000 6.000
## 20  0.083  0.083   0.083  0.000   2.083 0.000  0.083 0.000   0.000 0.167
## 21  0.000  0.667   0.000  0.000   3.667 1.333  0.667 0.333   0.000 0.000
## 22  0.000  1.500   1.250  1.500   7.000 0.250  1.500 0.000   0.250 0.500
## 23  0.000  3.800   0.600  0.000   2.600 0.200  1.200 0.200   0.000 0.200
## 24  0.667  0.000   0.000  0.333   1.000 0.333  0.000 0.000   0.333 2.333
## 25  0.600  0.000   0.000  0.000   3.000 1.000  0.000 1.600   0.200 0.200
## 26  0.000  2.000   3.000  6.000   8.000 0.000  1.000 0.000   4.000 2.000
## 27  0.333  0.000   0.333  0.000   2.667 0.000  0.000 0.333   1.333 0.000
## 28  0.000  4.000   3.000  0.000   1.000 0.000  0.000 0.000   1.000 1.000
## 29  0.000  0.000   0.000  0.000   0.818 0.000  0.000 0.091   0.000 1.045
## 30  1.000  0.500   2.500  2.000   2.500 0.000  0.000 0.500   1.500 0.000
## 31  0.000  0.000   0.750  2.750   1.500 0.500  0.000 0.000   4.250 0.250
## 32  1.000  1.000   4.000  0.000   4.000 0.000  2.000 1.000   1.000 1.000
## 33  0.000  0.000   0.000  0.000   0.000 0.000  0.000 0.000   0.000 6.000
## 34  0.000  8.000   2.000  0.000   4.000 0.000  0.000 0.000   0.000 0.000
## 35  0.000  0.000   0.000  0.000   4.500 2.000  0.000 1.000   1.000 0.500
## 36  1.000  0.000   0.000  0.000   0.875 0.000  0.000 0.125   0.375 0.500
## 37  1.000  0.000   0.000  6.000   3.000 0.000  1.000 1.000   9.000 2.000
## 38  0.000  0.000   0.000  0.000   1.000 0.500  0.500 0.000   2.000 2.000
## 39  0.000  0.000   0.000  1.667   0.000 0.333  0.000 0.333   0.000 0.000
## 40  0.000  0.000   0.000  0.000   1.000 0.000  0.500 0.250   4.750 2.250
## 41  0.000  0.250   0.000  0.500   3.250 0.000  0.250 0.000   0.000 0.250
## 42  0.000  0.050   0.100  0.000   1.000 0.000  0.000 0.000   0.000 0.000
## 43  0.000  0.000   0.000  0.000   5.333 0.000  0.000 0.000   0.000 0.667
## 44  1.750  0.000   0.000  0.000   3.250 0.000  0.000 0.000   0.000 0.000
## 45  0.000  0.000   0.000  0.000   2.000 0.000  0.500 0.500   1.500 4.500
## 46  1.000  0.000   1.000  2.000   5.000 0.000  0.000 0.000   9.000 5.000
## 47  6.000  0.000   1.000  4.000   2.000 0.000  2.000 3.000   0.000 3.000
## 48  0.000  1.000   2.000  0.000   2.000 2.000  0.000 0.000   0.000 0.000
## 49  0.000  0.000   0.000  0.125   1.250 0.000  0.125 0.250   1.250 1.125
## 50  2.000  0.000   2.000  0.500   5.000 0.000  1.000 1.000   0.500 0.500
## 51  0.000  0.000   0.000  0.000   2.667 0.333  0.000 0.000   0.333 3.333
## 52  0.000 17.000  28.000 13.000  19.000 0.000 14.000 5.000   2.000 0.000
## 53  0.000  1.000   3.000  2.000   9.000 0.000  3.000 0.000   7.000 1.000
## 54  0.667  0.000   0.000  0.000   2.667 0.000  0.000 0.333   1.333 2.000
## 55  0.000  0.000   0.000  0.500   0.500 0.500  0.000 1.000   0.500 1.000
## 56  0.000  2.500   0.500  4.500   4.000 0.000  1.500 0.500  16.000 1.000
## 57  0.000  0.000   0.000  1.000   1.000 3.000  0.000 0.000   0.000 0.000
## 58  0.000  1.500   1.500  0.000   4.500 0.000  0.500 0.000   1.500 0.500
## 59  0.000  0.000   0.000  0.000   0.000 0.000  1.500 0.000   0.000 0.000
## 60  0.000  2.500   2.500  0.000   1.000 1.000  0.000 0.000   0.000 0.500
## 61  0.000  0.000   0.250  0.750   1.750 0.000  0.000 0.000   0.000 1.000
## 62  1.250  2.250   2.250  0.000   4.500 2.000  1.250 0.250   0.250 1.250
## 63  0.000  8.000  10.000  1.000   2.000 0.000  3.000 3.000   0.000 0.000
## 64  0.000  0.400   1.800  0.000   1.800 0.000  0.200 0.000   2.800 0.400
## 65  0.250  0.000   0.250  3.750   1.750 0.250  0.000 0.000   0.000 0.250
## 66  0.000  1.750   1.750  1.000   3.250 0.500  1.000 0.000   0.250 0.000
## 67  0.000  0.667   0.333  0.000   1.667 0.000  2.000 1.000   0.000 0.000
## 68  3.000  1.000   2.000  8.000   6.000 0.000  1.000 0.000  19.000 1.000
## 69  0.000  4.000   2.000  5.000   6.000 0.000  3.000 2.000   2.000 1.000
## 70  0.000  0.000   4.000  4.000   4.000 0.000  0.000 0.000   0.000 6.000
## 71  0.000  0.000   0.200  0.400   0.000 0.000  0.200 0.000   1.000 0.000
## 72  0.000  0.000   0.000  0.000   0.500 0.500  0.000 2.000   0.500 1.500
## 73  0.000  0.000   0.400  0.400   1.600 0.000  0.200 0.200   0.800 0.000
## 74  0.000  0.000   0.000  0.333   1.333 0.000  0.333 0.000   5.333 0.667
## 75  1.500  3.000   5.000  0.500   6.500 0.500  1.500 2.000   0.000 0.500
## 76  0.333  0.000   1.333  0.000   1.000 0.333  0.667 0.000   0.667 0.333
## 77  0.000  0.000   0.000  0.200   2.400 1.000  0.000 0.200   0.000 0.000
## 78  4.000  0.000   0.000  0.000   2.000 2.000  0.000 1.000   0.000 1.000
## 79  0.000  0.000   0.143  0.000   1.286 0.000  0.000 0.000   0.857 0.000
## 80  0.000  9.000  12.000  0.000  11.000 2.000  7.000 8.000   1.000 0.000
## 81  1.500  1.500   0.500  3.000   2.500 0.000  1.500 1.000   0.000 4.000
## 82  0.000  4.000   9.000  0.000  16.000 1.000  3.000 0.000   3.000 1.000
## 83  1.000  0.000   1.000  0.000   2.000 0.000  3.000 1.000   3.000 0.000
## 84  0.000  0.000   0.000  0.333   3.333 1.333  0.000 1.000   0.000 1.000
## 85  0.000  0.000   0.000  1.000   0.800 0.000  0.000 0.000   2.800 1.600
## 86  0.000  0.000   5.000  2.000   4.000 0.000  1.000 0.000   5.000 0.000
## 87  0.000  0.000   0.000  1.000   1.250 0.250  0.000 0.000   0.000 0.000
## 88  1.000  0.500   2.000  0.000   8.000 0.000  0.000 0.000   2.500 1.500
## 89  0.000  0.333   0.667  0.000   2.333 0.000  0.000 0.667   0.333 2.333
## 90  0.000  0.667   0.667  2.333   1.667 0.333  0.667 0.000  10.667 0.000
## 91  0.333  0.667   1.333  0.667   3.000 0.000  0.333 0.667   0.000 0.333
## 92  0.000  5.500   2.500  0.000   4.000 0.500  0.000 1.000   1.000 0.000
## 93  0.000  2.000   5.000  4.000   3.500 2.000  0.000 0.000  22.500 0.000
## 94  0.000  6.000   4.000  0.000   6.000 0.000  6.000 1.000   3.000 2.000
## 95  0.000  0.000   0.000  0.500   3.000 0.000  0.000 0.500   2.500 0.000
## 96  0.000  0.000   0.000  0.000   0.500 2.000  0.000 0.500   0.500 1.000
## 97  1.000  0.000   0.000  0.000   3.000 4.000  0.000 1.000   0.000 1.000
## 98  0.250  0.750   2.250  0.500   1.250 0.000  0.500 0.000   0.250 0.250
## 99  0.500  0.000   0.000  2.000   1.500 0.000  0.000 0.000   5.000 3.000
## 100 0.000  2.500   4.000  2.000   3.000 0.500  1.500 1.500   1.000 0.500
##       top vishalsikka   will  year
## 1   0.333       3.000  0.333 0.333
## 2   0.000       0.667  0.667 0.000
## 3   0.000       2.000  5.000 0.000
## 4   0.000       0.000  1.273 0.000
## 5   0.000       0.000  0.000 0.000
## 6   2.000       2.500 10.000 0.500
## 7   0.000       0.000  0.000 0.000
## 8   1.000       3.500  1.500 1.500
## 9   0.250       1.250  2.000 0.250
## 10  0.000       0.000  0.125 0.000
## 11  0.000       0.000  0.000 0.000
## 12  0.500       3.000  1.500 0.000
## 13  4.000       0.000  1.000 2.000
## 14  0.500      19.000  3.500 1.000
## 15  0.100       2.300  0.000 0.100
## 16  0.000       0.000  2.000 0.000
## 17  0.667       4.000  3.000 1.333
## 18  1.000       2.000  0.000 0.000
## 19  0.000       2.000  2.000 0.000
## 20  0.167       0.167  0.000 0.083
## 21  0.333       3.667  1.000 0.000
## 22  1.000       1.750  2.000 0.250
## 23  0.600       3.000  0.400 0.400
## 24  0.000       0.000  2.667 0.000
## 25  0.000       0.000  1.200 0.200
## 26  2.000       6.000  3.000 1.000
## 27  0.667       0.000  1.000 0.333
## 28  0.000       0.000  0.000 0.000
## 29  0.409       0.000  0.000 0.045
## 30  0.500       0.500  3.500 0.000
## 31  0.000       0.000  0.250 0.250
## 32  0.000       3.000  1.000 6.000
## 33  0.000       0.000  2.000 1.000
## 34  3.000       0.000  0.000 4.000
## 35  0.000       0.000  1.500 0.000
## 36  0.000       0.000  0.500 0.000
## 37  0.000       0.000  0.000 0.000
## 38  0.500       0.000  0.000 0.500
## 39  0.000       0.000  0.000 0.333
## 40  0.000       0.000  0.000 1.000
## 41  0.250       0.000  1.500 0.250
## 42  0.200       0.000  0.000 0.000
## 43  0.000       0.667  0.333 0.000
## 44  0.000       0.000  0.000 0.000
## 45  0.000       0.000  1.000 0.500
## 46  0.000       0.000  0.000 3.000
## 47  1.000       0.000  7.000 0.000
## 48  1.000      10.000  0.000 0.000
## 49  0.125       0.000  0.000 0.000
## 50  0.500       0.000  1.500 1.000
## 51  0.000       0.000  0.000 0.000
## 52  1.000       6.000  8.000 7.000
## 53  0.000       3.000  1.000 4.000
## 54  0.000       0.000  0.333 0.000
## 55  0.000       0.000  3.500 1.000
## 56  1.000       1.000  0.500 3.500
## 57  0.000       0.000  1.000 0.000
## 58  0.000       5.500  0.000 0.500
## 59  0.000       0.000  1.000 0.500
## 60  3.000       0.000  0.000 1.000
## 61  0.750       0.000  0.000 0.000
## 62  0.250       6.500  4.000 0.000
## 63  1.000      12.000  4.000 1.000
## 64  0.200       0.000  0.000 0.800
## 65  0.250       0.000  0.000 0.000
## 66  0.250       0.250  0.250 0.500
## 67  0.000       0.667  3.667 1.000
## 68  3.000       2.000  0.000 0.000
## 69  0.000       2.000  1.000 1.000
## 70  1.000       0.000  0.000 0.000
## 71  0.000       0.200  0.400 0.000
## 72  0.000       0.000  1.000 0.000
## 73  0.000       0.000  0.200 2.000
## 74  0.000       0.000  0.333 0.000
## 75  1.000       8.500  1.500 0.000
## 76  1.000       0.000  1.000 0.000
## 77  0.000       0.000  0.000 0.000
## 78  0.000       0.000  0.000 0.000
## 79  0.000       0.000  0.000 1.000
## 80  1.000       9.000 19.000 3.000
## 81  3.000       0.000  3.000 1.000
## 82  3.000      11.000  7.000 2.000
## 83  1.000       0.000  2.000 2.000
## 84  0.000       0.000  0.000 0.000
## 85  0.000       0.000  0.400 0.000
## 86  0.000       0.000  0.000 3.000
## 87  0.000       0.000  0.250 0.000
## 88  1.500       2.500  3.000 1.000
## 89  0.000       0.333  1.333 0.000
## 90  0.000       0.667  1.000 2.333
## 91  2.000       2.333  0.667 0.000
## 92  0.000      11.000  6.500 0.500
## 93  0.500       2.000  5.000 3.500
## 94  2.000       0.000  1.000 3.000
## 95  0.000       0.000  0.500 2.000
## 96  0.000       0.000  0.000 0.000
## 97  0.000       0.000  0.000 0.000
## 98  0.250       2.750  1.750 0.750
## 99  0.000       0.000  0.000 2.500
## 100 2.000       0.000  1.000 0.500

ss<-10
for (i in 1:ss) {
  cat(paste("cluster ", i, ": ", sep = ""))
  n<-kmeansResult$size[i]
  cat(paste("Number of news ", n, ": ", sep = ""))
  
  s <- sort(kmeansResult$centers[i, ], decreasing = T)
  cat(names(s)[1:5], "\n")
  # print the news of every cluster
  # print(tweets[which(kmeansResult$cluster==i)])
}

## cluster 1: Number of news 3: infosys vishalsikka growth new buy 
## cluster 2: Number of news 3: buy infosys percent one stock 
## cluster 3: Number of news 1: infosys will ceo new buy 
## cluster 4: Number of news 11: will infosys stock may buy 
## cluster 5: Number of news 5: infosys one may stock buy 
## cluster 6: Number of news 2: will company vishalsikka top growth 
## cluster 7: Number of news 18: percent infosys ceo company buy 
## cluster 8: Number of news 2: company infosys vishalsikka ceo growth 
## cluster 9: Number of news 4: company infosys will vishalsikka buy 
## cluster 10: Number of news 8: infosys buy one may will

From the above top words and centers of clusters, we can see that the clusters are of different topics. For instance, cluster 1 focuses on infosys stock growth and new ceo viashal sikka, cluster 2 on advising to buy infosys stock, cluster 6 on under new appointed CEO company will grow faster. In general clusters talk about to buy infosys stock and about future growth after appointment of new CEO vishal sikka.

Clustering Tweets with the k-medoids Algorithm

We then try k-medoids clustering with the Partitioning Around Medoids (PAM) algorithm, which uses medoids (representative objects) instead of means to represent clusters. It is more robust to noise and outliers than k-means clustering, and provides a display of the silhouette plot to show the quality of clustering. In the example below, we use function pamk() from package fpc [Hennig,2010], which calls the function pam() with the number of clusters estimated by optimum average silhouette.

# Clustering news with the k-medoids Algorithm
library(fpc)
# partitioning around medoids with estimation of number of clusters
pamResult <- pamk(m3, metric="manhattan")
k <- pamResult$nc # number of clusters identified
pamResult <- pamResult$pamobject
# print cluster medoids
for (i in 1:k) {
  cat("cluster", i, ": ",
      colnames(pamResult$medoids)[which(pamResult$medoids[i,]==1)], "\n")
}

## cluster 1 :  infosys 
## cluster 2 :  ceo stock

# plot clustering result
layout(matrix(c(1, 2), 1, 2)) # set to two graphs per page
#layout(matrix(1)) # change back to one graph per page
par(mai=c(0.5,0.5,0.5,0.5))
plot(pamResult, col.p = pamResult$clustering)

In Figure the first chart is a 2D “clusplot” (clustering plot) of the k clusters, and the second one shows their silhouettes. With the silhouette, a large si (almost 1) suggests that the corresponding observations are very well clustered, a small si (around 0) means that the observation lies between two clusters, and observations with a negative si are probably placed in the wrong cluster. The average silhouette width is 0.275, which suggests that the clusters are not well separated from one another.

Topic Modelling

This shows the co-related topics

dtm <- as.DocumentTermMatrix(myTdm)
library(topicmodels)
lda <- LDA(dtm, k = 8) # find 8 topics
term <- terms(lda, 4) # first 4 terms of every topic
term

##      Topic 1   Topic 2       Topic 3       Topic 4   Topic 5   Topic 6    
## [1,] "infosys" "infosys"     "infosys"     "infosys" "percent" "infosys"  
## [2,] "company" "vishalsikka" "company"     "percent" "growth"  "deal"     
## [3,] "will"    "will"        "vishalsikka" "stock"   "quarter" "companies"
## [4,] "ceo"     "company"     "will"        "shares"  "infosys" "indian"   
##      Topic 7   Topic 8      
## [1,] "infosys" "infosys"    
## [2,] "growth"  "vishalsikka"
## [3,] "stock"   "company"    
## [4,] "will"    "ceo"

term <- apply(term, MARGIN = 2, paste, collapse = ", ")
topic <- topics(lda, 1)
topics <- data.frame(date=as.Date(InfosysNews$Date), topic)
par(mai=c(1,1.2,1,0.5))
qplot(date, ..count.., data=topics, geom="density",
      fill=term[topic], position="stack")

Sentiment Analysis

Sentiment scores are able to predict the movement of stocks quite accurately. Sentence Sentiment scores are often more accurate because of the larger sample size.

#Sentiment Analysis
pos = readLines("positive-words.txt")
neg = readLines("negative-words.txt")

score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
  
  require(plyr)
  require(stringr)
  
  scores = laply(sentences, function(sentence, pos.words, neg.words) {
    
      
    # split into words. str_split is in the stringr package
    
    word.list = str_split(sentence, '\\s+')
    
    # sometimes a list() is one level of hierarchy too much
    
    words = unlist(word.list)
    
    # compare our words to the dictionaries of positive & negative terms
    
    pos.matches = match(words, pos.words)
    
    neg.matches = match(words, neg.words)
    
    # match() returns the position of the matched term or NA
    
    # we just want a TRUE/FALSE:
    
    pos.matches = !is.na(pos.matches)
    
    neg.matches = !is.na(neg.matches)
    
    # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
    
    score = sum(pos.matches) - sum(neg.matches)
    
    return(score)
    
  }, pos.words, neg.words, .progress=.progress )
  
  scores.df = data.frame(score=scores, text=sentences)
  
  return(scores.df)
  
}

analysis = score.sentiment(testReplace, pos, neg,.progress='text')
table(analysis$score)

hist(analysis$score)

Above plot shows that the sentiment of market towards infosys stock is more positive.

Social Media Analytics : Data Mining Applied to Infosys Stock News

Trupti Palande

Sunday, May 24, 2015