The use of social media has grown significantly in recent years. With the growth in its use, there has also been a substantial growth in the amount of information generated by users if social media. Here we have taken news analysis using Text mining to predict trends of stock in stock market. News related to Infosys stocks from Jun 2014-sept 2014 are considered for analysis.
Here we have discusses the application of Tokenisation, association,wordcloud and clustering analysis to News media. This is demostrated by analysing Infosys Stocks News from Money Control,Reuter india,CNBC-TV18 , ET Now and HT news. The results of these analysis help identify keywords and concepts in the news media data, and can facilate the application of this information by traders,investors. Traders and investors can analyse this information and apply the results of the analysis in relevant area, they will be able to proactively address potential market and investment opportunity.
Keywords: Social media, News,analytics,data mining, text mining,clusterig,association analysis, sentiment analysis.
News of Infosys Stock is used as the data to analyze. It starts with extracting text from News related to infosys stock from New media. The extracted text is then transformed to build a document-term matrix. After that, frequent words and associations are found from the matrix. A word cloud is used to present important words in documents. In the end, words and news are clustered to find groups of words and also groups of news.
There are few important packages used in the examples: twitteR, tm and wordcloud. Package twitteR [Gentry, 2013] provides access to Twitter data, tm [Feinerer and Hornik, 2013] provides functions for text mining, and wordcloud [Fellows, 2013] visualizes the result with a word cloud 2.
News text related to infosys stock taken from from Money Control,Reuter india,CNBC-TV18 , ET Now and HT news are stored in excel file. News texts are extracted from excel with the code below.
## Loading required package: NLP
##
## Attaching package: 'ggplot2'
##
## The following object is masked from 'package:NLP':
##
## annotate
##
## Loading required package: RColorBrewer
## Loading required package: lattice
##
## Attaching package: 'latticeExtra'
##
## The following object is masked from 'package:ggplot2':
##
## layer
##
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:twitteR':
##
## id
##
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
##
## The following objects are masked from 'package:twitteR':
##
## id, location
##
## The following object is masked from 'package:randomForest':
##
## combine
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#read data from dataset excel
InfosysNews <- read.csv("InfosysTestWeka.csv")
InfosysNews.Details <-InfosysNews$Details
testReplace<-InfosysNews.Details
The news are first converted to a data frame and then to a corpus, which is a collection of text documents. After that, the corpus can be processed with functions provided in package tm [Feinerer and Hornik, 2013].
# build a corpus, and specify the source to be character vectors
InfosysCorpus<-Corpus(VectorSource(testReplace))
After that, the corpus needs a couple of transformations, including changing letters to lower case, and removing punctuations, numbers and stop words. The general English stop-word list is tailored here by adding “also”,“always” etc.
document.collection <-InfosysCorpus
document.collection <-tm_map(document.collection,stripWhitespace)
document.collection <- tm_map(document.collection, content_transformer(tolower))
document.collection <-tm_map(document.collection,removeNumbers)
document.collection <-tm_map(document.collection,removePunctuation)
document.collection <- tm_map(document.collection,removeWords, stopwords("english"))
more.stop.words <- c("also","always","among","can","come","comes","even",
"comming","say","saying","says","said","still","two","three","ltd")
document.collection <- tm_map(document.collection,
removeWords, more.stop.words)
dataframe<-data.frame(text=unlist(sapply(document.collection, `[`, "content")), stringsAsFactors=F)
A term-document matrix represents the relationship between terms and documents, where each row stands for a term and each column for a document, and an entry is the number of occurrences of the term in the document.
initial.tdm <- TermDocumentMatrix(document.collection)
termDocMatrix <- as.matrix(initial.tdm)
Ndoc <- length(InfosysCorpus)
#write.csv(termDocMatrix, "test.csv")
initial.tdm
## <<TermDocumentMatrix (terms: 3891, documents: 343)>>
## Non-/sparse entries: 18144/1316469
## Sparsity : 99%
## Maximal term length: 24
## Weighting : term frequency (tf)
As we can see from the above result, the term-document matrix is composed of 3891 terms and 343 documents. It is very sparse, with 99% of the entries being zero.
We have a look at the popular words and the association between words. Note that there are 343 news in total.
# Frequent Words and Association
myTdm <- TermDocumentMatrix(document.collection, control=list(wordLengths=c(1,Inf)))
myTdm <- TermDocumentMatrix(document.collection, control=list(minWordLength=1))
# Frequent Words
findFreqTerms(myTdm, lowfreq=60)
## [1] "around" "attrition" "back" "big" "board"
## [6] "business" "buy" "ceo" "change" "client"
## [11] "companies" "company" "continue" "deal" "executive"
## [16] "gain" "going" "good" "growth" "high"
## [21] "india" "indian" "infosys" "investors" "large"
## [26] "last" "like" "management" "market" "may"
## [31] "murthy" "new" "now" "one" "people"
## [36] "percent" "positive" "qoq" "quarter" "revenue"
## [41] "rupee" "see" "services" "shares" "software"
## [46] "stock" "take" "time" "top" "vishalsikka"
## [51] "well" "will" "year"
# Arrange frequent words in ascending order
wordmatrix <- as.matrix(myTdm)
vsort <- sort(rowSums(wordmatrix), decreasing=TRUE)
head(vsort,55)
## infosys percent vishalsikka company will stock
## 808 393 329 315 302 245
## growth ceo year new one revenue
## 215 207 145 140 122 118
## management buy top time quarter services
## 117 110 109 106 103 100
## may market companies last change murthy
## 99 98 92 91 88 86
## high see shares continue india now
## 83 83 83 81 80 80
## deal rupee software client business big
## 79 79 77 73 71 70
## board indian large like executive qoq
## 70 70 70 70 69 68
## take around gain positive well investors
## 67 66 65 65 65 64
## back going good attrition people margin
## 63 63 63 62 60 59
## level
## 58
termFrequency <- rowSums(as.matrix(myTdm))
termFrequency <- subset(termFrequency, termFrequency>=60)
df <- data.frame(term=names(termFrequency), freq=termFrequency)
# Plot frequently occuring words
ggplot(df, aes(x=term, y=freq)) + geom_bar(stat="identity") +
xlab("Terms") + ylab("Count") + coord_flip()
In the code above, findFreqTerms() finds frequent terms with frequency no less than 60. Note that they are ordered alphabetically, instead of by frequency or popularity. To show the top frequent words visually, we next make a barplot for them. From the termdocumentmatrix, we can derive the frequency of terms with rowSums(). Then we select terms thatappears in ten or more documents and shown them with a barplot using package ggplot2 [Wickham,2009]. In the code below, geom=“bar” specifies a barplot and coord_flip() swaps x- and y-axis.The barplot in Figure clearly shows that the three most frequent words are “infosys”,“percent” and “vishalsikka”.
We can also find what are highly associated with a word with function findAssocs(). Below we try to find terms associated with “infosys” with correlation no less than 0.5, and the words are ordered by their correlation with “vishalsikka” .
# which words are associated with "infosys"?
findAssocs(myTdm, 'infosys', 0.50)
## infosys
## company 0.69
## new 0.61
## change 0.60
## ceo 0.57
## management 0.54
## vishalsikka 0.52
## fill 0.51
## murthy 0.51
## challenge 0.50
## founder 0.50
# which words are associated with "vishalsikka"?
findAssocs(myTdm, 'vishalsikka', 0.50)
## vishalsikka
## sap 0.66
## choice 0.60
## executive 0.55
## drive 0.53
## ceo 0.52
## company 0.52
## hana 0.52
## infosys 0.52
## board 0.51
## chief 0.51
As you can see word frequency is not providing much information. Let us do n-gram tokenization of the Corpus using RWeka. First check 2 gram and followed by 3 gram
#Bigram Tokenise
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
dtm <- DocumentTermMatrix(document.collection, control = list(tokenize = BigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)
pl <- ggplot(subset(wof, freq > 20) ,aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="blue")
pl + theme(axis.text.x=element_text(angle=90))
#Trigram Tokennise
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
dtm <- DocumentTermMatrix(document.collection, control = list(tokenize = TrigramTokenizer))
freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
wof <- data.frame(word=names(freq), freq=freq)
pl <- ggplot(subset(wof, freq > 7) ,aes(word, freq))
pl <- pl + geom_bar(stat="identity", fill="brown")
pl + theme(axis.text.x=element_text(angle=90))
An analyis that highlights combination of words regardless of where they occur can also help an analyst understand the key and important concepts in a body of text data. the technique which is used to highlight combination of words is association analysis. This analysis determines the likelihood of a combination of items occuring tpgether as well as a confidance around the projection. Ultimately , the association abalusis produces a set of if-the rules(if item A is present in a transaction, then item B will be present as well) and the lift associated with the rule.
#Association rule
# source("http://bioconductor.org/biocLite.R")
# biocLite("Rgraphviz")
library(graph)
##
## Attaching package: 'graph'
##
## The following object is masked from 'package:plyr':
##
## join
library(Rgraphviz)
##
## Attaching package: 'Rgraphviz'
##
## The following object is masked from 'package:twitteR':
##
## name
freq.terms=findFreqTerms(myTdm, lowfreq=60)
plot(myTdm, term = freq.terms, corThreshold = 0.50, weighting = T)
After building a term-document matrix, we can show the importance of words with a word cloud(also known as a tag cloud), which can be easily produced with package wordcloud [Fellows,2013]. In the code below, we first convert the term-document matrix to a normal matrix, and then calculate word frequencies. After that, we set gray levels based on word frequency and usewordcloud() to make a plot for it. With wordcloud(), the first two parameters give a list of words and their frequencies. Words with frequency below three are not plotted, as specified bymin.freq=3. By setting random.order=F, frequent words are plotted first, which makes them appear in the center of cloud. We also set the colors to gray levels based on frequency. A colorful cloud can be generated by setting colors with rainbow().
m <- as.matrix(myTdm)
# calculate the frequency of words and sort it descendingly by frequency
wordFreq <- sort(rowSums(m), decreasing=TRUE)
# word cloud
set.seed(375) # to make it reproducible
col <- brewer.pal(5,"Dark2")
grayLevels <- gray( (wordFreq+10) / (max(wordFreq)+10) )
#wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=70, random.order=F,
# colors=grayLevels)
wordcloud(words=names(wordFreq), freq=wordFreq, min.freq=80, rot.per=0.5, scale=c(4,1),
random.color=T, random.order=F,colors=col)
The above word cloud clearly shows again that “infosys”, “percent” and “vishalsikka” are the top three words, which validates that the news present information on infosys stocks growth in percent and new CEO viashal sikka. Some other important words are “company”, “will”“,”stock“,”growth" and “ceo”“, which shows that it focuses on growth of companys stock on appointment of new CEO. Another set of frequent words,”last“,”change" , “murthy”, and “high” , are from news about Last year when narayan Murthy came back on management there were high growth seen in infosys stock. There are also some new on the topic of people attrition as indicated by words “attrition” and “people”in the cloud.
We then try to nd clusters of words with hierarchical clustering. Sparse terms are removed, so that the plot of clustering will not be crowded with words. Then the distances between terms are calculated with dist() after scaling. After that, the terms are clustered with hclust() and the dendrogram is cut into 10 clusters. The agglomeration method is set to ward, which denotes the increase in variance when two clusters are merged. Some other options are single linkage, complete linkage, average linkage, median and centroid. Details about dierent agglomeration methods can be found in data mining textbooks [Han and Kamber, 2000,Hand et al., 2001,Witten and Frank,2005].
# remove sparse terms
tdm2 <- removeSparseTerms(myTdm, sparse = 0.80)
m2 <- as.matrix(tdm2)
# cluster terms
distMatrix <- dist(scale(m2))
fit <- hclust(distMatrix, method = "ward.D2")
plot(fit)
rect.hclust(fit, k = 10) # cut tree into 6 clusters
In the above dendrogram, we can see the topics in the tweets. Words “buy”, “may” and “one” are clustered into one group, because there are a couple of news. This indiactaes positive sentiments that one can buy stock hoping growth in stock. The second cluster from left comprises “year”, “ceo” and “new” they are clustered into one group because of news about new appointment of CEO which was expected to be in this year.
News are clustered below with the k-means and the k-medoids algorithms.
We first try k-means clustering, which takes the values in the matrix as numeric. We transpose the term-document matrix to a document-term one. The tweets are then clustered with kmeans() with the number of clusters set to eight. After that, we check the popular words in every cluster and also the cluster centers. Note that a fixed random seed is set with set.seed() before running kmeans(), so that the clustering result can be reproduced.
m3 <- t(m2) # transpose the matrix to cluster documents (news)
set.seed(122) # set a fixed random seed
k <- 100 # number of clusters
kmeansResult <- kmeans(m3, k)
kmeansResult$size
## [1] 3 3 1 11 5 2 18 2 4 8 3 2 1 2 10 1 3 1 2 12 3 4 5
## [24] 3 5 1 3 1 22 2 4 1 1 1 2 8 1 2 3 4 4 20 3 4 2 1
## [47] 1 1 8 2 3 1 1 3 2 2 1 2 2 2 4 4 1 5 4 4 3 1 1
## [70] 1 5 2 5 3 2 3 5 1 7 1 2 1 1 3 5 1 4 2 3 3 3 2
## [93] 2 1 2 2 1 4 2 2
round(kmeansResult$centers, digits = 3) # cluster centers
## buy ceo company growth infosys may new one percent stock
## 1 2.000 1.000 1.333 2.667 5.000 0.000 2.333 0.333 0.000 1.333
## 2 3.333 0.000 0.000 0.000 2.333 0.000 0.000 0.667 1.000 0.667
## 3 2.000 3.000 0.000 2.000 5.000 0.000 3.000 0.000 0.000 2.000
## 4 0.000 0.000 0.000 0.000 1.091 0.091 0.000 0.000 0.000 0.182
## 5 0.800 0.000 0.000 0.200 1.600 1.000 0.000 1.200 0.000 1.000
## 6 0.000 0.500 4.000 1.500 1.000 1.500 0.000 0.000 1.000 0.000
## 7 0.000 0.111 0.056 0.000 1.111 0.000 0.000 0.000 1.167 0.000
## 8 0.000 3.000 7.500 1.500 5.000 0.000 1.500 0.000 0.500 0.000
## 9 0.750 0.750 5.000 0.000 4.000 0.000 0.000 0.500 0.000 0.250
## 10 1.375 0.000 0.000 0.000 1.750 0.500 0.000 1.250 0.000 0.000
## 11 1.667 0.000 0.000 0.000 2.333 1.333 0.000 1.000 1.333 2.667
## 12 0.000 1.000 3.000 1.500 10.000 2.500 1.000 1.500 0.000 0.000
## 13 0.000 0.000 1.000 7.000 5.000 0.000 0.000 1.000 0.000 2.000
## 14 0.000 1.500 4.500 2.000 6.500 0.000 1.000 1.500 0.000 0.000
## 15 0.000 0.800 0.200 0.200 1.300 0.000 0.400 0.100 0.600 0.500
## 16 0.000 0.000 0.000 4.000 0.000 0.000 0.000 0.000 2.000 1.000
## 17 0.333 2.333 8.000 1.000 9.333 0.333 1.333 0.667 2.333 1.000
## 18 0.000 1.000 3.000 7.000 2.000 0.000 0.000 0.000 0.000 6.000
## 19 1.000 0.000 1.000 0.000 4.500 1.500 0.000 1.500 0.000 6.000
## 20 0.083 0.083 0.083 0.000 2.083 0.000 0.083 0.000 0.000 0.167
## 21 0.000 0.667 0.000 0.000 3.667 1.333 0.667 0.333 0.000 0.000
## 22 0.000 1.500 1.250 1.500 7.000 0.250 1.500 0.000 0.250 0.500
## 23 0.000 3.800 0.600 0.000 2.600 0.200 1.200 0.200 0.000 0.200
## 24 0.667 0.000 0.000 0.333 1.000 0.333 0.000 0.000 0.333 2.333
## 25 0.600 0.000 0.000 0.000 3.000 1.000 0.000 1.600 0.200 0.200
## 26 0.000 2.000 3.000 6.000 8.000 0.000 1.000 0.000 4.000 2.000
## 27 0.333 0.000 0.333 0.000 2.667 0.000 0.000 0.333 1.333 0.000
## 28 0.000 4.000 3.000 0.000 1.000 0.000 0.000 0.000 1.000 1.000
## 29 0.000 0.000 0.000 0.000 0.818 0.000 0.000 0.091 0.000 1.045
## 30 1.000 0.500 2.500 2.000 2.500 0.000 0.000 0.500 1.500 0.000
## 31 0.000 0.000 0.750 2.750 1.500 0.500 0.000 0.000 4.250 0.250
## 32 1.000 1.000 4.000 0.000 4.000 0.000 2.000 1.000 1.000 1.000
## 33 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 6.000
## 34 0.000 8.000 2.000 0.000 4.000 0.000 0.000 0.000 0.000 0.000
## 35 0.000 0.000 0.000 0.000 4.500 2.000 0.000 1.000 1.000 0.500
## 36 1.000 0.000 0.000 0.000 0.875 0.000 0.000 0.125 0.375 0.500
## 37 1.000 0.000 0.000 6.000 3.000 0.000 1.000 1.000 9.000 2.000
## 38 0.000 0.000 0.000 0.000 1.000 0.500 0.500 0.000 2.000 2.000
## 39 0.000 0.000 0.000 1.667 0.000 0.333 0.000 0.333 0.000 0.000
## 40 0.000 0.000 0.000 0.000 1.000 0.000 0.500 0.250 4.750 2.250
## 41 0.000 0.250 0.000 0.500 3.250 0.000 0.250 0.000 0.000 0.250
## 42 0.000 0.050 0.100 0.000 1.000 0.000 0.000 0.000 0.000 0.000
## 43 0.000 0.000 0.000 0.000 5.333 0.000 0.000 0.000 0.000 0.667
## 44 1.750 0.000 0.000 0.000 3.250 0.000 0.000 0.000 0.000 0.000
## 45 0.000 0.000 0.000 0.000 2.000 0.000 0.500 0.500 1.500 4.500
## 46 1.000 0.000 1.000 2.000 5.000 0.000 0.000 0.000 9.000 5.000
## 47 6.000 0.000 1.000 4.000 2.000 0.000 2.000 3.000 0.000 3.000
## 48 0.000 1.000 2.000 0.000 2.000 2.000 0.000 0.000 0.000 0.000
## 49 0.000 0.000 0.000 0.125 1.250 0.000 0.125 0.250 1.250 1.125
## 50 2.000 0.000 2.000 0.500 5.000 0.000 1.000 1.000 0.500 0.500
## 51 0.000 0.000 0.000 0.000 2.667 0.333 0.000 0.000 0.333 3.333
## 52 0.000 17.000 28.000 13.000 19.000 0.000 14.000 5.000 2.000 0.000
## 53 0.000 1.000 3.000 2.000 9.000 0.000 3.000 0.000 7.000 1.000
## 54 0.667 0.000 0.000 0.000 2.667 0.000 0.000 0.333 1.333 2.000
## 55 0.000 0.000 0.000 0.500 0.500 0.500 0.000 1.000 0.500 1.000
## 56 0.000 2.500 0.500 4.500 4.000 0.000 1.500 0.500 16.000 1.000
## 57 0.000 0.000 0.000 1.000 1.000 3.000 0.000 0.000 0.000 0.000
## 58 0.000 1.500 1.500 0.000 4.500 0.000 0.500 0.000 1.500 0.500
## 59 0.000 0.000 0.000 0.000 0.000 0.000 1.500 0.000 0.000 0.000
## 60 0.000 2.500 2.500 0.000 1.000 1.000 0.000 0.000 0.000 0.500
## 61 0.000 0.000 0.250 0.750 1.750 0.000 0.000 0.000 0.000 1.000
## 62 1.250 2.250 2.250 0.000 4.500 2.000 1.250 0.250 0.250 1.250
## 63 0.000 8.000 10.000 1.000 2.000 0.000 3.000 3.000 0.000 0.000
## 64 0.000 0.400 1.800 0.000 1.800 0.000 0.200 0.000 2.800 0.400
## 65 0.250 0.000 0.250 3.750 1.750 0.250 0.000 0.000 0.000 0.250
## 66 0.000 1.750 1.750 1.000 3.250 0.500 1.000 0.000 0.250 0.000
## 67 0.000 0.667 0.333 0.000 1.667 0.000 2.000 1.000 0.000 0.000
## 68 3.000 1.000 2.000 8.000 6.000 0.000 1.000 0.000 19.000 1.000
## 69 0.000 4.000 2.000 5.000 6.000 0.000 3.000 2.000 2.000 1.000
## 70 0.000 0.000 4.000 4.000 4.000 0.000 0.000 0.000 0.000 6.000
## 71 0.000 0.000 0.200 0.400 0.000 0.000 0.200 0.000 1.000 0.000
## 72 0.000 0.000 0.000 0.000 0.500 0.500 0.000 2.000 0.500 1.500
## 73 0.000 0.000 0.400 0.400 1.600 0.000 0.200 0.200 0.800 0.000
## 74 0.000 0.000 0.000 0.333 1.333 0.000 0.333 0.000 5.333 0.667
## 75 1.500 3.000 5.000 0.500 6.500 0.500 1.500 2.000 0.000 0.500
## 76 0.333 0.000 1.333 0.000 1.000 0.333 0.667 0.000 0.667 0.333
## 77 0.000 0.000 0.000 0.200 2.400 1.000 0.000 0.200 0.000 0.000
## 78 4.000 0.000 0.000 0.000 2.000 2.000 0.000 1.000 0.000 1.000
## 79 0.000 0.000 0.143 0.000 1.286 0.000 0.000 0.000 0.857 0.000
## 80 0.000 9.000 12.000 0.000 11.000 2.000 7.000 8.000 1.000 0.000
## 81 1.500 1.500 0.500 3.000 2.500 0.000 1.500 1.000 0.000 4.000
## 82 0.000 4.000 9.000 0.000 16.000 1.000 3.000 0.000 3.000 1.000
## 83 1.000 0.000 1.000 0.000 2.000 0.000 3.000 1.000 3.000 0.000
## 84 0.000 0.000 0.000 0.333 3.333 1.333 0.000 1.000 0.000 1.000
## 85 0.000 0.000 0.000 1.000 0.800 0.000 0.000 0.000 2.800 1.600
## 86 0.000 0.000 5.000 2.000 4.000 0.000 1.000 0.000 5.000 0.000
## 87 0.000 0.000 0.000 1.000 1.250 0.250 0.000 0.000 0.000 0.000
## 88 1.000 0.500 2.000 0.000 8.000 0.000 0.000 0.000 2.500 1.500
## 89 0.000 0.333 0.667 0.000 2.333 0.000 0.000 0.667 0.333 2.333
## 90 0.000 0.667 0.667 2.333 1.667 0.333 0.667 0.000 10.667 0.000
## 91 0.333 0.667 1.333 0.667 3.000 0.000 0.333 0.667 0.000 0.333
## 92 0.000 5.500 2.500 0.000 4.000 0.500 0.000 1.000 1.000 0.000
## 93 0.000 2.000 5.000 4.000 3.500 2.000 0.000 0.000 22.500 0.000
## 94 0.000 6.000 4.000 0.000 6.000 0.000 6.000 1.000 3.000 2.000
## 95 0.000 0.000 0.000 0.500 3.000 0.000 0.000 0.500 2.500 0.000
## 96 0.000 0.000 0.000 0.000 0.500 2.000 0.000 0.500 0.500 1.000
## 97 1.000 0.000 0.000 0.000 3.000 4.000 0.000 1.000 0.000 1.000
## 98 0.250 0.750 2.250 0.500 1.250 0.000 0.500 0.000 0.250 0.250
## 99 0.500 0.000 0.000 2.000 1.500 0.000 0.000 0.000 5.000 3.000
## 100 0.000 2.500 4.000 2.000 3.000 0.500 1.500 1.500 1.000 0.500
## top vishalsikka will year
## 1 0.333 3.000 0.333 0.333
## 2 0.000 0.667 0.667 0.000
## 3 0.000 2.000 5.000 0.000
## 4 0.000 0.000 1.273 0.000
## 5 0.000 0.000 0.000 0.000
## 6 2.000 2.500 10.000 0.500
## 7 0.000 0.000 0.000 0.000
## 8 1.000 3.500 1.500 1.500
## 9 0.250 1.250 2.000 0.250
## 10 0.000 0.000 0.125 0.000
## 11 0.000 0.000 0.000 0.000
## 12 0.500 3.000 1.500 0.000
## 13 4.000 0.000 1.000 2.000
## 14 0.500 19.000 3.500 1.000
## 15 0.100 2.300 0.000 0.100
## 16 0.000 0.000 2.000 0.000
## 17 0.667 4.000 3.000 1.333
## 18 1.000 2.000 0.000 0.000
## 19 0.000 2.000 2.000 0.000
## 20 0.167 0.167 0.000 0.083
## 21 0.333 3.667 1.000 0.000
## 22 1.000 1.750 2.000 0.250
## 23 0.600 3.000 0.400 0.400
## 24 0.000 0.000 2.667 0.000
## 25 0.000 0.000 1.200 0.200
## 26 2.000 6.000 3.000 1.000
## 27 0.667 0.000 1.000 0.333
## 28 0.000 0.000 0.000 0.000
## 29 0.409 0.000 0.000 0.045
## 30 0.500 0.500 3.500 0.000
## 31 0.000 0.000 0.250 0.250
## 32 0.000 3.000 1.000 6.000
## 33 0.000 0.000 2.000 1.000
## 34 3.000 0.000 0.000 4.000
## 35 0.000 0.000 1.500 0.000
## 36 0.000 0.000 0.500 0.000
## 37 0.000 0.000 0.000 0.000
## 38 0.500 0.000 0.000 0.500
## 39 0.000 0.000 0.000 0.333
## 40 0.000 0.000 0.000 1.000
## 41 0.250 0.000 1.500 0.250
## 42 0.200 0.000 0.000 0.000
## 43 0.000 0.667 0.333 0.000
## 44 0.000 0.000 0.000 0.000
## 45 0.000 0.000 1.000 0.500
## 46 0.000 0.000 0.000 3.000
## 47 1.000 0.000 7.000 0.000
## 48 1.000 10.000 0.000 0.000
## 49 0.125 0.000 0.000 0.000
## 50 0.500 0.000 1.500 1.000
## 51 0.000 0.000 0.000 0.000
## 52 1.000 6.000 8.000 7.000
## 53 0.000 3.000 1.000 4.000
## 54 0.000 0.000 0.333 0.000
## 55 0.000 0.000 3.500 1.000
## 56 1.000 1.000 0.500 3.500
## 57 0.000 0.000 1.000 0.000
## 58 0.000 5.500 0.000 0.500
## 59 0.000 0.000 1.000 0.500
## 60 3.000 0.000 0.000 1.000
## 61 0.750 0.000 0.000 0.000
## 62 0.250 6.500 4.000 0.000
## 63 1.000 12.000 4.000 1.000
## 64 0.200 0.000 0.000 0.800
## 65 0.250 0.000 0.000 0.000
## 66 0.250 0.250 0.250 0.500
## 67 0.000 0.667 3.667 1.000
## 68 3.000 2.000 0.000 0.000
## 69 0.000 2.000 1.000 1.000
## 70 1.000 0.000 0.000 0.000
## 71 0.000 0.200 0.400 0.000
## 72 0.000 0.000 1.000 0.000
## 73 0.000 0.000 0.200 2.000
## 74 0.000 0.000 0.333 0.000
## 75 1.000 8.500 1.500 0.000
## 76 1.000 0.000 1.000 0.000
## 77 0.000 0.000 0.000 0.000
## 78 0.000 0.000 0.000 0.000
## 79 0.000 0.000 0.000 1.000
## 80 1.000 9.000 19.000 3.000
## 81 3.000 0.000 3.000 1.000
## 82 3.000 11.000 7.000 2.000
## 83 1.000 0.000 2.000 2.000
## 84 0.000 0.000 0.000 0.000
## 85 0.000 0.000 0.400 0.000
## 86 0.000 0.000 0.000 3.000
## 87 0.000 0.000 0.250 0.000
## 88 1.500 2.500 3.000 1.000
## 89 0.000 0.333 1.333 0.000
## 90 0.000 0.667 1.000 2.333
## 91 2.000 2.333 0.667 0.000
## 92 0.000 11.000 6.500 0.500
## 93 0.500 2.000 5.000 3.500
## 94 2.000 0.000 1.000 3.000
## 95 0.000 0.000 0.500 2.000
## 96 0.000 0.000 0.000 0.000
## 97 0.000 0.000 0.000 0.000
## 98 0.250 2.750 1.750 0.750
## 99 0.000 0.000 0.000 2.500
## 100 2.000 0.000 1.000 0.500
ss<-10
for (i in 1:ss) {
cat(paste("cluster ", i, ": ", sep = ""))
n<-kmeansResult$size[i]
cat(paste("Number of news ", n, ": ", sep = ""))
s <- sort(kmeansResult$centers[i, ], decreasing = T)
cat(names(s)[1:5], "\n")
# print the news of every cluster
# print(tweets[which(kmeansResult$cluster==i)])
}
## cluster 1: Number of news 3: infosys vishalsikka growth new buy
## cluster 2: Number of news 3: buy infosys percent one stock
## cluster 3: Number of news 1: infosys will ceo new buy
## cluster 4: Number of news 11: will infosys stock may buy
## cluster 5: Number of news 5: infosys one may stock buy
## cluster 6: Number of news 2: will company vishalsikka top growth
## cluster 7: Number of news 18: percent infosys ceo company buy
## cluster 8: Number of news 2: company infosys vishalsikka ceo growth
## cluster 9: Number of news 4: company infosys will vishalsikka buy
## cluster 10: Number of news 8: infosys buy one may will
From the above top words and centers of clusters, we can see that the clusters are of different topics. For instance, cluster 1 focuses on infosys stock growth and new ceo viashal sikka, cluster 2 on advising to buy infosys stock, cluster 6 on under new appointed CEO company will grow faster. In general clusters talk about to buy infosys stock and about future growth after appointment of new CEO vishal sikka.
We then try k-medoids clustering with the Partitioning Around Medoids (PAM) algorithm, which uses medoids (representative objects) instead of means to represent clusters. It is more robust to noise and outliers than k-means clustering, and provides a display of the silhouette plot to show the quality of clustering. In the example below, we use function pamk() from package fpc [Hennig,2010], which calls the function pam() with the number of clusters estimated by optimum average silhouette.
# Clustering news with the k-medoids Algorithm
library(fpc)
# partitioning around medoids with estimation of number of clusters
pamResult <- pamk(m3, metric="manhattan")
k <- pamResult$nc # number of clusters identified
pamResult <- pamResult$pamobject
# print cluster medoids
for (i in 1:k) {
cat("cluster", i, ": ",
colnames(pamResult$medoids)[which(pamResult$medoids[i,]==1)], "\n")
}
## cluster 1 : infosys
## cluster 2 : ceo stock
# plot clustering result
layout(matrix(c(1, 2), 1, 2)) # set to two graphs per page
#layout(matrix(1)) # change back to one graph per page
par(mai=c(0.5,0.5,0.5,0.5))
plot(pamResult, col.p = pamResult$clustering)
In Figure the first chart is a 2D “clusplot” (clustering plot) of the k clusters, and the second one shows their silhouettes. With the silhouette, a large si (almost 1) suggests that the corresponding observations are very well clustered, a small si (around 0) means that the observation lies between two clusters, and observations with a negative si are probably placed in the wrong cluster. The average silhouette width is 0.275, which suggests that the clusters are not well separated from one another.
This shows the co-related topics
dtm <- as.DocumentTermMatrix(myTdm)
library(topicmodels)
lda <- LDA(dtm, k = 8) # find 8 topics
term <- terms(lda, 4) # first 4 terms of every topic
term
## Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic 6
## [1,] "infosys" "infosys" "infosys" "infosys" "percent" "infosys"
## [2,] "company" "vishalsikka" "company" "percent" "growth" "deal"
## [3,] "will" "will" "vishalsikka" "stock" "quarter" "companies"
## [4,] "ceo" "company" "will" "shares" "infosys" "indian"
## Topic 7 Topic 8
## [1,] "infosys" "infosys"
## [2,] "growth" "vishalsikka"
## [3,] "stock" "company"
## [4,] "will" "ceo"
term <- apply(term, MARGIN = 2, paste, collapse = ", ")
topic <- topics(lda, 1)
topics <- data.frame(date=as.Date(InfosysNews$Date), topic)
par(mai=c(1,1.2,1,0.5))
qplot(date, ..count.., data=topics, geom="density",
fill=term[topic], position="stack")
Sentiment scores are able to predict the movement of stocks quite accurately. Sentence Sentiment scores are often more accurate because of the larger sample size.
#Sentiment Analysis
pos = readLines("positive-words.txt")
neg = readLines("negative-words.txt")
score.sentiment = function(sentences, pos.words, neg.words, .progress='none')
{
require(plyr)
require(stringr)
scores = laply(sentences, function(sentence, pos.words, neg.words) {
# split into words. str_split is in the stringr package
word.list = str_split(sentence, '\\s+')
# sometimes a list() is one level of hierarchy too much
words = unlist(word.list)
# compare our words to the dictionaries of positive & negative terms
pos.matches = match(words, pos.words)
neg.matches = match(words, neg.words)
# match() returns the position of the matched term or NA
# we just want a TRUE/FALSE:
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
# and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum():
score = sum(pos.matches) - sum(neg.matches)
return(score)
}, pos.words, neg.words, .progress=.progress )
scores.df = data.frame(score=scores, text=sentences)
return(scores.df)
}
analysis = score.sentiment(testReplace, pos, neg,.progress='text')
table(analysis$score)
hist(analysis$score)
Above plot shows that the sentiment of market towards infosys stock is more positive.