Twitter Analysis Methodology

Required library files for doing the Twitter Analysis

library(twitteR)

## Loading required package: ROAuth
## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: digest
## Loading required package: rjson

library(bitops)
library(RCurl)
library(RJSONIO)

## 
## Attaching package: 'RJSONIO'
## 
## The following objects are masked from 'package:rjson':
## 
##     fromJSON, toJSON

The Twitter handshake commands

I am attaching the code for doing the analysis with some dummy values.

library(RCurl) options(RCurlOptions = list(cainfo = system.file(“CurlSSL”, “cacert.pem”, package = “RCurl”)))

reqURL <- “https://api.twitter.com/oauth/request_token”

accessURL <- “https://api.twitter.com/oauth/access_token”

authURL <- “https://api.twitter.com/oauth/authorize”

consumerKey <- “Dra1Eyxxxxxxxxxxxxxxxxxxxx”

consumerSecret <- “RLtyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy”

twitCred <- OAuthFactory$new(consumerKey=consumerKey,consumerSecret=consumerSecret, requestURL=reqURL, accessURL=accessURL, authURL=authURL)

download.file(url=“http://curl.haxx.se/ca/cacert.pem”, destfile=“cacert.pem”)

twitCred$handshake(cainfo=“cacert.pem”) ############################

Once the above portion of the code is executed, the console will provide a URL which needs to be copied and pasted on a browser and executed. The browser will take you to the Twitter Authorisation page and will prompt you to authorise the handshake. Once it is authorised, it will redirect to the webpage(URL) which has been given in the Twitter App creation screen when you were creating the app. On the webpage, you will find the hadshake key within the URL. It will be the last portion of the URL. This handshake key has to be pasted onto the R-studio console to complete the handshake. After handshake is completed the below code is executed.

registerTwitterOAuth(twitCred)

Creating the function to download the Twitter hashtag info and creating the dataframe for further analysis. Since all these codes require the handshake code, I am putting these in the body of the RMD document and not within the R-Chunks. The below function is a one time function to download any of the topics you would be interested in.

Tweetanalysis <- function(search,number) {

tweetlist <- searchTwitter(search,n=number)

return(do.call(“rbind”,lapply(tweetlist,as.data.frame))) }

Carrying out the Tweet analysis function for Citibank and JPMorgan

Cititweet <- Tweetanalysis(“#Citibank”,395)

JPtweet <- Tweetanalysis(“#JPMorgan”,395)

The numbers 395 denotes the number of tweets which you want to download. You can select any number you want. The reason I selected 395 was because at the time I ran the code, there were only 395 tweets available for Citi for the day and therefore I selected the same number for both the Organisations.

Sorting the time at which the tweets arrived. This is done so that we get the time delay in seconds between adjacent tweets. Only by carrying out this, will we be able to find out the mean time in which the tweets appear.

Citisort <- Cititweet[order(as.integer(Cititweet$created)),]
JPsort <- JPtweet[order(as.integer(JPtweet$created)),]

The next task is to find out the frequency in which the tweets appear in seconds. This is done by calculating the time delay within subsequent tweets and finding the mean of these delays. This is done as below.

Citidelay <- as.integer(diff(Citisort$created))
JPdelay <- as.integer(diff(JPsort$created))
mean(Citidelay)

## [1] 1503

mean(JPdelay)

## [1] 1528

From the below it can be seen that the mean time delay between tweets for Citi is about 1500 seconds and for JPMorgan it is around 1527 seconds. So pretty close figures. Let us look at how many observations in both the banks have less than or equal to the mean value of the lowest. This will give us probabities of the rate of tweet.

Citiprob <- sum(Citidelay <= 1527)
JPprob <- sum(JPdelay <= 1527)
Citiprob

## [1] 288

JPprob

## [1] 281

Citiprob/length(Citidelay)

## [1] 0.731

JPprob/length(JPdelay)

## [1] 0.7132

So it is found out that 73% of observations of Citi have mean time lesser than this and for JP Morgan it is only 71%. Both observations look pretty close.

Let us do a poisson ratio test on both these stats individuall.

Citistat <- poisson.test(288,395)
JPstat <- poisson.test(281,395)
Citistat

## 
##  Exact Poisson test
## 
## data:  288 time base: 395
## number of events = 288, time base = 395, p-value = 1.94e-08
## alternative hypothesis: true event rate is not equal to 1
## 95 percent confidence interval:
##  0.6473 0.8184
## sample estimates:
## event rate 
##     0.7291

JPstat

## 
##  Exact Poisson test
## 
## data:  281 time base: 395
## number of events = 281, time base = 395, p-value = 1.742e-09
## alternative hypothesis: true event rate is not equal to 1
## 95 percent confidence interval:
##  0.6306 0.7996
## sample estimates:
## event rate 
##     0.7114

From the individual poisson tests, both Citi and JP are found to have significant p values with 95% confidence intervals having the observed values.

Let us now proceed to do a comparison betweeen both the values and try to infer from the stats

poisson.test(c(288,281),c(395,395))

## 
##  Comparison of Poisson rates
## 
## data:  c(288, 281) time base: c(395, 395)
## count1 = 288, expected count1 = 284.5, p-value = 0.8014
## alternative hypothesis: true rate ratio is not equal to 1
## 95 percent confidence interval:
##  0.8665 1.2123
## sample estimates:
## rate ratio 
##      1.025

From the comparative poisson test, it is found that the p values is very large. This states that we fail to reject the null hypothesis,which states that the ratio of these rates are equal to 1 or in other words both the rates are similar.

String Based analysis

Let us not attempt to do an analysis of the text strings within the tweets which we downloaded for Citibank. For doing string analysis, we required the stringr package from R. Let us load that first. The search() function checks if there are any packages which we have attached. If there are any we have to detach it with the detach() function.

library(stringr)

search()

##  [1] ".GlobalEnv"        "package:stringr"   "package:RJSONIO"  
##  [4] "package:twitteR"   "package:rjson"     "package:ROAuth"   
##  [7] "package:digest"    "package:RCurl"     "package:bitops"   
## [10] "package:stats"     "package:graphics"  "package:grDevices"
## [13] "package:utils"     "package:datasets"  "package:methods"  
## [16] "Autoloads"         "package:base"

The variable which we have to analyse is the “text” variable within the Cititweet dataframe. Let us first start with analysing the length of the strings withing the dataframe.

Cititweet$textlen <- str_length(Cititweet$text)
head(Cititweet$textlen,10)

##  [1] 135 108 127 137 187  91 106  79  97  79

By default, the number of text strings permissible in a Tweet is 140 charachters. On visual observation we can find that there are many tweets with more than 140 charachter lengths. Let identify those and the nature of those tweets.

head(Cititweet[Cititweet$textlen > 140,],2)

##                                                                                                                                                                                             text
## 5  RT @turkyepost: ÙØµØ©Ù ÙØ¬Ø§Ø| "#Citibank" Ø®Ø¨Ø±Ø© Ø§Ø³ØªØ«ÙØ§Ø±ÙØ© ØªÙØªØ¯ ÙÙ 38 Ø¹Ø§ÙØ§Ù ÙÙ Ø¨ØªØ±ÙÙØ§\n#ØªØ±ÙÙØ§_Ø¨ÙØ³Øª\nhttp://t.co/w87TAvccfo http://t.co/bM5ZQi60wu
## 29 RT @turkyepost: ÙØµØ©Ù ÙØ¬Ø§Ø| "#Citibank" Ø®Ø¨Ø±Ø© Ø§Ø³ØªØ«ÙØ§Ø±ÙØ© ØªÙØªØ¯ ÙÙ 38 Ø¹Ø§ÙØ§Ù ÙÙ Ø¨ØªØ±ÙÙØ§\n#ØªØ±ÙÙØ§_Ø¨ÙØ³Øª\nhttp://t.co/w87TAvccfo http://t.co/bM5ZQi60wu
##    favorited favoriteCount replyToSN             created truncated
## 5      FALSE             0      <NA> 2014-12-05 03:37:38     FALSE
## 29     FALSE             0      <NA> 2014-12-04 21:23:34     FALSE
##    replyToSID                 id replyToUID
## 5        <NA> 540711566417469440       <NA>
## 29       <NA> 540617431757369346       <NA>
##                                                                            statusSource
## 5    <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
## 29 <a href="http://twitter.com/download/android" rel="nofollow">Twitter for Android</a>
##         screenName retweetCount isRetweet retweeted longitude latitude
## 5  mohammedalnqeeb            2      TRUE     FALSE      <NA>     <NA>
## 29        mushax40            2      TRUE     FALSE      <NA>     <NA>
##    textlen
## 5      187
## 29     187

Most of the tweets which have length more than 140 fall under the category of Retweets. Let us now clean up the text by removing extra spaces withing the strings. After this we can identify the length of the text, count the words by counting the spaces and adding one and also fiding the mean wordcount.

Cititweet$modtext <- str_replace_all(Cititweet$text,"  "," ")
Cititweet$textlen2 <- str_length(Cititweet$modtext)
Cititweet$wordcount <- (str_count(Cititweet$modtext," ")+1)
mean(Cititweet$wordcount)

## [1] 12.26

Parsing of the text

Now lets do some parsing of the strings to analyse the text data further and unearthing trends within tweets. Let us first identify only those strings with RT in it.

Cititweet$rt <- str_match(Cititweet$modtext,"RT @[a-z,A-Z]*: ")
head(Cititweet$rt,10)

##       [,1]              
##  [1,] NA                
##  [2,] NA                
##  [3,] NA                
##  [4,] NA                
##  [5,] "RT @turkyepost: "
##  [6,] NA                
##  [7,] NA                
##  [8,] NA                
##  [9,] NA                
## [10,] NA

Cititweet$rt <- Cititweet$rt[,1]

After having done this, let us now identify if there are any influential people who would have retweeted. However before doing this we have to clean up the tweet data a bit to replace the RT string and other charachters. Let us do this with the string_replace function.

Cititweet$rt <- str_replace(Cititweet$rt, "RT @","")
Cititweet$rt <- str_replace(Cititweet$rt, ": ","")
tail(Cititweet$rt,10)

##  [1] "CoinTelegraph" NA              NA              NA             
##  [5] "CoinTelegraph" "CoinTelegraph" NA              "CoinTelegraph"
##  [9] NA              NA

Now that we have cleaned the data quite a bit, let us find out who are the influential tweeters and how many times have they tweeted. This can be done by converting the rt column as a factor and making a table out of the factors.

table(as.factor(Cititweet$rt))

## 
##      baapoffers       bienaraza         bsindia      CaioSasaki 
##               4               1               3               1 
##             cfo   CoinTelegraph     cromaretail   EDSamorzadowy 
##               5              35               5               1 
## ericdeladiennee     FilippasWay  innercitypress     mundrajajay 
##               1               1               3               1 
##       nxtgenttt       Pepperfry   princehandley PROPHECYandNEWS 
##               1               3               2               1 
##         rishjtt       samanizad    SASchoenfeld       ShopClues 
##               1               1               1               1 
##  TheKoreaHerald      TraderStef      turkyepost     velcrolewis 
##               4               1               6               2

Hm! we seem to have identified some very influential tweeters with one person having 35 tweets.

Now let us look at another perspective of this data and see how many of these inflential tweeters have contributed to the tweet length with more than 140 charachter length.

Cititweet$longtext <- (Cititweet$textlen2 > 140)
table(as.factor(Cititweet$rt),as.factor(Cititweet$longtext))

##                  
##                   FALSE TRUE
##   baapoffers          2    2
##   bienaraza           1    0
##   bsindia             3    0
##   CaioSasaki          1    0
##   cfo                 5    0
##   CoinTelegraph      35    0
##   cromaretail         0    5
##   EDSamorzadowy       0    1
##   ericdeladiennee     0    1
##   FilippasWay         0    1
##   innercitypress      0    3
##   mundrajajay         0    1
##   nxtgenttt           0    1
##   Pepperfry           1    2
##   princehandley       0    2
##   PROPHECYandNEWS     0    1
##   rishjtt             0    1
##   samanizad           0    1
##   SASchoenfeld        0    1
##   ShopClues           0    1
##   TheKoreaHerald      4    0
##   TraderStef          0    1
##   turkyepost          0    6
##   velcrolewis         2    0

Now let us attempt to do what we did for text for the URL strings in the text. Please note that we might have multiple or no URLs within each text and therefore we have to treat this analysis more carefully.

Let us first identify the Url with str_match_all() function as used earlier.

Cititweet$urlist <- str_match_all(Cititweet$text,"http://t.co/[a-z,A-Z,0-9]{8}")
head(Cititweet$urlist,1)

## [[1]]
##      [,1]                  
## [1,] "http://t.co/CPB0JyZA"

Let us now count, the output and find out how many URLs are there in each of the element. This can be done with the rapply function

Cititweet$urlnum <- rapply(Cititweet$urlist,length)
table(Cititweet$urlnum)

## 
##   0   1   2 
## 220 102  73

Let us also compare the URl lengths with the longtext and find out any pattern identified there

table(Cititweet$urlnum,Cititweet$longtext)

##    
##     FALSE TRUE
##   0   196   24
##   1    85   17
##   2    62   11

prop.table(table(Cititweet$urlnum,Cititweet$longtext))

##    
##       FALSE    TRUE
##   0 0.49620 0.06076
##   1 0.21519 0.04304
##   2 0.15696 0.02785

Let us also attempt to find out how many of these tweets are for highlighting some issues and find what proportion of the tweets are for highlighting issues.

Cititweet$issue <- str_match(Cititweet$modtext,"#[a-z,A-Z,_]*\nIssue:")
head(Cititweet$issue,2)

##      [,1]               
## [1,] NA                 
## [2,] "#mortgage\nIssue:"

Cititweet$issuelen <- !is.na(Cititweet$issue)
head(Cititweet$issuelen,2)

##       [,1]
## [1,] FALSE
## [2,]  TRUE

prop.table(table(Cititweet$issuelen))

## 
##  FALSE   TRUE 
## 0.6506 0.3494

Well, that is a pretty high percentage of tweets relating to issues. Its almost 35%. Let us see what are those issues which are being reported.

prop.table(table(Cititweet$issue[Cititweet$issuelen=="TRUE",]))

## 
## #bank_account_or_service\nIssue:           #consumer_loan\nIssue: 
##                          0.08696                          0.02174 
##             #credit_card\nIssue:         #debt_collection\nIssue: 
##                          0.54348                          0.08696 
##                #mortgage\nIssue:            #student_loan\nIssue: 
##                          0.24638                          0.01449

So thats another revealing fact. 54% of the issues tweeted are Credit card related. The Credit card division I bet are looking and thinking about this.

Twitter Analysis Methodology

WTHOML - tvjoseph

Friday, December 05, 2014

String Based analysis

Parsing of the text