Web Scrapping, Text Mining & Sentiment Analysis

As we approach 1483228800s in Epoch/Unix Time-Stamp I thought of working on my last blog post for the year 2016! As most of my readers know by now, I like to demonstrate the how-to in a reproducible fashion.

Social media has become a major data resource for most organizations today. Think about the bulkload of data generated every day, every hour & every second as large masses of people voice out literally about everything! You’re generating data as you post your tweet on twitter or upload your cool selfie on facebook or perhaps post an article on LinkedIn. Organizations, particularly those in marketing are leveraging this data for Social Media Marketing efforts. This has several benefits including:

Improved Brand Recognition
Higher Conversion Opportunities
Higher Conversion rates
Product Optimization
Decreased Marketing Costs
…

Big corporations including Small-To-Medium sized businesses are appreciating the free data directly from the consumers as they gain more brand authority and partake of the market share!

In this blog post, I will demo how you can scrap data from the web (twitter) then pre-process the data in its unstructured or raw state, mine it & finally perform sentiment analysis to extract insights regarding how the consumers feel about a particular product.A few things to note:

This analysis assumes the reader has atleast intermediate experience coding in R
Geolocation parameter not included in the data extraction
Only english tweets will be extracted

Methodology

In this demo, I will use the Hu Liu Opinion Lexicon as opposed to the Learning based technique which means the methodology will be Syntactical verses Statistical as it would’ve been had I used the Learning approach. One of the reasons I like Syntactic techniques is that they tend to deliver better accuracy because they make use of the syntactic rules of the language in order to detect the verbs, adjectives and nouns even though such techniques heavily depend on the language of the document and as a result the classifiers can’t be ported to other languages. Finally, I will use the J. Breen approach to code the function that will use the lexicon to score sentiments.

Load packages

library(twitteR)
library(stringr)
library(plyr)
library(lattice)

Score Function

I will start by implementing the score function. As stated earlier, this function in general will take the unstructured data (tweets in this use case) & dictionary of positive & negative from the Hu Liu opinion lexicon then pre-process the tweets & finally generate a data frame of scores that we can work with.

score.sentiment <- function(sentences, pos.words,neg.words, .progress = 'none')
{
  #parameters:
  #sentences :vector of sentences to score
  #lexicon   :the lexicon/dictionary you choose to use for the pos & neg words
  #.progress :for the progress bar
  
  #create array of scores with lapply()
  scores = laply(sentences, 
                 function(sentence, pos.words, neg.words)
    {
    #remove punctuation using global sub
    sentence = gsub('[[:punct:]]', '', sentence)
    #remove control characters
    sentence = gsub('[[:cntrl:]]', '', sentence)
    #remove digits
    sentence = gsub('\\d+','',sentence)
    
    #define err handling function when attempting to lower
    tryTolower = function(x)
    {
      #create missing value
      y = NA
      #Try catch err
      try_err = tryCatch(tolower(x), error = function(e) e)
      #if not an error
      if(!inherits(try_err, 'error'))
        y = tolower(x)
      #return y
      return(y)
    }
    #use trytolower with sapply
    sentence = sapply(sentence, tryTolower)
    
    #split sentence into words with str_split
    word.list = str_split(sentence, '\\s+')
    words     = unlist(word.list)
    
    #compare words to the dictionary/lexicon
    pos.matches = match(words, pos.words)
    neg.matches = match(words, neg.words)
    
    #get the position of the matched word or NA
    #We just want a T/F
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
    
    #final score
    score = sum(pos.matches) - sum(neg.matches)
    return(score)
  }, pos.words, neg.words)# .progress = .progress)
  
  #dataframe with scores for each sentence
  scores.df = data.frame(text = sentences, score = scores)
  return(scores.df)
}

Establish Twitter Connection

Wait a minute! If you do not have a twitter account please STOP & follow this link to R bloggers & there search for how to get the access credentials to establish connection to twitter from RStudio. Due to confidentiality I have omitted my access info in the code chunk below.

consumer_key    <- xxxxxx
consumer_secret <- xxxxxx
access_token    <- xxxxxx
access_secret   <- xxxxxx

With the above objects defined using twitter access info, I will set up twitter authorization as follows:

setup_twitter_oauth(consumer_key = consumer_key, consumer_secret = consumer_secret,
                    access_token = access_token, access_secret = access_secret)

## [1] "Using direct authentication"

Now that my RStudio has been connected to twitter, its time to extract the tweets!! I have selected 4 US tech giants…Apple, Google, Microsoft & Amazon. I am curious to see a snap shot of how consumers have been reacting to the iphone 7, Pixel by google & other technologies launched this year by the 4 companies. After performing sentiment analysis, I will compare my findings with Yahoo Finance for the last 5 days as of Dec-30-2016.

Web (Twitter) Scrapping

google   <- searchTwitter('#google', n = 1000, lang = 'en')
microsft <- searchTwitteR('#microsoft', n =1000, lang = 'en')
amazon   <- searchTwitter('#amazon', n = 1000, lang = 'en')
apple    <- searchTwitter('#apple', n = 1000, lang = 'en')

Now that I have extracted tweets, I will pull the text from the tweets & perform a few things including validating the number of tweets per object (company) then I will concatenate all the records together.

#get text
google_txt   <- sapply(google, function(x)x$getText())
microsft_txt <- sapply(microsft, function(x)x$getText())
amazon_txt   <- sapply(amazon, function(x)x$getText())
apple_txt    <- sapply(apple, function(x)x$getText())

#number of tweets per company
num_twt <- c(length(google_txt), length(microsft_txt), length(amazon_txt), length(apple_txt))
print(num_twt)

## [1] 1000 1000 1000 1000

#join all the texts
major_techs  <- c(google_txt, microsft_txt, amazon_txt, apple_txt)

Data Preprocessing

To begin with, I will load the positive & negative dictionaries from the Hu Liu lexicon then using the Sentiment.score() that I defined earlier I will pass major_techs & the dictionaries to it in order to obtain scores.df which I will assign the name Scores. I will then proceed on to defining a few necessary variables to get the data frame in a more intuitive form & finally calculate the overall percentage of positive sentiments.

#Read Hu Liu lexicon 
pos.words <- readLines('Positive-Words.txt')
neg.words <- readLines('Negative-Words.txt')

#apply function score.sentiment
scores <- score.sentiment(major_techs, pos.words = pos.words, neg.words = neg.words, .progress = 'text')

#add variables to dataframe
scores$major_techs <- factor(rep(c('Google', 'Microsoft', 'Amazon', 'Apple'), num_twt))
scores$very.pos    <- as.numeric(scores$score >= 2)
scores$very.neg    <- as.numeric(scores$score <= -2)

#number of very positive and very negative
numpos <- sum(scores$very.pos)
numneg <- sum(scores$very.neg)

#global % score
global_score <- round(100*numpos/(numpos + numneg), 2)
print(paste('The global % of very positive sentiments is', global_score,'%'))

## [1] "The global % of very positive sentiments is 86.25 %"

Let’s check out the final data frame as follows:

head(scores)

##                                                                                                                                                                                                                                                                                                                                         text
## 1 RT @TechGeekRebel: #Google is fairly eco friendly! <ed><U+00A0><U+00BC><ed><U+00BC><U+00B2><ed><U+00A0><U+00BC><ed><U+00BC><U+00B3><ed><U+00A0><U+00BC><ed><U+00BC><U+00B4><ed><U+00A0><U+00BC><ed><U+00BC><U+00B3><ed><U+00A0><U+00BC><ed><U+00BC><U+00B2> \n#tech #science #bigdata #mobile #innovation #awesome https://t.co/cNSp2oeat9
## 2                                                                                                                                                                                                                                                                       "... seriously, #Google?" #funny #humor #lol https://t.co/nGgqWnlrmm
## 3                                                                                                                                                                                                  #USA #Google - New Article - "Should I quit my job before starting a business online?"\n\nhttps://t.co/9tHv26Z0Jn https://t.co/frhTUWI5zu
## 4                                                                                                                                                                                                    #TonyStewart #NASCAR Tony Stewart #20 Nascar Baseball Cap Hat The Home Depot Joe Gibbs Like New https://t.co/Sfm6zTd39H #Google #Trends
## 5                                                                                                                                                                                                                 #BestOf2016 : #Google #AdWords #Benchmarks for YOUR Industry [NEW DATA] -- https://t.co/a41LPOR3my https://t.co/FNkrqjgQh7
## 6                                                                                                                                                                                                             #Google #Facebook - Beautiful Jewelry at the right time at the right price!\n\nhttps://t.co/nQPtVpVYqQ https://t.co/zDh5mm0iTd
##   score major_techs very.pos very.neg
## 1     4      Google        1        0
## 2     0      Google        0        0
## 3     0      Google        0        0
## 4     1      Google        0        0
## 5     0      Google        0        0
## 6     3      Google        1        0

Sentiment Analysis

To this point, I have shown how to extract data from the web (twitter in this use case), then preprocess the data, generate a data frame & finally transform the data frame to fit purpose. Now I will analyze the data & what’s better than using some visualizations?!

boxplot(score ~ major_techs,data = scores, col = c('red', 'green'))

histogram(~score|major_techs, data = scores, main = 'Sentiment Analysis of the 4 major US tech companines',
          xlab = 'Sentiment Score', col = c('blue', 'purple'))

The boxplots generally show the same IQR pattern across all 4 companies, Amazon is seen to have some extreme positive sentiments (Great for the company!) where as Apple shows some extreme negative sentiments. The Histograms reveal more as they show the distribution of mildly negative, most negative (negative score <= -2), positive and most positive (positive score >= 2).Apple looks better than the rest with the mildly negative bin having the lowest distribution. Microsoft & Google have the highest distributions of the mildly negative sentiments indicating lesser consumer satisfaction.Oddly enough, even though Apple shows less mildly negative sentiments, the histogram confirms the extreme negative sentiments with higher distribution compared to the rest. Amazon seems to have a better positive sentiments regime followed by Apple & Amazon. In general, Microsoft seems to have done the worst of the four giants!

Validation & Conclusion

Within the time window of the data captured, Microsoft’s performace in terms of customer satisfaction rate was was the lowest. It is obvious that customer satisfaction directly impacts a company’s financial performance.

Can the output of this analysis prove to have some correlation with the company’s financial performace?

Lets turn to Yahoo Finance. At the time of this analysis, I took a snap shot of the 4 companies’ Nasdaq stock market performance for the last 5 days-shown below. Notice how Apple is seen to have generally performed better especially as the week winds down. Microsoft’s & Googles’ performace trended below the other 2 in mid-week with the former picking up towards the end of the week beating Amazon & Google by Friday. In conclusion, Yahoo Finance shows Apple performing better than the other 3 which positively correlates with the higher customer satisfaction rate as seen from the sentiments visual analysis.

Thanks for reading if you found this helpful please leave your comments, suggestions or questions below. Happy new year to y’all!!!!

Bests!