Text-Analyzing Google Search Results

Here’s a simulated use case. Suppose you want to know what shows up when someone Google-searches for, say, ‘ISB CBA’. What pages would show up, in which order, what content would they contain, what sentiment would they express, etc.

Why focus on Google search only? Because, like it or not, it remains the predominant search engine of choice that most people use to navigate and [re]organize their worldview of the web around…

Python makes it super-simple to scrape Google search results. Once that data are in, a ton of stuff is enabled downstream - questions to ask, hypotheses to conjecture, analyses to do, answers to find …

Ready to start? OK. Follow the code snippets below. At one place, load the data from your local machine.

Step 1: Clear the work space and assign required libraries for this tutorial. Make sure all the packages are installed. (You can install a package by install.packages(“packagename”)

rm(list=ls()) # clears the workspace prior to fresh analysis

###############################################################
#             Invoke required library                         #
###############################################################

library("tm")
library("wordcloud")
library("syuzhet")
library("dplyr")
library("reshape2")
library("ggplot2")

Read the csv data file downloaded from python and saved somewhere on your local machine. When the code below runs, it’ll open a window and ask you for the location of the data file.

data = read.csv(file.choose(), stringsAsFactors = F)
dim(data)

## [1] 53  3

names(data)

## [1] "X"    "text" "url"

The next step involves some basic data munging/cleaning.

Some links in the google search results are pdf docs. These don’t read in well and are noisy. So for now we’ll just delete these links and corresponding noisy texts from the corpus. Also, we’ll remove non ASCII characters from text.

if (length(grep('.pdf',data$url)) != 0) {
  data = data[-grep('.pdf',data$url),]
  dim(data)
}

## [1] 52  3

data$text  =  iconv(data$text, "latin1", "ASCII", sub="")   # Keep only ASCII characters

Next Step: Apply Standard text corpus operation and generate basic wordcloud

wordCorpus <- Corpus(VectorSource(data$text))
wordCorpus <- tm_map(wordCorpus, removePunctuation)
wordCorpus <- tm_map(wordCorpus, content_transformer(tolower))
wordCorpus <- tm_map(wordCorpus, removeWords, stopwords("english"))
wordCorpus <- tm_map(wordCorpus, removeNumbers)
wordCorpus <- tm_map(wordCorpus, stripWhitespace)

pal <- brewer.pal(9,"YlGnBu")
pal <- pal[-(1:4)]
set.seed(123)
#

wordcloud(words = wordCorpus, scale=c(4,0.5), max.words=100, random.order=FALSE, 
          rot.per=0.35, use.r.layout=FALSE, colors=pal)

Next step: From this text corpus, we create the TermDocumentMatrix (TDM) and use it for all further statistical analysis.

tdm <- TermDocumentMatrix(wordCorpus)
inspect(tdm[order(rowSums(as.matrix(tdm)),     # order TDM by row sum
                  decreasing = T)[1:10],1:10]) # show a 10x10 subset of the ordered TDM

## <<TermDocumentMatrix (terms: 10, documents: 10)>>
## Non-/sparse entries: 70/30
## Sparsity           : 30%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## 
##             Docs
## Terms         1  2  3  4  5  6  7  8  9 10
##   business   10 10 13 11 10 25 10  9 12 10
##   programme  40 37 37 35 35 40 35 35 33 40
##   analytics   5  5 12  5  5 19  5  5  9  5
##   isb         7  7  7  8  9 12  7  7  7  8
##   mba         0  0  0  0  0  0  0  0  0  0
##   management 11 11 14 12 11 14 11 11 11 11
##   data        0  0  5  0  0 23  0  3  5  0
##   india       3  3  3  3  3  6  3  4  7  3
##   course      0  0  0  0  0  3  0  0  0  1
##   will        5  0  3  0  0  6  0  0  0  4

Often times, folks may want to match words in a corpus with phrases from an external wordlist. For example, say I have a list of marketing words (or technical words etc) whose occurrence in a corus I want to study, then how should I go about it?

For instance consider the following wordlist that I just came up with: ‘indian school of business’,‘certificate programme in business analytics’,‘data visualisation’, ‘predictive modelling’,‘business decisions’,‘programme leaders’,‘digital summit’,‘data mining’,‘pattern matching’, ‘global best practices’,‘business analytics’

So I first read it into a wordlist object called (say) ‘dict’.

dict = c('Indian school of Business','Certificate programme in Business Analytics','Data Visualisation',
'Predictive Modelling','Business decisions','Programme leaders','Digital summit','Data Mining','Pattern Matching', 'Global best Practices','Business Analytics')

First, we want to ensure that each of the wordlist’s phrases is treated as a single token. To do this, one way is to replace blank spaces in the phrase with a dash (-) that R’s tm package knows not to disturb. Then we can use this new text corpus to create a revised TDM and perform the Statistical Analysis as before.

# Replace phrases as a single word and plot wordcloud (Example only)

dict = tolower(dict)   # converts all to lower case
dictr = paste('',gsub(' ','-',dict),'')  # gsub() is global substitution in the wordlist

# Now Replace each expression as one word in each documents
newdoc = NULL
for (i in 1:nrow(data)){
  document = tolower(data$text[i])
    for (j in 1:length(dict))  {
      document = gsub(dict[j],dictr[j],document)
    }
  newdoc = c(newdoc,document)
}
data$text1 = newdoc

wordCorpus <- Corpus(VectorSource(data$text1))
wordCorpus <- tm_map(wordCorpus, removePunctuation, preserve_intra_word_dashes = TRUE)
wordCorpus <- tm_map(wordCorpus, removeNumbers)
wordCorpus <- tm_map(wordCorpus, removeWords, stopwords("english"))
wordCorpus <- tm_map(wordCorpus, stripWhitespace)

pal <- brewer.pal(9,"YlGnBu")
pal <- pal[-(1:4)]
set.seed(123)

wordcloud(words = wordCorpus, scale=c(4,0.5), max.words=100, random.order=FALSE, 
          rot.per=0.35, use.r.layout=FALSE, colors=pal)

We can also subset this TermDocumentMatrix for the wordlist and use this new TermDocumentMatrix for analysis

tdm1 <- TermDocumentMatrix(wordCorpus)

list = gsub(" ","",dictr)
list[1] = 'indian-school--business' # Because stopwords remvoed "of" from indian-school-of-business
list[2] = 'certificate-programme--business-analytics' # Because stopwords remvoed "in" from certificate-programme-in-business-analytics

tdm2 = tdm1[row.names(tdm1) %in% list,]
inspect(tdm2[,1:5])

## <<TermDocumentMatrix (terms: 11, documents: 5)>>
## Non-/sparse entries: 22/33
## Sparsity           : 60%
## Maximal term length: 41
## Weighting          : term frequency (tf)
## 
##                                            Docs
## Terms                                       1 2 3 4 5
##   business-analytics                        1 1 2 1 1
##   business-decisions                        0 0 0 0 0
##   certificate-programme--business-analytics 4 4 4 4 4
##   data-mining                               0 0 2 0 0
##   data-visualisation                        0 0 0 0 0
##   digital-summit                            0 0 0 0 0
##   global-best-practices                     0 0 0 1 0
##   indian-school--business                   1 1 1 2 1
##   pattern-matching                          0 0 0 0 0
##   predictive-modelling                      0 0 0 0 0
##   programme-leaders                         1 1 1 1 1

wordcloud(words = row.names(tdm2), freq = rowSums(as.matrix(tdm2)), scale=c(4,0.5), max.words=100, colors=pal)

Next step: We can use get_nrc_sentiment function from package syuzhet to get sentiments scores for each document. Then we can plot these sentiments scores over document with help of ggplot

docSentiment <- get_nrc_sentiment(data$text1)
data <- cbind(data, docSentiment)
data$DocNumber = data$X 

posnegtime <- melt(data[,13:15], id=c('DocNumber')) 
names(posnegtime) <- c("DocNumber", "sentiment", "meanvalue")
posnegtime$sentiment = factor(posnegtime$sentiment,levels(posnegtime$sentiment)[c(2,1)])

ggplot(data = posnegtime, aes(x = DocNumber, y = meanvalue, group = sentiment)) +
  geom_line(size = 2.5, alpha = 0.7, aes(color = sentiment)) +
  geom_point(size = 0.5) +
  ylim(0, NA) + 
  scale_colour_manual(values = c("springgreen4", "firebrick3")) +
  theme(legend.title=element_blank(), axis.title.x = element_blank()) +
#   scale_x_date(breaks = date_breaks("9 months"), 
#                labels = date_format("%Y-%b")) +
  ylab("Average sentiment score") + 
  ggtitle("Sentiment Over Documents")

Text-Analyzing Google Search Results

Data Collection for Biz Analytics @ ISB

14 February 2016