This is a measure of how similar two pieces of text are. These two pieces of text can be any two complicated documents or just simply two strings. Sometimes as a data scientist we are on a task to understand how similar texts are. For example, we need to match a list of product descriptions to our current product range. For this problem, we can build a simple search engine in R powered by the cosine similarity. For more information about cosine similarity we can refer to here https://en.wikipedia.org/wiki/Cosine_similarity.
In this documentation I assume readers already know how Cosine Similarity works. As a brief summary, Cosine Similarity algorithm parses a document into a vector where each entry is the number of occurrence of a word in this document, then to compare the similarity of two documents we just need to calculate the cosine of the two vectors.
To complete the following code we need to install the following packages twitteR, tidytext, dplyr, tm, SnowballC.
Firstly, we load these packages.
library(twitteR)
library(tidytext)
library(dplyr)
library(tm)
library(SnowballC)
Firstly you want to set you working directory. To run the code yourself, you need to change mine working directory to yours.
workingDir <- "C:/Users/gxw12/Desktop/Scripts/UTS/STDS/Assignments"
setwd(workingDir)
For demostration purposes, we are going to retrieve some tweets as the sample data from the user timeline of @rdatammining. At the end we will build a search engine to search through these sample tweets and bring back the top ten tweets that are most similar to our query term.
url <- "http://www.rdatamining.com/data/rdmTweets-201306.RData"
download.file(url, destfile = "rdmTweets-201306.RData")
After runing these two line, you should see a .RData file saved in your appointed directory.
Load the data into R and convert it as a data frame object
load(file = "rdmTweets-201306.RData")
tweets <- twListToDF(tweets)
And we can inspect the sample dataset
str(tweets)
## 'data.frame': 320 obs. of 14 variables:
## $ text : chr "Examples on calling Java code from R \nhttp://t.co/Yg1AivsO1R" "Simulating Map-Reduce in R for Big Data Analysis Using Flights Data http://t.co/uIAh6PgvQv via @rbloggers" "Job opportunity: Senior Analyst - Big Data at Wesfarmers Industrial & Safety - Sydney Area, Australia #jobs http://t.co/gXo"| __truncated__ "CLAVIN: an open source software package for document geotagging and geoparsing http://t.co/gTGbTanKCI" ...
## $ favorited : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ replyToSN : chr NA NA NA NA ...
## $ created : POSIXct, format: "2013-06-19 08:40:45" "2013-06-18 21:11:07" ...
## $ truncated : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ replyToSID : chr NA NA NA NA ...
## $ id : chr "347272731663945729" "347099177781719041" "346950612954533889" "346932424468492288" ...
## $ replyToUID : chr NA NA NA NA ...
## $ statusSource: chr "web" "<a href=\"http://twitter.com/tweetbutton\" rel=\"nofollow\">Tweet Button</a>" "<a href=\"http://www.linkedin.com/\" rel=\"nofollow\">LinkedIn</a>" "web" ...
## $ screenName : chr "RDataMining" "RDataMining" "RDataMining" "RDataMining" ...
## $ retweetCount: num 1 1 0 2 3 3 0 3 0 0 ...
## $ retweeted : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ longitude : logi NA NA NA NA NA NA ...
## $ latitude : logi NA NA NA NA NA NA ...
For the search engine we are going to build, we will use the “text” column as our “documents database”. After inspecting the dataset I found that there are some puncutations and meaningless text in each tweet. As they will bias the result, we need to strip them off from the text.
To remove them I use the regular expression in R. For those who are interested in regular expressions (regex), you can refer to here. https://stat.ethz.ch/R-manual/R-devel/library/base/html/regex.html
tweets <- tweets %>%
mutate(text=gsub("(http|https).+$|\\n|&|[[:punct:]]","",text),
rowIndex=as.numeric(row.names(.))) %>%
select(text,retweetCount,rowIndex)
Then transform documents into a list for later use
docList <- as.list(tweets$text)
N.docs <- length(docList)
With the above preperations we now can build a search engine. This search engine should take whatever the query term as the input and match to the current document database (the tweets) and return the top ten most similar documents.
QrySearch <- function(queryTerm) {
# Record starting time to measure your search engine performance
start.time <- Sys.time()
# store docs in Corpus class which is a fundamental data structure in text mining
my.docs <- VectorSource(c(docList, queryTerm))
# Transform/standaridze docs to get ready for analysis
my.corpus <- VCorpus(my.docs) %>%
tm_map(stemDocument) %>%
tm_map(removeNumbers) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords,stopwords("en")) %>%
tm_map(stripWhitespace)
# Store docs into a term document matrix where rows=terms and cols=docs
# Normalize term counts by applying TDiDF weightings
term.doc.matrix.stm <- TermDocumentMatrix(my.corpus,
control=list(
weighting=function(x) weightSMART(x,spec="ltc"),
wordLengths=c(1,Inf)))
# Transform term document matrix into a dataframe
term.doc.matrix <- tidy(term.doc.matrix.stm) %>%
group_by(document) %>%
mutate(vtrLen=sqrt(sum(count^2))) %>%
mutate(count=count/vtrLen) %>%
ungroup() %>%
select(term:count)
docMatrix <- term.doc.matrix %>%
mutate(document=as.numeric(document)) %>%
filter(document<N.docs+1)
qryMatrix <- term.doc.matrix %>%
mutate(document=as.numeric(document)) %>%
filter(document>=N.docs+1)
# Calcualte top ten results by cosine similarity
searchRes <- docMatrix %>%
inner_join(qryMatrix,by=c("term"="term"),
suffix=c(".doc",".query")) %>%
mutate(termScore=round(count.doc*count.query,4)) %>%
group_by(document.query,document.doc) %>%
summarise(Score=sum(termScore)) %>%
filter(row_number(desc(Score))<=10) %>%
arrange(desc(Score)) %>%
left_join(tweets,by=c("document.doc"="rowIndex")) %>%
ungroup() %>%
rename(Result=text) %>%
select(Result,Score,retweetCount) %>%
data.frame()
# Record when it stops and take the difference
end.time <- Sys.time()
time.taken <- round(end.time - start.time,4)
print(paste("Used",time.taken,"seconds"))
return(searchRes)
}
Now let’s see how this search engine perform.
QrySearch("data science")
## [1] "Used 0.391 seconds"
## Result
## 1 Free ebook on Data Science with R
## 2 Top LinkedIn Groups for Analytics Big Data Data Mining and Data Science
## 3 PostDoctoral fellow in Computer Science with specialization in Database Technology
## 4 2013 Poll hosted by KDnuggets Predictive Analytics Big Data Data Mining Data Science Software Used
## 5 Introduction to Data Science a free online course on Coursera already started on May 1st
## 6 A video from a talk on dynamic and correlated topic models applied to the journal Science
## 7 Research Associate in smart harvesting for social science open access literature and research data project Germany
## 8 Slides on big data in R
## 9 Vacancy of Data Scientist Data Miner for nPario a big data startup
## 10 Top 10 in Data Mining
## Score retweetCount
## 1 0.4947 6
## 2 0.4433 0
## 3 0.3409 0
## 4 0.3187 0
## 5 0.2736 4
## 6 0.2613 2
## 7 0.2429 0
## 8 0.0613 3
## 9 0.0548 0
## 10 0.0505 1
It returns the top ten results as expected.