Report 1: Introduction to K-Means Clustering with Twitter Data

by John Baldwin at Aentropico

In this post we will describe how to get started in data mining from Twitter using R, as well as a K-Means clustering technique to analyze the data. This post will be split into 3 sections: Getting Twitter Authorization for R, data mining with the twitteR package, and K-Means clustering. This post should serve as an introduction to a much more in-depth field of study.

Basic experience with R programming will be helpful in this tutorial. For a thorough introduction to the R language, go to:

http://cran.r-project.org/doc/manuals/R-intro.pdf

We're going to use three R packages in this post, twitteR, ROAuth and ggplot2.

twitteR is an R package with a broad array of functions for Twitter data mining and analysis. ROAuth is a package that gets the online certification to extract Twitter data. ggplot2 is for statistical graphing. The documentation for the packages can be found here:

We will also be using K-Means clustering to analyze the data. An excellent introduction to clustering techniques in R can be found here:

http://www.statmethods.net/advstats/cluster.html

The above packages may have dependencies which should be installed automatically, as necessary. Load the above packages with the following commands in R:

require(twitteR)
require(ROAuth)
require(ggplot2)

Step 1: Getting Twitter Authorization for R In order to upload Twitter data, you will have to register yourself as a developer on Twitter. Go to https://dev.twitter.com, log in to your Twitter account, and select 'My Applications' by scrolling over your profile picture in the upper-right part of the screen. Select 'Create an Application', fill out the form, and create your application.

In the resulting window for your app, under the 'Details' window, in OAuth settings, you should see two variables labelled 'Consumer Key' and 'Consumer Secret'. Run the following code in R, first replacing the given values of consumerKey and consumerSecret in the code with your personal values:

reqURL <- "http://api.twitter.com/oauth/request_token"
accessURL <- "http://api.twitter.com/oauth/access_token"
authURL <- "http://api.twitter.com/oauth/authorize"
consumerKey <- "xxxxxxxxxxxxxxxxxxxx"
consumerSecret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
cred <- OAuthFactory$new(consumerKey=consumerKey,
                             consumerSecret=consumerSecret,
                             requestURL=reqURL,
                             accessURL=accessURL,
                             authURL=authURL)
cred$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl", ssl.verifypeer = FALSE))

This step can be tricky and prone to error; be prepared to consult google or stackoverflow with any error messages you may receive. In my personal experience, I had to include this line of code at the top to get the package to work, although this method is less secure:

options(RCurlOptions = list(capath = system.file("CurlSSL", "cacert.pem", package = "RCurl"), ssl.verifypeer = FALSE))

cred$handshake should prompt you to go to a generated URL and receive your PIN. Go to the site, click 'Authorize App', and enter the PIN in the R console.

You can now check that the OAuth registration was successful with the following command:

registerTwitterOAuth(cred)

##TRUE

You can now save the 'cred' authorization token you have generated for future use:

save(cred, file="cred.RData")

You should be able to reload this certification in future R sessions with full functionality.

Step 2: Data Mining with twitteR

twitteR is a highly versatile package with a variety of ways to extract user information, but for now we will focus on extracting basic information about friends and followers of a given user.

We will use the Twitter handle of my sister and serial Tweeter Madeleine B for this example (madsmaru), but feel free to include your own handle or that of a Twitter account that interests you instead. Let's now get the handles of Madeleine's friends and followers, combine them and turn them into the data frame userNeighbors.df:

user <- getUser("madsmaru")
userFriends <- user$getFriends()
userFollowers <- user$getFollowers()
userNeighbors <- union(userFollowers, userFriends)
userNeighbors.df = twListToDF(userNeighbors)

The generated data frame, userNeighbors.df, consists of information about Madeleine's friends and followers, including a count of each individual's statuses, followers, and friends.

Step 3: K Means Clustering Now that we've got the info we want from Twitter, let's play around with the data. We can start with a simple graph of number of friends versus number of followers in ggplot:

##Simple Point Graph of Madeleine's data, by friends and followers.
gg1 <- ggplot(data=userNeighbors.df, aes(x=followersCount, y=friendsCount))
gg1 <- gg1 + layer(geom="point")
gg1 <- gg1 + xlab("Followers Count")
gg1 <- gg1 + ylab("Friends Count")
gg1 <- gg1 + ggtitle("madsmaru's Friends vs Followers")
gg1

Plot of followersCount vs friendsCount

Not particularly useful. A log transformation of the two variables creates a more interesting picture…

##Create two vectors of the log values of friendsCount and followersCount
##and add them to userNeighbors.df, adding 1 to eliminate any undefined
##log(0) values.

gg1.log <- ggplot(data=userNeighbors.df, aes(x=logFollowersCount, y=logFriendsCount))
gg1.log <- gg1.log + layer(geom="point")
gg1.log <- gg1.log + xlab("Log Followers Count")
gg1.log <- gg1.log + ylab("Log Friends Count")
gg1.log <- gg1.log + ggtitle("madsmaru's Friends vs Followers - Log 10 Scale")
gg1.log

gg1.log <- ggplot(data=userNeighbors.df, aes(x=logFollowersCount, y=logFriendsCount))
gg1.log <- gg1.log + layer(geom="point")
gg1.log

Plot of log of followersCount vs friendsCount

Much more interesting. We can see a variety of patterns in the data. Let's run a K-Means clustering (for K=2) algorithm on the log transformed data, and then color the points by their respective clusters, to get a clearer picture:

##Create a object containing the vectors to be used in the K-Means
kObject.log <- data.frame(userNeighbors.df$logFriendsCount,
                          userNeighbors.df$logFollowersCount)

##Run the K Means algorithm, specifying 2 centers
user2Means.log <- kmeans(kObject.log, centers=2, iter.max=10, nstart=100)

##Add the vector of specified clusters back to the original vector as a factor
kObject.log$cluster=factor(user2Means.log$cluster)


gg1.log <- gg1.log + aes(color=user2Means.log$cluster)
##gg1.log <- gg1.log + scale_colour_gradient(low='blue', high='red20')
gg1.log

Plot of log of followersCount vs friendsCount colored by 2 clusters

Cool! We can identify two distinct clusters, approximately separated along the log 4 (10,000 in non-log terms) followers count mark. The left cluster appears much more dense with a potential linear trend, the right more disparate.

Let's try a cluster with 5 centers and see what happens:

##Run the same K-Means algorithm again but with 5 centers
kObject.log <- data.frame(userNeighbors.df$logFriendsCount,
                          userNeighbors.df$logFollowersCount)

user5Means.log <- kmeans(kObject.log, centers=5, iter.max=10, nstart=100)

kObject.log$cluster=factor(user5Means.log$cluster)

gg1.log <- gg1.log + aes(color=user5Means.log$cluster)
gg1.log <- gg1.log + scale_colour_gradient(low='yellow', high='brown')
gg1.log

Plot of log of followersCount vs friendsCount colored by 5 clusters

The 5-means cluster is also interesting, but we may lack an ability to interpret it without stronger insights into the data. In any case, attempting different numbers of clusters in K-Means testing may offer new and different insights.

We can also use K-Means clustering on more than two variables. Let's try a 3-dimensional cluster, adding a log-transformed count of statuses posted by a given user to our cluster. Note that the graph will remain a plot of followers versus friends, since it is two dimensional and cannot fit the third statuses variable. Despite this, we will see a change in the clustering patterns:

userNeighbors.df$logStatusesCount = log10(userNeighbors.df$statusesCount+1)


kObject.log <- data.frame(userNeighbors.df$logFriendsCount,
                          userNeighbors.df$logFollowersCount)

user5Means.log <- kmeans(kObject.log, centers=5, iter.max=10, nstart=100)

kObject.log$cluster=factor(user5Means.log$cluster)

Plot of log of followersCount vs friendsCount colored by 5 clusters, statusesCount as third cluster variable

This tutorial should serve as an introduction to K-Means clustering of Twitter data, by no means as a comprehensive overview of the subject. More advanced techniques involve clustering based on semantics i.e. words from tweets, and sentiment analysis i.e. clustering to determine moods and emotions of users.

I hope that this tutorial has motivated you to learn more about these subjects in the future.