Machine learning is the science of getting computers to identify patterns in data. This post illustrates one such machine learning method to cluster social media data into geographic groups using R and Plotly. Clustering is a method used to partition objects (i.e. people, products, etc.) into similar groups. The method traces its origin to the psychologists of the 1930’s – who created the tool to partition individuals into groups based on personality traits.
How can businesses gain insight into customer and market behavior through their social media data?
* Business and marketing: in market research clustering is used to group consumers into market segments * Social networks: In statistical analysis of networks, clusters are used to identify communities within large groups of people.
* Media: clustering is used to recommend new items based on a consumer’s past media behavior or the media behavior of people like them (i.e. recommendation systems).
I first scrape Twitter data from the @NetFlix account using the Twitter Graph API within R. I then geocode Netflixs’ Twitter followers using the Google Maps Geocoding API in the GGMAP package in R. Next, I analyze the data using k-means clustering to classify Twitter followers into geographic clusters. Lastly, I show how to bring this data to life by building an interactive web-based visualization dashboard via Plotly. For any questions, you can contact me at mjlacour[at]gmail.com.
Tags: data visualization, machine learning, geospatial, plotly, twitter, kernel density estimation, Internet TV
Load Twitter Library
library(twitteR)
Refer to the Twitter Application Management page to create your own Twitter Application. The following is available on the ``Application Management" page of your Twitter app:
consumer_key <- "YOUR_CONSUMER_KEY"
consumer_secret <- "YOUR_CONSUMER_Secret"
token <- "YOUR_TOKEN"
token_secret <- "YOUR_TOKEN_SECRET" #Access token secret
setup_twitter_oauth(consumer_key, consumer_secret, access_token=NULL, access_secret=NULL)
# WARNING: Keep the "Consumer Secret" a secret. This key should never be human-readable in your application.
# The access token can be used to make API requests on your own account's behalf. Do not share your access token secret with anyone.
Pull Netflix data from Twitter:
username <- "netflix"
user <- getUser(username)
Pull data on Twitter followers:
followers <- user$getFollowers() # Note: This function take ~ 5 minutes to run
Function to translate twitterR list into a data.frame. This function take ~ 2 minutes to run
followers <- twListToDF(followers)
Data housekeeping: remove missing data and retain only complete cases for simplicity
followers$location[followers$location==""] <- NA
followers <- followers[complete.cases(followers),]
Google Maps API limits users to geocoding 2,500 requests a day, so we will make our sample size approximately that. Note that if we were conducting social network analysis, we would want to first create an adjacency matrix of nodes and ties, and we would want to avoid sampling – because it would break the edges of the network.
set.seed(1992)
followers <- followers[sample(1:nrow(followers), size = 2500),]
dat <- followers
Install and load ggmap package in R to geocode the data. Convert data to character format before geocoding
install.packages('ggmap')
library('ggmap')
coord <- as.character(dat$location)
coord <- geocode(coord) # Note: Takes ~ 50 minutes to run!
The above takes time to run. To simplify things, reading in pre-loaded data
dat <- read.csv("twitter_data_coordinates.csv")
library(maps)
library(ggplot2)
map.dat <- map_data("world")
fig <- ggplot() +
geom_polygon(aes(long,lat, group=group), fill="grey94", data=map.dat) +
geom_point(aes(x = lon, y = lat),alpha=I(0.2),size=I(2), data = dat) +
theme_minimal() +
guides(fill = guide_legend(override.aes = list(alpha = 1))) +
ggtitle("Netflix Twitter Followers") +
theme(text = element_text(size=20),
legend.position= "none",
axis.text = element_blank(),
axis.title=element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
fig
As of January, 2016 Netflix is available globally, for simplicity let’s restrict our analysis to the continental US.
us <- subset(dat, lat>25 & lat<50 & lon< -60 & lon> -130)
us.dat <- map_data("state")
fig <- ggplot() + geom_polygon(aes(long,lat, group=group), fill="grey94", data=us.dat) +
geom_point(aes(x = lon, y = lat),alpha=0.4,size=8, data = us) +
ggtitle("Netflix Twitter Followers in US") +
theme_minimal() +
guides(fill = guide_legend(override.aes = list(alpha = 1))) +
theme(text = element_text(size=20),
legend.position= "none",
axis.text = element_blank(),
axis.title=element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
fig
Estimate the number of clusters based on the elbow method. Range of clusters to be tested for
clustRange <- 2:20
Isolate required features
kmeans.df <- data.frame(lat = us$lat,
lon = us$lon)
Remove Missing Data
kmeans.df <- na.omit(kmeans.df)
Use Sum of Squares errors across all clusters
SS <- c()
for(i in clustRange){
fit <- kmeans(kmeans.df, centers = i, iter.max = 100)
SS <- c(SS,fit$tot.withinss)
}
SS.df <- data.frame(No.Of.Clusters = clustRange,
Total.Within.SS = SS,
Pct.Change = c(NA, diff(SS)/SS[1:length(SS)-1]) * 100)
Use plot to find ‘elbow’ visually
library(plotly)
plot_ly(SS.df, x = No.Of.Clusters, y = Total.Within.SS, mode = "line + markers", color = Pct.Change,
marker = list(symbol = "circle-dot", size = 10),
line = list(dash = "2px")) %>%
layout(title = "Total Sum of Squares",
annotations = list(
list(x = 6, y = Total.Within.SS[5], text = "Elbow", ax = 30, ay = -40)
))
5 clusters is where is the elbow is at. Fit using suggested clusters
nClust <- 5
fit <- kmeans(kmeans.df, centers = nClust, nstart = 20)
Add cluster information to dataframe
kmeans.df$cluster = fit$cluster
Set colors
library(RColorBrewer)
cols <- brewer.pal(nClust, "Set3")
for(i in 1:nClust){
kmeans.df$color[kmeans.df$cluster == i] <- cols[i]
}
Plot Twitter data by cluster.
kmeans.df$Cluster <- as.factor(kmeans.df$cluster)
g <- ggplot() +
geom_polygon(aes(long,lat, group=group), fill="grey97", data=us.dat) +
geom_jitter(aes(x = lon, y = lat, fill=Cluster,
colour=Cluster), alpha=.5, size=3, data = kmeans.df) +
guides(fill = guide_legend( override.aes = list(alpha = 1))) +
theme_minimal() +
theme(text = element_text(size=20),
legend.position="top",
axis.text = element_blank(),
axis.title=element_blank(),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_blank(),
panel.background = element_blank())
g
Write Function for two-dimensional kernel density estimation to overlay onto plot.
library("MASS")
stat_density_2d <- function(mapping = NULL, data = NULL, geom = "density_2d",
position = "identity", contour = TRUE,
n = 100, h = NULL, na.rm = FALSE,bins=0,
show.legend = NA, inherit.aes = TRUE, ...) {
layer(
data = data,
mapping = mapping,
stat = StatDensity2d,
geom = geom,
position = position,
show.legend = show.legend,
inherit.aes = inherit.aes,
params = list(
na.rm = na.rm,
contour = contour,
n = n,
bins=bins,
...
)
)
}
stat_density2d <- stat_density_2d
StatDensity2d <-
ggproto("StatDensity2d", Stat,
default_aes = aes(colour = "#3366FF", size = 0.5),
required_aes = c("x", "y"),
compute_group = function(data, scales, na.rm = FALSE, h = NULL,
contour = TRUE, n = 100,bins=0) {
if (is.null(h)) {
h <- c(MASS::bandwidth.nrd(data$x), MASS::bandwidth.nrd(data$y))
}
dens <- MASS::kde2d(
data$x, data$y, h = h, n = n,
lims = c(scales$x$dimension(), scales$y$dimension())
)
df <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
df$group <- data$group[1]
if (contour) {
if (bins>0){
StatContour$compute_panel(df, scales,bins)
} else {
StatContour$compute_panel(df, scales)
}
} else {
names(df) <- c("x", "y", "density", "group")
df$level <- 1
df$piece <- 1
df
}
}
)
Overlay cluster kernel density estimations.
g <- g +
stat_density2d(
aes(x = lon, y = lat,fill=as.factor(cluster), colour=as.factor(cluster)),
bins = 10, alpha=.2,
size=0, data = kmeans.df, geom = "polygon")
g
Build interactive dashboard.
dashboard <- (gg <- ggplotly(g)) %>%
layout(
hovermode = "closest",
showlegend = T,
visible = TRUE,
title = paste("Netflix Twitter Followers Clustered By K-Means"),
xaxis = list(
gridcolor = "white",
tickfont = list(color = "white"),
title = "",
titlefont = list(color = "rgb(204, 204, 204)"),
hoverformat = "0f"),
yaxis = list(
gridcolor = "white",
tickfont = list(color = "white"),
title = "",
titlefont = list(color = "rgb(204, 204, 204)")))
Click the radio buttons in the legend or scroll over the dashboard below to see the how the machine learning method clustered the Twitter followers geographically.
dashboard
Publish dashboard online. First, set Plotly login details
Sys.setenv("plotly_username" = Sys.getenv("PLOTLY_USER"))
Sys.setenv("plotly_api_key" = Sys.getenv("PLOT_API_KEY"))
plotly_POST(dashboard, world_readable=TRUE)