Machine learning is the science of getting computers to identify patterns in data. This post illustrates one such machine learning method to cluster social media data into geographic groups using R and Plotly. Clustering is a method used to partition objects (i.e. people, products, etc.) into similar groups. The method traces its origin to the psychologists of the 1930’s – who created the tool to partition individuals into groups based on personality traits.

Applications

How can businesses gain insight into customer and market behavior through their social media data?
* Business and marketing: in market research clustering is used to group consumers into market segments * Social networks: In statistical analysis of networks, clusters are used to identify communities within large groups of people.
* Media: clustering is used to recommend new items based on a consumer’s past media behavior or the media behavior of people like them (i.e. recommendation systems).

Abstract

I first scrape Twitter data from the @NetFlix account using the Twitter Graph API within R. I then geocode Netflixs’ Twitter followers using the Google Maps Geocoding API in the GGMAP package in R. Next, I analyze the data using k-means clustering to classify Twitter followers into geographic clusters. Lastly, I show how to bring this data to life by building an interactive web-based visualization dashboard via Plotly. For any questions, you can contact me at mjlacour[at]gmail.com.

Tags: data visualization, machine learning, geospatial, plotly, twitter, kernel density estimation, Internet TV

I. Get Twitter Data

Load Twitter Library

library(twitteR)

Refer to the Twitter Application Management page to create your own Twitter Application. The following is available on the ``Application Management" page of your Twitter app:

 consumer_key <- "YOUR_CONSUMER_KEY"
 consumer_secret <- "YOUR_CONSUMER_Secret"
 token <- "YOUR_TOKEN" 
 token_secret <- "YOUR_TOKEN_SECRET" #Access token secret
 setup_twitter_oauth(consumer_key, consumer_secret, access_token=NULL, access_secret=NULL)
 
 # WARNING: Keep the "Consumer Secret" a secret. This key should never be human-readable in your application.
# The access token can be used to make API requests on your own account's behalf. Do not share your access token secret with anyone.

Pull Netflix data from Twitter:

  username <- "netflix"

  user <- getUser(username)

Pull data on Twitter followers:

   followers <- user$getFollowers() # Note: This function take ~ 5 minutes to run

Function to translate twitterR list into a data.frame. This function take ~ 2 minutes to run

 followers <- twListToDF(followers)

II. Geocode Twitter Data

Data housekeeping: remove missing data and retain only complete cases for simplicity

 followers$location[followers$location==""] <- NA
 followers <- followers[complete.cases(followers),]

Google Maps API limits users to geocoding 2,500 requests a day, so we will make our sample size approximately that. Note that if we were conducting social network analysis, we would want to first create an adjacency matrix of nodes and ties, and we would want to avoid sampling – because it would break the edges of the network.

 set.seed(1992)
 followers <- followers[sample(1:nrow(followers), size = 2500),]
 dat <- followers

Install and load ggmap package in R to geocode the data. Convert data to character format before geocoding

install.packages('ggmap') 
library('ggmap')
 coord <- as.character(dat$location)
 coord <- geocode(coord) # Note: Takes ~ 50 minutes to run!

The above takes time to run. To simplify things, reading in pre-loaded data

dat <- read.csv("twitter_data_coordinates.csv")

III. Plot and Visualize Data

library(maps)
library(ggplot2)
map.dat <- map_data("world")

fig <- ggplot() + 
    geom_polygon(aes(long,lat, group=group), fill="grey94", data=map.dat) +
    geom_point(aes(x = lon, y = lat),alpha=I(0.2),size=I(2), data = dat) +
    theme_minimal() + 
    guides(fill = guide_legend(override.aes = list(alpha = 1))) +
    ggtitle("Netflix Twitter Followers") +
    theme(text = element_text(size=20),
      legend.position= "none",
      axis.text = element_blank(), 
      axis.title=element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      panel.background = element_blank())
fig

As of January, 2016 Netflix is available globally, for simplicity let’s restrict our analysis to the continental US.

us <- subset(dat, lat>25 & lat<50 & lon< -60 & lon> -130)

us.dat <- map_data("state")
fig <- ggplot() + geom_polygon(aes(long,lat, group=group), fill="grey94", data=us.dat) +
    geom_point(aes(x = lon, y = lat),alpha=0.4,size=8,   data = us) +  
      ggtitle("Netflix Twitter Followers in US") +
    theme_minimal() + 
    guides(fill = guide_legend(override.aes = list(alpha = 1))) +
    theme(text = element_text(size=20),
      legend.position= "none",
      axis.text = element_blank(), 
      axis.title=element_blank(),
      panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      panel.background = element_blank())
fig

IV. Cluster Data by Geography

Estimate the number of clusters based on the elbow method. Range of clusters to be tested for

clustRange <- 2:20

Isolate required features

kmeans.df <- data.frame(lat = us$lat,
                        lon = us$lon)

Remove Missing Data

kmeans.df <- na.omit(kmeans.df)

Use Sum of Squares errors across all clusters

SS <- c()
for(i in clustRange){
  fit <- kmeans(kmeans.df, centers = i, iter.max = 100)
  SS <- c(SS,fit$tot.withinss)
}

SS.df <- data.frame(No.Of.Clusters = clustRange, 
                    Total.Within.SS = SS,
                    Pct.Change = c(NA, diff(SS)/SS[1:length(SS)-1]) * 100)

Use plot to find ‘elbow’ visually

library(plotly)
plot_ly(SS.df, x = No.Of.Clusters, y = Total.Within.SS, mode = "line + markers", color = Pct.Change, 
        marker = list(symbol = "circle-dot", size = 10),
        line = list(dash = "2px")) %>% 
  layout(title = "Total Sum of Squares", 
         annotations = list(
           list(x = 6, y = Total.Within.SS[5], text = "Elbow", ax = 30, ay = -40)
         ))

5 clusters is where is the elbow is at. Fit using suggested clusters

nClust <- 5
fit <- kmeans(kmeans.df, centers = nClust, nstart = 20)

Add cluster information to dataframe

kmeans.df$cluster = fit$cluster

Set colors

library(RColorBrewer)
cols <- brewer.pal(nClust, "Set3")
for(i in 1:nClust){
  kmeans.df$color[kmeans.df$cluster == i] <- cols[i]
}

V. Build Interactive Machine Learning Visualization

Plot Twitter data by cluster.

kmeans.df$Cluster <- as.factor(kmeans.df$cluster)
g <- ggplot() + 
    geom_polygon(aes(long,lat, group=group), fill="grey97", data=us.dat) +
    geom_jitter(aes(x = lon, y = lat, fill=Cluster,
                  colour=Cluster), alpha=.5, size=3,  data = kmeans.df) + 
    guides(fill = guide_legend( override.aes = list(alpha = 1))) +
      theme_minimal() +
      theme(text = element_text(size=20),
          legend.position="top",
          axis.text = element_blank(), 
          axis.title=element_blank(),
          panel.grid.major = element_blank(),
      panel.grid.minor = element_blank(),
      panel.border = element_blank(),
      panel.background = element_blank())
g

Write Function for two-dimensional kernel density estimation to overlay onto plot.

library("MASS")

stat_density_2d <- function(mapping = NULL, data = NULL, geom = "density_2d",
                            position = "identity", contour = TRUE,
                            n = 100, h = NULL, na.rm = FALSE,bins=0,
                            show.legend = NA, inherit.aes = TRUE, ...) {
  layer(
    data = data,
    mapping = mapping,
    stat = StatDensity2d,
    geom = geom,
    position = position,
    show.legend = show.legend,
    inherit.aes = inherit.aes,
    params = list(
      na.rm = na.rm,
      contour = contour,
      n = n,
      bins=bins,
      ...
    )
  )
}

stat_density2d <- stat_density_2d

StatDensity2d <- 
  ggproto("StatDensity2d", Stat,
          default_aes = aes(colour = "#3366FF", size = 0.5),

          required_aes = c("x", "y"),

          compute_group = function(data, scales, na.rm = FALSE, h = NULL,
                                   contour = TRUE, n = 100,bins=0) {
            if (is.null(h)) {
              h <- c(MASS::bandwidth.nrd(data$x), MASS::bandwidth.nrd(data$y))
            }

            dens <- MASS::kde2d(
              data$x, data$y, h = h, n = n,
              lims = c(scales$x$dimension(), scales$y$dimension())
            )
            df <- data.frame(expand.grid(x = dens$x, y = dens$y), z = as.vector(dens$z))
            df$group <- data$group[1]

            if (contour) {
              if (bins>0){
                StatContour$compute_panel(df, scales,bins)
              } else {
                StatContour$compute_panel(df, scales)
              }
            } else {
              names(df) <- c("x", "y", "density", "group")
              df$level <- 1
              df$piece <- 1
              df
            }
          }
  )

Overlay cluster kernel density estimations.

g <- g +
     stat_density2d(
     aes(x = lon, y = lat,fill=as.factor(cluster), colour=as.factor(cluster)), 
     bins = 10, alpha=.2,
     size=0, data = kmeans.df, geom = "polygon") 
g

Build interactive dashboard.

dashboard <- (gg <- ggplotly(g)) %>% 
  layout(
    hovermode = "closest", 
         showlegend = T,
          visible = TRUE, 
         title = paste("Netflix Twitter Followers Clustered By K-Means"), 
         
         xaxis = list(
           gridcolor = "white", 
           tickfont = list(color = "white"), 
           title = "", 
           titlefont = list(color = "rgb(204, 204, 204)"),
           hoverformat = "0f"),
         
         yaxis = list(
           gridcolor = "white", 
           tickfont = list(color = "white"), 
           title = "", 
           titlefont = list(color = "rgb(204, 204, 204)")))

Interactive Dashboard

Click the radio buttons in the legend or scroll over the dashboard below to see the how the machine learning method clustered the Twitter followers geographically.

dashboard

Publish dashboard online. First, set Plotly login details

Sys.setenv("plotly_username" = Sys.getenv("PLOTLY_USER"))
Sys.setenv("plotly_api_key" = Sys.getenv("PLOT_API_KEY"))
plotly_POST(dashboard, world_readable=TRUE)

A Visual Machine Learning Tutorial in R

Author: Michael Jules

February 23, 2016