R&D Visual Analytics

Introduction

The purpose of this R&D project is to develop an innovative analytics visual model that conveys insight to the end user in the tackling of large scale data problems focused on Internet based applications. This R&D project began a couple of years ago based on years of experience in developing custom machine learning algorithms to solve problems in the area of border security and law enforcement as well as in the financial and scientific domains.

Much research has gone into several different lines of investigation namely review of different machine learning algorithms, large scale data problems with a focus on data sparsity and rare/anomaly events, big data infrastructure, and recent developments in visual analytics.

What follows is the application of this R&D to a production grade Internet dataset. Extensive preliminary analysis and analytics has gone into undertstanding the problem space and feature engineering development.

Internet Domain

Search advertising has been one of the major revenue sources of the Internet industry for years. A key technology behind search advertising is to predict the click-through rate (CTR) of ads, as the economic model behind search advertising requires CTR values to rank ads and to price click.

Training instances derived from session logs from the Tencent proprietary search engine, sosa.com was provided to us. The original purpose of this data set was to accurately predict the CTR of ads in the testing instances. However our objective is to use this data set for our R&D project which has the goal of developing innovative visual analytics model that end users can leverage to derive business insight. Our target domain are in the area of Internet based services including Internet of Things.

Data Set

The training data set consists of 150 Million queries records for 23 Million search users, for a total of 10 Gb of data. Multiple queries with the same properties and their outputs were rolled up into one record in the training data. Expanded to individual queries, this produced 235 million training records.

Each query and its output were described by 10 variables as follows:

adUrlID: This is a property of the ad. The URL was shown together with the title and description of an ad. It was usually the shortened landing page URL of the ad.
adID: The unique ID of an advertisement.
advertiserID: The unique advertiser ID of the adID. It is a property of the ad. Some advertisers might produce more attractive advertisements than others.
depth: The number of ads displayed to the user in a query session. The maximum depth is 3.
position: The position of the adID in a query session. The maximum position is 3.
queryID: The unique ID of the query.
keywordID: A number representing a keyword used in the ad.
titleID: A number representing the ad title.
descriptionID: A number representing a description of the ad.
userID: The unique ID of a user who conducts the query.

The IDs above were all hash-mapped to integers. Each of the queryID, keywordID, titleID, and descriptionID was associated with a set of keywords, which were also hash- mapped to integers and provided in four extra data files, to provide detailed description of the query and advertisement, respectively.

If a user could not be identified, 0 was assigned to the userID (30% of data). In addition, The gender (Male=1, Female=2, and Unknown=0) and age ((0,12]=1, (12,18]=2, (18, 24]=3, (24,30]=4, (30,40]=5, and (40+,]=6) of each userID were provided in another data file. Two variables described the response of a user to an advertisement: the number of clicks and the number of impressions. For each user, query and the resulted ad, the number of impressions indicated the times that the ad was displayed to the user. The number of clicks was the times that the user clicked the ad. Consequently, the CTR was the ratio between the numbers of clicks and impressions. The average CTR of the training data was 0.0387.

Data Set Transformation

The data set was reduced as the purpose of the project was not the development of an optimal analytical model but the development of an innovative visual analytics data product with the requirement to run hundreds of simulations quickly.

The data set reduction algorithm was based on least user usage of the search based on the number of search queries issued by a user. The training data set was randomly split between training 70%, validation 15% and test set 15%.

Feature Engineering

15 derived feature tables were created.

AdUrlID
AdID
AdvertiserdID
depth
position
gender
age
userid
adID_gender
adID_age
advertiserID_gender
advertiserID_age
depth_gender
depth_age
position_gender
position_age

Each Derived Feature table has the following general table format:

ID's numeric
impression numeric
click numeric
ctr click/impression numeric
low_ctr ctr = 0
medium_ctr ctr > 0 & ctr < 2
high_ctr ctr > 2

Machine Learning Algorithms

Over the last 18 months we have evaluated a number of machine learning algorithms based on the type of problem space (clustering, classification, anomaly detection, prediction) and analytics approaches.

Our focus is on ensemble learning which combines several algorithms based on Self Organizing Map (SOM), Particle Swarm Optimization, Neural Network, Random Forest, KMeans, and Social Network Graph Based Analytics (SNA).

This R code focuses on SOM, kMeans, and Neural Network. Other R code will focus on SOM-SNA and on Random Forest.

Codebook

All dataset reduction and transformations are available in a separate extensive codebook file. The PostgreSQL database was used to store all datasets. Almost all data transformations were conducted in the database and the R programming language was used for the analytics and visualization. Information is also exported to Gephi for graphing visualization.

  library(RPostgreSQL)

## Loading required package: DBI

  library(kohonen)

## Loading required package: class
## Loading required package: MASS

  library(neuralnet)

## Loading required package: grid

  library(ROCR)

## Loading required package: gplots
## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009
## 
## Attaching package: 'gplots'
## 
## The following object is masked from 'package:stats':
## 
##     lowess
## 
## 
## Attaching package: 'ROCR'
## 
## The following object is masked from 'package:neuralnet':
## 
##     prediction

  library(RColorBrewer)
  library(png)

#Function to create the polygon for each hexagon of custom SOM
Hexagon <- function (x, y, unitcell = 1, col = col) {
  polygon(c(x, x, x + unitcell/2, x + unitcell, x + unitcell, 
            x + unitcell/2), c(y + unitcell * 0.125, y + unitcell * 
                               0.875, y + unitcell * 1.125, y + unitcell * 0.875, 
                               y + unitcell * 0.125, y - unitcell * 0.125), 
          col = col, border=NA)
}#function

  coolBlueHotRed <- function(n, alpha = 1) {
    rainbow(n, end=4/6, alpha=alpha)[n:1]
  }

  pretty_palette <- c("#1f77b4","#ff7f0e","#2ca02c",
                      "#d62728","#9467bd","#8c564b","#e377c2")

Loads the database driver and connect to database with credentials. R code is set to echo=FALSE.

  ## Training: Size can vary from 500K to 5.5M 
  advertisingtable <- paste("select * from search_instance LIMIT 750000 OFFSET 0")
  ## Validation
  ##advertisingtable <- paste("select * from search_instance LIMIT 1500000 OFFSET 5500000")

  ## Testing
  ##advertisingtable <- paste("select * from search_instance LIMIT 1500000 OFFSET 7000000")

  rs <- dbSendQuery(con, advertisingtable)

  ## fetch all elements from the result set
  result_set <- fetch(rs,n=-1)

  ## We started we are large number of input features and have reduced based on our analysis.
  data_train <- result_set[,c("position","adid_low","adid_medium","adid_high","adid_age_low","adid_age_medium","adid_age_high","adid_gender_low","adid_gender_medium","adid_gender_high","advertiserid_low","advertiserid_medium","advertiserid_high","advertiserid_age_low","advertiserid_age_medium","advertiserid_age_high","advertiserid_gender_low","advertiserid_gender_medium","advertiserid_gender_high", "male","female","unknowngender","age0to12","age12to18","age18to24","age24to30","age30to40","age40plus","clicked", "notclicked")]

  ## Scaling is not required as this has already been done in the database
  #data_train_matrix <- as.matrix(scale(data_train))

  data_train_matrix <- as.matrix(data_train)

Several hundred simulations were conducted using differenct SOM parameters including custom parameters.

  ## Create the SOM Grid - you generally have to specify the size of the 
  ## training grid prior to training the SOM.
  som_grid <- somgrid(xdim = 7, ydim=7, topo="hexagonal")
  som1total <- 49
  som1row <-7
  som1col <-7

  ## Finally, train the SOM, options for the number of iterations,
  ## the learning rates, and the neighbourhood are available
  set.seed(21121962)
  som_model <- som(subset(data_train_matrix, select=-c(clicked,notclicked)), 
                   grid=som_grid, 
                   rlen=7, 
                   alpha=c(0.05,0.01), 
                   keep.data = TRUE, 
                   n.hood="circular")

  ## Training progress. This shows the variation between the weights of the nodes and 
  ## the cases presented to it. Overtime each individual nodes weight should closely 
  ## match its winning cases. This also shows how many iterations are required before 
  ## the mean distance is minimized. This can be used to determine the optimal size of
  ## of the SOM. If it is too small it may have a hard time to convergence to a minimum.
  plot(som_model, type="changes",main="Training Progress")

plot of chunk SOMTrainingPlot

  ## The SOM allows to visualise the count of how many cases are mapped to each 
  ## node on the map. This metric can be used as a measure of map quality – ideally the 
  ## sample distribution is relatively uniform. Large values in some map areas suggests 
  ## that a larger map would be benificial. If increasing the map does not change this
  ## then it may suggest a large cluster of cases. 
  plot(som_model, type="count", palette.name= coolBlueHotRed,main="Counts Plot")

plot of chunk SOMNodeCountPlot

  ## Often referred to as the “U-Matrix”, this visualisation is of the distance 
  ## between each node and its neighbours. Typically viewed with a grayscale palette, 
  ## areas of low neighbour distance indicate groups of nodes that are similar. 
  ## Areas with large distances indicate the nodes are much more dissimilar – 
  ## and indicate natural boundaries between node clusters. 
  ## The U-Matrix can be used to identify clusters within the SOM map.
  plot(som_model, type="dist.neighbours",palette.name= coolBlueHotRed, main="Neighbour Distance Plot")

plot of chunk SOMUPlot

  ## Shows the mean distance of objects mapped to a unit to the codebook vector of that unit. 
  ## The smaller the distances, the better the objects are represented by the codebook vectors.
  plot(som_model, type="quality",palette.name= coolBlueHotRed, main="Winning Node Inter Distance Plot")

  ## The node weight vectors, or “codes”, are made up of normalised values of the 
  ## original variables used to generate the SOM. Each node’s weight vector is 
  ## representative / similar of the samples mapped to that node. 
  ## By visualising the weight vectors across the map, we can see patterns 
  ## in the distribution of samples and variables. 
  ## The default visualisation of the weight vectors is a “fan diagram”, 
  ## where individual fan representations of the magnitude of each variable 
  ## in the weight vector is shown for each node.
  ##plot(som_model, type="codes",palette.name = rainbow)

The reason that at this point we try and plot the potential clusters using KMeans is that there is little point in doing further detailed analysis if the map is not a quality map. The previous plots may have provided some insight, however this plot will provide final validation.

  ## Look at elbow point which tells your the number of clusters
  mydata <- som_model$codes 
  wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var)) 
  for (i in 2:15) {
    wss[i] <- sum(kmeans(mydata, centers=i)$withinss)
  }
  plot(wss)

plot of chunk SOMNumberofClusterPlot

  ## Plot cluster boundaries based on results found from previous plot(wss)
  ## use hierarchical clustering to cluster the codebook vectors
  som_cluster <- cutree(hclust(dist(som_model$codes)), 2)
  # plot these results:
  plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters") 
  add.cluster.boundaries(som_model, som_cluster)

plot of chunk SOMClustersPlot12

  ## Plot cluster boundaries based on results found from previous plot(wss)
  ## use hierarchical clustering to cluster the codebook vectors
  som_cluster <- cutree(hclust(dist(som_model$codes)), 3)
  ## plot these results:
  plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters") 
  add.cluster.boundaries(som_model, som_cluster)

plot of chunk SOMClustersPlot13

  ## Plot cluster boundaries based on results found from previous plot(wss)
  ## use hierarchical clustering to cluster the codebook vectors
  som_cluster <- cutree(hclust(dist(som_model$codes)), 4)
  ## plot these results:
  plot(som_model, type="mapping", bgcol = pretty_palette[som_cluster], main = "Clusters") 
  add.cluster.boundaries(som_model, som_cluster)

plot of chunk SOMClustersPlot14

  ## Heatmaps are perhaps the most important visualisation possible for Self-Organising Maps.
  ## The use of a weight space view as in that tries to view all dimensions on the 
  ## one diagram is unsuitable for a high-dimensional (>7 variable) SOM. 
  ## A SOM heatmap allows the visualisation of the distribution of a single variable 
  ## across the map. Typically, a SOM investigative process involves the creation of 
  ## multiple heatmaps, and then the comparison of these heatmaps to identify 
  ## interesting areas on the map. 
  ## It is important to remember that the individual sample positions do not move 
  ## from one visualisation to another, the map is simply coloured by different variables.
  ## The default Kohonen heatmap is created by using the type “heatmap”, and then providing
  ## one of the variables from the set of node weights. 
plot(som_model, type = "property", property = som_model$codes[,1], palette.name=coolBlueHotRed,main=names(data_train)[1])