Creating Data

The dataset for this machine learning task is created using below Java Program. The program creates two datasets:

public class mlCreateFeatures {

    private static String filePath="Input/";
    
    public static void main(String[] args) {
        
        String fileName = "MUTAG";//IMDB-BINARY , COLLAB , ENZYMES
        outPutFeatures(fileName);
        outPutFeaturesGraphVector(fileName);
        // Files are availble in Project folder /Output
    }

Description

  • GraphID: It is to identify and match to the input file. During machine learning task, this column will be ignored
  • Class: Our task to predict the value of class Label using training data.
  • Components: Integer value indicating how many connected component exists in the each Graph
  • AvgDegree : Average outgoing degree of all nodes
  • Density : Density of Graph ( see https://en.wikipedia.org/wiki/Dense_graph)
  • t-0…t-38 : corresponds to the array output of Graphlet search. (on ML-.csv datasets)
  • t-0…t-16 : corresponds to the array output of SimpleGraphVector.toArray . (on ML-GV-.csv datasets)

Using Data

We will do followings during our Machine learning process:
0. Prepare Environment
1. Import data
2. Data cleansing
3. Train Model
4. Predict
5. Analyse performance

0) Prepare Envirnoment

To do the analysis , we need to import related packages.

rm(list = ls(all = TRUE)) # Removing any exisiting variables
require(dplyr)
require(tidyr)
require(caret)
require(nnet)
require(pROC)
require(e1071)
require(rpart)
## For reproducible results we need to start with the same seed
set.seed(1333) 
if(.Platform$OS.type =="windows")
{
  setwd("D:/gitRepos/PerfAnalysis/")
  
}else {
  setwd("~/gitRepos/PerfAnalysis/")
  ##library(doMC)
  ##registerDoMC(cores = 4)
}
source('myMultiClassSummary.R')

1) Import data

We first read the csv files: ML-MUTAG.csv and ML-GV-MUTAG.csv . Here GV is short for GraphVector and GS is short for GraphletSearch

df.input.GS<-read.csv2(file="./input/ML-MUTAG.csv",
                       sep=";",
                       header = TRUE,
                       dec = ".")
df.input.GV<-read.csv2(file="./input/ML-GV-MUTAG.csv",
                       sep=";",
                       header = TRUE,
                       dec = ".")

Lets see the dimension of each data frame:

   Rows Columns
GS  188      47
GV  188      21

2) Data cleansing

  1. First we need to correct column data types in order our classification algorithms to work. So the GraphID and Class are of the type nominal (or in R called factor).
  2. Then we find the column with zero variance and remove it from dataset.
df.input.GV$GraphID <-as.factor(df.input.GV$GraphID );
df.input.GV$Class<-as.factor(make.names(df.input.GV$Class)); 
df.input.GS$GraphID <-as.factor(df.input.GS$GraphID );
df.input.GS$Class<-as.factor(make.names(df.input.GS$Class)); 
## Finding and removing columns with zero variance
nzv.GS<-nearZeroVar(df.input.GS)
nzv.GV<-nearZeroVar(df.input.GV)

Following columns/features have zero variance, which can be removed:

   Rows Columns
GS  188      47
GV  188      21

Columns for GS to be removed:  Components; d0; d1; d2; d3; d5; d6; d7; d8; d9; d10; d13; d14; d15; d16; d17; d20; d21; d22; d23; d25; d26; d27; d28; d29; d30; d31; d32; d33; d34; d35; d36; d37; d38; X
Columns for GV to be removed:  Components; t.0; t.11; t.12; t.13; t.16; X

Now let see the shape of our datasets after removing zero variance

   Rows Columns
GS  188      12
GV  188      14

Columns for GS:  GraphID; Class; Verticies; Edges; AvgDegree; Density; d4; d11; d12; d18; d19; d24
Columns for GV:  GraphID; Class; t.1; t.2; t.3; t.4; t.5; t.6; t.7; t.8; t.9; t.10; t.14; t.15

The dimensions corresponds to the following Graph type:

All type 3,4,5 Graphlets.

All type 3,4,5 Graphlets.

  1. Partitioning the datasets into train (70 % ) and test (30 %) for cross validation
##randomly partition data into two datasets ,training and testing
#GraphletSearch
inTrain.GS<-createDataPartition(y=df.input.GS$Class,p = .7,list=FALSE)
train.GS<-df.input.GS[inTrain.GS,]   
test.GS<- df.input.GS[-inTrain.GS,] 
#GraphVector
inTrain.GV<-createDataPartition(y=df.input.GV$Class,p = .7,list=FALSE)
train.GV<-df.input.GV[inTrain.GV,]   
test.GV<- df.input.GV[-inTrain.GV,] 

3) Train Model

Now we train following models:
* SVM
* SVM with Polynomial kernel
* SVM with Radial kernel
* Random forest

##Create a data frame from all combinations of the supplied parameters for Radial and linear Kernels
grid <- expand.grid(sigma = c(.01, .015, 0.2),
                    C = c(0.75, 0.9, 1, 1.1, 1.25))  
##Create a data frame from all combinations of the supplied parameters for Polynomial kernel
grid.poly <- expand.grid(C = c(0.75, 0.9, 1, 1.1, 1.25),
                         scale=c(.0001),
                         degree=1:2) 
#We need to check if it is binary classification 
#or multiclass classifcation, which requires diffrent summary function.
if(length(unique(df.input.GS$Class)) > 2) 
{
  ctrl <-trainControl(method="repeatedcv",
                      repeats=10,
                      classProbs=TRUE,
                      summaryFunction=myMultiClassSummary)
} else
{
  ctrl <- trainControl(method="repeatedcv",
                       repeats=10,
                       classProbs=TRUE,
                       summaryFunction=twoClassSummary)
}
## Parameters for Random Forest algorithm.
RF.control <- trainControl(method="repeatedcv", number=10, repeats=3)

Before training we will center and scale the features for better accuracy.

SVM linear

svmModel.linear.GS <- train(x=train.GS[,-c(1,2)],
                            y= train.GS[,2],
                            method = "svmLinear",
                            preProc = c("center","scale"),
                            trControl=ctrl)  
svmModel.linear.GV <- train(x=train.GV[,-c(1,2)],
                            y= train.GV[,2],
                            method = "svmLinear",
                            preProc = c("center","scale"),
                            trControl=ctrl)  

SVM with polynomial kernel

svmModel.Poly.GS <- train(x=train.GS[,-c(1,2)],
                          y= train.GS[,2], 
                          method = "svmPoly",
                          preProc = c("center","scale"),
                          tuneGrid = grid.poly,trControl=ctrl)  
svmModel.Poly.GV <- train(x=train.GV[,-c(1,2)],
                          y= train.GV[,2],
                          method = "svmPoly",
                          preProc = c("center","scale"),
                          tuneGrid = grid.poly,
                          trControl=ctrl)  

SVM with radial kernel

svmModel.Radial.GS <- train(x=train.GS[,-c(1,2)],
                            y= train.GS[,2],
                            method = "svmRadial",
                            preProc = c("center","scale"),
                            tuneGrid = grid,
                            trControl=ctrl)  
svmModel.Radial.GV <- train(x=train.GV[,-c(1,2)],
                            y= train.GV[,2],
                            method = "svmRadial",
                            preProc = c("center","scale"),
                            tuneGrid = grid,
                            trControl=ctrl)  

Random forest

#Random forest has its own naming preference
#GraphletSearch dataset
RfTrainGS<-cbind(y=train.GS[,2],train.GS[,-c(1,2)]) 
RfModelGS <- train(y~.,data=RfTrainGS,method = "rf",
                   preProcess=c("center","scale"),
                   trControl=RF.control,
                   prox=TRUE, 
                   tuneGrid=expand.grid(mtry = 5),
                   number=10,
                   ntree=500) 
#GraphVector dataset
RfTrainGV<-cbind(y=train.GV[,2],train.GV[,-c(1,2)]) 
RfModelGV <- train(y~.,data=RfTrainGV,method = "rf",
                   preProcess=c("center","scale"),
                   trControl=RF.control,
                   prox=TRUE,
                   tuneGrid=expand.grid(mtry = 5),
                   number=10,
                   ntree=500) 

4) Predict

We will predict the Class value for each Graph on test dataset so we can later compare to the actual Class with the predicted value and finally calculate performance.

## Predicting for different models for GraphletSearch test dataset
## -c(1,2) means removing GraphID,Class
prediction.y.linear.GS<-predict(svmModel.linear.GS,test.GS[,-c(1,2)]) 
prediction.y.Radial.GS<-predict(svmModel.Radial.GS,test.GS[,-c(1,2)])
prediction.y.Poly.GS  <-predict(svmModel.Poly.GS,test.GS[,-c(1,2)])
prediction.y.RF.GS<-predict(RfModelGS,test.GS[,-c(1,2)])
## Predicting for different models for GraphVector test dataset
prediction.y.linear.GV<-predict(svmModel.linear.GV,test.GV[,-c(1,2)])
prediction.y.Radial.GV<-predict(svmModel.Radial.GV,test.GV[,-c(1,2)])
prediction.y.Poly.GV  <-predict(svmModel.Poly.GV,test.GV[,-c(1,2)])
prediction.y.RF.GV<-predict(RfModelGV,test.GV[,-c(1,2)])

5) Analyse performance

First we calculate performance for each model, then we will put them in a table for each dataset:

confMatrix.linear.GS<-confusionMatrix(test.GS[,2],prediction.y.linear.GS)
confMatrix.linear.GV<-confusionMatrix(test.GV[,2],prediction.y.linear.GV)
confMatrix.radial.GS<-confusionMatrix(test.GS[,2],prediction.y.Radial.GS)
confMatrix.radial.GV<-confusionMatrix(test.GV[,2],prediction.y.Radial.GV)
confMatrix.poly.GS<-confusionMatrix(test.GS[,2],prediction.y.Poly.GS)
confMatrix.poly.GV<-confusionMatrix(test.GV[,2],prediction.y.Poly.GV)
confMatrix.rf.GS<-confusionMatrix(test.GS[,2],prediction.y.RF.GS)
confMatrix.rf.GV<-confusionMatrix(test.GV[,2],prediction.y.RF.GV)
acc.GS<-cbind(
        Statistic='Accuracy',
        SVM_Linear_GS=confMatrix.linear.GS$overall['Accuracy'],
        SVM_Radial_GS=confMatrix.radial.GS$overall['Accuracy'],
        SVM_Poly_GS=confMatrix.poly.GS$overall['Accuracy'],
        RandForest_GS=confMatrix.rf.GS$overall['Accuracy'])
kappa.GS<-cbind(
        Statistic='Kappa',
        SVM_Linear_GS=confMatrix.linear.GS$overall['Kappa'],
        SVM_Radial_GS=confMatrix.radial.GS$overall['Kappa'],
        SVM_Poly_GS=confMatrix.poly.GS$overall['Kappa'],
        RandForest_GS=confMatrix.rf.GS$overall['Kappa'])
acc.GV<-cbind(
        Statistic='Accuracy',
        SVM_Linear_GV=confMatrix.linear.GV$overall['Accuracy'],
        SVM_Radial_GV=confMatrix.radial.GV$overall['Accuracy'],
        SVM_Poly_GV=confMatrix.poly.GV$overall['Accuracy'],
        RandForest_GV=confMatrix.rf.GV$overall['Accuracy'])
kappa.GV<-cbind(
        Statistic='Kappa',
        SVM_Linear_GV=confMatrix.linear.GV$overall['Kappa'],
        SVM_Radial_GV=confMatrix.radial.GV$overall['Kappa'],
        SVM_Poly_GV=confMatrix.poly.GV$overall['Kappa'],
        RandForest_GV=confMatrix.rf.GV$overall['Kappa'])
comparison.GS<-as.data.frame(rbind(acc.GS,kappa.GS))
comparison.GV<-as.data.frame(rbind(acc.GV,kappa.GV))

Confusion matrix provides different information about performance of our model. As an Example for GraphletSearch dataset using Random forest:

Note: X1 means Class label “1” and X.1 refers to the Class label “-1” of the graph.

Confusion Matrix and Statistics

          Reference
Prediction X1 X.1
       X1  30   7
       X.1  3  15
                                         
               Accuracy : 0.8182         
                 95% CI : (0.691, 0.9092)
    No Information Rate : 0.6            
    P-Value [Acc > NIR] : 0.000462       
                                         
                  Kappa : 0.6094         
 Mcnemar's Test P-Value : 0.342782       
                                         
            Sensitivity : 0.9091         
            Specificity : 0.6818         
         Pos Pred Value : 0.8108         
         Neg Pred Value : 0.8333         
             Prevalence : 0.6000         
         Detection Rate : 0.5455         
   Detection Prevalence : 0.6727         
      Balanced Accuracy : 0.7955         
                                         
       'Positive' Class : X1             
                                         

Generally we can compare models on each dataset based on their accuracy and kappa value. For the same range of Accuracy, we would say the method with higher kappa value provides more reliability on the result.

Performance of GraphletSearch and GraphVector datasets shown below:

For GraphletSearch:

         SVM_Linear_GS       SVM_Radial_GS       SVM_Poly_GS         RandForest_GS      
Accuracy "0.854545454545454" "0.872727272727273" "0.872727272727273" "0.818181818181818"
Kappa    "0.66966966966967"  "0.715025906735751" "0.715025906735751" "0.609375"         


For GraphVector:

         SVM_Linear_GV       SVM_Radial_GV       SVM_Poly_GV         RandForest_GV      
Accuracy "0.890909090909091" "0.836363636363636" "0.836363636363636" "0.890909090909091"
Kappa    "0.759124087591241" "0.611764705882353" "0.63360473723168"  "0.752252252252252"

