How to Tune Support Vector Machine Parameters?

Overview

Support vector machine (SVM), also called support vector network, is one of the most popular supervised learning algorithm in machine learning. According to real world applications, SVM can be used to solve classifciation and regression problems. Kernel method enables SVM to adapt to patterns of data, by nonlinearly mapping the data from original space into a higher dimensional space. As one of broadly used kernel, radial basis function (RBF) is utilized to enhance SVM flexibility and robustness to fit to the given data distritbuion. However, mixture use of SVM and RBF increases technical difficulty for data scientists to figure out optimal parameters (Gamma and C) and then to come up with optimized models. Applying SVM-RBF regression model to Boston data (a R data set), as an example, I would like to demonstrate how to tune SVM-RBF model parameters efficiently and effectively.

Data

This project is based on the boston dataset. The dataset can be downloaded from Github. The dataset consists of 506 observations and 14 variables. Please see the detailed definition of variables in the dataset:

1.CRIM  - per capita crime rate by town
2. ZN   - proportion of residential land zoned for lots over 25,000 sq.ft
3.INDUS - proportion of non-retail business acres per town
4.CHAS  - charles river dummy variable ( 1 if tract bounds river; else 0 )
5. NOX  - nitric oxides concentration ( parts per 10 million )
6. RM   - average number of rooms perdwelling
7. AGE  - proportion of owner-occupied units built prior to 1940
8. DIS  - weighted distances to five Boston employment centres
9. RAD  - index of accessibility to radial highways
10.TAX  - full-value property-tax rate per $10,000
11.PTRATIO  - pupil-teacher ratio by town
12.BLACK   - 1000(BK-0.63)^2 where BK is the proportion of blacks by town
13.ISTAT - % lower status of the population
14.MEDV - medium value of owner-occupied home in $1000's

Load Dataset

Download data and read it into R environment

 boston_data <- read.csv("boston.csv")
 View(boston_data)

Explore dataset

Explore Boston Dataset

 summary(boston_data)

       X              crim                zn             indus            chas        
 Min.   :  1.0   Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
 1st Qu.:127.2   1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
 Median :253.5   Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
 Mean   :253.5   Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
 3rd Qu.:379.8   3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
 Max.   :506.0   Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
    nox               rm             age              dis              rad        
 Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
 1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
 Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
 Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
 3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
 Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
    tax           ptratio          black            lstat            medv      
 Min.   :187.0   Min.   :12.60   Min.   :  0.32   Min.   : 1.73   Min.   : 5.00  
 1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02  
 Median :330.0   Median :19.05   Median :391.44   Median :11.36   Median :21.20  
 Mean   :408.2   Mean   :18.46   Mean   :356.67   Mean   :12.65   Mean   :22.53  
 3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00  
 Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97   Max.   :50.00

Plot all dependent variables to target variable

All the dependent variables are boxploted against target variable - MEDV

plot(boston_data)

Parameters Tuning

RBF Parameters: Gamma and C

Gamma: this parameter reshapes decision boundary, by trying to assemble and cluster similar data points.
C: this parameter is called penalty parameter, which controls the penalty of misclassification.

Parameter Tuning Method: Genetic Algorithm

Genetic algorithm (GA) is heuristic and parallel optimization process, which is inspired by natural selection: high fitness individual to survive while low fitness individual to be eliminated. In GA, simulated gene operators, in terms of mutation, crossover and selection, are adopted to generate promising offsprings by iteration. After multiple iterations, high fitness chromosomes (or called individuals) are obtained. John Holland proposed GA in 1960s, later he was recognised as the father of genetic algorithm, with widespread usage in engineering optimization and operational research https://en.wikipedia.org/wiki/Genetic_algorithm.

R Packages for SVM and GA

SVM package: e1071
GA package: GA

How to Optimize Gamma and C

Split data into training & testing partitions for cross validation

## 5-fold cross validation
K = 5 
fold_inds <- sample( 1 : K, nrow( boston_data ), replace = TRUE )

## split boston data into training & testing partitions
cv_data <- lapply( 
    1 : K, 
    function( index ) 
    list( 
        train_data = boston_data[ fold_inds != index, , drop = FALSE ], 
        test_data = boston_data[ fold_inds == index, , drop = FALSE ] 
    )
)

Define fitness function for GA iteration

Calculate root-mean-square deviation RMSD of the model over the test data. RMSD quantifies difference between model predicted values and observed values. The formula is defined below:

Based on the values of parameters Gamma and C, the function to calculate RMSD is defined below:

rmsd <- function( train_data, test_data, c, gamma ) 
{

    ## train SVM model 
    model <- svm( 
        medv ~ ., 
        data = train_data, 
        cost = c, 
        gamma = gamma, 
        type = "eps-regression", 
        kernel = "radial"
    )

    ## test and calculate RMSD
    rmsd <- mean( 
        ( predict( model, newdata = test_data ) - test_data$medv ) ^ 2 
    )

    ## return calculated RMSD
    return ( rmsd )
}

Based on the obtained RMSD, a fitness function for GA iteration process is further defined as below. Since R package GA can only maximize the fitness values of individuals, the negative values of RMSD is chosen here. Maximized negative RMSD corresponds to minimized positive RMSD.

fitness_func <- function( x, cv_data ) 
{

    ## fetch SVM parameters
    gamma_val <- x[ 1 ]
    c_val <- x[ 2 ]

    ## use cross validation to estimate RMSD for each partition of data set
    rmsd_vals <- sapply(
        cv_data, 
        function( input_data ) with( 
            input_data, 
            rmsd( train_data, test_data, c_val, gamma_val ) 
        )
    )

    ## return negative RMSD 
    return ( -mean( rmsd_vals ) )
}

Execute GA to achieve optimal SVM-RBF model parameter

## set value range for the parameters: Gamma & C 
para_value_min <- c( gamma = 1e-3, c = 1e-4 )
para_value_max <- c( gamma = 2, c = 10 )

## run genetic algorithm
results <- ga( type = "real-valued", 
               fitness = fitness_func, 
               cv_data, 
               names = names( para_value_min ), 
               min = para_value_min, 
               max = para_value_max,
               popSize = 50, 
               maxiter = 100
)

Results and Conclusion

GA helps to determine optimal SVM-RBF model parameters Gamma and C by minimizing RMSD. In other word, the optimized parameters from GA enables the model to have more robustness after cross validation, more generalization across randomly split test samples, and less bias between model prediction and actual observation. The optimized Gamma and C are averaged from a group of high fitness individuals, so Gamma = 0.0889 and C = 1.8631.
Fitness of overall population increases by iteration, which means gradual decrease of RMSD and model evolution and upgrade. Following figure suggests six types of statistics min, Q1, median, mean, Q3, max and their dynamics along with iteration onward.

GA has been proven to be an efficient and handy tool for SVM-RBF model parameter tuning. Furthermore, it can be promoted and fitted to other machine learning model. Research works related to GA and its variants to tune deep neural network have been reported.