Overview

Support vector machine (SVM), also called support vector network, is one of the most popular supervised learning algorithm in machine learning. According to real world applications, SVM can be used to solve classifciation and regression problems. Kernel method enables SVM to adapt to patterns of data, by nonlinearly mapping the data from original space into a higher dimensional space. As one of broadly used kernel, radial basis function (RBF) is utilized to enhance SVM flexibility and robustness to fit to the given data distritbuion. However, mixture use of SVM and RBF increases technical difficulty for data scientists to figure out optimal parameters (Gamma and C) and then to come up with optimized models. Applying SVM-RBF regression model to Boston data (a R data set), as an example, I would like to demonstrate how to tune SVM-RBF model parameters efficiently and effectively.

Data

This project is based on the boston dataset. The dataset can be downloaded from Github. The dataset consists of 506 observations and 14 variables. Please see the detailed definition of variables in the dataset:

1.CRIM  - per capita crime rate by town
2. ZN   - proportion of residential land zoned for lots over 25,000 sq.ft
3.INDUS - proportion of non-retail business acres per town
4.CHAS  - charles river dummy variable ( 1 if tract bounds river; else 0 )
5. NOX  - nitric oxides concentration ( parts per 10 million )
6. RM   - average number of rooms perdwelling
7. AGE  - proportion of owner-occupied units built prior to 1940
8. DIS  - weighted distances to five Boston employment centres
9. RAD  - index of accessibility to radial highways
10.TAX  - full-value property-tax rate per $10,000
11.PTRATIO  - pupil-teacher ratio by town
12.BLACK   - 1000(BK-0.63)^2 where BK is the proportion of blacks by town
13.ISTAT - % lower status of the population
14.MEDV - medium value of owner-occupied home in $1000's

Load Dataset

  • Download data and read it into R environment

     boston_data <- read.csv("boston.csv")
     View(boston_data)

Explore dataset

  • Explore Boston Dataset

     summary(boston_data)
           X              crim                zn             indus            chas        
     Min.   :  1.0   Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
     1st Qu.:127.2   1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
     Median :253.5   Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
     Mean   :253.5   Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
     3rd Qu.:379.8   3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
     Max.   :506.0   Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
        nox               rm             age              dis              rad        
     Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130   Min.   : 1.000  
     1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100   1st Qu.: 4.000  
     Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207   Median : 5.000  
     Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795   Mean   : 9.549  
     3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188   3rd Qu.:24.000  
     Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.000  
        tax           ptratio          black            lstat            medv      
     Min.   :187.0   Min.   :12.60   Min.   :  0.32   Min.   : 1.73   Min.   : 5.00  
     1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38   1st Qu.: 6.95   1st Qu.:17.02  
     Median :330.0   Median :19.05   Median :391.44   Median :11.36   Median :21.20  
     Mean   :408.2   Mean   :18.46   Mean   :356.67   Mean   :12.65   Mean   :22.53  
     3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23   3rd Qu.:16.95   3rd Qu.:25.00  
     Max.   :711.0   Max.   :22.00   Max.   :396.90   Max.   :37.97   Max.   :50.00 

Plot all dependent variables to target variable

All the dependent variables are boxploted against target variable - MEDV

plot(boston_data)

Parameters Tuning

RBF Parameters: Gamma and C

  • Gamma: this parameter reshapes decision boundary, by trying to assemble and cluster similar data points.

  • C: this parameter is called penalty parameter, which controls the penalty of misclassification.

Parameter Tuning Method: Genetic Algorithm

Genetic algorithm (GA) is heuristic and parallel optimization process, which is inspired by natural selection: high fitness individual to survive while low fitness individual to be eliminated. In GA, simulated gene operators, in terms of mutation, crossover and selection, are adopted to generate promising offsprings by iteration. After multiple iterations, high fitness chromosomes (or called individuals) are obtained. John Holland proposed GA in 1960s, later he was recognised as the father of genetic algorithm, with widespread usage in engineering optimization and operational research https://en.wikipedia.org/wiki/Genetic_algorithm.

R Packages for SVM and GA

  • SVM package: e1071

  • GA package: GA

How to Optimize Gamma and C

  • Split data into training & testing partitions for cross validation

    ## 5-fold cross validation
    K = 5 
    fold_inds <- sample( 1 : K, nrow( boston_data ), replace = TRUE )
    
    ## split boston data into training & testing partitions
    cv_data <- lapply( 
        1 : K, 
        function( index ) 
        list( 
            train_data = boston_data[ fold_inds != index, , drop = FALSE ], 
            test_data = boston_data[ fold_inds == index, , drop = FALSE ] 
        )
    )
  • Define fitness function for GA iteration

    • Calculate root-mean-square deviation RMSD of the model over the test data. RMSD quantifies difference between model predicted values and observed values. The formula is defined below:

    • Based on the values of parameters Gamma and C, the function to calculate RMSD is defined below:
    rmsd <- function( train_data, test_data, c, gamma ) 
    {
    
        ## train SVM model 
        model <- svm( 
            medv ~ ., 
            data = train_data, 
            cost = c, 
            gamma = gamma, 
            type = "eps-regression", 
            kernel = "radial"
        )
    
        ## test and calculate RMSD
        rmsd <- mean( 
            ( predict( model, newdata = test_data ) - test_data$medv ) ^ 2 
        )
    
        ## return calculated RMSD
        return ( rmsd )
    }
    • Based on the obtained RMSD, a fitness function for GA iteration process is further defined as below. Since R package GA can only maximize the fitness values of individuals, the negative values of RMSD is chosen here. Maximized negative RMSD corresponds to minimized positive RMSD.
    fitness_func <- function( x, cv_data ) 
    {
    
        ## fetch SVM parameters
        gamma_val <- x[ 1 ]
        c_val <- x[ 2 ]
    
        ## use cross validation to estimate RMSD for each partition of data set
        rmsd_vals <- sapply(
            cv_data, 
            function( input_data ) with( 
                input_data, 
                rmsd( train_data, test_data, c_val, gamma_val ) 
            )
        )
    
        ## return negative RMSD 
        return ( -mean( rmsd_vals ) )
    }
  • Execute GA to achieve optimal SVM-RBF model parameter

    ## set value range for the parameters: Gamma & C 
    para_value_min <- c( gamma = 1e-3, c = 1e-4 )
    para_value_max <- c( gamma = 2, c = 10 )
    
    ## run genetic algorithm
    results <- ga( type = "real-valued", 
                   fitness = fitness_func, 
                   cv_data, 
                   names = names( para_value_min ), 
                   min = para_value_min, 
                   max = para_value_max,
                   popSize = 50, 
                   maxiter = 100
    )

Results and Conclusion