Support vector machine (SVM), also called support vector network, is one of the most popular supervised learning algorithm in machine learning. According to real world applications, SVM can be used to solve classifciation and regression problems. Kernel method enables SVM to adapt to patterns of data, by nonlinearly mapping the data from original space into a higher dimensional space. As one of broadly used kernel, radial basis function (RBF) is utilized to enhance SVM flexibility and robustness to fit to the given data distritbuion. However, mixture use of SVM and RBF increases technical difficulty for data scientists to figure out optimal parameters (Gamma and C) and then to come up with optimized models. Applying SVM-RBF regression model to Boston data (a R data set), as an example, I would like to demonstrate how to tune SVM-RBF model parameters efficiently and effectively.
This project is based on the boston dataset. The dataset can be downloaded from Github. The dataset consists of 506 observations and 14 variables. Please see the detailed definition of variables in the dataset:
1.CRIM - per capita crime rate by town
2. ZN - proportion of residential land zoned for lots over 25,000 sq.ft
3.INDUS - proportion of non-retail business acres per town
4.CHAS - charles river dummy variable ( 1 if tract bounds river; else 0 )
5. NOX - nitric oxides concentration ( parts per 10 million )
6. RM - average number of rooms perdwelling
7. AGE - proportion of owner-occupied units built prior to 1940
8. DIS - weighted distances to five Boston employment centres
9. RAD - index of accessibility to radial highways
10.TAX - full-value property-tax rate per $10,000
11.PTRATIO - pupil-teacher ratio by town
12.BLACK - 1000(BK-0.63)^2 where BK is the proportion of blacks by town
13.ISTAT - % lower status of the population
14.MEDV - medium value of owner-occupied home in $1000's
Download data and read it into R environment
boston_data <- read.csv("boston.csv")
View(boston_data)Explore Boston Dataset
summary(boston_data)
X crim zn indus chas
Min. : 1.0 Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
1st Qu.:127.2 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
Median :253.5 Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
Mean :253.5 Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
3rd Qu.:379.8 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
Max. :506.0 Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
nox rm age dis rad
Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
tax ptratio black lstat medv
Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73 Min. : 5.00
1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95 1st Qu.:17.02
Median :330.0 Median :19.05 Median :391.44 Median :11.36 Median :21.20
Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65 Mean :22.53
3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95 3rd Qu.:25.00
Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97 Max. :50.00 All the dependent variables are boxploted against target variable - MEDV
plot(boston_data)
Gamma: this parameter reshapes decision boundary, by trying to assemble and cluster similar data points.
C: this parameter is called penalty parameter, which controls the penalty of misclassification.
Genetic algorithm (GA) is heuristic and parallel optimization process, which is inspired by natural selection: high fitness individual to survive while low fitness individual to be eliminated. In GA, simulated gene operators, in terms of mutation, crossover and selection, are adopted to generate promising offsprings by iteration. After multiple iterations, high fitness chromosomes (or called individuals) are obtained. John Holland proposed GA in 1960s, later he was recognised as the father of genetic algorithm, with widespread usage in engineering optimization and operational research https://en.wikipedia.org/wiki/Genetic_algorithm.
Split data into training & testing partitions for cross validation
## 5-fold cross validation
K = 5
fold_inds <- sample( 1 : K, nrow( boston_data ), replace = TRUE )
## split boston data into training & testing partitions
cv_data <- lapply(
1 : K,
function( index )
list(
train_data = boston_data[ fold_inds != index, , drop = FALSE ],
test_data = boston_data[ fold_inds == index, , drop = FALSE ]
)
)Define fitness function for GA iteration
rmsd <- function( train_data, test_data, c, gamma )
{
## train SVM model
model <- svm(
medv ~ .,
data = train_data,
cost = c,
gamma = gamma,
type = "eps-regression",
kernel = "radial"
)
## test and calculate RMSD
rmsd <- mean(
( predict( model, newdata = test_data ) - test_data$medv ) ^ 2
)
## return calculated RMSD
return ( rmsd )
}
fitness_func <- function( x, cv_data )
{
## fetch SVM parameters
gamma_val <- x[ 1 ]
c_val <- x[ 2 ]
## use cross validation to estimate RMSD for each partition of data set
rmsd_vals <- sapply(
cv_data,
function( input_data ) with(
input_data,
rmsd( train_data, test_data, c_val, gamma_val )
)
)
## return negative RMSD
return ( -mean( rmsd_vals ) )
}Execute GA to achieve optimal SVM-RBF model parameter
## set value range for the parameters: Gamma & C
para_value_min <- c( gamma = 1e-3, c = 1e-4 )
para_value_max <- c( gamma = 2, c = 10 )
## run genetic algorithm
results <- ga( type = "real-valued",
fitness = fitness_func,
cv_data,
names = names( para_value_min ),
min = para_value_min,
max = para_value_max,
popSize = 50,
maxiter = 100
)GA helps to determine optimal SVM-RBF model parameters Gamma and C by minimizing RMSD. In other word, the optimized parameters from GA enables the model to have more robustness after cross validation, more generalization across randomly split test samples, and less bias between model prediction and actual observation. The optimized Gamma and C are averaged from a group of high fitness individuals, so Gamma = 0.0889 and C = 1.8631.
Fitness of overall population increases by iteration, which means gradual decrease of RMSD and model evolution and upgrade. Following figure suggests six types of statistics min, Q1, median, mean, Q3, max and their dynamics along with iteration onward.