The simulation is to demonstrate how each parameter in the SVM regression could affect the fit. We fit the model with different value of each parameter while holding other variable constant. To make it more understandable, we also visualize the model. The simulated data is as follows.
set.seed(100)
x <- runif(100, min = 2,
max = 10)
y <- sin(x) + rnorm(length(x)) * 0.25
sinData <- data.frame(x = x, y = y)
dataGrid <- data.frame(x=seq(2,10, length = 100))
First, we tried changing the values of Cost. This procedure tries each different Cost value from 1 to 20. The Cost value is the penalizing factor to the large residuel. The Cost parameter is expressed as 1/lambda. If the cost value is big, beta would be sensitive to the residuel so the model might overfit. If the cost value is small, beta would be less sensitive to the residuel so the model might underfit.
plot(x, y)
for(i in 1:20) {
ksvm.fit <- ksvm(x = x, y = y, data = sinData,
kernel = "rbfdot", kpar = "automatic",
C= i, epsilon = 0.1 )
pred <- predict(ksvm.fit,dataGrid)
points(x = dataGrid$x, y = pred, type = 'l',
col = i)
}
As you see, some lin wiggles around the error. That means the fuction of wiggling line quite overfits data, meaning sensitive to the error from the fitted line. The cost value on the plot is likely to be huge.
The best fit among the model on the data set can be found with RMSE value. Please note that since the lack of parameter used in modeling, and absence of validation process, the modeling itself is not appropriate - just toy example!
RMSE <- rep(0, 20)
for(i in 1:20) {
ksvm.fit <- ksvm(x = x, y = y, data = sinData,
kernel = "rbfdot", kpar = "automatic",
C= i, epsilon = 0.1 )
pred <- predict(ksvm.fit)
RMSE[i] <- RMSE(pred,sinData$y)
}
which.min(RMSE)
## [1] 18
When the cost is 4, it shows the lowest RMSE. The graph with cost value 4 is as follows.
plot(x, y)
ksvm.fit <- ksvm(x = x, y = y, data = sinData,
kernel = "rbfdot", kpar = "automatic",
C= 4, epsilon = 0.1 )
pred <- predict(ksvm.fit,dataGrid)
points(x = dataGrid$x, y = pred, type = 'l',
col = 4)
It looks really nice compared to the wiggling lines.
Sigma, or Gamma in other books, is the parameter controling the influence of radial kernel. If the value is low, the influence is far, underfit, high then the influence is close, overfit. I would assign the values from 1:20 as I did on the Cost function.
plot(x, y)
for(i in 1:20) {
ksvm.fit <- ksvm(x = x, y = y, data = sinData,
kernel = "rbfdot", kpar = list(sigma = i),
C= 4, epsilon = 0.1 )
pred <- predict(ksvm.fit,dataGrid)
points(x = dataGrid$x, y = pred, type = 'l',
col = i)
}
Compared to the change of Cost, the function does less become wiggling as the parameter, sigma, increases.
Next one is epilon. Given that the simulated data comes from nomalized random data, we set epsilon from 1 to 2 with increasing value is 0.1
plot(x, y)
for(i in 1:20) {
ksvm.fit <- ksvm(x = x, y = y, data = sinData,
kernel = "rbfdot", kpar = "automatic",
C= 4, epsilon = i*0.05 )
pred <- predict(ksvm.fit,dataGrid)
points(x = dataGrid$x, y = pred, type = 'l',
col = i)
}
There is no particular wiggling patter as the parameter increases. But interesting pattern is shown that as the epsilon approches to 1, the line plot goes flat.
From this simulated data, we can see that among the parameters in SVM Regression, Cost value is really influential to the overfitting. And sigma value shows not much influence to the degree of overfitting. Lastly although epsilon shows much less influence on the overfitness on data, the epsilon flattens the line as the epsilon value increases.