Title: Support Vector Machine - Summary
Reference:

Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. Vol. 1. No. 10. New York: Springer series in statistics, 2001.

1. Support vector classifier - seperable

[1] 문제 정의

\(\small x_i \in \mathbb{R}^p, y_i \in \{-1, 1\}\)인 \(\small N\)개의 훈련 데이터 \(\small (x_1, y_1), (x_2, y_2), \cdots , (x_N, y_N)\)에 대해서, class를 나누는 hyperplane을 찾으려고 한다
이러한 문제에 대해서, Vapnik(The nature of statistical learning theory, 1996)은 \[\small y\mathbf{(x) = w^T \phi (x)} +b\] 위와 같은 hyperplane이 support vector인 점들에 대해 maximum margin(수직 거리)을 갖도록 하는 Support Vector Machine을 제안했다
- Margin
  :어떤 객체 \(\small (\mathbf{x_n}, t_n)\) (여기서 \(\small t_n \in \{-1, 1\}\)은 target value)가 존재할 때, 위의 hyperplane과의 수직 거리(perpendicular distance)를 마진이라고 한다. \[\small margin = \frac {t_n y(\mathbf{x_n})}{|| \mathbf{w} ||} = \frac {t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b) }{|| \mathbf{w} ||}\]

[2] 목적함수와 제약식 정의

Maximum margin을 가지는 hyperplane을 찾기 위해서는 hyperplane과 가장 가까운 객체에 대해서 가장 큰 margin을 갖도록 하는 목적함수를 만족하면 된다 \[\small \operatorname{argmax_{\mathbf{w}, b}} \frac {1}{||\mathbf{w}||} \min_n [t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b)] \]
- 문제를 좀 더 풀기 쉽게 만들기 위해, \(\small \mathbf{w} \to \kappa \mathbf{w}, b \to \kappa b\)로 바꿔주면
  - hyperplane과 가장 가까운 객체 \(\small \mathbf{x_n}\)에 대해 \(\small t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b)=1\)이라 쓸 수 있다
  - margin이 1인 객체애 대해 \(\small ||\mathbf{w}||^{-1}\)을 최대화하는 문제이므로 \[\small \operatorname{argmin_{\mathbf{w}, b}} \frac {1}{2} {||\mathbf{w}||^2} \]와 같이 최소화하는 문제로 바꿀 수 있다
  - 가장 가까운 객체가 \(\small t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b)=1\)을 만족하므로, 모든 객체들은 \[\small t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b) \geq 1, n=1, \cdots, N\] 와 같은 제약식을 갖게 된다

따라서, maximum margin을 갖는 hyperplane을 찾기 위해서는 아래와 같은 선형 제약식을 갖는 quadratic programming problem을 풀면 된다
\[\begin{gather*} \operatorname{argmin_{\mathbf{w}, b}} \quad \frac {1}{2} {||\mathbf{w}||^2} \\ \text{subject to} \quad t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b) \geq 1, \ n=1, \cdots, N \end{gather*}\]

[3] Lagrangian multiplier를 이용해 해 찾기

제약식을 가지는 이차 계획법 문제는 라그랑지안 함수를 이용하여 그 해를 찾을 수 있다 \[\small L(\mathbf{w}, b, \mathbf{a}) = \frac 1 2 ||\mathbf{w}||^2 - \sum_{n=1}^N a_n\{t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b)-1\} \quad (\mathbf{a}=(a_1, \cdots, a_N)^T)\]
- \(\small \mathbf{w}, b\)에 대해 미분하면, \[\small \frac {\partial L}{\partial \mathbf{w}} = 0 \quad \to \quad \mathbf{w}= \sum_{n=1}^N a_nt_n\phi(\mathbf{x_n}) \] \[\small \frac {\partial L}{\partial b} = 0 \quad \to \quad 0 = \sum_{n=1}^N a_nt_n \]
- 미분한 결과를 다시 라그랑지안 함수에 대입하여 라그랑지안 쌍대 함수를 구하면 \[\small \tilde{L}(\mathbf{a}) = \sum_{n=1}^N a_n - \frac 1 2 \sum_{n=1}^N \sum_{m=1}^N a_n a_m t_n t_m k(\mathbf{x}_n, \mathbf{x}_m) \quad (k(\mathbf{x}, \mathbf{x}') = \phi(\mathbf{x}) \phi(\mathbf{x}')) \] \[\small \begin{gather*} \text{subject to} \quad a_n \geq 0, \ n = 1, \cdots, N \\ \sum_{n=1}^N a_n t_n = 0 \end{gather*}\]
- 이때, Karush-Kuhn-Tucher(KKT) 조건도 만족해야 하므로, 제약식은 아래와 같다 \[\small \begin{gather*} a_n \geq 0 \\ t_ny(\mathbf{x}_n)-1 \geq 0 \\ a_n\{t_ny(\mathbf{x}_n)-1\} = 0 \end{gather*}\]

Support vectors
- \(\small \tilde{L}(\mathbf{a})\)에 대해 결정변수 \(\small a_n\)를 구하고 나면 두 가지 경우가 생긴다
  - 첫번째로, \(\small a_n=0\)인 경우: \(\small t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b) \gt 1\)일때, KKT 조건에 따라 \(\small a_n\)값이 0이 된 경우로, 이에 해당하는 객체는 support vector가 되지 않고, margin 바깥에 위치한다
  - 두번째로, \(\small a_n \gt 0\)인 경우: \(\small a_n \gt 0\) 일때, KKT 조건에 따라 반드시 \(\small t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b)=1\)이어야 하므로, margin위에 위치한 점이며, 이 객체가 support vector가 된다

새로운 데이터가 들어왔을 때 예측은?
- 우리가 구하려던 hyperplane \(\small y\mathbf{(x) = w^T \phi (x)} +b\)에 대하여, \(\small \mathbf{w}= \sum_{n=1}^N a_nt_n\phi(\mathbf{x_n})\)을 대입하면, 새로운 데이터 \(\small \mathbf{x}\)에 대한 \(\small y\)은 아래와 같다 \[\small y(\mathbf{x}) = \sum_{n=1}^N a_nt_n\phi(\mathbf{x_n})\phi(\mathbf{x}) + b = \sum_{n=1}^N a_nt_nk(\mathbf{x}, \mathbf{x}_n) + b\]
\(\small \mathbf{w}, b\) 구하기
- \(\small \mathbf{w}\) 구하기
  - 구한 \(\small \mathbf{a}=(a_1, \cdots, a_N)^T\)을 이용하여, \(\small \mathbf{w}= \sum_{n=1}^N a_nt_n\phi(\mathbf{x_n})\)에 대입하여 \(\small \mathbf{w}\)를 얻는다
- \(\small b\) 구하기
  - support vector에 해당하는 객체들의 집합을 \(\small S\)라 하자
  - \(\small S\)에 속하는 어떤 객체 \(\small (\mathbf{x}_n, t_n)\)은 \(\small t_ny(\mathbf{x}_n)=1\)이므로 \[\small t_n(\sum_{m \in S} a_mt_mk(\mathbf{x}_n, \mathbf{x}_m) + b) = 1 \]을 만족한다
  - 위의 식에서 양변에 \(\small t_n\)을 곱하면, \(\small t_n^2=1\)이므로 아래와 같이 \(\small b\)에 대해 정리할 수 있다 \[\small b = \frac {1}{N_S} \sum_{n \in S} (t_n - \sum_{m \in S} a_m t_m k(\mathbf{x}_n, \mathbf{x}_m) ) \]

2. Support vector classifier - nonseperable

[1] 문제 정의

class간 overlapping이 존재하는 상황이라면, 위의 최적화 문제의 해를 찾을 수 없으므로, 잉여변수(slack variable)를 정의하여 문제를 완화시킬 수 있다
margin내에 오분류되는 객체도 존재할 수 있도록
- 잉여변수 \(\small \xi_n=|t_n-y(\mathbf{x}_n)|\) 정의
- 분리 가능한 SVM에서의 제약식 \(\small t_ny_n = t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b) \geq 1\)을 아래와 같이 변형 \[\small t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b) \geq 1 - \xi_n, \quad \xi_n \geq 0 \]

[2] 목적함수와 제약식 정의

이전 장에서 SVM의 목적함수와 제약식은 \[\small \begin{gather*} \operatorname{argmin_{\mathbf{w}, b}} \quad \frac {1}{2} {||\mathbf{w}||^2} \\ \text{subject to} \quad t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b) \geq 1, \ n=1, \cdots, N \end{gather*}\]
잉여변수를 적용한 overlapping class distribution 상황의 SVM의 목적함수와 제약식은 \[\small \begin{gather*} \operatorname{argmin_{\mathbf{w}, b}} \quad \frac {1}{2} {||\mathbf{w}||^2} + C\sum_{n=1}^N \xi_n \\ \text{subject to} \quad t_n(\mathbf{w}^T \phi(\mathbf{x_n}) + b) \geq 1-\xi_n, \ n=1, \cdots, N \\ \ \xi_n \ \geq 0 , \ n=1, \cdots, N \end{gather*}\]
- 여기서, \(\small C \to \infty\)가 되면 분리 가능한 SVM의 경우와 같아진다

[3] Lagrangian multiplier를 이용해 해 찾기

위의 최적화 문제를 라그랑지안 함수로 쓰면 \[\small L(\mathbf{w}, b, \mathbf{a})= \frac {1}{2} {||\mathbf{w}||^2} + C\sum_{n=1}^N \xi_n - \sum_{n=1}^N a_n\{t_n y(\mathbf{x}_n)-1 +\xi_n\} - \sum_{n=1}^N \mu_n\xi_n \]
- \(\small \mathbf{w}, b, \mathbf{a}\)에 대해 미분하면, \[\small \begin{gather*} \frac {\partial L}{\partial \mathbf{w}} = 0 \quad \to \quad \mathbf{w} = \sum_{n=1}^N a_nt_n\phi(\mathbf{x}_n)\\ \frac {\partial L}{\partial b} = 0 \quad \to \quad 0 = \sum_{n=1}^N a_nt_n\\ \frac {\partial L}{\partial \xi_n} = 0 \quad \to \quad a_n = C - \mu_n \end{gather*}\]
위의 미분 결과를 다시 라그랑지안 함수에 대입하여 라그랑지안 쌍대 함수를 구하면, \[\small \tilde{L}(\mathbf{a}) = \sum_{n=1}^N a_n - \frac 1 2 \sum_{n=1}^N \sum_{m=1}^N a_na_mt_nt_mk(\mathbf{x}_n, \mathbf{x}_m)\]
이때, KKT 조건(제약식)은 \[\small \begin{aligned} a_n \geq 0 \\ t_ny(\mathbf{x}_n) - 1 + \xi_n \geq 0 \\ a_n(t_ny(\mathbf{x}_n) - 1 + \xi_n) = 0 \\ \mu_n \geq 0 \\ \xi_n \geq 0 \\ \mu_n\xi_n = 0 \end{aligned}\]

Support vectors
- KKT 조건 중 \(\small a_n \geq 0, \mu_n \geq 0\), 그리고 \(\small \xi_n\)에 대해 미분하여 나온 식 \(\small a_n = C - \mu_n\)에 따라, 아래와 같이 정리할 수 있다 \[\small 0 \leq a_n \leq C\]
- \(\small a_n\)의 값에 따라 경우를 나누면 다음과 같다
  - ㄱ. \(\small a_n=0\)인 경우: \(\small t_ny_n \gt 1 - \xi_n\)일때, KKT 조건에 따라 \(\small a_n\)값이 0이 된 경우로,
    이에 해당하는 객체는 \(\small a_n = C-\mu_n\)의 식에 따라, \(\small \mu_n=C\)이므로 \(\small \xi_n=0\)이 되며,
    따라서 \(\small t_ny_n \gt 1\)을 만족하여 support vector가 되지 않고, margin 바깥에 위치한다
  - ㄴ. \(\small 0 \lt a_n \lt C\)인 경우: KKT 조건에 따라 \(\small t_ny_n = 1 - \xi_n\)을 만족해야 하며,
    \(\small a_n = C-\mu_n\)의 식에 따라 \(\small \mu_n \gt 0, \xi_n=0\)이므로,
    \(\small t_ny_n = 1\)인 객체로 margin 위에 존재하는 support vector가 된다
  - ㄷ. \(\small a_n = C\)인 경우: KKT 조건에 따라 \(\small t_ny_n \gt 1 - \xi_n\)을 만족해야 하며,
    \(\small a_n = C-\mu_n\)의 식에 따라 \(\small \mu_n = 0, \xi_n \gt 0\)이므로,
    오분류가 허용된 객체로, support vector에 해당하며 margin 내에 존재하는 객체이다.

새로운 데이터가 들어왔을 때 예측은?
- 1에서 분리 가능한 SVM과 마찬가지로 아래와 같은 식을 통해 예측할 수 있다 \[\small y(\mathbf{x}) = \sum_{n=1}^N a_nt_n\phi(\mathbf{x_n})\phi(\mathbf{x}) + b = \sum_{n=1}^N a_nt_nk(\mathbf{x}, \mathbf{x}_n) + b\]
\(\small \mathbf{w}, b\) 구하기
- \(\small \mathbf{w}\) 구하기
  - 구한 \(\small \mathbf{a}=(a_1, \cdots, a_N)^T\)을 이용하여, \(\small \mathbf{w}= \sum_{n=1}^N a_nt_n\phi(\mathbf{x_n})\)에 대입하여 \(\small \mathbf{w}\)를 얻는다
- \(\small b\) 구하기
  - support vector 중 \(\small 0 \lt a_n \lt C\)인 경우의 객체들의 집합을 \(\small S\)라 하자
  - \(\small 0 \lt a_n \lt C\)인 속하는 객체들만 \(\small t_ny(\mathbf{x}_n)=1\)을 만족하므로(이 객체들만 margin 위에 존재하는 객체들) \[\small t_n(\sum_{m \in S} a_mt_mk(\mathbf{x}_n, \mathbf{x}_m) + b) = 1 \]을 만족한다
  - 위의 식에서 양변에 \(\small t_n\)을 곱하면, \(\small t_n^2=1\)이므로 아래와 같이 \(\small b\)에 대해 정리할 수 있다 \[\small b = \frac {1}{N_S} \sum_{n \in S} (t_n - \sum_{m \in S} a_m t_m k(\mathbf{x}_n, \mathbf{x}_m) ) \]

3. Support vector regression

[1] 문제 정의

앞에서 본 분류 문제를 위한 SVM을 회귀 문제에도 적용해보자
regularized linear regression의 목적함수는 아래와 같이 쓸 수 있다 \[\small \operatorname{min_w} \frac 12 \sum_{n=1}^N \{y_n-t_n\}^2 + \frac {\lambda}{2} ||\mathbf{w}||^2 \]
sparse solution을 얻기 위해, 위의 목적함수에서 quadratic error function 대신 \(\small \epsilon\)-insensitive error function을 사용할 것이다
- \(\small \epsilon\)-insensitive error function \[\small E_{\epsilon}(y(\mathbf{x})-t) = \begin{cases} 0 & \text{if} \quad |y(\mathbf{x})-t| \lt \epsilon \\ |y(\mathbf{x})-t| - \epsilon & \text{otherwise} \end{cases}\]
  - quadratic error function와 \(\small \epsilon\)-insensitive error function 비교(빨간선이 \(\small \epsilon\)-insensitive)
- \(\small \epsilon\)-insensitive error function을 사용한 SVR 목적함수 \[\small \min_w \quad C\sum_{n=1}^N E_{\epsilon}(y(\mathbf{x})-t) + \frac 12 ||\mathbf{w}||^2 \]
  - \(\epsilon\)-insensitive error function을 사용하므로, \(\small \pm \ \epsilon\)의 오차를 허용하는 것이다

[2] 목적함수와 제약식 정의

2장에서 분리 불가능한 경우의 SVM처럼, 여기서도 오차를 허용할 수 있도록 \(\small \xi_n,\ \hat{\xi_n}\)이라는 잉여변수를 적용하여 목적함수와 제약식을 설정한다
- 첫번째(그림에서 회귀식 위쪽에 객체가 존재하는 경우), \(\small t_n - y(\mathbf{x}_n) \gt \epsilon\)인 경우 \[\small t_n - y(\mathbf{x}_n) \leq \epsilon + \xi_n, \quad \xi_n \gt 0\]의 조건으로 오차가 존재할 수 있도록 한다
- 두번째(그림에서 회귀식 아래쪽에 객체가 존재하는 경우), \(\small t_n - y(\mathbf{x}_n) \lt \epsilon\)인 경우 \[\small t_n - y(\mathbf{x}_n) \geq \epsilon + \hat{\xi_n}, \quad \hat{\xi_n} \gt 0\]의 조건으로 오차가 존재할 수 있도록 한다

따라서, SVR은 아래와 같은 목적함수와 제약식을 갖는다 \[\small \begin{gather*} \min_{\mathbf{w}} \quad C\sum_{n=1}^N (\xi_n + \hat{\xi_n}) + \frac 12 ||\mathbf{w}||^2 \\ \text{subject to} \quad t_n \leq y(\mathbf{x}_n) + \epsilon + \xi_n \\ t_n \geq y(\mathbf{x}_n) - \epsilon - \hat{\xi_n} \\ \xi_n \gt 0 \\ \hat{\xi_n} \gt 0 \end{gather*}\]

[3] Lagrangian multiplier를 이용해 해 찾기

라그랑지안 함수로 바꾸면 \[\small L = C\sum_{n=1}^N (\xi_n + \hat{\xi_n}) + \frac 12 ||\mathbf{w}||^2 - \sum_{n=1}^N (\mu_n\xi_n + \hat{\mu_n}\hat{\xi_n}) - \sum_{n=1}^N a_n(\epsilon + \xi_n + y_n -t_n) - \sum_{n=1}^N \hat{a_n}(\epsilon + \hat{\xi_n} - y_n + t_n)\]
- \(\small \mathbf{w}, b, \xi_n, \hat{\xi_n}\)에 대해 미분하면 \[\small \begin{gather*} \frac {\partial L}{\partial \mathbf{w}} = 0 \quad \to \quad \mathbf{w} = \sum_{n=1}^N (a_n - \hat{a_n})\phi(\mathbf{x}_n)\\ \frac {\partial L}{\partial b} = 0 \quad \to \quad 0 = \sum_{n=1}^N (a_n - \hat{a_n})\\ \frac {\partial L}{\partial \xi_n} = 0 \quad \to \quad a_n + \mu_n = C \\ \frac {\partial L}{\partial \hat{\xi_n}} = 0 \quad \to \quad \hat{a_n} + \hat{\mu_n} = C \end{gather*} \]
- 미분한 결과를 다시 라그랑지안 함수에 대입하여 라그랑지안 쌍대 함수를 구하면 \[\small \begin{gather*} \tilde{L}(\mathbf{a}, \mathbf{\hat{a}}) = -\frac 12 \sum_{n=1}^N \sum_{m=1}N (a_n - \hat{a_n})(a_m - \hat{a_m})k(\mathbf{x}_n, \mathbf{x}_m) - \epsilon\sum_{n=1}^N (a_n + \hat{a_n})+\sum_{n=1}^N (a_n - \hat{a_n})t_n \\ \text{subject to} \quad a_n \geq 0 \\ \hat{a_n} \geq 0 \\ \mu_n \geq 0 \\ \hat{\mu_n} \geq 0 \end{gather*} \]
- 이때, KKT 조건(제약식)은 \[\small \begin{gather*} a_n(\epsilon + \xi_n +y_n - t_n) = 0 \\ \hat{a_n}(\epsilon + \hat{\xi_n}-y_n + t_n) = 0 \\ (C-a_n)\xi_n = 0 \\ (C-\hat{a_n})\hat{\xi_n} = 0 \end{gather*} \]

새로운 데이터가 들어왔을 때 예측은?
- \(\small y\mathbf{(x) = w^T \phi (x)} +b\)에 \(\small \mathbf{w}\)에 대해 미분했을 때 얻은 \(\small \mathbf{w} = \sum_{n=1}^N (a_n - \hat{a_n})\phi(\mathbf{x}_n)\)을 대입하면 아래와 같다 \[\small y(\mathbf{x}) = \sum_{n=1}^N (a_n - \hat{a_n})k(\mathbf{x}, \mathbf{x}_n) + b\]
\(\small \mathbf{w}, b\) 구하기
- \(\small \mathbf{w}\) 구하기
  - 구한 \(\small \mathbf{a}, \hat{\mathbf{a}}\)을 이용하여, \(\small \mathbf{w} = \sum_{n=1}^N (a_n - \hat{a_n})\phi(\mathbf{x}_n)\)에 대입하여 \(\small \mathbf{w}\)를 얻는다
- \(\small b\) 구하기
  - \(\small 0 \lt a_n \lt C\)(또는 \(\small 0 \lt \hat{a_n} \lt C\))인 객체가 \(\small \mu_n \gt 0\)(또는 \(\small \hat{\mu_n} \gt 0\))를 만족하여 \(\small \xi_n = 0\)(또는 \(\small \hat{\xi_n} = 0\))을 가지므로, KKT 조건에 따라 \(\small \epsilon + \xi_n +y_n - t_n = 0\)(또는 \(\small \epsilon - y_n + t_n = 0\))이 된다
  - 따라서, support vector인 \(\small \mathbf{x}_n\)들에 대하여, 아래 식과 같이 \(\small b\)를 구할 수 있다 \[\small b = t_n - \epsilon - \mathbf{w}^T\phi(\mathbf{x}_n) = t_n - \epsilon - \sum_{n=1}^N (a_n - \hat{a_n})k(\mathbf{x}, \mathbf{x}_n)\]

4. 참고

[1] \(\small \nu\) - SVM (Scholkopf et al., 2000)

분류 문제에서의 \(\small \nu\) - SVM
- 목적함수와 제약식 \[\small \begin{gather*} \tilde{L}(\mathbf{a}) = -\frac 12 \sum_{n=1}^N \sum_{m=1}^N a_na_mt_nt_mk(\mathbf{x}_n, \mathbf{x}_m) \\ \text{subject to} \quad 0 \leq a_n \leq \frac 1N \\ \sum_{n=1}^Na_nt_n=0 \\ \sum_{n=1}^N a_n \geq \nu \end{gather*}\]
- 위의 목적함수와 제약식에서 기존 SVM과의 차이를 보면 \(\small C\) 대신에 \(\small \nu\)라는 parameter를 사용한 것이다
- 그러면, \(\small \nu\)의 의미는?
  - 제약식 중 \(\small \sum_{n=1}^N a_n \geq \nu\) 이 식에서 볼 수 있듯이 \(\small \nu\)는 \(\small \xi_n \gt 0\)인 객체들의 비율을 의미하므로, support vector의 비율에 대한 하한선이라고 할 수 있다

회귀 문제에서의 \(\small \nu\) - SVM
- 목적함수와 제약식 \[\small \begin{gather*} \tilde{L}(\mathbf{a}, \hat{\mathbf{a}}) = -\frac 12 \sum_{n=1}^N \sum_{m=1}^N (a_n-\hat{a_n})(a_m-\hat{a_m})k(\mathbf{x}_n, \mathbf{x}_m)+ \sum_{n=1}^N (a_n-\hat{a_n})t_n \\ \text{subject to} \quad 0 \leq a_n \leq \frac CN \\ 0 \leq \hat{a_n} \leq \frac CN \\ \sum_{n=1}^N(a_n- \hat{a_n})=0 \\ \sum_{n=1}^N(a_n + \hat{a_n}) \leq \nu C \end{gather*}\]
- \(\small \epsilon\)-insensitive error function을 사용했던 SVR에서 \(\small \epsilon\)만큼의 너비(\(\small \epsilon\)-tube)에 대해서 에러가 0이 되도록 한 것에 더해, \(\small \nu\) - SVR에서는 튜브 바깥에 존재하는 객체의 비율을 parameter \(\small \nu\)로 정하였다

[2] Computational learning theory

Probably Approximately Correct(PAC) learning
- PAC framework에서는 good generalization을 가지기 위해서는 얼마나 많은 데이터가 필요한지에 대해 알아본다
- 좀 더 자세히 보면,
  - \(\small p(\mathbf{x}, t)\)를 분포로 가지는 크기 \(\small N\)인 데이터셋 \(\small D\)에 대해, \(\small t = g(\mathbf{x})\)가 존재한다고 하자
  - 이때, 데이터셋 \(\small D\)를 기반으로 하는 space \(\small \mathcal{F}\)에서 나온 함수 \(\small f(\mathbf{x}; D)\)는 \[\small P[E_{\mathbf{x}, t} [I(f(\mathbf{x}; D) \neq t)] \lt \epsilon] \geq 1 - \delta\] 을 만족하면 good generalization을 갖는다
  - 따라서, PAC learning에서는 \(\small \mathcal{F},\epsilon, \delta\)가 주어졌을 때, \(\small f(\mathbf{x}; D)\) good generalization을 가지려면 필요한 최소한의 크기 \(\small N\)의 경계를 찾으려고 하는 것이 그 목표이다
SVM은 이러한 PAC learning에서 시작된 computational learning theory을 기반으로 한다

Support Vector Machine - Summary

by Stoney

2020-06-16

1. Support vector classifier - seperable

2. Support vector classifier - nonseperable

3. Support vector regression

4. 참고