735 次浏览

支持向量机(svm)

去年暑假刷凌青老师的凸优化课(reference 1),做了几个习题之后就没怎么管了,凌青老师在课上说了整个凸优化的核心就是KKT条件,前面也帮老师做了一下支持向量机的ppt,简单地copy了书本的东西,知道svm涉及到凸优化的知识,放寒假了,复习复习知识,所以来仔细看下svm作为凸优化的复习。

优化问题

优化问题的一般形式:在满足不等式约束以及等式约束的条件下,最小化函数\(\)\(f_0 (x)\)\(\)(从下面式子可以看到,当约束条件越多问题会变得很难解决)

\(\)\(minimize f_0 (x)\)\(\)
\(\)\(s.t. f_i (x) \leq b_i, i = 1, \ldots, m\)\(\)
\(\)\(h_i (x)=0, i=1, \ldots, p\)\(\)

对于凸优化需要满足三个条件:
1. 目标函数为凸函数
2. 不等式约束为凸函数
3. 等式约束为仿射的
凸函数简单理解就是一个铁锅,无论给多大锅盖(线段)都在铁锅上面。仿射简单来说就是线性。

凸优化问题是优化问题中很好解决的问题,凸问题一个很好的性质就是局部最优解就是全局最优解。

complement:对于svm,支撑超平面是一个核心内容,其解释可以参考凸优化书籍(reference 2, 有中文版)理论知识,当然也可以参考reference 3简单明了。

对偶问题是将原问题求min转换为求max的问题,对偶函数一定是一个凹函数,且对偶问题的最优值会小于或等于原问题的最优值,所以当原问题是一个非凸问题时,可以转换成对偶问题进行求解。

KKT条件
设\(\)\(x^{\star}, \lambda ^{\star}, v^{\star}\)\(\)为原问题和对偶问题的最优解
\(\)\(f_i (x^{\star}) \leq 0\)\(\) (原问题可行性条件)
\(\)\(h_i (x^{\star}) = 0\)\(\) (原问题可行性条件)
\(\)\(\lambda _i ^{\star} \geq 0\)\(\) (对偶问题可行性条件)
\(\)\(\lambda _i ^{\star} f_i (x^{\star}) = 0\)\(\) (互补松弛条件)
\(\)\(\nabla f_0(x^{\star}) + \sum ^m _{i=1} \lambda _i ^{\star} \nabla f_i (x^{\star}) + \sum ^p _{i=1} v^{\star} \nabla h_i(x^{\star})=0\)\(\) (稳定性条件)
kkt是一个必要条件,若原问题为凸问题,各个函数可微,对偶间隙为零,则KKT条件为充要条件。

hard margin SVM

假设遇到的的是一个二分类线性可分的问题,使用感知机就可以解决这样的问题,但是我们获得的分离超平面是有无数种的,而SVM则是利用最大间隔求最优分离超平面,获得的超平面是惟一的。

svm的一般描述是:使得到超平面最近的点的距离最大。
数学描述:\(\)\(max_{w, b} \min_{x_i} \frac{y_i (w^T x_i +b)}{||w||}, s.t. y_i (w^T x_i +b) > 0\)\(\)(同感知机一样,预测标签和实际标签是一致才意味着正确,也即同号相乘为正,二者相乘去掉绝对值符号)

最小化点到超平面的距离与w无关,所以可以改写:
\(\)\(max_{w, b} \frac{1}{||w||} min_{x_i}y_i (w^T x_i +b), s.t. y_i (w^T x_i +b) > 0\)\(\)

\(\)\(y_i(w^T x_i + b)\)\(\)为函数间隔,函数间隔可以相对的表示点到超平面的距离,点\(\)\((0, 0)\)\(\)到同一超平面\(\)\(x_1 – x_2 +1 = 0\)\(\)、\(\)\(2x_1 – 2x_2 + 2 = 0\)\(\)的函数间隔分别为1、2。\(\)\(\frac{y_i(w^T x_i + b)}{||w||}\)\(\)为几何间隔,则前面点到超平面的距离则是一样为\(\)\(\frac{1}{\sqrt{2}}\)\(\)

在最小化函数间隔时,函数间隔\(\)\(y_i (w^T x_i +b) \geq \gamma\)\(\),所以\(\)\(\gamma\)\(\)的取值对于max问题没有影响,为了方便取\(\)\(\gamma=1\)\(\)

上述问题就变成一个凸二次优化问题:
\(\)\(\min \frac{1}{2}||w||^2\)\(\)
\(\)\(s.t. y_i (w^T x_i +b) \geq 1\)\(\)

此时问题是否好解呢?问题要优化\(\)\(w, b\)\(\)使的问题最小化,但是问题里没有\(\)\(b\)\(\),约束里带了\(\)\(b\)\(\),我考虑的是将\(\)\(w, b\)\(\)看做整体,也即:
\(\)\(\min 1/2(w^2 + b^2)\)\(\)
\(\)\(s.t. y_i([x_i, 1][w, b]^T \geq 1)\)\(\)
哈哈,利用cvxpy求解(代码如下),并不能获得解。(各种书籍说可以计算出来,也没给出方法,个人找了很久也未找出方法,将问题转换称为cvxopt或者cvxpy库可解的等价问题)

import numpy as np
import cvxpy as cp
import matplotlib.pyplot as plt

# dataset
class_a = np.around(np.array([np.random.normal(1.1, 0.1, (1, 10)).squeeze(), np.random.normal(1.5, 0.1, (1, 10)).squeeze()]).T, 2)
class_b = np.around(np.array([np.random.normal(1.5, 0.1, (1, 10)).squeeze(), np.random.normal(1.1, 0.1, (1, 10)).squeeze()]).T, 2)
label_a = np.array(10*[1.])
label_b = np.array(10*[-1.])

# concatenate
one_vector = np.array(20*[1.]).reshape(-1, 1)
class_concat = np.concatenate((class_a, class_b), axis=0)
label_concat = np.concatenate((label_a, label_b), axis=0)
coefficient = np.concatenate((class_concat, one_vector), axis=1)
array_h = 1 / label_concat

# solve the quadratic program: w_1, w_2, b
P = np.array([[1., 0., 0.], [0., 1., 0.], [0., 0., 0.]])
q = np.array([0., 0., 0.])
G = - coefficient
h = - array_h

w = cp.Variable(3)
prob = cp.Problem(cp.Minimize((1/2)*cp.quad_form(w, P) + q.T@w), [G@w <= h])
prob.solve()
print(prob.value, w.value)

w_1, w_2, b = w.value

print('class a:')
for item in class_a:
    label = np.sign(w_1 * item[0] + w_2 * item[1] + b)
    print(label)

print('class b:')
for item in class_b:
    label = np.sign(w_1 * item[0] + w_2 * item[1] + b)
    print(label)

x_1 = np.arange(1, 2, 0.01)
x_2 = - (w_1 * x_1 + b) / w_2
print(x_1, '\n', x_2)

plt.figure()

plt.scatter(class_a[:, 0], class_a[:, 1], c='red', label='class 1')
plt.scatter(class_b[:, 0], class_b[:, 1], c='blue', label='class -1')

plt.plot(x_1, x_2, label='hyperplane')

plt.legend()
plt.show()

上述方法受阻,我只能求助于拉格朗日乘子法了。

primal problem:
\(\)\(\min \frac{1}{2}||w||^2\)\(\)
\(\)\(s.t. y_i (w^T x_i +b) \geq 1\)\(\)

使用拉格朗日乘子法得到拉格朗日函数:
\(\)\(L(w, b, \lambda)=\frac{1}{2}||w||^2 + \sum ^N _{i=1} \lambda _i (1-y_i(w^T x_i +b)) \)\(\)

然后进行固定\(\)\(w, b\)\(\)(求偏导),获得关于\(\)\(\lambda\)\(\)的对偶函数,
\(\)\(min L(w, b, \lambda)\)\(\)
\(\)\(\nabla _w L(w, b, \lambda) = w – \sum^N _{i=1}\lambda _i y_i x_i = 0\)\(\)
\(\)\(\nabla _b L(w, b, \lambda) = -\sum ^N _{i=1} \lambda _i y_i = 0\)\(\)
将结果代回拉格朗日函数得到对偶函数(注意y为标量和x为向量):\(\)\(g(\lambda) = \frac{1}{2}\sum ^N _{i=1} \sum ^N _{j=1}\lambda _i \lambda _j y_i y_j (x_i \cdot x_j) + \sum ^N _{i=1} \lambda _i\)\(\)

dual problem(对偶问题是求对偶函数的最大值):
\(\)\(max _{\lambda} -\frac{1}{2}\sum ^N _{i=1} \sum ^N _{j=1}\lambda _i \lambda _j y_i y_j (x_i \cdot x_j) + \sum ^N _{i=1} \lambda _i\)\(\)
\(\)\(s.t. \sum ^N _{i=1} \lambda _i y_i = 0, \lambda _i \geq 0\)\(\)

再将上述改写成等价形式:
令\(\)\(P_{ij}=y_i y_j <x_i \cdot x_j>\)\(\)
\(\)\(\min _{\lambda} \frac{1}{2}\lambda ^T P \lambda- 1^T \lambda\)\(\)
\(\)\(s.t. -\lambda _i \leq 0\)\(\)
\(\)\(s.t. y^T \lambda = 0\)\(\)

最终将求得的\(\)\(\lambda\)\(\),来得到\(\)\(w, b\)\(\)

写出问题的KKT条件:
原问题可行性条件:\(\)\(y_i (w^T x_i +b) -1 \geq 0\)\(\)
对偶问题可行性条件: \(\)\(\lambda _i \geq 0\)\(\)
互补松弛条件:\(\)\(\lambda _i (1 -y_i(w^T x_i +b)) =0\)\(\)
稳定性条件:\(\)\(\nabla _w L(w, b, \lambda) = w – \sum^N _{i=1}\lambda _i y_i x_i = 0\)\(\),\(\)\(\nabla _b L(w, b, \lambda) = -\sum ^N _{i=1} \lambda _i y_i = 0\)\(\)

w易得:\(\)\(w=\sum^N _{i=1}\lambda _i y_i x_i\)\(\)

而求偏置b,可以看到互补性条件,\(\)\(\lambda\)\(\)不是0向量,必存在\(\)\(\lambda_j>0\)\(\),则有\(\)\(y_j (w^T x_i +b) -1 = 0\)\(\),从而解的b:
\(\)\(y_j (w^T x_i +b) -1 = 0\)\(\)
\(\)\(y_j * y_j (w^T x_i +b) = 1 * y_j\)\(\)
\(\)\(b = y_j – w^T x_i\)\(\)

得到最终的决策函数:\(\)\(f(x) = sign(\sum^N _{i=1}\lambda _i y_i <x_i, x_j> +b)\)\(\)

import numpy as np
import matplotlib.pyplot as plt
from cvxopt import matrix, solvers

# dataset
class_a = np.around(np.array([np.random.normal(1.1, 0.1, (1, 2)).squeeze(), np.random.normal(1.5, 0.1, (1, 2)).squeeze()]).T, 2)
class_b = np.around(np.array([np.random.normal(1.5, 0.1, (1, 2)).squeeze(), np.random.normal(1.1, 0.1, (1, 2)).squeeze()]).T, 2)
label_a = np.array(2*[1.])
label_b = np.array(2*[-1.])

# concatenate
X_ori = np.concatenate((class_a, class_b), axis=0)
y = np.concatenate((label_a, label_b), axis=0)
y = y.reshape(-1, 1)

# solve the dual problem
# compute P
X = y * X_ori
P = np.dot(X, X.T)

# solve the quadratic program: lambdas
P = matrix(P)
q = matrix(-np.ones((4, 1)))
G = matrix(-np.eye(4))
h = matrix(np.zeros(4))
A = matrix(y.reshape(1, -1))
b = matrix(np.zeros(1))

sol = solvers.qp(P, q, G, h, A, b)
lambdas = np.array(sol['x'])
print('lambdas = ', lambdas)

# get hyperplane parameters : w, b
w = ((y * lambdas).T @ X_ori).reshape(-1, 1)
print('w = ', w)
# S = (lambdas > 1e-4).flatten()
idx = np.argsort(lambdas.T)[0, -1]
bias = (lambdas > 1e-4).reshape(-1)
b = y[bias] - np.dot(X[bias], w)  
print('b = ', b)
w_1, w_2 = w[0, 0], w[1, 0]

# visualize
x_1 = np.arange(1, 2, 0.01)
x_2 = - (w_1 * x_1 + b[0, 0]) / w_2

plt.figure()

plt.scatter(class_a[:, 0], class_a[:, 1], c='red', label='class 1')
plt.scatter(class_b[:, 0], class_b[:, 1], c='blue', label='class -1')

plt.plot(x_1, x_2, label='hyperplane')

plt.legend()
plt.show()

reference
1. https://www.bilibili.com/video/av40868517?p=1
2. Boyd S, Vandenberghe L. Convex optimization[M]. Cambridge university press, 2004.
3. https://www.jianshu.com/p/ba02b92baaaf
4. 拉格朗日乘子法理解https://zhuanlan.zhihu.com/p/39354973
5. https://blog.csdn.net/jclian91/article/details/79321407
6.https://www.cvxpy.org/examples/basic/quadratic_program.html