1,421 次浏览

Almeida–Pineda algorithm

Almeida-Pineda algorithm, a.k.a, recurrent backpropagation,来自于reference1和reference2,由于年代久远,没有找到reference1,读了下reference2,个人感觉有点晦涩(对第一个耦合微分等式不了解),所以搜罗了其他资料来读来理解。

从reference2中,可以知道RBP算法(简称),解决的是通过迭代找到参数的fixed point(固定点),但需要保证网络的对称性。

reference3中以RNN模型为标准,来reviving RBP算法。了解RNN的机制我们知道,隐状态的转化公式为\(\)\(h_{t+1} = F(x, w_F, h_t)\)\(\),我们希望在通过time step后的转化,我们的隐状态可以进行激活输出从而获得预测结果,也就是通过训练得到一个稳定的模型。也即:
\(\)\(h^* = F(x, w_F, h^*)\)\(\)
也就是说,我们的状态转换最终会得到一个固定点\(\)\(h^*\)\(\)
然后进行输出预测:
\(\)\(y = G(x, w_G, h^*)\)\(\)
最终通过损失函数\(\)\(L=l(\hat{y}, y)\)\(\)来判断训练的程度。

那么如何更新参数呢?

假设我们的输入\(\)\(x\)\(\)是固定的。将转态转移看做是隐函数,考虑到\(\)\(h\)\(\)与\(\)\(w_F\)\(\)之间存在着函数关系,所以参数\(\)\(w_F\)\(\)的改变对隐状态的改变的影响(即\(\)\(\frac{\partial{h}}{\partial{w_F}}\)\(\)):
\(\)\(h^* = F(x, w_F, h^*)\)\(\)
\(\)\(\frac{\partial{h^*}}{\partial{w_F}}-\frac{\partial{F(x, w_F, h^*)}}{\partial{w_F}}-\frac{\partial{F(x, w_F, h^*)}}{\partial{h^*}}\frac{\partial{h^*}}{\partial{w_F}}=0\)\(\)
\(\)\(\frac{\partial{h^*}}{\partial{w_F}}=(I-\frac{\partial{F(x, w_F, h^*)}}{\partial{h^*}})^{-1}\frac{\partial{F(x,w_F, h^*)}}{\partial{w_F}}\)\(\)

reference3中的讲述则是:构建一个函数\(\)\(\Psi (w_F, h) = h – F(x, w_F, h)\)\(\)
求导:
\(\)\( \frac{\partial{\Psi (w_F, h^*)}}{\partial{w_F}} =\frac{\partial{h^*}}{\partial{w_F}}- \frac{dF(x, w_F, h^*)}{dw_F}= (I – J_{F, h^*})\frac{\partial{h^*}}{\partial{w_F}}-\frac{\partial{F(x, w_F, h^*)}}{\partial{w_F}} = 0\)\(\)
其中\(\)\(J_{F, h^*} = \frac{\partial{F(x, w_F, h^*)}}{\partial{h}}\)\(\),为雅克比矩阵
\(\)\(\Longrightarrow \frac{\partial{h^*}}{\partial{w_F}} = (I – J_{F, h^*})^{-1}\frac{\partial{F(x, w_F, h^*)}}{\partial{w_F}} \)\(\)

假设认为\(\)\(I-J^T _{F, h^*}\)\(\)是可逆的。

对于损失函数的梯度计算:
\(\)\(\frac{\partial{L}}{\partial{w_G}}=\frac{\partial{L}}{\partial{y}}\frac{\partial{G(x, w_G, h^*)}}{\partial{w_G}}\)\(\)
\(\)\(\frac{\partial{L}}{\partial{w_F}}=\frac{\partial{L}}{\partial{y}}\frac{\partial{y}}{\partial{h^*}}(I – J_{F, h^*})^{-1}\frac{\partial{F(x, w_F, h^*)}}{\partial{w_F}}\)\(\)

对于参数\(\)\(w_G\)\(\)是很好计算的,但是\(\)\(w_F\)\(\)则不好计算了。易知\(\)\(\frac{\partial{F(x, w_F, h^*)}}{\partial{w_F}}\)\(\)是易得的。令:
\(\)\(z = (I-J^T _{F, h^*})^{-1}(\frac{\partial{L}}{\partial{y}}\frac{\partial{y}}{\partial{h^*}})^T\)\(\)
对上式使用不动点迭代进行计算更新\(\)\(z\)\(\)
\(\)\((I-J^T _{F, h^*})z=(\frac{\partial{L}}{\partial{y}}\frac{\partial{y}}{\partial{h^*}})^T\)\(\)
\(\)\(z=J^T _{F, h^*}z +(\frac{\partial{L}}{\partial{y}}\frac{\partial{y}}{\partial{h^*}})^T \)\(\)

不同于现在常用的BPTT算法,按照time step一步一步往后propagation,并不能保证最终隐状态的稳定不变,而RBP算法通过不动点的迭代来讲隐状态收敛到稳定不变状态,但是对初始的条件要求也较高。状态转换函数F需要是压缩映射,才能保证h的唯一性。

reference
1. Almeida L B. A learning rule for asynchronous perceptrons with feedback in a combinatorial environment[C]//Proceedings, 1st First International Conference on Neural Networks. IEEE, 1987, 2: 609-618.
2. Pineda F J. Generalization of back-propagation to recurrent neural networks[J]. Physical review letters, 1987, 59(19): 2229.
3. Liao R, Xiong Y, Fetaya E, et al. Reviving and improving recurrent back-propagation[J]. arXiv preprint arXiv:1803.06396, 2018.