优化算法
类型 | 公式 |
---|---|
梯度下降(Gradient descent) | $\pmb{g}_{t} \leftarrow\nabla f_{\mathcal{B}_{t}}\left(\pmb{x}_{t-1}\right)=\frac{1}{\mid\mathcal{B}\mid}\sum_{i\in\mathcal{B}_{t}}\nabla f_i\left(\pmb{x}_{t-1}\right),$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\eta_t\pmb{g}_t,$ $\mid\mathcal{B}\mid=\mid\mathcal{B}_t\mid=n.$ |
随机梯度下降(Stochastic gradient descent, SGD) | $\pmb{x}\leftarrow \pmb{x}-\eta\nabla f_i\left(\pmb{x}\right),$ $i=\pmb{random}\lbrace 1,…,n\rbrace$ |
小批量随机梯度下降(Mini-batch stochastic gradient descent) | $\pmb{g}_{t} \leftarrow\nabla f_{\mathcal{B}_{t}}\left(\pmb{x}_{t-1}\right)=\frac{1}{\mid\mathcal{B}\mid}\sum_{i\in\mathcal{B}_{t}}\nabla f_i\left(\pmb{x}_{t-1}\right),$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\eta_t\pmb{g}_t,$ $\mid\mathcal{B}\mid=\mid\mathcal{B}_t\mid<n.$ |
以上三个注意看小批量随机梯度下降的代码实现 | 会发现三者代码编写一样,区别仅仅在于批量大小值,这与前面已经随机获取$\mid\mathcal{B}\mid$批量数据有关!!! |
动量法(Momentum) | $\pmb{v}_t\leftarrow\gamma\pmb{v}_{t-1}+\eta_t\pmb{g}_t=\gamma\pmb{v}_{t-1}+\left(1-\gamma\right)\left(\frac{\eta_t\pmb{g}_t}{1-\gamma}\right),$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\pmb{v}_t,\\ 1\leq\gamma<1.$ |
AdaGrad算法 | $\pmb{s}_t\leftarrow\pmb{s}_{t-1}+\pmb{g}_t\bigodot\pmb{g}_t,$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\frac{\eta}{\sqrt{\pmb{s}_t+\epsilon}}\bigodot\pmb{g}_t.$ |
RMSProp算法 | $\pmb{s}_t\leftarrow\gamma\pmb{s}_{t-1}+\left(1-\gamma\right)\pmb{g}_t\bigodot\pmb{g}_t,$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\frac{\eta}{\sqrt{\pmb{s}_t+\epsilon}}\bigodot\pmb{g}_t,$ $0\leq\gamma<1.$ |
AdaDelta算法 | $\pmb{s}_t\leftarrow\rho\pmb{s}_{t-1}+\left(1-\rho\right)\pmb{g}_t\bigodot\pmb{g}_t,$ $\pmb{g}_t^{\prime}\leftarrow\sqrt{\frac{\Delta\pmb{x}_{t-1}+\epsilon}{\pmb{s}_t+\epsilon}}\bigodot\pmb{g}_t,$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\pmb{g}_t^{\prime},$ $\Delta\pmb{x}_t\leftarrow\Delta\pmb{x}_{t-1}+\left(1-\rho\right)\pmb{g}_t^{\prime}\bigodot\pmb{g}_t^{\prime},$ $0\leq\rho<1.$ |
Adam算法 | $\pmb{v_t}_t\leftarrow\beta_1\pmb{v}_{t-1}+\left(1-\beta_1\right)\pmb{g}_t,$ $\pmb{s}\leftarrow\beta_{2}\pmb{s}_{t-1}+\left(1-\beta_{2}\right)\pmb{g}_t\bigodot\pmb{g}_t,$ $\hat{\pmb{v}_t}\leftarrow\frac{\pmb{v}_t}{1-\beta_1^t},$ $\hat{\pmb{s}_t}\leftarrow\frac{\pmb{s}_t}{1-\beta_2^t},$ $\pmb{g}_t^{\prime}\leftarrow\frac{\eta\hat{\pmb{v}_t}}{\sqrt{\hat{\pmb{s}_t}+\epsilon}},$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\pmb{g}_t^{\prime},$ $0\leq\beta_1,\beta_2<1(recommend:\beta_1=0.9,\beta_2=0.999).$ |
优化与深度学习
1 | from mpl_toolkits import mplot3d |
梯度下降和随机梯度下降
一维梯度下降
1 | from mpl_toolkits import mplot3d |
1 | Ouputs: |
多维梯度下降
1 | from mpl_toolkits import mplot3d |
1 | Outputs: |
随机梯度下降
1 | from mpl_toolkits import mplot3d |
1 | epoch 1, x1 -4.063163, x2 -1.156916 |
小批量随机梯度下降
原生实现
1 | from mxnet import autograd, gluon, init, nd |
1 | Outputs: |
简洁实现
1 | from mxnet import autograd, gluon, init, nd |
1 | Outputs: |
动量法
梯度下降的问题
1 | from mxnet import nd |
1 | Outputs: |
测试
1 | from mxnet import nd |
1 | training: |
线性回归原生实现(动量法)
1 | from mxnet import nd, autograd |
1 | training: |
线性回归简洁实现(动量法)
1 | from mxnet import nd, autograd, init, gluon |
1 | training: |
Adagrad算法
测试
1 | from mxnet import nd |
1 | training: |
线性回归原生实现(Adagrad算法)
1 | from mxnet import nd, autograd |
1 | training: |
线性回归简洁实现(Adagrad算法)
1 | from mxnet import nd, autograd, init, gluon |
1 | training: |
RMSProp算法
测试
1 | from mxnet import nd |
1 | training: |
线性回归原生实现(RMSProp算法)
1 | from mxnet import nd, autograd |
1 | training: |
线性回归简洁实现(RMSProp算法)
1 | from mxnet import nd, autograd, init, gluon |
1 | training: |
AdaDelta算法
线性回归原生实现(AdaDelta算法)
1 | from mxnet import nd, autograd |
1 | training: |
线性回归简洁实现(AdaDelta算法)
1 | from mxnet import nd, autograd, init, gluon |
1 | training: |
Adam算法
线性回归原生实现(Adam算法)
1 | from mxnet import nd, autograd |
1 | training: |
线性回归简洁实现(Adam算法)
1 | from mxnet import nd, autograd, init, gluon |
1 | training: |