优化算法
| 类型 | 公式 | 
|---|---|
| 梯度下降(Gradient descent) | $\pmb{g}_{t} \leftarrow\nabla f_{\mathcal{B}_{t}}\left(\pmb{x}_{t-1}\right)=\frac{1}{\mid\mathcal{B}\mid}\sum_{i\in\mathcal{B}_{t}}\nabla f_i\left(\pmb{x}_{t-1}\right),$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\eta_t\pmb{g}_t,$ $\mid\mathcal{B}\mid=\mid\mathcal{B}_t\mid=n.$ | 
| 随机梯度下降(Stochastic gradient descent, SGD) | $\pmb{x}\leftarrow \pmb{x}-\eta\nabla f_i\left(\pmb{x}\right),$ $i=\pmb{random}\lbrace 1,…,n\rbrace$ | 
| 小批量随机梯度下降(Mini-batch stochastic gradient descent) | $\pmb{g}_{t} \leftarrow\nabla f_{\mathcal{B}_{t}}\left(\pmb{x}_{t-1}\right)=\frac{1}{\mid\mathcal{B}\mid}\sum_{i\in\mathcal{B}_{t}}\nabla f_i\left(\pmb{x}_{t-1}\right),$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\eta_t\pmb{g}_t,$ $\mid\mathcal{B}\mid=\mid\mathcal{B}_t\mid<n.$ | 
| 以上三个注意看小批量随机梯度下降的代码实现 | 会发现三者代码编写一样,区别仅仅在于批量大小值,这与前面已经随机获取$\mid\mathcal{B}\mid$批量数据有关!!! | 
| 动量法(Momentum) | $\pmb{v}_t\leftarrow\gamma\pmb{v}_{t-1}+\eta_t\pmb{g}_t=\gamma\pmb{v}_{t-1}+\left(1-\gamma\right)\left(\frac{\eta_t\pmb{g}_t}{1-\gamma}\right),$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\pmb{v}_t,\\ 1\leq\gamma<1.$ | 
| AdaGrad算法 | $\pmb{s}_t\leftarrow\pmb{s}_{t-1}+\pmb{g}_t\bigodot\pmb{g}_t,$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\frac{\eta}{\sqrt{\pmb{s}_t+\epsilon}}\bigodot\pmb{g}_t.$ | 
| RMSProp算法 | $\pmb{s}_t\leftarrow\gamma\pmb{s}_{t-1}+\left(1-\gamma\right)\pmb{g}_t\bigodot\pmb{g}_t,$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\frac{\eta}{\sqrt{\pmb{s}_t+\epsilon}}\bigodot\pmb{g}_t,$ $0\leq\gamma<1.$ | 
| AdaDelta算法 | $\pmb{s}_t\leftarrow\rho\pmb{s}_{t-1}+\left(1-\rho\right)\pmb{g}_t\bigodot\pmb{g}_t,$ $\pmb{g}_t^{\prime}\leftarrow\sqrt{\frac{\Delta\pmb{x}_{t-1}+\epsilon}{\pmb{s}_t+\epsilon}}\bigodot\pmb{g}_t,$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\pmb{g}_t^{\prime},$ $\Delta\pmb{x}_t\leftarrow\Delta\pmb{x}_{t-1}+\left(1-\rho\right)\pmb{g}_t^{\prime}\bigodot\pmb{g}_t^{\prime},$ $0\leq\rho<1.$ | 
| Adam算法 | $\pmb{v_t}_t\leftarrow\beta_1\pmb{v}_{t-1}+\left(1-\beta_1\right)\pmb{g}_t,$ $\pmb{s}\leftarrow\beta_{2}\pmb{s}_{t-1}+\left(1-\beta_{2}\right)\pmb{g}_t\bigodot\pmb{g}_t,$ $\hat{\pmb{v}_t}\leftarrow\frac{\pmb{v}_t}{1-\beta_1^t},$ $\hat{\pmb{s}_t}\leftarrow\frac{\pmb{s}_t}{1-\beta_2^t},$ $\pmb{g}_t^{\prime}\leftarrow\frac{\eta\hat{\pmb{v}_t}}{\sqrt{\hat{\pmb{s}_t}+\epsilon}},$ $\pmb{x}_t\leftarrow\pmb{x}_{t-1}-\pmb{g}_t^{\prime},$ $0\leq\beta_1,\beta_2<1(recommend:\beta_1=0.9,\beta_2=0.999).$ | 
优化与深度学习
| 1 | from mpl_toolkits import mplot3d | 
梯度下降和随机梯度下降
一维梯度下降
| 1 | from mpl_toolkits import mplot3d | 
| 1 | Ouputs: | 
多维梯度下降
| 1 | from mpl_toolkits import mplot3d | 
| 1 | Outputs: | 
随机梯度下降
| 1 | from mpl_toolkits import mplot3d | 
| 1 | epoch 1, x1 -4.063163, x2 -1.156916 | 
小批量随机梯度下降
原生实现
| 1 | from mxnet import autograd, gluon, init, nd | 
| 1 | Outputs: | 
简洁实现
| 1 | from mxnet import autograd, gluon, init, nd | 
| 1 | Outputs: | 
动量法
梯度下降的问题
| 1 | from mxnet import nd | 
| 1 | Outputs: | 
测试
| 1 | from mxnet import nd | 
| 1 | training: | 
线性回归原生实现(动量法)
| 1 | from mxnet import nd, autograd | 
| 1 | training: | 
线性回归简洁实现(动量法)
| 1 | from mxnet import nd, autograd, init, gluon | 
| 1 | training: | 
Adagrad算法
测试
| 1 | from mxnet import nd | 
| 1 | training: | 
线性回归原生实现(Adagrad算法)
| 1 | from mxnet import nd, autograd | 
| 1 | training: | 
线性回归简洁实现(Adagrad算法)
| 1 | from mxnet import nd, autograd, init, gluon | 
| 1 | training: | 
RMSProp算法
测试
| 1 | from mxnet import nd | 
| 1 | training: | 
线性回归原生实现(RMSProp算法)
| 1 | from mxnet import nd, autograd | 
| 1 | training: | 
线性回归简洁实现(RMSProp算法)
| 1 | from mxnet import nd, autograd, init, gluon | 
| 1 | training: | 
AdaDelta算法
线性回归原生实现(AdaDelta算法)
| 1 | from mxnet import nd, autograd | 
| 1 | training: | 
线性回归简洁实现(AdaDelta算法)
| 1 | from mxnet import nd, autograd, init, gluon | 
| 1 | training: | 
Adam算法
线性回归原生实现(Adam算法)
| 1 | from mxnet import nd, autograd | 
| 1 | training: | 
线性回归简洁实现(Adam算法)
| 1 | from mxnet import nd, autograd, init, gluon | 
| 1 | training: | 
