这篇文章相信学习DL的人都知道,deep learning在2006年三大breakthrough文章之一,其主要思想是逐层贪婪训练方法,以下为论文部分摘抄及翻译。

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


 

Problem: To train deep networks, gradient-based optimization starting from random initialization appears to often get stuck in poor solutions.

1. Introduction
对于shallow architecture模型,如SVM,对于d个输入,要有2^d个样本,才足够训练模型。当d增大的时候,这就产生了维数灾难问题。而多层神经网络能够避免这个问题:
boolean functions (such as the function that computes the multiplication of two numbers from their d-bit representation) expressible by O(logd) layers of combinatorial logic with O(d) elements in eachlayer may require O(2^d)elements when expressed with only 2 layers.

Three important aspects:
1. Pre-training one layer at a time in a greedy way.
2. Using unsupervised learning at each layer in order to preserve information from the inputs.
3. Fine-tuning the whole network with respect to the ultimate criterion of interest.
2. DBN
2.1 RBM
2.2 Gibbs Markov Chain and log-likelyhood gradient in a RBM
RBMUpdate Algorithm
2.3 Greedy layer-wise training of a DBN
每当训练完一层RBM,向上叠加一层RBM,且输入层使用下一层RBM的输出作为输入。使用RBM中隐层的posterior distribution作为DBN中可视层的posterior distribution。贪婪学习的动机是一个部分DBN对最底层的表示比单个RBM要好。
TrainUnsupervisedDBN
其中i为层数
2.4 Fine-tuning
wake-sleep algorithm or mean-field approximation
TrainSupervisedDBN
这里C为squared error or cross entropy
DBMSupervisedFineTuning

3. Extension to continuous-valued inputs
将输入向量进行归一化,转换为(0,1)区间的数,把它当做是二值单元变成1的概率,然后用RBM的方法进行训练。这种方法对灰度像素是有效的,但是可能对其他形式的输入无效。
4.Understand why layer-wise strategy works
TrainGreedyAutoEncodingDeepNet
n为各层单元数
TrainGreedySupervisedDeepNet
Experiment 2 shows that greedy unsupervised layer-wise pre-training gives much better results than the standard way to train a deep network (with no greedy pre-training) or a shallownetwork, and that, without pre-training, deep networks tend to perform worse than shallow networks.
同样supervised pretraining要比unsupervised效果差,因为它太贪婪,可能的解释是在学习到的隐层表示中,它抛弃了目标的一些信息。

Experiment 3 将最顶层设置为只有20个单元,因为在实验2中training error都很小,基本看不出pretraining对optimization的帮助,那是因为即使没有很好的初始化,最底层和最高层还是组成了一个标准的浅层网络,他们能够保留足够的输入信息去适应训练集,但是对生成没有什么帮助。由实验室结果可以看出这个假设是正确的。

Continuous training of all layers of a DBN
我们希望连续的训练一个DBN而不是每次加一层,再决定迭代次数来训练。To achieve this it is suf_cient to insert a line in TrainUnsupervisedDBN, so that RBMupdate is called on all the layers and the stochastic hidden values are propagated all the way up. The advantage is that we can now have a single stopping criterion (for the whole network).
具体怎么做文章没有细说。
5. Dealing with uncooperative input distributions
当输入分布于目标不太相关的情况下,例如x~p(x), p is Gaussian and target y=f(x)+noise,f=sinus,这时候在p和f之间没有特别的相关性,这时候无监督贪婪预训练就帮不上忙。这时候在训练每层时可以用一个混合的训练规则,结合无监督和有监督。
TrainPartiallySupervisedLayer