CS229 Machine Learning
mlGradient Descent
-
Batch gradient descent
At each parameter update step, look at all the examples (cost function sum over all the examples) -
Stochastic gradient descent
Each update step, only look at one training example (iterate over all examples)pro: gets parameter close to minimium faster than batch gradient descent (avoid sum over all examples)
cons: may never converge to the minimum
-
Mini-batch gradient descent
It seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.Split the examples into batches, each time look at each batch.
Mini-batch sizes, commonly called “batch sizes” for brevity, are often tuned to an aspect of the computational architecture on which the implementation is being executed. Such as a power of two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on.
Tips: 32 may be good batch size
The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. In all cases the best results have been obtained with batch sizes m = 32 or smaller, often as small as m = 2 or m = 4.
Practical recommendations for gradient-based training of deep architectures, 2012