CS229 Machine Learning

Gradient Descent

Batch gradient descent
At each parameter update step, look at all the examples (cost function sum over all the examples)
Stochastic gradient descent
Each update step, only look at one training example (iterate over all examples)

pro: gets parameter close to minimium faster than batch gradient descent (avoid sum over all examples)

cons: may never converge to the minimum
Mini-batch gradient descent
It seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.

Split the examples into batches, each time look at each batch.

Mini-batch sizes, commonly called “batch sizes” for brevity, are often tuned to an aspect of the computational architecture on which the implementation is being executed. Such as a power of two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on.

Tips: 32 may be good batch size

The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. In all cases the best results have been obtained with batch sizes m = 32 or smaller, often as small as m = 2 or m = 4.

Practical recommendations for gradient-based training of deep architectures, 2012