Gradient Descent

  • Batch gradient descent
    At each parameter update step, look at all the examples (cost function sum over all the examples)

  • Stochastic gradient descent
    Each update step, only look at one training example (iterate over all examples)

    pro: gets parameter close to minimium faster than batch gradient descent (avoid sum over all examples)

    cons: may never converge to the minimum

  • Mini-batch gradient descent
    It seeks to find a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent.

    Split the examples into batches, each time look at each batch.

    Mini-batch sizes, commonly called “batch sizes” for brevity, are often tuned to an aspect of the computational architecture on which the implementation is being executed. Such as a power of two that fits the memory requirements of the GPU or CPU hardware like 32, 64, 128, 256, and so on.

    Tips: 32 may be good batch size

    The presented results confirm that using small batch sizes achieves the best training stability and generalization performance, for a given computational cost, across a wide range of experiments. In all cases the best results have been obtained with batch sizes m = 32 or smaller, often as small as m = 2 or m = 4.

    Practical recommendations for gradient-based training of deep architectures, 2012