Alternatives to gradient descent

Published: January 03, 2023Updated: October 04, 2023

Table of contents

Improvements
Alternatives
Extreme Learning
Towards a more biologically plausible learning algorithm
Nash’s equilibrium

Improvements

First let's review improvements for Gradient Descent. They are still GD by its nature.

Momentum gradient descent (MGD)
Nesterov accelerated gradient descent (NAG)
Adaptive gradient (Adagrad)
Root-mean-square gradient propagation (RMSprop)
Adaptive moment estimation (Adam)

Source: from an alternative ANN architecture - Backpropagation Neural Tree paper

Alternatives

When it comes to Global Optimisation tasks (i.e. attempting to find a global minimum of an objective function) you might wanna take a look at:

Pattern Search (also known as direct search, derivative-free search, or black-box search), which uses a pattern (set of vectors ${\{v_i\}}$ ) to determine the points to search at next iteration.
Genetic Algorithm that uses the concept of mutation, crossover and selection to define the population of points to be evaluated at next iteration of the optimisation.
Particle Swarm Optimisation that defines a set of particles that "walk" through the space searching for the minimum.
Surrogate Optimisation that uses a surrogate model to approximate the objective function. This method can be used when the objective function is expensive to evaluate.
Multi-objective Optimisation (also known as Pareto optimisation) which can be used for the problem that cannot be expressed in a form that has a single objective function (but rather a vector of objectives).
Simulated Annealing, which uses the concept of annealing (or temperature) to trade-off exploration and exploitation. It proposes new points for evaluation at each iteration, but as the number of iteration increases, the "temperature" drops and the algorithm becomes less and less likely to explore the space thus "converging" towards its current best candidate.
Backpropagation Neural Tree (BNeuralT)
Target Propagation
Alternating Descent Method of Multipliers (ADMM) (You need 7000 CPU cores on a supercomputer, while I can have 1280 GPU cores on my old laptop with GeForce GTX 1060)
Zeroth-Order Relaxed Backpropagation (ZORB)
Finito https://arxiv.org/abs/1407.2710
Stochastic Dual Coordinate Ascent (SDCA) https://arxiv.org/abs/1209.1873
Stochastic Optimization with Variance Reduction https://hal.inria.fr/hal-01375816v1/document
Spike timing-dependent plasticity (STDP) for Spiking Neural Networks
Feedback alignment (FA)

As mentioned above, Simulated Annealing, Particle Swarm Optimisation and Genetic Algorithms are good global optimisation algorithms that navigate well through huge search spaces and unlike Gradient Descent do not need any information about the gradient and could be successfully used with black-box objective functions and problems that require running simulations.

Source: 1 2

Extreme Learning

Towards a more biologically plausible learning algorithm

Main article - Biologically-Plausible Learning Algorithms Can Scale to Large Datasets link

Nash’s equilibrium

Why is gradient descent the most popular, the only one technique in use? Because it’s stable. No matter what’s in your training data, with any order you will get predictable outcome. And things like Adaptive Resonance Theory aren't stable in that regard. They learn fast, but the order of data is very important. But. What if we apply Nash’s equilibrium here? Every system, component in the brain fight for their truth and shift the system, they play against each other, but we can determine when they must stop overwriting each other. Will it fix the problem with order?