# The idea of the gradient descent algorithm

In many optimisation problems one needs to find an absolute minimum of a continuous scalar function , which can be given by a formula, or a rule, or a procedure. Sometimes f can be defined piecewise.
Of course, if the number n is small () and the function is derivable and given by a simple expression, then we can do the standard analytical study of the extrema:
• to find all the critical points, the solutions of the system
• to classify the candidates to the extrema
• to choose the absolute minimum between the extrema
Now imagine that the number n of the variables and, moreover, the function f itself is defined by a complicated procedure. Then it's more natural to apply a direct gradient descent method:
• choose an initial point and calculate a numerical approximation of the vector grad
• make the descent iterations:
where is a step size parameter, which can vary from step to step in such a way that at each step the new value of is smaller than at the previous one.
• stop the iterations when the norm of is sufficiently small (this means that the last obtained point is sufficiently close to a critical point of f (which can be a local minimum or a saddle point). Calculate the final value of f.
• repeat the procedure for different initial point to be sure to reach the absolute minimum of f.
We illustrate this procedure on an example of a function of just two variables: . Suppose we can visualise its graph in 3D, a family of the contour lines on the graph and the level curves on the plane , also the local extrema and saddle points (if any), as shown below (the extrema are marked with white points and the saddle points are marked with white crosses):
So, our function has 4 local minima (one of them is the absolute minimum ), one local maximum, and 4 saddle points.
Choosing an initial point , we do a series of descents: at each step we move to the direction of the fastest descent: it is opposite to the gradient vector at the corresponding point. Notice that at each point the gradient vector is orthogonal to the level curve passing through that point, see the field of directions of the descents: