The idea of the gradient descent algorithm

In many optimisation problems one needs to find an absolute minimum of a continuous scalar function , which can be given by a formula, or a rule, or a procedure. Sometimes f can be defined piecewise.
Of course, if the number n is small () and the function is derivable and given by a simple expression, then we can do the standard analytical study of the extrema:
Now imagine that the number n of the variables and, moreover, the function f itself is defined by a complicated procedure. Then it's more natural to apply a direct gradient descent method:
where is a step size parameter, which can vary from step to step in such a way that at each step the new value of is smaller than at the previous one.
We illustrate this procedure on an example of a function of just two variables: . Suppose we can visualise its graph in 3D, a family of the contour lines on the graph and the level curves on the plane , also the local extrema and saddle points (if any), as shown below (the extrema are marked with white points and the saddle points are marked with white crosses):
So, our function has 4 local minima (one of them is the absolute minimum ), one local maximum, and 4 saddle points.
Choosing an initial point , we do a series of descents: at each step we move to the direction of the fastest descent: it is opposite to the gradient vector at the corresponding point. Notice that at each point the gradient vector is orthogonal to the level curve passing through that point, see the field of directions of the descents: