The idea of the gradient descent algorithm

In many optimisation problems one needs to find an absolute minimum

of a continuous scalar function

, which can be given by a formula, or a rule, or a procedure. Sometimes f can be defined piecewise.

Of course, if the number n is small (

) and the function is derivable and given by a simple expression, then we can do the standard analytical study of the extrema:

to find all the critical points, the solutions of the system

to classify the candidates to the extrema
to choose the absolute minimum between the extrema

Now imagine that the number n of the variables

and, moreover, the function f itself is defined by a complicated procedure. Then it's more natural to apply a direct gradient descent method:

choose an initial point and calculate a numerical approximation of the vector grad
make the descent iterations:

where

is a step size parameter, which can vary from step to step in such a way that at each step the new value of

is smaller than at the previous one.

stop the iterations when the norm of is sufficiently small (this means that the last obtained point is sufficiently close to a critical point of f (which can be a local minimum or a saddle point). Calculate the final value of f.
repeat the procedure for different initial point to be sure to reach the absolute minimum of f.

We illustrate this procedure on an example of a function of just two variables:

. Suppose we can visualise its graph in 3D, a family of the contour lines on the graph and the level curves on the plane

, also the local extrema and saddle points (if any), as shown below (the extrema are marked with white points and the saddle points are marked with white crosses):

So, our function has 4 local minima (one of them is the absolute minimum

), one local maximum, and 4 saddle points.

Choosing an initial point

, we do a series of descents: at each step we move to the direction of the fastest descent: it is opposite to the gradient vector at the corresponding point. Notice that at each point

the gradient vector is orthogonal to the level curve passing through that point, see the field of directions of the descents:

Below several descent paths for different initial points

(red dots) are shown:

As we see, almost any descend path necessarily ends in a neighbourhood of a local minimum (where

is almost zero). Some rare paths may end near a saddle point, where the gradient is also close to zero (see the red path), but the probability of this issue is very small. Moreover, when the number n of the variables is big, this probability is practically zero, so we do not take this case into account.

So, making a sufficiently big number of paths with different initial points, we can be almost sure that we would reach the absolute minimum of f provided that we have chosen a right step size ε: if it is too big we may constantly jump over the extrema and do not do any descent (see the red path on the Figure below); if it is too small we risk never approach to a minimum sufficiently close.

Example 1: Simple Linear regression and the gradient descent in

Suppose we have a set of experimental data describing a relation between a property X and a property Y (for example, the height of a father and the hight of his child at a certain age). We represent the data as a set of points

on the plane

and we want to approximate this relation by a linear function

, where a is the slope (of the graph) and b is the bias. This approximation problem is known as the simple linear regression.

To obtain the best possible approximation we choose the pair

which minimises an error function,

which can be either

where

is the distance between the point

and the line. As we know from Geometry,

is given by the formula above. In this exercise we choose

, and, for a given set

, our task will be to find the absolute minimum of

by doing a gradient descent.

Consider a concrete case of 36 points on

, their coordinates being rows of the matrix

P = [23.05376671, 2.083554005; 23.774115313,2.14486932; 25.031876523,1.7269246;...
956640, 2.83704978; 27.357839,4.2577748; 27.865011,4.813969; ...
07254042, 4.0247780; 30.0714742,4.41801357; 30.9875855,5.54587904; ...
1409034, 5.96687696; 33.06714971,5.36700523; 34.0717238,6.95209411;...
04888937,7.16387720; 36.072688,7.07862363; 37.0293871,7.335086878;...
08883956,7.641171957; 38.893112954,8.226200522; 39.7055715,9.57535211;...
0325190, 9.14802867; 42.1370298,9.215393; 42.9897757553914,10.2534211833571;...
0319206739166, 10.925143438655; 44.9135120082676,11.2379794815215;...
9835120980791, 11.9510829150115; 47.1093265669039,12.5937093190458;...
9136347178011, 12.6309436364522; 48.8785882956385,12.6045997034053;...
9993150671897, 14.1130521233139; 50.9230334086246,14.098551525104;...
9774415597729, 14.8469424555258; 52.8910935704948,14.863022985666;...
0552527021112, 15.7402440871523; 55.1544211895504,15.7843724532702;...
8508409689362, 15.9030792650961; 56.893841826668,17.5901828896008;...
9384398118533, 17.3992307134816];

First observe that we can calculate the error function S analytically using the code:

syms a b % a, b will be symbolic variables in this file
% Below we calculate the error function Ferr(a,b) symbolically
N=size(P,1);
Ferr=0; 
for i=1:N
    Ferr=Ferr+(a*P(i,1)-P(i,2)+b)^2/(a^2+1);
end
Ferr = Ferr/N;
simplify(Ferr);

The output, the function

, takes the form:

It has two critical points (candidates to extrema):

An additional study shows that the first point is a minimum, while the second one is a saddle, we conclude that

is the absolute minimum with a value

Compute the result using the gradient descent method:

Now we try to recover this result and make some plots by applying a gradient descent.

Using the function costline.m

function S=costlin(P,a,b) % For a given set of points P_1,...,P_N and a pair (a,b) 
    % we calculate the error function 
    N=size(P,1);
    S=0;
    for i=1:N
        S=S+(a*P(i,1)-P(i,2)+b)^2/(a^2+1)/N;
    end
end

and the main code LinearRegr.m (see below),

which, after 35000 gradient steps (!), produces the final parameters (a, b) of the line approximating the data set

namely, (a = 0.4496516538, b = -8.8253832022) and the final value of the error function S = 0.182979175523. Compare this with the exact coordinates

of the absolute minimum of Ferr and its value min(Ferr) computed before.

The code also produces the following plots (with some comments added):

Now here is the code LinearRegr.m (note that it uses the matrix P and the function costlin.m!)

% LinearRegr.m
a0 = 1.0; b0 = -10.5; % initial point (a,b) and h
h = 0.002;  
% initialize the parameters for iterations
    a = a0; b = b0;
    numMaxSteps = 35000;
% create list of values of the cost function, etc.
    CostFun = zeros(numMaxSteps,1); NormGrad = zeros(numMaxSteps,1); 
    Bval = zeros(numMaxSteps,1); Aval = zeros(numMaxSteps,1); Eps = zeros(numMaxSteps,1);
% start iterations
for k = 1:numMaxSteps % k = present number of descent step 
    eps = 0.001; % the initial step size which we try to apply at each step
    S = costlin(P,a,b); % calculate the error function (defined at the end of this script)
    CostFun(k) = S;
    gradS(1) = (costlin(P,a+h,b)-costlin(P,a-h,b))/(2*h); % calculate the approximation of partial derivatives
    gradS(2) = (costlin(P,a,b+h)-costlin(P,a,b-h))/(2*h);
    NormGrad(k) = norm(gradS); 
    Bval(k) = b; 
    Aval(k) = a; 
    while costlin(P,a-eps*gradS(1),b-eps*gradS(2))>S % we check if the next 
        % error function S is bigger than in the previous step. This happens when
        % 'eps' is too big. 
        eps=eps/2.5;  % if this happens, we reduce the step size 'eps' at the current step 
    end
    Eps(k) = eps;
    a=a-eps*gradS(1); % calculate values of (a,b) for the next step
    b=b-eps*gradS(2);
end
 
% Plot the results:
figure()
    plot(P(:,1),P(:,2),'ro',Color='red') % we plot the points P_1,...,P_N on the plane (x,y)
    hold on
    t = 17:5:62;
    plot(t,a0*t+b0,Color='red',LineWidth=1.8) % we plot the line with the initial parameters (a,b)
    for k = 2:5
        plot(t,Aval(k)*t+Bval(k),'--',LineWidth=0.5) % we plot few more lines
    end
    plot(t,a*t+b,Color='blue',LineWidth=1.8) % we plot the approximating line with the final values (a,b)
    title('line approximation')
figure()
    numS = 60;
    plot(CostFun(1:numS),'.-')
    axis([0,numS,0,250])
    title('error function (first 60 steps)')  
figure()
    axis equal
    plot(Aval(1),Bval(1),'b*')
    axis([0,1.2,-10.7,-8.6])
    hold on
    plot(Aval,Bval,'b-')
    plot(0.4475439589,-8.73456329,'ro',Color='red') % plot the absolute minimum computed above
    hold off
    legend('ini point','the path on (a,b)','target point = abs. min',...
        'Location','northoutside')    
figure()
    ha = 0.01; hb = 0.1;
    aa = a-0.3:ha:a+0.6; bb = b-2:hb:b+2;
    [A,B]=meshgrid(aa,bb);
    N=size(P,1);
    [m,n]= size(A);
    for i =1:m
        for j=1:n
            S=0;
            for k=1:N
                S=S+(A(i,j)*P(k,1)-P(k,2)+B(i,j))^2/(A(i,j)^2+1)/N;
            end
            Z(i,j) = S;
        end
    end
    mesh(A,B,Z)
    view(8,23) %point of view
    hold on
    plot3(Aval(1:10), Bval(1:10), CostFun(1:10),'ro',MarkerSize=5,MarkerFaceColor='r')
    plot3(Aval(1:10), Bval(1:10), CostFun(1:10),'y--')
    plot3(Aval(11:500:end), Bval(11:500:end), CostFun(11:500:end),'ro',...
          MarkerSize=5,MarkerFaceColor='r')
    title('a-b iteration steps in 3D')
 

An observation: As we see from the graph of the function

the step number), the closer we are to the target value

, the more slowly we approach it:

This is because in the neighbourhood of the absolute minimum

, the b-component of the vector

is very small, so it requires many steps to reach the minimum.

A math hack: In the above algorithm, instead of choosing the initial values of the parameters

randomly, do the following:

1) From the list

choose any 2 points, say

, which are not too close to each other !;

2) Calculate the equation

of the unique line which passes through

;

3) Take the coefficients

as the initial values of

and repeat the algorithm of the gradient descent to get a better approximation line.

Example 2: Recognising the letter 'O' by the Gradient Descent

Suppose we are given a set of

points

which, approximately, form an oval. We want to get the best possible approximation of this set by a continuous curve, an ellipse

. How to choose the best approximating ellipse then ?

For a given ellipse E, it's natural to introduce the cost function (or error function)

, where

is the shortest distance between

and E, and then find an ellipse which minimises S. That is, S is the average squared distance between the ellipse E and all the points

On the other hand, any ellipse can be uniquely described by 5 parameters:

the coordinates of the centre ,
the semi-axes ,
the angle α between the semi-axis a and the horizontal line passing through the centre.

( Naturally, if

, the ellipse is just a circumference and the angle α is not determined! ). In the general case we assume that

, since α and

define the same ellipse.

The points

of a given ellipse E admit a trigonometrical parameterisation with a parameter

So, in view of the above parameterisation of E, for each point

of the set, we have

An explicit formula for the above minimum is quite complicated, instead of using it we will calculate a numerical approximation of

using the MATLAB function cost.m, see below.

We conclude that the cost function

depends of the 5 parameters

of E, and we want to find (approximately !) the ellipse E for which

reaches the global minimum.

Note that, in contrast to the previous example, where we knew an analytic expression of the error function

, now we do not know such a formula for

and, therefore, cannot know the exact position of the extrema of S!

We can only apply the following gradient descent algorithm :

1) Download the set of points

as a

matrix and plot it on

2) Choose a set of the 5 parameters

of an initial ellipse E and plot it in same frame. We can choose the parameters randomly or using other approaches (see below);

3) For each point

calculate

, and then the value of the cost function

;

4) For a given small h, calculate the numerical approximation of the 5-dimensional centred gradient vector (see Practica PCII-1)

5) Replace the parameters

where, as in the previous example,

is a step size parameter (which can vary from step to step). Plot the new ellipse E in the same frame.

6) Repeat the steps 3),4),5) M times, where M is big enough, or repeat them until the norm of the new gradient is less than a small number (say,

). Then stop the procedure and assume the latest obtained set

to be a numerical approximation of a local minimum of the function

7) Notice that for some sets

the function S may have several local minima. To try to reach the global minimum of the cost function S (that is, to obtain the best approximation of the set by an ellipse), take another initial ellipse (a new set

) and apply the above algorithm from step (2). Do this several (m) times.

8) Finally compare the values of S at all the local minima obtained (if there are more than one) and choose the global one. Note that m can be quite big, but still there is no guarantee that this algorithm would necessarily detect the global minimum !

The Matlab code:

First we will need the function cost.m with the input variables:

matrix of the -coordinates of the points ;
xc, yc, a, b, alpha = the 5 parameters defining an ellipse E

and the output variables:

S = the cost function for the given ellipse and the set of the points
Xt, Yt = the lists of x - and y - coordinates of points of E to plot them.

function [S,Xt,Yt]=cost(P,xc,yc,a,b,alpha)
t = 0: 2*pi/400 :2*pi;
n = size(t,2);
Xt = xc + cos(alpha)*a*cos(t) - sin(alpha)*b*sin(t);
Yt = yc + sin(alpha)*a*cos(t) + cos(alpha)*b*sin(t);
S = 0;
for i=1:size(P,1)
    DPi=zeros(1,n);
    for j=1:n
        DPi(j)=(Xt(j)-P(i,1))^2+(Yt(j)-P(i,2))^2; % the distance between the point (Xt(j),Yt(j)) of the ellipse and the point P_i of the set
    end
    dist(i)=min(DPi); % dist(i) = the squared distance between the ellipse and the point P_i
    S= S+dist(i)/size(P,1); % the cost function for the given ellipse
end
 

Now choose a set of points

% 2D Points
P=[226.98, 681.63; 232.86, 692.24; 233.97, 702.84; 234.02, 711.39; 244.26, 724.88;...
64, 708.33; 244.26, 724.88; 244.85, 734.19 ;246.94, 748.99; 262.28, 746.91;...
71, 759.26;290.51, 768.915;304.33,784.773; 327.881, 755.707; 348.513,761.045;...
37,738.539;367.863,722.165; 343.84, 713.01; 344.0, 685.99;353.68,667.17;...
49,665.9;316.58,648.41;308.76, 622.39; 284.845, 622.275;260.88 617.84; ...
284, 644.716; 233.422, 634.545;226.903,642.539;221.35, 651.76;220.16, ...
27;218.67,670.42];

Next, below you find the code circapprox.m which calculates (approximately) a local minimum of the cost function S. It also plots a sequence of ellipses approximating the set

, the graph of the cost function against the number of steps, the graph of the norm

, and the graph of the step size

applied for each step.

For exemple, after 2400 steps of the gradient descent we have the plots:

Here the initial approximating ellipse is coloured as thick red and the final (after 2400 iterations) ellipse as thick blue. (We did not plot all the 2400 ellipses to avoid a mess on the plot)

The parameters of the final ellipse E are:

and the final value of

% circapprox.m
%
% initial point (xc,yc,a,b,alpha) and step
    xc=300; yc=710; a=55; b=35; alpha=pi/3; 
    h=0.005; 
    [Sini,Xtini,Ytini]=cost(P,xc,yc,a,b,alpha); %initial ellipse
% initialize iterations
    N = size(P,1);
    numMaxSteps = 2400;
    p(1) = xc; p(2) = yc; p(3) = a; p(4) = b; p(5) = alpha; % p = the list of the 5 parameters of the ellipse
    CostF = zeros(N,1); NormGr = zeros(N,1); % initialize result variables to zero
    Ep=zeros(N,1); Ellipse_Param = zeros(N,5);
 
for k = 1:numMaxSteps % initiate the descent iterations
    Ellipse_Param(k,:) = p;
    eps = 0.015; % the initial step size which we try to apply at each step
    [S,Xt,Yt] = cost(P,p(1),p(2),p(3),p(4),p(5)); 
    % calculate the 5 components of the gradient vector Grad (S): 
    gradS(1)=(cost(P,p(1)+h,p(2),p(3),p(4),p(5))-cost(P,p(1)-h,p(2),p(3),p(4),p(5)))/(2*h);
    gradS(2)=(cost(P,p(1),p(2)+h,p(3),p(4),p(5))-cost(P,p(1),p(2)-h,p(3),p(4),p(5)))/(2*h);
    gradS(3)=(cost(P,p(1),p(2),p(3)+h,p(4),p(5))-cost(P,p(1),p(2),p(3)-h,p(4),p(5)))/(2*h);
    gradS(4)=(cost(P,p(1),p(2),p(3),p(4)+h,p(5))-cost(P,p(1),p(2),p(3),p(4)-h,p(5)))/(2*h);
    gradS(5)=(cost(P,p(1),p(2),p(3),p(4),p(5)+h)-cost(P,p(1),p(2),p(3),p(4),p(5)-h))/(2*h);
    
    while ( cost(P,p(1)-eps*gradS(1),p(2)-eps*gradS(2),p(3)-eps*gradS(3),p(4)-eps*gradS(4),p(5)-eps*gradS(5)) > S) % we check if the next 
        % error function S is bigger than in the previous step. This happens when
        % 'eps' is too big. 
        eps=eps/2.5;  % if this happens, we reduce the step size 'eps' at the current step 
    end
    CostF(k) = S; NormGr(k) = norm(gradS); Ep(k) = eps;
    p=p-eps*gradS; % change the vector of the 5 parameters of the ellipse
end
 
% plot results
figure()
    plot(P(:,1),P(:,2),'ro')
    hold on;
    plot(Xtini,Ytini, Color='red',LineWidth=2) % plotting the initial ellipse
    for r = [3,7,30,50,90,120,180,240,320,380,500,1000] % we plot only few ellipses   
        p = Ellipse_Param(r,:); 
        [S,Xt,Yt]=cost(P,p(1),p(2),p(3),p(4),p(5));
        plot(Xt,Yt);  
    end
    plot(Xt,Yt,Color='blue',LineWidth=2) % plotting the final ellipse
    axis([200,380,600,800])
    title('Ellipse evolution')
    hold off;
figure()
    plot(P(:,1),P(:,2),'ro')
    hold on;
    plot(Xt,Yt,Color='blue',LineWidth=2) % plotting the final ellipse
    axis([200,380,600,800])
    title('Final Ellipse')
    hold off;    
figure()
    plot(CostF) % plotting the graph of S against the number of iterations
    title('Cost Function')
figure()
    plot(NormGr(1:1500)) % plotting the graph of norm(grad(S)) against the number of iterations
    axis([1,1500,0,300])
    title('norm of Grad(S) (1500 steps)')
figure()
    plot(Ep(1:1200))
    title('step size eps for the first 1200 steps')
    hold off
 

Exercise 1. Apply the noise filter

As we already know,

is the average squared distance between an ellipse and all the points

Now, given the final ellipse

with the parameters

, from the set

you should remove the points

with

and create a new list of the points which satisfy

. In other words, we discard the renegade points

which are too far from

. (see an example in the Figure below).

The corresponding code is:

% SQdistance.m
Sfin=77.6994; % Specify the final value S_f of the cost function 
p=[291.5469,697.6231,81.3316,56.6900,-2.1785]; % and the final values of the parameters of the final ellipse E_f
Dist=[]; Pnew=[]; % create the list of the squared distances between all the poins P_i and E_f  
for i=1:size(P,1)
    Di=cost(P(i,:), p(1),p(2),p(3),p(4),p(5)); % calculating the squared distance between the point P_i and E_f 
    Dist=[Dist,Di];
    if Di < Sfin
        Pnew=[Pnew;P(i,:)]; % create a new list of the points P_i which are not far from the ellipse E_f 
    end
end
figure()
plot(Dist,'ro',Color='red')
axis([0,35,-20,600])
hold on
plot([1,32],[Sfin,Sfin],LineWidth=2,Color='blue')
hold off

Then, for the modified list of

, repeat the algorithm of the gradient descent taking

as the initial ellipse and obtain a new ellipse

(the parameters

) which minimises the cost function S for the modified list. An example (with 800 gradient steps) is illustrated below:

This method is known as a noise filter because some points

in the initial list

might had appeared due to some errors (noise) during the experiments.

You are invited to develop the code and recover the above results yourself!