The idea of the gradient descent algorithm

In many optimisation problems one needs to find an absolute minimum of a continuous scalar function , which can be given by a formula, or a rule, or a procedure. Sometimes f can be defined piecewise.
Of course, if the number n is small () and the function is derivable and given by a simple expression, then we can do the standard analytical study of the extrema:
Now imagine that the number n of the variables and, moreover, the function f itself is defined by a complicated procedure. Then it's more natural to apply a direct gradient descent method:
where is a step size parameter, which can vary from step to step in such a way that at each step the new value of is smaller than at the previous one.
We illustrate this procedure on an example of a function of just two variables: . Suppose we can visualise its graph in 3D, a family of the contour lines on the graph and the level curves on the plane , also the local extrema and saddle points (if any), as shown below (the extrema are marked with white points and the saddle points are marked with white crosses):
So, our function has 4 local minima (one of them is the absolute minimum ), one local maximum, and 4 saddle points.
Choosing an initial point , we do a series of descents: at each step we move to the direction of the fastest descent: it is opposite to the gradient vector at the corresponding point. Notice that at each point the gradient vector is orthogonal to the level curve passing through that point, see the field of directions of the descents:
grad_field.png
Below several descent paths for different initial points (red dots) are shown:
As we see, almost any descend path necessarily ends in a neighbourhood of a local minimum (where is almost zero). Some rare paths may end near a saddle point, where the gradient is also close to zero (see the red path), but the probability of this issue is very small. Moreover, when the number n of the variables is big, this probability is practically zero, so we do not take this case into account.
So, making a sufficiently big number of paths with different initial points, we can be almost sure that we would reach the absolute minimum of f provided that we have chosen a right step size ε: if it is too big we may constantly jump over the extrema and do not do any descent (see the red path on the Figure below); if it is too small we risk never approach to a minimum sufficiently close.

Example 1: Simple Linear regression and the gradient descent in

Suppose we have a set of experimental data describing a relation between a property X and a property Y (for example, the height of a father and the hight of his child at a certain age). We represent the data as a set of points on the plane and we want to approximate this relation by a linear function , where a is the slope (of the graph) and b is the bias. This approximation problem is known as the simple linear regression.
To obtain the best possible approximation we choose the pair which minimises an error function, which can be either
where is the distance between the point and the line. As we know from Geometry, is given by the formula above. In this exercise we choose , and, for a given set , our task will be to find the absolute minimum of by doing a gradient descent.
Consider a concrete case of 36 points on , their coordinates being rows of the matrix
P = [23.05376671, 2.083554005; 23.774115313,2.14486932; 25.031876523,1.7269246;...
25.956640, 2.83704978; 27.357839,4.2577748; 27.865011,4.813969; ...
29.07254042, 4.0247780; 30.0714742,4.41801357; 30.9875855,5.54587904; ...
32.1409034, 5.96687696; 33.06714971,5.36700523; 34.0717238,6.95209411;...
35.04888937,7.16387720; 36.072688,7.07862363; 37.0293871,7.335086878;...
38.08883956,7.641171957; 38.893112954,8.226200522; 39.7055715,9.57535211;...
41.0325190, 9.14802867; 42.1370298,9.215393; 42.9897757553914,10.2534211833571;...
44.0319206739166, 10.925143438655; 44.9135120082676,11.2379794815215;...
45.9835120980791, 11.9510829150115; 47.1093265669039,12.5937093190458;...
47.9136347178011, 12.6309436364522; 48.8785882956385,12.6045997034053;...
49.9993150671897, 14.1130521233139; 50.9230334086246,14.098551525104;...
51.9774415597729, 14.8469424555258; 52.8910935704948,14.863022985666;...
54.0552527021112, 15.7402440871523; 55.1544211895504,15.7843724532702;...
55.8508409689362, 15.9030792650961; 56.893841826668,17.5901828896008;...
57.9384398118533, 17.3992307134816];
First observe that we can calculate the error function S analytically using the code:
syms a b % a, b will be symbolic variables in this file
% Below we calculate the error function Ferr(a,b) symbolically
N=size(P,1);
Ferr=0;
for i=1:N
Ferr=Ferr+(a*P(i,1)-P(i,2)+b)^2/(a^2+1);
end
Ferr = Ferr/N;
simplify(Ferr);
The output, the function , takes the form:
It has two critical points (candidates to extrema):
An additional study shows that the first point is a minimum, while the second one is a saddle, we conclude that is the absolute minimum with a value
minima.png

Compute the result using the gradient descent method:

Now we try to recover this result and make some plots by applying a gradient descent.
Using the function costline.m
function S=costlin(P,a,b) % For a given set of points P_1,...,P_N and a pair (a,b)
% we calculate the error function
N=size(P,1);
S=0;
for i=1:N
S=S+(a*P(i,1)-P(i,2)+b)^2/(a^2+1)/N;
end
end
and the main code LinearRegr.m (see below),
which, after 35000 gradient steps (!), produces the final parameters (a, b) of the line approximating the data set namely, (a = 0.4496516538, b = -8.8253832022) and the final value of the error function S = 0.182979175523. Compare this with the exact coordinates of the absolute minimum of Ferr and its value min(Ferr) computed before.
The code also produces the following plots (with some comments added):
allFigures00.png
Now here is the code LinearRegr.m (note that it uses the matrix P and the function costlin.m!)
% LinearRegr.m
a0 = 1.0; b0 = -10.5; % initial point (a,b) and h
h = 0.002;
% initialize the parameters for iterations
a = a0; b = b0;
numMaxSteps = 35000;
% create list of values of the cost function, etc.
CostFun = zeros(numMaxSteps,1); NormGrad = zeros(numMaxSteps,1);
Bval = zeros(numMaxSteps,1); Aval = zeros(numMaxSteps,1); Eps = zeros(numMaxSteps,1);
% start iterations
for k = 1:numMaxSteps % k = present number of descent step
eps = 0.001; % the initial step size which we try to apply at each step
S = costlin(P,a,b); % calculate the error function (defined at the end of this script)
CostFun(k) = S;
gradS(1) = (costlin(P,a+h,b)-costlin(P,a-h,b))/(2*h); % calculate the approximation of partial derivatives
gradS(2) = (costlin(P,a,b+h)-costlin(P,a,b-h))/(2*h);
NormGrad(k) = norm(gradS);
Bval(k) = b;
Aval(k) = a;
while costlin(P,a-eps*gradS(1),b-eps*gradS(2))>S % we check if the next
% error function S is bigger than in the previous step. This happens when
% 'eps' is too big.
eps=eps/2.5; % if this happens, we reduce the step size 'eps' at the current step
end
Eps(k) = eps;
a=a-eps*gradS(1); % calculate values of (a,b) for the next step
b=b-eps*gradS(2);
end
 
% Plot the results:
figure()
plot(P(:,1),P(:,2),'ro',Color='red') % we plot the points P_1,...,P_N on the plane (x,y)
hold on
t = 17:5:62;
plot(t,a0*t+b0,Color='red',LineWidth=1.8) % we plot the line with the initial parameters (a,b)
for k = 2:5
plot(t,Aval(k)*t+Bval(k),'--',LineWidth=0.5) % we plot few more lines
end
plot(t,a*t+b,Color='blue',LineWidth=1.8) % we plot the approximating line with the final values (a,b)
title('line approximation')
figure()
numS = 60;
plot(CostFun(1:numS),'.-')
axis([0,numS,0,250])
title('error function (first 60 steps)')
figure()
axis equal
plot(Aval(1),Bval(1),'b*')
axis([0,1.2,-10.7,-8.6])
hold on
plot(Aval,Bval,'b-')
plot(0.4475439589,-8.73456329,'ro',Color='red') % plot the absolute minimum computed above
hold off
legend('ini point','the path on (a,b)','target point = abs. min',...
'Location','northoutside')
figure()
ha = 0.01; hb = 0.1;
aa = a-0.3:ha:a+0.6; bb = b-2:hb:b+2;
[A,B]=meshgrid(aa,bb);
N=size(P,1);
[m,n]= size(A);
for i =1:m
for j=1:n
S=0;
for k=1:N
S=S+(A(i,j)*P(k,1)-P(k,2)+B(i,j))^2/(A(i,j)^2+1)/N;
end
Z(i,j) = S;
end
end
mesh(A,B,Z)
view(8,23) %point of view
hold on
plot3(Aval(1:10), Bval(1:10), CostFun(1:10),'ro',MarkerSize=5,MarkerFaceColor='r')
plot3(Aval(1:10), Bval(1:10), CostFun(1:10),'y--')
plot3(Aval(11:500:end), Bval(11:500:end), CostFun(11:500:end),'ro',...
MarkerSize=5,MarkerFaceColor='r')
title('a-b iteration steps in 3D')
 
An observation: As we see from the graph of the function the step number), the closer we are to the target value , the more slowly we approach it:
stepNumber.png
This is because in the neighbourhood of the absolute minimum , the b-component of the vector is very small, so it requires many steps to reach the minimum.
A math hack: In the above algorithm, instead of choosing the initial values of the parameters randomly, do the following:
1) From the list choose any 2 points, say , which are not too close to each other !;
2) Calculate the equation of the unique line which passes through ;
3) Take the coefficients as the initial values of and repeat the algorithm of the gradient descent to get a better approximation line.

Example 2: Recognising the letter 'O' by the Gradient Descent

Suppose we are given a set of points which, approximately, form an oval. We want to get the best possible approximation of this set by a continuous curve, an ellipse . How to choose the best approximating ellipse then ?
Ofitting1.png
For a given ellipse E, it's natural to introduce the cost function (or error function) , where is the shortest distance between and E, and then find an ellipse which minimises S. That is, S is the average squared distance between the ellipse E and all the points .
ellipse0.png
On the other hand, any ellipse can be uniquely described by 5 parameters:
( Naturally, if , the ellipse is just a circumference and the angle α is not determined! ). In the general case we assume that , since α and define the same ellipse.
The points of a given ellipse E admit a trigonometrical parameterisation with a parameter :
So, in view of the above parameterisation of E, for each point of the set, we have
An explicit formula for the above minimum is quite complicated, instead of using it we will calculate a numerical approximation of using the MATLAB function cost.m, see below.
We conclude that the cost function depends of the 5 parameters of E, and we want to find (approximately !) the ellipse E for which reaches the global minimum.
Note that, in contrast to the previous example, where we knew an analytic expression of the error function , now we do not know such a formula for and, therefore, cannot know the exact position of the extrema of S!
We can only apply the following gradient descent algorithm :
1) Download the set of points as a matrix and plot it on
2) Choose a set of the 5 parameters of an initial ellipse E and plot it in same frame. We can choose the parameters randomly or using other approaches (see below);
3) For each point calculate , and then the value of the cost function ;
4) For a given small h, calculate the numerical approximation of the 5-dimensional centred gradient vector (see Practica PCII-1)
5) Replace the parameters by
where, as in the previous example, is a step size parameter (which can vary from step to step). Plot the new ellipse E in the same frame.
6) Repeat the steps 3),4),5) M times, where M is big enough, or repeat them until the norm of the new gradient is less than a small number (say, ). Then stop the procedure and assume the latest obtained set to be a numerical approximation of a local minimum of the function .
7) Notice that for some sets the function S may have several local minima. To try to reach the global minimum of the cost function S (that is, to obtain the best approximation of the set by an ellipse), take another initial ellipse (a new set ) and apply the above algorithm from step (2). Do this several (m) times.
8) Finally compare the values of S at all the local minima obtained (if there are more than one) and choose the global one. Note that m can be quite big, but still there is no guarantee that this algorithm would necessarily detect the global minimum !

The Matlab code:

First we will need the function cost.m with the input variables:
and the output variables:
function [S,Xt,Yt]=cost(P,xc,yc,a,b,alpha)
t = 0: 2*pi/400 :2*pi;
n = size(t,2);
Xt = xc + cos(alpha)*a*cos(t) - sin(alpha)*b*sin(t);
Yt = yc + sin(alpha)*a*cos(t) + cos(alpha)*b*sin(t);
S = 0;
for i=1:size(P,1)
DPi=zeros(1,n);
for j=1:n
DPi(j)=(Xt(j)-P(i,1))^2+(Yt(j)-P(i,2))^2; % the distance between the point (Xt(j),Yt(j)) of the ellipse and the point P_i of the set
end
dist(i)=min(DPi); % dist(i) = the squared distance between the ellipse and the point P_i
S= S+dist(i)/size(P,1); % the cost function for the given ellipse
end
 
Now choose a set of points :
% 2D Points
P=[226.98, 681.63; 232.86, 692.24; 233.97, 702.84; 234.02, 711.39; 244.26, 724.88;...
258.64, 708.33; 244.26, 724.88; 244.85, 734.19 ;246.94, 748.99; 262.28, 746.91;...
269.71, 759.26;290.51, 768.915;304.33,784.773; 327.881, 755.707; 348.513,761.045;...
362.37,738.539;367.863,722.165; 343.84, 713.01; 344.0, 685.99;353.68,667.17;...
333.49,665.9;316.58,648.41;308.76, 622.39; 284.845, 622.275;260.88 617.84; ...
261.284, 644.716; 233.422, 634.545;226.903,642.539;221.35, 651.76;220.16, ...
660.27;218.67,670.42];
Next, below you find the code circapprox.m which calculates (approximately) a local minimum of the cost function S. It also plots a sequence of ellipses approximating the set , the graph of the cost function against the number of steps, the graph of the norm , and the graph of the step size applied for each step.
For exemple, after 2400 steps of the gradient descent we have the plots:
allFigures02.png
steps.png
Here the initial approximating ellipse is coloured as thick red and the final (after 2400 iterations) ellipse as thick blue. (We did not plot all the 2400 ellipses to avoid a mess on the plot)
The parameters of the final ellipse E are:
and the final value of
% circapprox.m
%
% initial point (xc,yc,a,b,alpha) and step
xc=300; yc=710; a=55; b=35; alpha=pi/3;
h=0.005;
[Sini,Xtini,Ytini]=cost(P,xc,yc,a,b,alpha); %initial ellipse
% initialize iterations
N = size(P,1);
numMaxSteps = 2400;
p(1) = xc; p(2) = yc; p(3) = a; p(4) = b; p(5) = alpha; % p = the list of the 5 parameters of the ellipse
CostF = zeros(N,1); NormGr = zeros(N,1); % initialize result variables to zero
Ep=zeros(N,1); Ellipse_Param = zeros(N,5);
 
for k = 1:numMaxSteps % initiate the descent iterations
Ellipse_Param(k,:) = p;
eps = 0.015; % the initial step size which we try to apply at each step
[S,Xt,Yt] = cost(P,p(1),p(2),p(3),p(4),p(5));
% calculate the 5 components of the gradient vector Grad (S):
gradS(1)=(cost(P,p(1)+h,p(2),p(3),p(4),p(5))-cost(P,p(1)-h,p(2),p(3),p(4),p(5)))/(2*h);
gradS(2)=(cost(P,p(1),p(2)+h,p(3),p(4),p(5))-cost(P,p(1),p(2)-h,p(3),p(4),p(5)))/(2*h);
gradS(3)=(cost(P,p(1),p(2),p(3)+h,p(4),p(5))-cost(P,p(1),p(2),p(3)-h,p(4),p(5)))/(2*h);
gradS(4)=(cost(P,p(1),p(2),p(3),p(4)+h,p(5))-cost(P,p(1),p(2),p(3),p(4)-h,p(5)))/(2*h);
gradS(5)=(cost(P,p(1),p(2),p(3),p(4),p(5)+h)-cost(P,p(1),p(2),p(3),p(4),p(5)-h))/(2*h);
while ( cost(P,p(1)-eps*gradS(1),p(2)-eps*gradS(2),p(3)-eps*gradS(3),p(4)-eps*gradS(4),p(5)-eps*gradS(5)) > S) % we check if the next
% error function S is bigger than in the previous step. This happens when
% 'eps' is too big.
eps=eps/2.5; % if this happens, we reduce the step size 'eps' at the current step
end
CostF(k) = S; NormGr(k) = norm(gradS); Ep(k) = eps;
p=p-eps*gradS; % change the vector of the 5 parameters of the ellipse
end
 
% plot results
figure()
plot(P(:,1),P(:,2),'ro')
hold on;
plot(Xtini,Ytini, Color='red',LineWidth=2) % plotting the initial ellipse
for r = [3,7,30,50,90,120,180,240,320,380,500,1000] % we plot only few ellipses
p = Ellipse_Param(r,:);
[S,Xt,Yt]=cost(P,p(1),p(2),p(3),p(4),p(5));
plot(Xt,Yt);
end
plot(Xt,Yt,Color='blue',LineWidth=2) % plotting the final ellipse
axis([200,380,600,800])
title('Ellipse evolution')
hold off;
figure()
plot(P(:,1),P(:,2),'ro')
hold on;
plot(Xt,Yt,Color='blue',LineWidth=2) % plotting the final ellipse
axis([200,380,600,800])
title('Final Ellipse')
hold off;
figure()
plot(CostF) % plotting the graph of S against the number of iterations
title('Cost Function')
figure()
plot(NormGr(1:1500)) % plotting the graph of norm(grad(S)) against the number of iterations
axis([1,1500,0,300])
title('norm of Grad(S) (1500 steps)')
figure()
plot(Ep(1:1200))
title('step size eps for the first 1200 steps')
hold off
 

Exercise 1. Apply the noise filter

As we already know, is the average squared distance between an ellipse and all the points
Now, given the final ellipse with the parameters , from the set you should remove the points with and create a new list of the points which satisfy . In other words, we discard the renegade points which are too far from . (see an example in the Figure below).
filter.png
The corresponding code is:
% SQdistance.m
Sfin=77.6994; % Specify the final value S_f of the cost function
p=[291.5469,697.6231,81.3316,56.6900,-2.1785]; % and the final values of the parameters of the final ellipse E_f
Dist=[]; Pnew=[]; % create the list of the squared distances between all the poins P_i and E_f
for i=1:size(P,1)
Di=cost(P(i,:), p(1),p(2),p(3),p(4),p(5)); % calculating the squared distance between the point P_i and E_f
Dist=[Dist,Di];
if Di < Sfin
Pnew=[Pnew;P(i,:)]; % create a new list of the points P_i which are not far from the ellipse E_f
end
end
figure()
plot(Dist,'ro',Color='red')
axis([0,35,-20,600])
hold on
plot([1,32],[Sfin,Sfin],LineWidth=2,Color='blue')
hold off
Then, for the modified list of , repeat the algorithm of the gradient descent taking as the initial ellipse and obtain a new ellipse (the parameters ) which minimises the cost function S for the modified list. An example (with 800 gradient steps) is illustrated below:
Exrecise1.png
This method is known as a noise filter because some points in the initial list might had appeared due to some errors (noise) during the experiments.
You are invited to develop the code and recover the above results yourself!
© Numerical Factory 2024