A level set of a function $f: \R^n \to \R $ is the set of points $\b{x} $ satisfying $f(\b{x})$ for some constant $c $.
$\nabla f(x_0)$, if it is not a zero vector, is orthogonal to the tangent vector to an arbitrary smooth curve passing through $\b{x}_0$ on the level set $f(\b{x}) = c$.
Thus, the direction of maximum rate of increase of a real-valued differentiable function at a point is orthogonal to the level set of the function through the point.
In other words, the gradient acts in such a direction that for a given small displacement, the function $f $ increases more in the direction of the gradient than in any other direction.
Thus, the direction in which $\nabla f(\b{x})$ points is the direction of maximum rate of increase of $f$ at $\b{x}$. The direction in which $- \nabla f(\b{x})$ points is the direction of maximum rate of decrease of $f$ at $\b{x}$. Hence, the direction of negative gradient is a good direction to search if we want to find a function minimizer.
by the Cauchy-Schwarz inequality
</span>
</span>
<span class="proof__expand"><a>[expand]</a></span>
Let $\b{x}_\pare{0} $ be a starting point, and consider the point $\b{x}^{(0)} - \alpha \nabla f(\b{x}^{(0)}) $. Then by Taylor’s theorem we obtain
$$f(\b{x}^{(0)} - \alpha \nabla f(\b{x}^{\pare{0}})) = f(x^{(0)}) - \alpha \norm{ \nabla f(\b{x}^{(0)})}^2 + o(\alpha)$$
Thus, if $\nabla f(\b{x}^{(0)}) \neq \b{0}$, then for sufficiently small $\alpha > 0 $, we have
$$f(\b{x}^{(0)} - \alpha \nabla f(\b{x}^{(0)})) < f(\b{x}^{(0)})$$
This means that the point $\b{x}^{(0)} - \alpha \nabla f(\b{x} ^{(0)})$ is an improvement over the point $\b{x}^{(0)}$ if we are searching for a minimizer.
Suppose that we are given a point $\b{x}^{(k)} $. To find the next point $\b{x}^{(k+1)} $, we start at $\b{x}^{(k)} $ and move by an amount $- \alpha_k \nabla f(\b{x}^{(k)}) $, where $\alpha_k $ is a positive scalar called the step size. The above procedure leads to the following iterative algorithm:
$\b{x}^{(k+1)} = \b{x}^{(k)} - \alpha_k \nabla f(\b{x}^{(k)})$
We refer to the above as a gradient descent algorithm.