The Method of Steepest Descent - Jim Zenn

$\newcommand{\br}{\\}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\Q}{\mathbb{Q}}$ $\newcommand{\Z}{\mathbb{Z}}$ $\newcommand{\N}{\mathbb{N}}$ $\newcommand{\C}{\mathbb{C}}$ $\newcommand{\P}{\mathbb{P}}$ $\newcommand{\F}{\mathbb{F}}$ $\newcommand{\L}{\mathcal{L}}$ $\newcommand{\spa}[1]{\text{span}(#1)}$ $\newcommand{\dist}[1]{\text{dist}(#1)}$ $\newcommand{\max}[1]{\text{max}(#1)}$ $\newcommand{\min}[1]{\text{min}(#1)}$ $\newcommand{\supr}[1]{\text{sup}(#1)}$ $\newcommand{\infi}[1]{\text{inf}(#1)}$ $\newcommand{\argmax}[1]{\underset{#1}{\text{argmax }}}$ $\newcommand{\argmin}[1]{\underset{#1}{\text{argmin }}}$ $\newcommand{\set}[1]{\left\{#1\right\}}$ $\newcommand{\emptyset}{\varnothing}$ $\newcommand{\otherwise}{\text{ otherwise }}$ $\newcommand{\if}{\text{ if }}$ $\newcommand{\proj}{\text{proj}}$ $\newcommand{\union}{\cup}$ $\newcommand{\intercept}{\cap}$ $\newcommand{\abs}[1]{\left| #1 \right|}$ $\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$ $\newcommand{\pare}[1]{\left(#1\right)}$ $\newcommand{\t}[1]{\text{ #1 }}$ $\newcommand{\head}{\text H}$ $\newcommand{\tail}{\text T}$ $\newcommand{\d}{\text d}$ $\newcommand{\limu}[2]{\underset{#1 \to #2}\lim}$ $\newcommand{\der}[2]{\frac{\d #1}{\d #2}}$ $\newcommand{\derw}[2]{\frac{\d #1^2}{\d^2 #2}}$ $\newcommand{\pder}[2]{\frac{\partial #1}{\partial #2}}$ $\newcommand{\pderw}[2]{\frac{\partial^2 #1}{\partial #2^2}}$ $\newcommand{\pderws}[3]{\frac{\partial^2 #1}{\partial #2 \partial #3}}$ $\newcommand{\inv}[1]{{#1}^{-1}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\nullity}[1]{\text{nullity}(#1)}$ $\newcommand{\rank}[1]{\text{rank }#1}$ $\newcommand{\var}[1]{\text{var}(#1)}$ $\newcommand{\tr}[1]{\text{tr}(#1)}$ $\newcommand{\oto}{\text{ one-to-one }}$ $\newcommand{\ot}{\text{ onto }}$ $\newcommand{\ceil}[1]{\lceil#1\rceil}$ $\newcommand{\floor}[1]{\lfloor#1\rfloor}$ $\newcommand{\Re}[1]{\text{Re}(#1)}$ $\newcommand{\Im}[1]{\text{Im}(#1)}$ $\newcommand{\dom}[1]{\text{dom}(#1)}$ $\newcommand{\fnext}[1]{\overset{\sim}{#1}}$ $\newcommand{\transpose}[1]{{#1}^{\text{T}}}$ $\newcommand{\b}[1]{\boldsymbol{#1}}$ $\newcommand{\None}[1]{}$ $\newcommand{\Vcw}[2]{\begin{bmatrix} #1 \br #2 \end{bmatrix}}$ $\newcommand{\Vce}[3]{\begin{bmatrix} #1 \br #2 \br #3 \end{bmatrix}}$ $\newcommand{\Vcr}[4]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \end{bmatrix}}$ $\newcommand{\Vct}[5]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \br #5 \end{bmatrix}}$ $\newcommand{\Vcy}[6]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \br #5 \br #6 \end{bmatrix}}$ $\newcommand{\Vcu}[7]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \br #5 \br #6 \br #7 \end{bmatrix}}$ $\newcommand{\vcw}[2]{\begin{matrix} #1 \br #2 \end{matrix}}$ $\newcommand{\vce}[3]{\begin{matrix} #1 \br #2 \br #3 \end{matrix}}$ $\newcommand{\vcr}[4]{\begin{matrix} #1 \br #2 \br #3 \br #4 \end{matrix}}$ $\newcommand{\vct}[5]{\begin{matrix} #1 \br #2 \br #3 \br #4 \br #5 \end{matrix}}$ $\newcommand{\vcy}[6]{\begin{matrix} #1 \br #2 \br #3 \br #4 \br #5 \br #6 \end{matrix}}$ $\newcommand{\vcu}[7]{\begin{matrix} #1 \br #2 \br #3 \br #4 \br #5 \br #6 \br #7 \end{matrix}}$ $\newcommand{\Mqw}[2]{\begin{bmatrix} #1 & #2 \end{bmatrix}}$ $\newcommand{\Mqe}[3]{\begin{bmatrix} #1 & #2 & #3 \end{bmatrix}}$ $\newcommand{\Mqr}[4]{\begin{bmatrix} #1 & #2 & #3 & #4 \end{bmatrix}}$ $\newcommand{\Mqt}[5]{\begin{bmatrix} #1 & #2 & #3 & #4 & #5 \end{bmatrix}}$ $\newcommand{\Mwq}[2]{\begin{bmatrix} #1 \br #2 \end{bmatrix}}$ $\newcommand{\Meq}[3]{\begin{bmatrix} #1 \br #2 \br #3 \end{bmatrix}}$ $\newcommand{\Mrq}[4]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \end{bmatrix}}$ $\newcommand{\Mtq}[5]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \br #5 \end{bmatrix}}$ $\newcommand{\Mqw}[2]{\begin{bmatrix} #1 & #2 \end{bmatrix}}$ $\newcommand{\Mwq}[2]{\begin{bmatrix} #1 \br #2 \end{bmatrix}}$ $\newcommand{\Mww}[4]{\begin{bmatrix} #1 & #2 \br #3 & #4 \end{bmatrix}}$ $\newcommand{\Mqe}[3]{\begin{bmatrix} #1 & #2 & #3 \end{bmatrix}}$ $\newcommand{\Meq}[3]{\begin{bmatrix} #1 \br #2 \br #3 \end{bmatrix}}$ $\newcommand{\Mwe}[6]{\begin{bmatrix} #1 & #2 & #3\br #4 & #5 & #6 \end{bmatrix}}$ $\newcommand{\Mew}[6]{\begin{bmatrix} #1 & #2 \br #3 & #4 \br #5 & #6 \end{bmatrix}}$ $\newcommand{\Mee}[9]{\begin{bmatrix} #1 & #2 & #3 \br #4 & #5 & #6 \br #7 & #8 & #9 \end{bmatrix}}$

Definition

The method of steepest descent is a gradient algorithm where the step size $\alpha_k $ is chosen to achieve the maximum amount of decrease of the objective function at each individual step. Specifically, $\alpha_k$ is chosen to minimize $\phi_k (\alpha) \triangleq f(\b{x}^{(k)} - \alpha \nabla f(\b{x}^{(k)}))$. In other words,

$$\alpha_k = \argmin{ \alpha \geq 0} f(\b{x}^{(k)} - \alpha \nabla f(\b{x}^{k}))$$

To summarize, the steepest descent algorithm proceeds as follows:

At each step starting from the point $\b{x}^{(k)} $, we conduct a line search in the direction $- \nabla f(\b{x}^{(k)}) $ until a minimizer $\b{x}^{(k+1)}$ is found.

Theorem

If $\set{ \b{x}^{(k)}}^ \infty _ {k=0} $ is a steepest descent sequence for a given function $f: \R^n \to \R $, then for each $k$ the vector $\b{x}^{(k+1)} - \b{x}^{(k)}$ is orthogonal to the vector $\b{x}^{(k + 2)} - \b{x}^{(k+1)}$.

Properties: Descent Property

If $\set{ x^{(k)}}^ \infty _ {k = 0} $ is the steepest descent sequence for $f: \R^n \to \R $ and if $\nabla f(\b{x}^{(k)}) \neq \b{0} $, then $f(\b{x}^{(k+1)}) < f (\b{x}^{(k)})$.

Note: Stopping Criterion

In the case for some $k $ we have $\nabla f(\b{x}^{(k)}) = 0 $, then the point $\b{x}^{(k)} $ satisfies the FONC. In this case, $\b{x}^{(k+1)} = \b{x}^{(k)}$ can be used as a stopping criterion for the algorithm. However, this criterion can rarely be met.

A practical stopping criterion is to check if $\norm{ \nabla f(\b{x}^{(k)})}$ is less than a prespecified threshold, in which case we stop. Alternatively, we may use the following criterion

$$\abs{ f(\b{x}^{(k+1)}) - f(\b{x}^{(k)})} < \epsilon$$

where $\epsilon > 0 $ is a prespecified threshold.

Yet another alternative is to compute the norm $\norm{ \b{x}^{(k+1)} - \b{x}^{(k)}}$ of the difference between two successive iterates, and we stop if the norm is less than a prespecified threshold:

$$\norm{ \b{x}^{k+1} - \b{x}^{(k)}} < \epsilon $$

Note: Optimizing Quadratic Functions $\None{}f(\b{x})=\frac{1}{2}\transpose{\b{x}}\b{Q}\b{x}\transpose{\b{b}}\b{x} $

$\b{Q} \in \R^{n \times n}, \b{Q} = \transpose{\b{Q}}, \b{Q} > 0$, $\b{b} \in \R ^n$, and $\b{x} \in \R^n $.

The unique minimizer of $\b{f} $ can be found by setting the $\nabla f(\b{x}) $ to zero, where

$$\nabla f(\b{x}) = \b{Q} \b{x} - \b{b}$$

since $$D (\transpose{ \b{x}} \b{Q} \b{x}) = \transpose{ \b{x}} (\b{Q} + \transpose{ \b{Q}}) = 2 \transpose{ \b{x}} \b{Q}\br D (\transpose{ b } \b{x}) = \transpose{ \b{b}}$$

Also, we have $\b{F}(\b{x}) = \b{Q} = \transpose{ \b{Q}} > 0 $.

In fact, we can generalize the matrix $\b{Q} $ to be a non-symmetric matrix $\b{A} $. In that case, we just need to replace $\b{A}$ with

$$\b{Q} = \pare{\b{A} + \transpose{ \b{A}}} $$

Then it can be covered under above case.

To simplify the notation we write $g^{(k)} = \nabla f(\b{x}^{(k)})$. Then, the steepest descent algorithm for the quadratic function can be represented as

$$\b{x}^{(k+1)} = \b{x}^{(k)} - \alpha_k g^{(k)}$$

where

$$\begin{align*} \alpha_k &= \argmin{ \alpha \geq 0 } f(\b{x}^{(k)} - \alpha \b{g}^{(k)}) \br &= \argmin{ \alpha \geq 0 } (\frac{ 1 }{ 2 } \transpose{(\b{x}^{(k)} - \alpha \b{g}^{(k)})} \b{Q} (\b{x}^{(k)} - \alpha \b{g}^{(k)}) - \transpose{(\b{x}^{(k)} - \alpha \b{g}^{(k)})}\b{b}) \end{align*}$$

In the quadratic case, we can find an explicit formula for $\alpha_k $.

We want to find a specific formula for $\alpha_k $.

Assume $\b{g}^{(k)} \neq 0 $, for if $\b{g}^{(k)} = 0 $ is a stop criterion. Because $\alpha_k \geq 0 $ is a minimizer of $\phi_k (\alpha) = f(\b{x}^{(k)} - \alpha \b{x}^{(k)}) $, we apply the FONC to $\phi_k(\alpha)$ to obtain

$$\phi'_k(\alpha) = \transpose{(\b{x}^{(k)} - \alpha \b{g}^{(k)})} \b{Q} (- g^{(k)}) - \transpose{ \b{b}}(-\b{g}^{(k)}) = 0 $$

Hence,

$$$alpha_k= \frac{ \transpose{ \b{g}^{(k)}} \b{g}^{(k)}}{ \transpose{ \b{g}^{(k)}} \b{Q} \b{g}^{(k)}}$

The method of steepest descent for the quadratic takes the form

$$\b{x}^{(k+1)} = \b{x}^{(k)} - \pare{\frac{ \transpose{ \b{g}^{(k)}} \b{g}^{(k)}}{ \transpose{ \b{g}^{(k)}} \b{Q} \b{g}^{(k)}}} \b{g}^{(k)} $$