$\newcommand{\br}{\\}$ $\newcommand{\R}{\mathbb{R}}$ $\newcommand{\Q}{\mathbb{Q}}$ $\newcommand{\Z}{\mathbb{Z}}$ $\newcommand{\N}{\mathbb{N}}$ $\newcommand{\C}{\mathbb{C}}$ $\newcommand{\P}{\mathbb{P}}$ $\newcommand{\F}{\mathbb{F}}$ $\newcommand{\L}{\mathcal{L}}$ $\newcommand{\spa}[1]{\text{span}(#1)}$ $\newcommand{\dist}[1]{\text{dist}(#1)}$ $\newcommand{\max}[1]{\text{max}(#1)}$ $\newcommand{\min}[1]{\text{min}(#1)}$ $\newcommand{\supr}[1]{\text{sup}(#1)}$ $\newcommand{\infi}[1]{\text{inf}(#1)}$ $\newcommand{\set}[1]{\left\{#1\right\}}$ $\newcommand{\emptyset}{\varnothing}$ $\newcommand{\otherwise}{\text{ otherwise }}$ $\newcommand{\if}{\text{ if }}$ $\newcommand{\proj}{\text{proj}}$ $\newcommand{\union}{\cup}$ $\newcommand{\intercept}{\cap}$ $\newcommand{\abs}[1]{\left| #1 \right|}$ $\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$ $\newcommand{\pare}[1]{\left(#1\right)}$ $\newcommand{\t}[1]{\text{ #1 }}$ $\newcommand{\head}{\text H}$ $\newcommand{\tail}{\text T}$ $\newcommand{\d}{\text d}$ $\newcommand{\limu}[2]{\underset{#1 \to #2}\lim}$ $\newcommand{\der}[2]{\frac{\d #1}{\d #2}}$ $\newcommand{\derw}[2]{\frac{\d #1^2}{\d^2 #2}}$ $\newcommand{\pder}[2]{\frac{\partial #1}{\partial #2}}$ $\newcommand{\pderw}[2]{\frac{\partial^2 #1}{\partial #2^2}}$ $\newcommand{\pderws}[3]{\frac{\partial^2 #1}{\partial #2 \partial #3}}$ $\newcommand{\inv}[1]{{#1}^{-1}}$ $\newcommand{\inner}[2]{\langle #1, #2 \rangle}$ $\newcommand{\nullity}[1]{\text{nullity}(#1)}$ $\newcommand{\rank}[1]{\text{rank }#1}$ $\newcommand{\var}[1]{\text{var}(#1)}$ $\newcommand{\tr}[1]{\text{tr}(#1)}$ $\newcommand{\oto}{\text{ one-to-one }}$ $\newcommand{\ot}{\text{ onto }}$ $\newcommand{\ceil}[1]{\lceil#1\rceil}$ $\newcommand{\floor}[1]{\lfloor#1\rfloor}$ $\newcommand{\Re}[1]{\text{Re}(#1)}$ $\newcommand{\Im}[1]{\text{Im}(#1)}$ $\newcommand{\dom}[1]{\text{dom}(#1)}$ $\newcommand{\fnext}[1]{\overset{\sim}{#1}}$ $\newcommand{\transpose}[1]{{#1}^{\text{T}}}$ $\newcommand{\b}[1]{\boldsymbol{#1}}$ $\newcommand{\None}[1]{}$ $\newcommand{\Vcw}[2]{\begin{bmatrix} #1 \br #2 \end{bmatrix}}$ $\newcommand{\Vce}[3]{\begin{bmatrix} #1 \br #2 \br #3 \end{bmatrix}}$ $\newcommand{\Vcr}[4]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \end{bmatrix}}$ $\newcommand{\Vct}[5]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \br #5 \end{bmatrix}}$ $\newcommand{\Vcy}[6]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \br #5 \br #6 \end{bmatrix}}$ $\newcommand{\Vcu}[7]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \br #5 \br #6 \br #7 \end{bmatrix}}$ $\newcommand{\vcw}[2]{\begin{matrix} #1 \br #2 \end{matrix}}$ $\newcommand{\vce}[3]{\begin{matrix} #1 \br #2 \br #3 \end{matrix}}$ $\newcommand{\vcr}[4]{\begin{matrix} #1 \br #2 \br #3 \br #4 \end{matrix}}$ $\newcommand{\vct}[5]{\begin{matrix} #1 \br #2 \br #3 \br #4 \br #5 \end{matrix}}$ $\newcommand{\vcy}[6]{\begin{matrix} #1 \br #2 \br #3 \br #4 \br #5 \br #6 \end{matrix}}$ $\newcommand{\vcu}[7]{\begin{matrix} #1 \br #2 \br #3 \br #4 \br #5 \br #6 \br #7 \end{matrix}}$ $\newcommand{\Mqw}[2]{\begin{bmatrix} #1 & #2 \end{bmatrix}}$ $\newcommand{\Mqe}[3]{\begin{bmatrix} #1 & #2 & #3 \end{bmatrix}}$ $\newcommand{\Mqr}[4]{\begin{bmatrix} #1 & #2 & #3 & #4 \end{bmatrix}}$ $\newcommand{\Mqt}[5]{\begin{bmatrix} #1 & #2 & #3 & #4 & #5 \end{bmatrix}}$ $\newcommand{\Mwq}[2]{\begin{bmatrix} #1 \br #2 \end{bmatrix}}$ $\newcommand{\Meq}[3]{\begin{bmatrix} #1 \br #2 \br #3 \end{bmatrix}}$ $\newcommand{\Mrq}[4]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \end{bmatrix}}$ $\newcommand{\Mtq}[5]{\begin{bmatrix} #1 \br #2 \br #3 \br #4 \br #5 \end{bmatrix}}$ $\newcommand{\Mqw}[2]{\begin{bmatrix} #1 & #2 \end{bmatrix}}$ $\newcommand{\Mwq}[2]{\begin{bmatrix} #1 \br #2 \end{bmatrix}}$ $\newcommand{\Mww}[4]{\begin{bmatrix} #1 & #2 \br #3 & #4 \end{bmatrix}}$ $\newcommand{\Mqe}[3]{\begin{bmatrix} #1 & #2 & #3 \end{bmatrix}}$ $\newcommand{\Meq}[3]{\begin{bmatrix} #1 \br #2 \br #3 \end{bmatrix}}$ $\newcommand{\Mwe}[6]{\begin{bmatrix} #1 & #2 & #3\br #4 & #5 & #6 \end{bmatrix}}$ $\newcommand{\Mew}[6]{\begin{bmatrix} #1 & #2 \br #3 & #4 \br #5 & #6 \end{bmatrix}}$ $\newcommand{\Mee}[9]{\begin{bmatrix} #1 & #2 & #3 \br #4 & #5 & #6 \br #7 & #8 & #9 \end{bmatrix}}$

Convergence

Definition: Descent Method

Iterative algorithms in which as each new point is generated by the algorithm, the corresponding value of the objective function decreases in value.

i.e. The algorithm possesses the descent property.

Definition: Globally Convergent & Locally Convergent

An iterative algorithm is globally convergent if for any arbitrary starting point the algorithm is guaranteed to generate a sequence of points converging to a point that satisfies the FONC for a minimizer.

When the algorithm is not globally convergent, it may still generate a sequence that converges to a point satisfying the FONC, provided the initial point is sufficiently close to the point. In this case we say that the algorithm is locally convergent.

Definition: Rate of Convergence

How fast the algorithm converges to a solution point.

Lemma

$$V(\b{x}) = f(\b{x}) + \frac{ 1 }{ 2 } \b{ x }^{*\text{T}} \b{Q} \b{ x^*} = \frac{ 1 }{ 2 } \transpose{(\b{x} - \b{x}^*)} \b{Q} (\b{x} - \b{x}^*) $$

The iterative algorithm

$$\b{x}^{(k+1)} = \b{x}^{(k)} \alpha_k \b{g}^{(k)} $$

with $\b{g}^{(k)} = \b{Q}\b{x}^{(k)} - \b{b}$ satisfies

$$V(\b{x}^{(k+1)}) = (1 - \gamma_k) V(\b{x}^{(k)})$$

where, if $\b{g}^{(k)} = 0$ then $\gamma_k = 1 $, and if $\b{g}^{(k)} \neq 0 $ then

$$\gamma_k = \alpha_k \frac{ \transpose{{g^{(k)}}} \b{Q} \b{g}^{(k)}}{ \transpose{{\b{g}^{(k)}}} \inv{ \b{Q}} \b{g}^{(k)}} \pare{2 \frac{ \transpose{{\b{g}^{(k)}}} \b{g}^{(k)}}{ \transpose{{\b{g}^{(k)}}} \b{Q} \b{g}^{(k)}} - \alpha_k} $$

Theorem

let $\set{ \b{x}^{(k)}} $ be the sequence resulting from a gradient algorithm $\b{x}^{(k+1)} = \b{x}^{(k)} - \alpha_k \b{g}^{(k)} $. Let $\gamma_k $ be as defined in the previous lemma, and suppose that $\gamma_k > 0 $ for all $k $. Then, $\set{ \b{x}^{(k)}} $ converges to $\b{x}^{*} $ for any initial condition $\b{x}^{(0)}$ if and only if

$$\sum_{ k =0 }^{ \infty } \gamma_k = \infty $$

Lemma

Let $\b{Q} = \transpose{ \b{Q}} > 0 $ be an $n \times n $ real symmetric positive definite matrix. Then, for any $\b{x} \in \R^n $, we have

$$\frac{ \lambda_{ \text{min}}(\b{Q})}{ \lambda_{ \text{max}}(\b{Q})} \leq \frac{(\transpose{ \b{x}} \b{x})^2 }{(\b{x}^\text{T} \b{Q} \b{x})(\b{x}^\text{T} \inv{ \b{Q}} \b{x})} \leq \frac{ \lambda_\text{max}(\b{Q})}{ \lambda_\text{min}(\b{Q})}$$

Theorem

In the steepest descent algorithm, we have $\b{x}^{(k)} \to x^* $ for any $\b{x}^{(0)} $.

Theorem

For the fixed step size gradient algorithm, $\b{x}^{(k)} \to \b{x}^*$ for any $\b{x}^{(0)} $ if and only if

$$0 < \alpha < \frac{ 2 }{ \lambda_{\text{max}}(\b{Q})} $$.

Convergence Rate

Theorem

In the method of steepest descent applied to the quadratic function, at every step $k$, we have

$$ V(\b{x}^{(k + 1)}) \leq \pare{ \frac{ \lambda_\text{max}(\b{Q}) - \lambda_\text{min}(\b{Q})}{ \lambda_\text{max}(\b{Q})}} V(\b{x}^{(k)}) $$

Note

Sorry, the note for this chapter stops here. I couldn’t figure out where $V(x)$ comes from, and why it is important. I will probably come back to this chapter in the future.