8The Hessian
In Chapter 6 we exploited the second derivative of a one variable real function to analyze convexity along with local minima and maxima.In this chapter we introduce an analogue of the second derivative for real functions of several variables. This will be an matrix. The important notion of a matrix being positive (semi-) definite introduced in Section 3.7 will now make its appearance.8.1 Introduction
In Section 6.4 the Taylor expansion for a one variable differentiable function centered a with step size was introduced as Recall that the second derivative contains a wealth of information about the function. Especially if , then we might glean from if is a local maximum or minimum or none of these (see Theorem 6.43 and review Exercise 6.46).We also noticed that gradient descent did not work so well only descending along the gradient. We need to take the second derivative into account to get a more detailed picture of the function.8.2 Several variables
Our main character is a differentiable function in several variables. We already know that where and are vectors in (as opposed to the good old numbers in (8.1)). Take a look back at Definition 7.5 for the general definition of differentiability.We wish to have an analogue of the Taylor expansion in (8.1) for such a function of several variables. To this end we introduce the function given by Notice that where is the function given by . In particular we get by using the chain rule (see Theorem 7.24).
Explain how the chain rule is applied to get (8.3).
The derivative is also composed of several functions and again we may
compute by using the chain rule:
where is defined by
and by
The Hessian matrix of at the point
is defined by
Why is the Hessian matrix symmetric?
Suppose that is given by
Then the gradient
and the Hessian
of are computed in the Sage window below.See the further documentation for Calculus functions in Sage.
Verify (just this once) by hand the computations done by Sage in Example 8.4.Also, experiment with a few other functions in the Sage window and compute their
Hessians.
By applying Proposition 7.11 it is not too hard to see that the Hessian
matrix fits nicely into the framework above, since
The full application of the chain rule then gives
Give a detailed explanation as to why (8.5) holds.
8.3 Newton's method for finding critical points
We may use Newton's method for computing critical points for a function of several variables. Recall that a critical point is a point with . By (7.8) and (8.5) the computation in Newton's method becomes In practice the (inverse) Hessian appearing in (8.7) is often a heavy computational burden. This leads to the socalled quasi-Newton methods, where the inverse Hessian in (8.7) is replaced by other matrices.
We will return to the logistic regression in Example 7.34 about the
Challenger disaster. Here we sought to maximize the function
In order to employ Newton's method we compute the gradient and the Hessian of (8.8)
where
is the sigmoid function.Notice the potential problem in using Newton's method here: the formula for
the second order derivatives in (8.9) show that if the are just mildly big, say , then
the Hessian is extremely close to the zero matrix and therefore Sage considers it non-invertible and
(8.7) fails.In the code below we have nudged the initial vector so that it works, but you
can easily set other values and see its failure. Optimization is not just mathematics, it also calls
for some good (engineering) implementation skills (see for example details on the
quasi Newton algorithms).In the instance below we do, however, get a gradient that is practically .Code for Newton's method
8.3.1 Transforming data for better numerical performance
The numerical problems with Newton's method in Example 8.7 can be prevented by transforming the input data. It makes sense to transform data from large numbers to smaller numbers around . There is a rather standard way of doing this.Suppose in logistic regression we have a set of data associated with outcomes . Then the function from Example 7.32 becomes much more manageable if we first transform the data according to and instead optimize the function Here is the mean value and the variance of the data in (8.10).Now if and is an optimum for , then is an optimum for , since
Why is the claim/trick alluded to in (8.11) true?Below is a snippet of Sage code implementing the trick in (8.11).
The function test takes as input x0 (an initial vector like [0,0]) and
noofits (the number of iterations of Newton's method). You execute this in the Sage
window by adding for example
test([0,0], 10)
and then pressing Compute.Experiment and compare with the official output from Example 7.34. Also, compute
the gradient of the output below for the original non-transformed problem.Transformed code8.4 The Taylor series in several variables
Now we are in a position to state at least the first terms in the Taylor expansion for a differentiable function . The angle of the proof is to reduce to the one-dimensional case through the function defined in (8.2). Here one may prove that where with continuous at , much like in the definition of differentiability except that we also include the second derivative.Now (8.12) translates into by using (8.3) and (8.6).From (8.13) one reads the following nice criterion, which may be viewed as a several variable generalization of Theorem 6.43.
Let be a critical point for . Then
- is a local minimum if is positive definite.
- is a local maximum if is positive definite (here we call negative definite).
- is a saddle point if is indefinite.
We need to clarify two things concerning the above theorem.
- An indefinite matrix is a symmetric matrix with the property that there exists with i.e., neither nor is positive semidefinite.
- A saddle point for is defined by the existence of two vectors , such that as illustrated in the graphics below.
Consider, with our new technology in Theorem 8.9, Exercise 7.22 once again.
Here we analyzed the point for the function
and showed (by a trick) that is neither a local maximum nor a local minimum for . The Hessian
matrix for at is
Now
Theorem 8.9(ⅲ.) shows that is a saddle point, since
and
Try plotting the graph for different values of aa=4 shows the saddle point clearly. in the Sage window in
Example 8.11. What do you observe for the point with
respect to the function? Does a have to be a number? Could it be a symbolic
expression in the variables x and y like a = -10*cos(x)*sin(y)?
Check the computation of the Hessian matrix in Example 8.11 by showing
that the Hessian matrix for at the point is
Give an example of a function having a local minimum at
, where is not positive definite.
The following exercise is a sci2u exercise from the Calculus book.
Consider the function
Compute its critical points and decide on their types according to Theorem 8.9.
Try to convince yourself that
for every .
Look at the minimization problem
subject to
where is a big number.
Give an example of a function that has a local maximum, but where
there exists with for any given (large) number .
8.5 Convex functions of several variables
Below is the generalization of Theorem 6.53 to several variables. You have already seen this in Exercise 6.54, right?
Let be a differentiable function, where
is an open convex subset. Then is convex if
and only if
for every .
Suppose that (8.14) holds and let with
, where . To prove that is convex
we must verify the inequality
Let . Then
by (8.14). If you multiply the first inequality by , the
second by and then add the two, you get (8.15).Suppose on the other hand that is a convex function. Let . Since is an open subset, it follows that for , where is
sufficiently small. Now define the function by
Being the composition of two differentiable functions, is
differentiable. Suppose that and . Then
showing that is a convex function. By Theorem 6.53,
which translates into
by using the chain rule in computing .
Prove that a bounded convex differentiable function is
constant.
The following is the generalization of Corollary 6.48.
Let be a differentiable function with continuous
second order partial derivatives, where is a
convex open subset. Then is convex if and only if the Hessian
is positive semidefinite for every . If
is positive definite for every , then is
strictly convex.
We have done all the work for a convenient reduction to the one
variable case. Suppose that is convex. Then the same reasoning
as in the proof of Theorem 8.19 shows that
is a convex function for every and every from
an open interval to for suitable
. Therefore by
Theorem 6.47. This proves that the matrix is
positive semidefinite for every . Suppose on the other hand
that is positive semidefinite for every .
Then Theorem 6.47 shows that is a convex
function from to for small
and , since
for . Therefore is a convex function, since
The same argument (using the last part of Theorem 6.47 on
strict convexity), shows that is strictly convex if
is positive definite. It follows that is strictly
convex if is positive definite for every .
Prove that
is a strictly convex function from to . Also, prove that
is a convex subset of .
Is strictly convex on some non-empty
open convex subset of the plane?
Show that given by
is a convex function. Is strictly convex?
Let be given by
where .
- Show that is a strictly convex function if and only if
and . This is a hint for the only if part. If is the Hessian for , then where - this is seen by a matrix multiplication computation. We know that is positive semidefinite. If was not positive definite, there would exist with . Now use to complete the proof that is positive definite by looking at .
- Suppose now that and . Show that has a unique global minimum and give a formula for this minimum in terms of and .
8.6 How to decide the definiteness of a matrix
In this section we will outline a straightforward method for deciding if a matrix is positive definite, positive semidefinite, negative definite or indefinite.Before proceeding it is a must that you do the following exercise.
Show that a diagonal matrix
is positive definite if and only if
and positive semidefinite
if and only if .Also, show that if is a symmetric matrix, then is
positive definite if and only if
is positive definite, where is an invertible matrix.
The crucial ingredient is the following result.
Let be a real symmetric matrix. Then there exists an
invertible matrix , such that is a diagonal matrix.
Suppose that . If has a non-zero entry in the upper
left hand corner i.e., , then
where is a real symmetric matrix and is the
invertible matrix
By induction on we may find an invertible matrix matrix such that
Putting
it follows that
We now treat the case of a zero entry in the upper left hand corner
i.e., . Suppose first that for
some . Let denote the identity matrix with the first and
-th rows interchanged. The operation amounts to
interchanging the first and -th columns in . Similarly
is interchanging that first and -th rows in .
The matrix is invertible and is a symmetric matrix
with and we have reduced to the case
of a non-zero entry in the upper left hand corner.If for every we may assume that for some . Let denote the identity matrix where
the entry in the first column and -th row is . The operation
amounts to adding the -th column to the first
column in . Similarly is adding the -th row
to the first row in . All in all we get , where we have used that for . Again we have reduced to the case of a non-zero entry in the
upper left hand corner.
Consider the real symmetric matrix.
Here . Therefore the fundamental step in the
proof of Theorem 8.27 applies and
and again
Summing up we get
You are invited to check that
Let
Here , but the diagonal element . So we are in the second step of the proof of
Theorem 8.27. Using the matrix
we get
As argued in the proof, this corresponds to interchanging the first
and third columns and then interchanging the first and third
rows. In total you move the non-zero to the upper left
corner in the matrix.
Consider the symmetric matrix
We have zero entries in the diagonal. As in the third step in the
proof of Theorem 8.27 we must find an invertible matrix
, such that the upper left corner in is
non-zero. In the proof it is used that every diagonal element is
zero: if we locate a non-zero element in the -th column in the
first row, we can add the -th column to the first column and then
the -th row to the first row obtaining a non-zero element in the
upper left corner. For above we choose and the matrix
becomes
so that
Let be any matrix. Show that
is positive semidefinite.
Find inequalities defining the set
Same question with positive semidefinite. Sketch and compare the two
subsets of the plane .
Let denote the function given by
where . Let denote the Hessian of
in a point .
- Compute .
- Show that for and .
- Compute a non-zero vector , such that in the case, where . Is invertible in this case?
- Show that is strictly convex if .
-
Is strictly convex if ?HintConsider the line segment between and a suitable vector , where .
Why is the subset given by the inequalities
a convex subset of ?