The answer is: yes, it is always quasi-convex! I'll show this by first proving a stronger characterization, from which the other facts follow. Please bear with me as I first make a few definitions.
Let convex SsubseteqmathbbR and a function f:StomathbbR be given. To avoid existence of derivatives, let f′(v) refer to any subgradient of f at v, and say f is convex if for any x,vinS, f(x)geqf(v)+f′(v)(x−v). (This is an equivalent formulation of convexity, and when f is differentiable, gives the 'first-order' definition of convexity.) Note critically that for u,vinS with uleqv, it follow that
f′(u)leqf′(v). (This is sort of like the mean value theorem, though not exactly since those subgradients are technically sets; I think all of I've said so far may appear in the thesis
of Shai Shalev-Shwartz.) Define
bx(v)=f(x)−f(v)−f′(v)(x−v)
to be the Bregman divergence of f at the point x, taking the linear approximation at v. By the definition of convexity, if follows that bx(v)geq0 for all x,vinS.
Fact: bx(cdot) is decreasing up to x, exactly zero at x, and increasing after x.
Proof. bx(x)=f(x)−f(x)−f′(x)(0)=0. Now consider uleqvleqx; we'd like to show bx(u)geqbx(v). To start, write
bx(u)−bx(v)=f(v)+f′(v)(x−v)−f(u)−f′(u)(x−u).
Now, using f(v)geqf(u)+f′(u)(v−u) yields
bx(u)−bx(v)geqf′(v)(x−v)+f′(u)(v−x)=(f′(v)−f′(u))(x−v),
and bx(u)−bx(v)geq0 follows since f′(v)geqf′(u) and x−vgeq0. To
show the last case, that xleqvlequ gives bx(u)geqbx(v), the proof is
analogous. QED.
Some remarks:
- To see that this means bx(cdot) is quasi-convex, take any yleqz and any
lambdain[0,1]. Then the point w:=lambday+(1−lambda)z lies on the line
segment yz, and bx(cdot) must be increasing in the direction of at least one of
these endpoints. - This also gives a strong idea of how convexity breaks down for bx(cdot). In
particular, let f=max0,|x|−1 (a 1-insensitive loss for regression). Then the
function b0(cdot) is 0 on (−1,1) and 1 everywhere else except −1,+1 (those points
are different since, by using subgradients, these functions have sets as output; but if you
took a differentiable analog to this loss, something like a Huber loss, you'd get basically the same effect, and b0(cdot) is a vanilla continuous (non-convex) function).
No comments:
Post a Comment