By: Nick Greenquist

Introduction

In this post we will derive the math necessary to implement Pegasos SVM. Pegasos is stochastic subgradient descent for the SVM with a special schedule for the step-size.

After understanding the math, we will be able to implement Pegasos SVM with confidence in Python and see it in action.

Calculating Subgradients

Before we can implement SVM, we need to understand what a subgradient is.

Recall that a vector $g\in R^{d}$ is a subgradient of $f:R^{d}\to R$ at $x$ if for all $z$,

\[\begin{eqnarray} f(z)\ge f(x)+g^{T}(z-x). \end{eqnarray}\]

There may be $0$, $1$, or infinitely many subgradients at any point. The subdifferential of $f$ at a point $x$, denoted $\partial f(x)$, is the set of all subgradients of $f$ at $x$.

First we will derive a property that will make our life easier for finding a subgradient of the hinge loss (used in SVM).

Subgradients for pointwise maximum of functions

Suppose $f_{1},\ldots,f_{m}:R^{d}\to R$ are convex functions, and

\[\begin{eqnarray} f(x)=\max_{i=1,\ldots,,m}f_{i}(x). \end{eqnarray}\]

Let $k$ be any index for which $f_{k}(x)=f(x)$, and choose $g\in\partial f_{k}(x)$.

We are using the fact that a convex function on $R^{d}$ has a non-empty subdifferential at all points.

Proving that $g\in\partial f(x)$.

To find the subgradient of the maximum of functions, we can choose one of the functions that acheives the maximum at the point, and choose any subgradient of that function at that point.

We can use the definition of the subgradient to show:

\[\begin{eqnarray} f(z) \geq f_k(z) \geq f_k(x) + g^T(z - x) = f(x) + g^T(z - x) \end{eqnarray}\]

We can also show this using Convex Set notation.

\[\begin{eqnarray} I(x) = {i| f_i(x) = f(x)} \end{eqnarray}\]

This defines the set of ‘active’ functions at x.

weak result: to compute the subgradient at x, we can choose any $k \in I(x)$, any subgradient of $f_k$ at x

strong result: the convex hull of the union of subdifferentials of ‘active’ functions at x

\[\begin{eqnarray} \partial f(x) = conv U_{i \in I(x)} \partial f_i(x) \end{eqnarray}\]

Hinge Loss

\[\begin{eqnarray} J(w)=\max\left\{ 0,1-yw^{T}x\right\} . \end{eqnarray}\]

This function is not differentiable, but has a subgradient with respect to model parameters w of a linear SVM with score function $yw^{T}x$.

Subgradient of hinge loss for linear prediction

Subgradient of

\[\begin{eqnarray} J(w)=\max\left\{ 0,1-yw^{T}x\right\} . \end{eqnarray}\]

We can compute the subgradient as follows:

\[\begin{eqnarray} \partial_w \max\left\{ 0,1-yw^{T}x\right\} \end{eqnarray}\] \[\begin{eqnarray} \partial w \begin{cases} 0 & yw^Tx \geq 1 \\ 1 - yw^Tx & yw^Tx < 1 \\ \end{cases} \end{eqnarray}\] \[\begin{eqnarray} \begin{cases} \partial_w 0 & yw^Tx \geq 1 \\ \partial_w(1 - yw^Tx) & yw^Tx < 1 \\ \end{cases} \end{eqnarray}\]

We get the final subgradient as:

\[\begin{eqnarray} g = \begin{cases} 0 & yw^Tx \geq 1 \\ -yx & yw^Tx < 1 \\ \end{cases} \end{eqnarray}\]

Support Vector Machine via Pegasos

SVM objective function:

\[\begin{eqnarray} \min_{w\in R^{d}}\frac{\lambda}{2}\|w\|^{2}+\frac{1}{m}\sum_{i=1}^{m}\max\left\{ 0,1-y_{i}w^{T}x_{i}\right\} \end{eqnarray}\]

Note that, for simplicity, we are leaving off the unregularized bias term $b$.

Pegasos is stochastic subgradient descent using a step size rule $\eta_{t}=1/\left(\lambda t\right)$.

The pseudocode: Pegasos

Stochastic SVM objective function

Let’s derive the SVM objective function with a single training point.

If $i$ is selected uniformly from the set ${1,\ldots,m}$, then this stochastic objective function has the same expected value as the full SVM objective function:

\[\begin{eqnarray} J_{i}(w) = \frac{\lambda}{2}|w|^{2} + \max(0,1-y_{i}w^{T}x_{i}) \end{eqnarray}\]

The function $J_{i}(\theta)$ is not differentiable everywhere. Let’s find the gradient of $J_{i}(w)$ where it’s defined, and also show where it is not defined.

Gradient of SVM Objective Function

First, let’s give an expression for gradient of $J_i(w)$

\[\begin{eqnarray} \partial_w (\frac{\lambda}{2}\|w\|^{2}+\max\left\{ 0,1-y_{i}w^{T}x_{i}\right\}) \end{eqnarray}\]

Gradient is undefined at $ y_iw^Tx_i = 1$

\[\begin{eqnarray} \partial_w \begin{cases} \frac{\lambda}{2}\|w\|^{2} & y_iw^Tx_i > 1 \\ \frac{\lambda}{2}\|w\|^{2} + 1 - y_iw^Tx_i & y_iw^Tx_i < 1 \\ undefined & y_iw^Tx_i = 1 \end{cases} \end{eqnarray}\]

Compute the gradient of each part of the full gradient

\[\begin{eqnarray} \begin{cases} \partial_w \frac{\lambda}{2}\|w\|^{2} & y_iw^Tx_i > 1 \\ \partial_w \frac{\lambda}{2}\|w\|^{2} + 1 - y_iw^Tx_i & y_iw^Tx_i < 1 \\ undefined & y_iw^Tx_i = 1 \end{cases} \end{eqnarray}\]

This is the final gradient for $J_i(w)$

\[\begin{eqnarray} \partial_w J_i(w) = \begin{cases} \lambda w & y_iw^Tx_i > 1 \\ \lambda w - y_ix_i & y_iw^Tx_i < 1 \\ undefined & y_iw^Tx_i = 1 \end{cases} \end{eqnarray}\]

Subgradient of $J_{i}(w)$

Let’s show that the subgradient of $J_{i}(w)$ is given by:

\[\begin{eqnarray} g & = & \begin{cases} \lambda w-y_{i}x_{i} & \mbox{for }y_{i}w^{T}x_{i}<1\\ \lambda w & \mbox{for }y_{i}w^{T}x_{i}\ge1. \end{cases} \end{eqnarray}\]

Let’s define a few RULES that we can use to prove the subgradient is correct:

1) If $f_{1},\ldots,f_{m}:R^{d} \to R$ are convex functions and $f=f_{1}+\cdots+f_{m}$, then $\partial f(x)=\partial f_{1}(x)+\cdots+\partial f_{m}(x)$.

2) For $\alpha\ge0$, $\partial(\alpha f)(x)=\alpha\partial f(x)$.

We are also going to reuse what we proved in the beginning of this post.

The Proof

First we want to show that $J_i(w)$ is a linear combination of convex functions. The first function:

\[\begin{eqnarray} \frac{\lambda}{2}\|w\|^2 \end{eqnarray}\]

This function is convex as it is a quadratic and bowl shaped. We can also invoke RULE 2 to take the constant out of the subdifferential

\[\begin{eqnarray} \partial_w \frac{\lambda}{2}\|w\|^2 = \frac{\lambda}{2} \partial_w \|w\|^2 \end{eqnarray}\]

This funciton is differentiable at all points and the subdifferential is:

\[\begin{eqnarray} \partial_w \frac{\lambda}{2}\|w\|^2 = {\lambda w} \end{eqnarray}\]

The next function is:

\[\begin{eqnarray} \max\left\{ 0,1-y_{i}w^{T}x_{i}\right\} \end{eqnarray}\]

Note that this function is convex because the inputs to the $\max{}$ are affine and therefore the entire function is convex, even though max by itself is not convex.

The subdifferential of this is:

\[\begin{eqnarray} \begin{cases} 0 & y_iw^Tx_i \geq 1 \\ -y_ix_i & y_iw^Tx_i < 1 \end{cases} \end{eqnarray}\]

Using RULE 1 we can combine these two subdifferntials:

\[\begin{eqnarray} \begin{cases} \lambda w & y_iw^Tx_i \geq 1 \\ \lambda w + -y_ix_i & y_iw^Tx_i < 1 \end{cases} \end{eqnarray}\]

This is equivalent to what we were asked to show:

\[\begin{eqnarray} g = \begin{cases} \lambda w-y_{i}x_{i} & \mbox{for }y_{i}w^{T}x_{i}<1\\ \lambda w & \mbox{for }y_{i}w^{T}x_{i}\ge1. \end{cases} \end{eqnarray}\]

The Step Size

If our step size rule is $\eta_{t}=1/(\lambda t)$, then doing SGD with the subgradient direction from the above proof is the same as the Pegasos pseudocode.

Let’s show this is true:

\[\begin{eqnarray} w_{t+1} = w_{t} - \eta_{t}\partial_w J_i(w) \end{eqnarray}\]

If we use a step size of $\eta_{t} = \frac{1}{t \lambda}$ we get this update step if $y_iw^Tx_i < 1$

\[\begin{eqnarray} w_{t+1} = w_{t} - \eta_{t}(\lambda w_t - y_ix_i) \end{eqnarray}\] \[\begin{eqnarray} w_{t+1} = w_{t} - \eta_{t}\lambda w_t + \eta_{t}y_ix_i \end{eqnarray}\] \[\begin{eqnarray} w_{t+1} = w_t(1 - \eta_t \lambda) + \eta_{t}y_ix_i \end{eqnarray}\] \[\begin{eqnarray} w_{t+1} = (1 - \eta_t \lambda)w_t + \eta_{t}y_ix_i \end{eqnarray}\]

This is equivalent to this pseudocode:

If $y_{j}w_{t}^{T}x_{j}<1$
$w_{t+1}=(1-\eta_{t}\lambda)w_{t}+\eta_{t}y_{j}x_{j}$

If we use a step size of $\eta_{t} = \frac{1}{t \lambda}$ we get this update step if $y_iw^Tx_i \geq 1$

\[\begin{eqnarray} w_{t+1} = w_{t} - \eta_{t}(\lambda w_t) \end{eqnarray}\] \[\begin{eqnarray} w_{t+1} = w_{t} - \eta_{t} \lambda w_t \end{eqnarray}\] \[\begin{eqnarray} w_{t+1} = w_{t}(1 - \eta_{t} \lambda) \end{eqnarray}\] \[\begin{eqnarray} w_{t+1} = (1 - \eta_{t} \lambda)w_{t} \end{eqnarray}\]

This is equivalent to this pseudocode:

Else
$w_{t+1}=(1-\eta_{t}\lambda)w_{t}$

Conclusion

In this post we derived all the math necessary to feel confident that implementing Pegasos SVM will work. In my next post, I will step-by-step show how to turn this math into working Python code. Stay tuned!