By: Nick Greenquist


After taking a break from posting for over a year, in order to enjoy my first year at Google, I’ve decided to get back into writing some blog posts on Machine Learning. In this post, I will briefly introduce Backprop, and the math of how you would compute the partial derivate of Softmax (which would then be used in Backprop for a system that use a Softmax layer).


Backpropagation is how we calculate the gradient of the loss function of a neural network (with respect to its weights).

Chain Rule

The chain rule is the foundation for backpropagation.

If we are given an input $x$ and output $y$ both in $R^2$ and the error to the output is $\frac{∂L}{∂y}$.


\[\begin{eqnarray} y = W x + b \end{eqnarray}\]

where $W ∈ R^{mxn}$ and $x,b ∈ R^n$

Let’s compute the expression for $\frac{∂L}{∂W}$ and $\frac{∂L}{∂b}$ in terms of $\frac{∂L}{∂y}$ and $x$ using the chain rule.

Here is $\frac{\partial L}{\partial W}$

\[\begin{eqnarray} \frac{\partial L}{\partial W} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial W} \end{eqnarray}\]


\[\begin{eqnarray} \frac{\partial L}{\partial W} = \frac{\partial L}{\partial y}x \end{eqnarray}\]

Here is $\frac{\partial L}{\partial B}$

\[\begin{eqnarray} \frac{\partial L}{\partial b} = \frac{\partial L}{\partial y} \frac{\partial y}{\partial b} \end{eqnarray}\]


\[\begin{eqnarray} \frac{\partial L}{\partial b} = \frac{\partial L}{\partial y}1 \end{eqnarray}\]


softmax Source: Medium

Multinomial logistic regression takes logistic regression and generalzing it for multiple classes. The softmax expression normalizes n unconstrained values to n values that all sum to 1. Because the n values all sum to 1, you can view these values as probabilities.

Softmax is used as a final layer in many neural networks.

Can we backpropogate error through something more complex than the equation we saw above?

Here is the softmax expression which indicates the probability of the %j%th class:

\[\begin{eqnarray} P(z = j | x) = y_j = \frac{exp(βx_j)}{\sum_{i}exp(βx_i)} \end{eqnarray}\]

Ok so we have an equation, we have a label ($y), input ($x$), and weights ($β$). If we were to backprop error through this thing, we’d need to find partial derivates and use them in the chain rule.

Let’s find this partial derivaitve: $\frac{βy_j}{βx_i}$, which is basically ‘how much is our prediction for $y_j$ affected by the $i$th value of input $x$’

There will be two partial derivatives for this equation, one for when $i==j$ and one for when $i != j$

Case: i == j

First, let’s do the case when i == j. We need to remember the quotient rule for take derivatives:

\[\begin{eqnarray} \frac{d}{dx} (\frac{f(x)}{g(x)}) = \frac{\frac{d}{dx}f(x)g(x) - f(x)\frac{d}{dx}g(x)}{g^2(x)} \end{eqnarray}\]

I’m also going to rewrite the sum of the denominator to be with respect to k. This does not change the equation at all but makes the notation easier to understand. Also, that means k == i for one value which will make the derivative of the entire expression easier:

\[\begin{eqnarray} P(z = j|x) = y_j = \frac{exp(\beta x_j)}{\sum_{i=k}^{} exp(\beta x_k)} \end{eqnarray}\] \[\begin{eqnarray} f(x) = exp(\beta x_j) \end{eqnarray}\] \[\begin{eqnarray} \frac{d}f(x) = \beta exp(\beta x_j) \end{eqnarray}\] \[\begin{eqnarray} g(x) = \sum_{i=k}^{} exp(\beta x_k) \end{eqnarray}\]

Because only one term of the sum over k will be k=i, the derivative of the denominator is:

\[\begin{eqnarray} \frac{d}g(x) = \beta exp(\beta x_i) \end{eqnarray}\]

Now that we have all the terms for the quotient rule formula, we can plug them and reduce:

\[\begin{eqnarray} \frac{\beta exp(\beta x_j)\sum_{k}^{}exp(\beta x_k) - exp(\beta x_j)\beta exp(x_i)}{g(x)^2} \end{eqnarray}\]

Let’s split up the denominator

\[\begin{eqnarray} \frac{\beta exp(\beta x_j)\sum_{k}^{}exp(\beta x_k)}{g(x)^2} - \frac{exp(\beta x_j)\beta exp(x_i)}{g(x)^2} \end{eqnarray}\]

Expand the denominator on both sides

\[\begin{eqnarray} \frac{\beta exp(\beta x_j)\sum_{k}^{}exp(\beta x_k)}{\sum_{k}^{}exp(\beta x_k) \sum_{k}^{}exp(\beta x_k)} - \frac{exp(\beta x_j)\beta exp(x_i)}{\sum_{k}^{}exp(\beta x_k) \sum_{k}^{}exp(\beta x_k)} \end{eqnarray}\]

Simply out any like terms

\[\begin{eqnarray} \frac{\beta exp(\beta x_j))}{\sum_{k}^{}exp(\beta x_k)} - \frac{exp(\beta x_j)\beta exp(x_i)}{\sum_{k}^{}exp(\beta x_k) \sum_{k}^{}exp(\beta x_k)} \end{eqnarray}\]

We now see that we have similar terms from the problem we started with. Let’s create some useful names for expressions and plug those in.

\[\begin{eqnarray} y_{j} = \frac{exp(\beta x_j))}{\sum_{k}^{}exp(\beta x_k)} \end{eqnarray}\] \[\begin{eqnarray} y_{i} = \frac{exp(\beta x_i))}{\sum_{k}^{}exp(\beta x_k)} \end{eqnarray}\] \[\begin{eqnarray} \frac{\partial y_j}{\partial x_i} = \beta y_{j} - y_{j} \beta y_{i} \end{eqnarray}\] \[\begin{eqnarray} \frac{\partial y_j}{\partial x_i} = \beta y_{j} (1 - y_{i}) \end{eqnarray}\]

NOTE: i == j so we can replace the $y_{j}$ with $y_{i}$

\[\begin{eqnarray} \frac{\partial y_j}{\partial x_i} = \beta y_{i} (1 - y_{i}) \end{eqnarray}\]

Case: i != j

\[\begin{eqnarray} f(x) = exp(\beta x_j) \end{eqnarray}\] \[\begin{eqnarray} \frac{d}f(x) = 0 \end{eqnarray}\] \[\begin{eqnarray} g(x) = \sum_{i=k}^{} exp(\beta x_k) \end{eqnarray}\]

Because only one term of the sum over k will be k=i, the derivative of the denominator is:

\[\begin{eqnarray} \frac{d}g(x) = \beta exp(\beta x_i) \end{eqnarray}\]

Fill in the pieces to quotient rule formula:

\[\begin{eqnarray} \frac{0 - exp(\beta x_j)\beta exp(\beta x_i)}{h(x)^2} \end{eqnarray}\] \[\begin{eqnarray} \frac{-\beta exp(\beta x_j) exp(\beta x_i)}{\sum_{k}^{}exp(\beta x_k) \sum_{k}^{}exp(\beta x_k)} \end{eqnarray}\]

Using the same formula definitions as from the first problem, we get:

\[\begin{eqnarray} -\beta y_{j}y_{i} \end{eqnarray}\]

Therefore, we get the final answer:

\[\begin{eqnarray} \frac{\partial y_j}{\partial x_i} = -\beta y_{j}y_{i} \end{eqnarray}\]

Final Computed Partial Derivatives

\[\begin{eqnarray} i == j: \frac{\partial y_j}{\partial x_i} = \beta y_{i} (1 - y_{i}) \end{eqnarray}\] \[\begin{eqnarray} i \neq j: \frac{\partial y_j}{\partial x_i} = -\beta y_{j}y_{i} \end{eqnarray}\]