By: Nick Greenquist

Introduction

Welcome back to another fun filled blog post on machine learning! Following this post on Gradient Descent for Linear Regression and this post on Linear Regression with L2 Regularization: Ridge Regression, we will now be using a different kind of Optimization Algorithm for Ridge Ridgression (which remember, is just Linear Regression with L2 Regularization). This algorithm is called Stochastic Gradient Descent (stochastic just means random…these math people and their confusing lingo, am I right?) and while it is an EXTREMELY simple varient of normal Batch Gradient Descent, it has some very very interesting theoretical properties I will be discussing as I show you how to implement it.

Breaking Down the Objective Function

Let’s return to the Objective Function for Ridge Regression:

\[\begin{eqnarray} J(\theta)=\frac{1}{n}\sum_{i=1}^{n}\left(h_{\theta}(x_{i})-y_{i}\right)^{2} + \lambda\theta^{T}\theta \end{eqnarray}\]

Notice how we are computing loss by computing the loss on every training point and then summing all the losses together. Let’s create a statement for the loss on a single training example, and call is $s_i(\theta)$, which means the loss on the $i^{th}$ training point from our Design Matrix $X$.

The Objective Function could be simplified by just using $s_i(\theta)$ inside the summation:

\[\begin{eqnarray} J(\theta)=\frac{1}{n}\sum_{i=1}^{n}s_i(\theta) \end{eqnarray}\]

So, here is actual equation for the loss on a single training example:

\[\begin{eqnarray} s_i(\theta) = (h_{\theta}(x_i) - y_i)^{2} + \lambda\theta^{T}\theta \end{eqnarray}\]

But hold on! Aren’t we know summing up $n$ values of $\lambda\theta^{T}\theta$? Well yes, but notice the $\frac{1}{n}$ outside the summation. That will help us cancel out the summed version of the penalty term as follows:

\[\begin{eqnarray} J(\theta)=\frac{1}{m}\sum_{i=1}^{m}[(x_{i}\theta - y_i)^{2} + \lambda\theta^{T}\theta] \end{eqnarray}\]

When we sum the $m$ regularization terms, we get

\[\begin{eqnarray} \frac{1}{m}(m(\lambda\theta^{T}\theta)) \end{eqnarray}\]

Which is just:

\[\begin{eqnarray} \lambda\theta^{T}\theta \end{eqnarray}\]

So, we now have derived an expression for the loss on single training point of $X$ that still uses the same exact Objective Function we have been dealing with for normal Ridge Regression.

Proving Stochastic Gradient Descent Works

For Stochastic Gradient Descent, we only want to use the gradient we compute using this SINGLE training point (the gradient of $s_i(\theta)$). However, will this work? Well, we can prove it using some math!

So, here is our Objective Function but with $s_i(\theta)$ inside the summation.

\[\begin{eqnarray} J(\theta)=\frac{1}{n}\sum_{i=1}^{n}s_{i}(\theta) \end{eqnarray}\]

We can write the gradient of this thing as follows:

\[\begin{eqnarray} \nabla J(\theta)=\nabla \frac{1}{n}\sum_{i=1}^{n}s_{i}(\theta) \end{eqnarray}\]

And we can move the gradient inside the sum because the gradient is a linear operation:

\[\begin{eqnarray} \nabla J(\theta)=\frac{1}{n}\sum_{i=1}^{n} \nabla s_{i}(\theta) \end{eqnarray}\]

We need to find the Expected Value of the Stochastic Gradient. We don’t actually need to compute it yet (we will once we prove this thing will work and we won’t be wasting our time!), but we just want to see if we can show that the Expected Value of it is the same thing as the Full Batch Gradient ($\nabla J(\theta)$)

Why do we care about this thing’s Expected Value? Well, here is the definition of the Expected Value: ‘In probability theory, the expected value of a random variable, intuitively, is the long-run average value of repetitions of the same experiment it represents’ - Wikipedia.

This is a pretty powerful statement. This means if use the Stochastic Gradient repeatedly, over time it will give us the same results as using the full Batch Gradient we have been using! Ok, so let’s see what this things expected value is!

\[\begin{eqnarray} E[\nabla s_i(\theta)] \end{eqnarray}\]

The Expected Value of that gradient is simply the gradient of the sum of all possible choices of $i$, sampled uniformly from ${1,\ldots,m}$.

\[\begin{eqnarray} E[\nabla s_i(\theta)] = \nabla \sum_{j=1}^{n}P(j = i)s_i(\theta) \end{eqnarray}\]

We can move the gradient inside the summation (and to the right of the probability since that is just a constant) because the gradient is a linear operation:

\[\begin{eqnarray} E[\nabla s_i(\theta)] = \sum_{j=1}^{n}P(j = i)\nabla s_i(\theta) \end{eqnarray}\]

What is the chance that $j=i$ if we have $n$ possible values for $j$ and each has the same chance of being chosen? Well, it’s simply $\frac{1}{n}$!

\[\begin{eqnarray} E[\nabla s_i(\theta)] = \frac{1}{n}\sum_{j=1}^{n}\nabla s_i(\theta) \end{eqnarray}\]

Woah! Haven’t we seen this expression before? Oh yeah, it’s literally the definition of the full Objective Function!

\[\begin{eqnarray} \nabla J(\theta)=\frac{1}{n}\sum_{i=1}^{n} \nabla s_{i}(\theta) \end{eqnarray}\]

Nice! So we just proved that:

\[\begin{eqnarray} E[\nabla s_i(\theta)] = \nabla J(\theta) \end{eqnarray}\]

Stochastic Gradient

Now that we have proven that the Stochastic Gradient is a valid thing to use for our optimization descent algorithm, let’s find its gradient:

\[\begin{eqnarray} \nabla s_i(\theta) = \nabla[(h_{\theta}(x_i) - y_i)^{2} + \lambda\theta^{T}\theta] \end{eqnarray}\]

We can use the chain rule on the $(h_{\theta}(x_i) - y_i)^{2}$ part, and the regularization term is a pretty simple derivative. For the chain rule, we need to find $\nabla h_{\theta}(x_i)$.

\[\begin{eqnarray} h_{\theta}(x_i) = \theta^T x_i \end{eqnarray}\] \[\begin{eqnarray} \nabla h_{\theta}(x_i) = x_i \end{eqnarray}\]

We can now use this back in the main gradient:

\[\begin{eqnarray} \nabla s_i(\theta) = 2(h_{\theta}(x_i) - y_i)\nabla h_{\theta}(x_i) + 2\lambda\theta \end{eqnarray}\] \[\begin{eqnarray} \nabla s_i(\theta) = 2(h_{\theta}(x_i) - y_i)x_i + 2\lambda\theta \end{eqnarray}\] \[\begin{eqnarray} \nabla s_i(\theta) = 2[(h_{\theta}(x_i) - y_i)x_i + \lambda\theta] \end{eqnarray}\]

Math -> Code

Now that we have proven SGD is theoretically sound and have computed the Stochastic Gradient that we can use inside of SGD, let’s code it up!

Square Loss Function

Our loss function for Stochastic Gradient Descent is the same as for Batch Gradient Descent. The only thing that will change when using SGD versus Batch Gradient Descent is that we only use a single training point to compute the gradient. Nothing changes for evaluating how well $\theta$ is doing at any given time step.

Remember, here is our normal vectorized square loss function: \(\begin{eqnarray} J(\theta)=\frac{1}{m}\|X\theta - y\|_2^2 \end{eqnarray}\)

def compute_square_loss(X, y, theta):
    loss = 0 #Initialize the average square loss
    
    m = len(y)
    loss = (1.0/m)*(np.linalg.norm((X.dot(theta) - y), 2) ** 2)
    return loss

Stochastic Gradient

Next, we will create a function that compuates the Gradient vector at a given time step using only ONE point. We are going to conver this equation:

\[\begin{eqnarray} \nabla s_i(\theta) = 2[(h_{\theta}(x_i) - y_i)x_i + \lambda\theta] \end{eqnarray}\]

into Python code:

def compute_stochastic_gradient(X_i, y_i, theta, lambda_reg):
    return 2*((theta.T @ X_i) - y_i)*X_i + 2*lambda_reg*theta

Stochastic Gradient Descent

We are now ready to create our iterative, training function, but this time will be updating $\theta$ each step using our Stochastic Gradient, not the full Gradient. Also note, instead of generating a random number every iteration, we instead shuffle up all the indices at the beginning of each epoch, and then just loop through those shuffled indices. This is a more efficient way to make the optimization ‘stochastic’ without having to generate thousands of random numbers (millions for bigger datasets) because shuffling algorithms are actually pretty fast.

def stochastic_grad_descent(X, y, learning_rate=0.01, lambda_reg=0.1, epochs=1000):
    num_instances, num_features = X.shape[0], X.shape[1]
    theta = np.ones(num_features) #Initialize theta

    loss_hist = np.zeros((num_epoch, num_instances)) #Initialize loss_hist
    
    for e in range(num_epoch):
        
        # Reshuffle indices to for new epoch, so we get new random points at each iteration
        shuffled_indices = np.random.permutation(num_instances)
        
        for i in shuffled_indices:
            
            X_i = X[i]
            y_i = y[i]
            
            # perform update
            theta = theta - (learning_rate)*compute_stochastic_gradient(X_i, y_i, theta, lambda_reg)
            
            loss_hist[e, i] = compute_square_loss(X, y, theta) + lambda_reg*(theta.T @ theta)
    
    return loss_hist

Using the same data as in the last two blog posts, let’s do ‘machine’ learning using SGD instead of Batch Gradient Descent!

SGD Behaviour

Before we run our code and examine the plots, let’s take a look of how SGD behaves as it gets near the minima of the Objective Function. Notice how the Batch Gradient Descent optimizer takes a nice, straight walk towards it. This is a pretty picture, but remember Batch Gradient Descent can be much, much, MUCH slower than SGD in even getting close in the first place. While SGD flails around wildly on its way to the minima, it does get there over time, and very, very, VERY fast. Plot4 Source: UCLA CS260

NOTE: Minibatch Gradient Descent just refers to Gradient Descent using $m$ training points to compute the gradient, where $1 < m < n$. Stochastic Gradient Descent is actually a special case of Mini Batch Gradient Descent, where $m=1$.

Finding Optimal Learning Rate

Let’s now find the best learning rate for Stochastic Gradient Descent, and also observe how SGD is affected by the learning rate.

def main():
    #Loading the dataset in Pandas DataFrame
    df = pd.read_csv('data.csv', delimiter=',')
    X = df.values[:,:-1]
    y = df.values[:,-1]

    # Split into Train and Test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size =100)

    # Scaling all features to [0, 1] and adding bias term
    X_train, X_test = feature_normalization(X_train, X_test)
    X_train = np.hstack((X_train, np.ones((X_train.shape[0], 1))))  # Add bias term
    X_test = np.hstack((X_test, np.ones((X_test.shape[0], 1))))  # Add bias term
    
    # Run SGD with different learning rates
    lamb = 0.015 # we are using optimal regularization parameter found from last blog post
    
    # try a few step sizes
    alphas = [0.01, 0.005, 0.001]
    for alpha in alphas:
        loss_hist = stochastic_grad_descent(X_train, y_train, alpha=alpha, lambda_reg=lamb)
        
        plt.xlabel('epochs')
        plt.ylabel('Objective Function Loss')
        plt.ylim(bottom=2, top=5)
        plt.title('SGD, alpha=%s, lambda=%s' % (str(alpha), str(lamb)))
        plt.plot([i for i in range(1, 1000+1)], loss_hist.mean(axis=1))
        plt.show()
        print("Final Objective Loss: %s \n" % (str(loss_hist.mean(axis=1)[-1])))

Here are the plots of the Objective Loss (I’m including the regularization penalty) by the epoch for each learning rate we tried.

Plot1

Plot2

Plot3

Notice how the optimizer is indeed flailing around as it gets close to the minima, but it does get there. Also, notice how the ‘flailing’ is dampened by how low we set our learning rate.

One thing to note is that SGD is not achieving the best loss as compared to Gradient Descent. However, in practice, this doesnt matter. The difference in loss is such a small percentage that it almost doesnt matter. A better explanation of why this doesn’t matter that much requires a deep dive in Bayes Decision Functions and Bayes Risk, but for now, you can just believe me (and pretty much every Machine Learning practioner), that the massive speed boost SGD offers outweighs the small hit in final performance of the model.

If you really wanted to, you could use SGD to get you 95% of the way there and then switch to using Batch Gradient Descent, but usually this isn’t even worth it. Most people actually just use learning rate update rules to lower the learning rate as it gets closer and closer to the minima (or lower it if its making no progress at all).

Conclusion

In this post, we learned what Stochastic Gradient Descent it, proved that it works in relation to Full Batch Gradient Descent, how to take the Stochastic Gradient for Ridge Regression, and then finally code SGD in Python.

For a deeper dive into this area, you can explore the following topics:

  1. Use an adaptive learning rate to get closer to the minima or learn faster in the beginning
  2. Learn about Bayes Decision Functions and Bayes Risk (very theoretical)
  3. Implement Minibatch Gradient Descent
  4. Use SGD on different loss functions (or different models)