## Introduction

In my previous post, we derived and proved all the math that is foundational to implementing an SVM from scratch (namely Pegasos SVM). In this post, I will show you how to implement Pegasos in Python, optimize it (while still proving the math holds), and then analyzing the results.

## Code

Now that we have reasoned through the Objective Function for SVM and shown how a certain step size gives us Pegasos SVM, we can now code it up!

### Data Prep

We are using movie review data from: Polarity Dataset v2.0.

It has the full text from 2000 movies reviews: 1000 reviews are classified as positive and 1000 as negative.

Our goal is to predict whether a review has positive or negative sentiment from the text of the review.

#### Load data and randomly split into train/validation

def folder_list(path,label):
filelist = os.listdir(path)
review = []
for infile in filelist:
file = os.path.join(path,infile)
r = list(r)
r.append(label)
review.append(r)
return review

f = open(file)
symbols = '${}()[].,:;+-*/&|<>=~" ' words = map(lambda Element: Element.translate(str.maketrans("", "", symbols)).strip(), lines) words = filter(None, words) return words def shuffle_data(): ''' pos_path is where you save positive review data. neg_path is where you save negative review data. ''' pos_path = "data/pos" neg_path = "data/neg" pos_review = folder_list(pos_path,1) neg_review = folder_list(neg_path,-1) reviews = pos_review + neg_review random.shuffle(reviews) return reviews # load and shuffle data reviews = shuffle_data() train = reviews[0:1500] val = reviews[1500:]  ### Sparse Representations We will represent our text document features using bag of words. Here every possible word is a feature, and the value of a word feature is the number of times that word appears in the document. Most words will not appear in many documents, so the count values will be 0. Sparse Represendation: Rather than store many values of zeros, we only store the counts that are nonzero. The counts are stored in a dictionary in Python (hashmap in other languages will work). For example, Shrek and Shrek II would be represented as the following Python dictionary: x = {'Shrek':2, 'and':1, 'II':1}  We will use a linear classifier:$f(x)=w^{T}x$. We can store the$w$vector in a sparse format, such as: w = {'maximum':1.3, 'Shrek':-1.1, 'bad':-4.2, 'and':2.2, 'good':9.1}  The inner product between$w$and$x$would only involve the features that appear in both, since whatever doesn’t appear is 0. For this example, the inner product would be: inner_product = (x['Shrek'] * w['Shrek']) + (x['and'] * w['and']) = 2*(-1.1) + 1*(2.2)}  #### Convert example (list of words) into a sparse bag-of-words representation def to_sparse(example): example_sparse = {} for word in example: if word in example_sparse: example_sparse[word] += 1 else: example_sparse[word] = 1 return example_sparse # convert data to sparse reps X_train = [] y_train = [] X_val = [] y_val = [] for example in train: X_train.append(to_sparse(example[0:-1])) y_train.append(example[-1:][0]) for example in val: X_val.append(to_sparse(example[0:-1])) y_val.append(example[-1:][0])  ### Implement the Pegasos Algorithm Let’s implement the Pegasos algorithm to run on our sparse data. The output will be a sparse, trained weight vector$w$. def scale(d, scale): for f, v in d.items(): d[f] = v * scale def pegasus(X, y, lamb=1, epochs=1000, verbose=False): if verbose: print("Pegasus using lamb={} and epochs={}".format(lamb, epochs)) m = len(y) t = 0.0 w = {} loss = 0 for i in range(epochs): for j in range(m): t += 1 learning_rate = 1.0 / (t * lamb) x_j = X[j] y_j = y[j] pred = y_j*dotProduct(w,x_j) loss += max(0, 1 - pred) if pred < 1: scale(w, 1.0 - (learning_rate * lamb)) increment(w, learning_rate * y_j, x_j) else: scale(w, 1.0 - (learning_rate * lamb)) if verbose: loss = (1.0/m) * loss + (lamb/2) * dotProduct(w, w) print("Current loss at epoch {}/{}: {}".format(i+1, epochs, loss)) loss = 0 return w  ### Optimizations In every step of the Pegasos algorithm, we rescale every entry of$w_{t}$by the factor$(1-\eta_{t}\lambda)$. Doing this on a dictionary is very very slow. We can make things faster by representing$w$as$w=sW$, where$s\in R$and$W\in R^{d}$. Let’s start with$s=1$and$W$all zeros ( empty dictionary). Both updates (i.e. whether or not we have a margin error) start with rescaling$w_{t}$, which we do by setting$s_{t+1}= (1-\eta_{t}\lambda)s_{t}$. If the update is$w_{t+1}=(1-\eta_{t}\lambda)w_{t}+\eta_{t}y_{j}x_{j}$, we can verify that the Pegasos update step is equivalent to: $\begin{eqnarray} s_{t+1} & = & \left(1-\eta_{t}\lambda\right)s_{t}\\ W_{t+1} & = & W_{t}+\frac{1}{s_{t+1}}\eta_{t}y_{j}x_{j}. \end{eqnarray}$ Let’s prove the above is equivalent to standard Pegasos update: $\begin{eqnarray} w_{t+1} = s_{t+1}W_{t+1} \end{eqnarray}$ $\begin{eqnarray} W_{t+1} = \frac{w_{t+1}}{s_{t+1}} \end{eqnarray}$ $\begin{eqnarray} W_{t} = \frac{w_{t}}{s_{t}} \end{eqnarray}$ Sub this into the optimal function: $\begin{eqnarray} W_{t+1} = W_{t}+\frac{1}{s_{t+1}}\eta_{t}y_{j}x_{j} \end{eqnarray}$ $\begin{eqnarray} \frac{w_{t+1}}{s_{t+1}} = \frac{w_{t}}{s_{t}} + \frac{\eta_{t}y_{j}x_{j}}{s_{t+1}} \end{eqnarray}$ Multiply both sides by$s_{t+1}$$\begin{eqnarray} w_{t+1} = \frac{s_{t+1} w_{t}}{s_{t}} + \eta_{t}y_{j}x_{j} \end{eqnarray}$ Expand definition of$s_{t+1}$$\begin{eqnarray} w_{t+1} = \frac{\left(1-\eta_{t}\lambda\right)s_{t} w_{t}}{s_{t}} + \eta_{t}y_{j}x_{j} \end{eqnarray}$ Cancel the$s_t$$\begin{eqnarray} w_{t+1} = \left(1-\eta_{t}\lambda\right) w_{t} + \eta_{t}y_{j}x_{j} \end{eqnarray}$ We have now proven the optimal update is equal to the original update ### Optimal Pegasos Let’s implement the Pegasos algorithm with the optimal$(s,W)$representation from above. def scale(d, scale): for f, v in d.items(): d[f] = v * scale def pegasus_optimal(X, y, lamb=1, epochs=1000, verbose=False): if verbose: print("PegasusOptimal using lamb={} and epochs={}".format(lamb, epochs)) m = len(y) t = 1.0 # to prevent t=1 inside the inner loop w = {} s = 1.0 loss = 0 for i in range(epochs): for j in range(m): t += 1 learning_rate = 1.0 / (t * lamb) x_j = X[j] y_j = y[j] pred = s*y_j*dotProduct(w,x_j) loss += max(0, 1 - pred) if pred < 1: s = (1.0 - (learning_rate * lamb)) * s increment(w, (learning_rate * y_j) / s, x_j) else: s = (1.0 - (learning_rate * lamb)) * s if verbose: loss = (1.0/m) * loss + (lamb/2) * dotProduct(w, w) * (s*s) print("Current loss at epoch {}/{}: {}".format(i+1, epochs, loss)) loss = 0 scale(w, s) return w  ### Verify Pegasos Implementations We will now run both implemtations of Pegasos on the training data for a few epochs using the bag-of-words representation. We will verify each implementation is correct by comparing the results and also measuring how long each takes. We expect to see a massive speedup from the Optimal implementation. epochs = 3 # couple of epochs lamb = 1 t0 = time() w_normal = pegasus(X_train, y_train, lamb=lamb, epochs=epochs) t1 = time() pegasus_time = t1 - t0 ''' Pegasus using lamb=1 and epochs=3 Current loss at epoch 1/3: 5.802513611688971 Current loss at epoch 2/3: 0.9373906607201646 Current loss at epoch 3/3: 0.769923916295324 ''' t0 = time() w_optimal = pegasus_optimal(X_train, y_train, lamb=lamb, epochs=epochs) t1 = time() pegasus_optimal_time = t1 - t0 ''' PegasusOptimal using lamb=1 and epochs=3 Current loss at epoch 1/3: 5.104274351207599 Current loss at epoch 2/3: 0.9371642829272218 Current loss at epoch 3/3: 0.7698404446775569 '''  We can see from above that at each epoch, both implementations achieve the same loss. Let’s now measure how long each takes. print('for {} epochs: pegasus={}sec, pegasus_optimal={}sec'.format(epochs, pegasus_time, pegasus_optimal_time)) ''' for 3 epochs: pegasus=32.19710898399353sec, pegasus_optimal=1.454498529434204sec '''  Wow! We achieve a 22x speedup with the optimal implementaion!s Let’s wrap up and make sure both implementations produce a trained$w$vector that gives the same loss result on validation data. def calculate_loss(X, y, w, lamb=1): m = len(y) loss = 0 for j in range(m): x_j = X[j] y_j = y[j] margin = dotProduct(w, x_j) * y_j loss += max(0, 1 - margin) return (1/m)*loss + (lamb/2) * dotProduct(w, w) # Make sure your implementations arecorrect by verifying that the two approaches give essentially the same resul print("Pegasos validation loss: {}".format(calculate_loss(X_val, y_val, w_normal, lamb=lamb))) print("Pegasos Optimal validation loss: {}".format(calculate_loss(X_val, y_val, w_optimal, lamb=lamb))) ''' Pegasos validation loss: 0.6840473333334376 Pegasos Optimal validation loss: 0.683987954391298 '''  Everything looks pretty good! Our optimal implementation gives the same result as the basic Pegasos implementation and is much much faster. ### 0-1 Loss Let’s write a function that takes a sparse weight vector$w$and a collection of$(x,y)$pairs, and returns the percent error when predicting$y$using$sign(w^{T}x)$. This function will give us the 0-1 loss of the linear predictor$x\mapsto w^{T}x$def calculate_accuracy(X, y, w): correct = 0 for i in range(len(y)): x_j = X[i] margin = dotProduct(w, x_j)*y[i] if margin > 0: correct += 1 return correct / len(y) def calculate_percent_error(X, y, w): return 1.0 - calculate_accuracy(X, y, w) percent_error_normal = calculate_percent_error(X_val, y_val, w_normal) print("Pegosos Percent Error: {}".format(percent_error_normal)) percent_error_optimal = calculate_percent_error(X_val, y_val, w_optimal) print("Pegosos Optimal Percent Error: {}".format(percent_error_optimal)) ''' Pegosos Percent Error: 0.30200000000000005 Pegosos Optimal Percent Error: 0.30200000000000005 '''  Both implementations achieve about 30% error, or 70% accuracy on the moview reviews dataset. ### Hyperparameter Tuning Next let’s search for the regularization parameter that gives the minimal percent error on our test set. The plan will be to use the optimal Pegasos function and run it unitl it converges. epochs = 10 lambdas = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0] test_errors = [] for lamb in lambdas: w = pegasus_optimal(X_train, y_train, lamb=lamb, epochs=epochs) test_errors.append(calculate_percent_error(X_val, y_val, w)) plt.xlabel('L2Reg') plt.ylabel('Percent Error') plt.title('Validation Percent Error after {} Epochs'.format(epochs)) plt.plot(lambdas, test_errors) plt.show()  The best Lambda we can get is 0.8 with 0.19 percent error (81% accuracy) ### SVM Score as ‘Confidence’ Remember that the score is the value of the prediction$f(x)=w^{T}x$. The magnitude of the score represents the confidence of the prediction. This is something we measure to see if it makes sense. We will break the predictions into groups based on the score. For each group, we will look at the percentage error. Let’s see if there a correlation between higher magnitude scores and accuracy. def split_by_score(X, y, w): X_high = [] y_high = [] X_low = [] y_low = [] for i in range(len(y)): margin = dotProduct(w, X[i]) if abs(margin) > 1: # what makes a 'high' score X_high.append(X[i]) y_high.append(y[i]) elif abs(margin) > 0: X_low.append(X[i]) y_low.append(y[i]) return X_high, y_high, X_low, y_low lamb = 1 epochs = 50 w = pegasus_optimal(X_train, y_train, lamb=lamb, epochs=epochs) X_high, y_high, X_low, y_low = split_by_score(X_val, y_val, w) print(len(X_high)) print(len(X_low)) ''' 136 364 ''' acc_high = calculate_accuracy(X_high, y_high, w) print("High Score Accuracy: {}".format(acc_high)) acc_low = calculate_accuracy(X_low, y_low, w) print("Low Score Accuracy: {}".format(acc_low)) acc_total = calculate_accuracy(X_val, y_val, w) print("Total Accuracy: {}".format(acc_total))  group | accuracy high-score | 0.9705882352941176 low-score | 0.739010989010989 total | 0.802  For high-score predictions, the accuracy is much higher than the average accuracy. For low score predictions, the accuracy is lower than average. This makes sense. For predictions that are ‘high confidence’, we expect the model to have actually made the right decision. For predictions that the model is ‘unsure’ about, we expect the accuracy to be lower. There is a very high positive correlation between the magnitude of the scores the accuracy ### Is a Non-differentiable Objective Function dangerous? Our objective is not differentiable when$y_{i}w^{T}x_{i}=1$. Let’s see how often and when we have$y_{i}w^{T}x_{i}=1$(or perhaps within a small distance of$1$Does it make sense just skipping the update when$yw^{T}x_{i}=1\$? What about shortening the step size by a small percentage?

def pegasus_bad_margins(X, y, lamb=1, epochs=1000, e=0):
m = len(y)
t = 1.0 # to prevent t=1 inside the inner loop
w = {}
s = 1.0
count = 0
for i in range(epochs):
for j in range(m):
t += 1
learning_rate = 1.0 / (t * lamb)
x_j = X[j]
y_j = y[j]
pred = s*y_j*dotProduct(w,x_j)
if abs(pred - 0) < e:
count += 1
if pred < 1:
s = (1.0 - (learning_rate * lamb)) * s
increment(w, (learning_rate * y_j) / s, x_j)
else:
s = (1.0 - (learning_rate * lamb)) * s
scale(w, s)
return w, count

lamb = 0.5
epochs = 5
w, count = pegasus_bad_margins(X_train, y_train, lamb=lamb, epochs=epochs, e=0.01)
'''
count = 44
'''

w, count = pegasus_bad_margins(X_train, y_train, lamb=lamb, epochs=epochs, e=1.0e-5)
'''
count = 0
'''


With a small threshold, there are quite a few examples that are close to 1. However, with 1.0e-5, there is no example that is close to 1. With no threshold, no margin ever equals 1 exactly. The advice to ignore the non-diff area of the gradient might make sense.

If we shorten the step size, we get the following numbers:

t = 10.0
...
learning_rate = 1.0 / (t * lamb)*10
...


Only 2 examples are within 0.01 threshold of 1 and 0 are within 1.0e-5 threshold of 1. Reducing the step size is a valid approach to avoid the dangerous region of 1.

## Conclusion

I hope you enjoyed following along on this 2 post series where we learned the deep math behind SMV and how to implement Pegasos SVM in code. We also saw how to optimize the code and stil prove the math would hold, and also how to analyze the results of the code.