SVM Loss Function
For the problem of classification, one of loss function that is commonly used is multi-class SVM (Support Vector Machine). The SVM loss is to satisfy the requirement that the correct class for one of the input is supposed to have a higher score than the incorrect classes by some fixed margin δ. It turns out that the fixed margin δ can be any real number and it does not matter how big it is. We, therefore, let δ=1.
In CIFAR-10, we have the dataset of training images, xi and its corresponding labels, yi. The linear score function f(xi;W,b) computes a score vector for each training image, xi, where f(xi;W,b)∈R10. Denote score vector for xi be s where s=f(xi;W,b). Then, each element, sj, in score vector s is the score for $j$-th class. That is, sj=f(xi;W,b)j;j=1,.,10
Based on that notation declaration, multi-class SVM loss/hinge loss function is defined as following:
Li=∑j≠yimax(0,sj−syi+1)Let’s compute the SVM/hinge loss for each input image:
image1(class=cat) | image2(class=car) | image3(class=frog) | |
---|---|---|---|
cat | 3.2 | 1.3 | 2.2 |
car | 5.1 | 4.9 | 2.5 |
frog | -1.7 | 2.0 | -3.1 |
For image 1 where the ground truth class is cat,
L1=∑j≠yimax(0,sj−sy1+1)=max(0,5.1−3.2+1)+max(0,−1.7−3.2+1)=max(0,2.9)+max(0,−3.9)=2.9+0=2.9For image 2 where the ground truth is car,
L2=∑j≠yimax(0,sj−sy2+1)=max(0,1.3−4.9+1)+max(0,2.0−4.9+1)=max(0,−2.6)+max(0,−1.9)=0+0=0For image 3 where the ground truth is frog,
L3=∑j≠yimax(0,sj−sy3+1)=max(0,2.2+3.1+1)+max(0,2.5+3.1+1)=max(0,6.3)+max(0,6.6)=6.3+6.6=12.9Overall, the loss for those three images are L=1n∑ni=1Li=2.9+12.9+03=5.27
Ultimately, we want to minimize the loss by tuning our parameters, W,b . SVM wants the score of the correct class, yi, to be larger than the incorrect class scores by δ
A last piece of terminology is the threshold at zero and max(0,−) is often called Hinge Loss. Sometimes, we may use Squared Hinge Loss instead in practice, with the form of max(0,−)2, in order to penalize the violated margins more strongly because of the squared sign. In some datasets, square hinge loss can work better. However, it is critical for us to pick a right and suitable loss function in machine learning and know why we pick it.