Skip to Content
LibraryConceptsLinear Classifiers

Linear Classifiers

Linear classifiers form the foundation of neural networks. Before understanding deep learning, we must understand these simple models that create linear decision boundaries in feature space.

Support Vector Machines (SVM)

The SVM loss function (hinge loss) penalizes predictions that fall on the wrong side of the decision boundary.

For a datapoint (xi,yi)(x_i, y_i), the loss is:

Li=jyimax(0,sjsyi+Δ)L_i = \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + \Delta)

where sj=f(xi;W)js_j = f(x_i; W)_j are the class scores and Δ\Delta is a margin (typically 1).

Intuition: The loss is zero when the correct class score is at least Δ\Delta higher than all incorrect class scores. Otherwise, we accumulate a penalty proportional to how far we are from the desired margin.

Key Properties:

  • Creates a margin between classes
  • Only cares about points near the decision boundary
  • Robust to outliers on the correct side
  • Loss saturates once margin is satisfied

Softmax Classifier

The softmax function converts raw scores into probabilities:

P(Y=kX=xi)=eskjesjP(Y = k | X = x_i) = \frac{e^{s_k}}{\sum_j e^{s_j}}

The loss is the negative log-likelihood (cross-entropy):

Li=logP(Y=yiX=xi)L_i = -\log P(Y = y_i | X = x_i)

Why Cross-Entropy?

Cross-entropy has several desirable properties:

  • Penalizes confident wrong predictions heavily
  • Has smooth gradients everywhere for optimization
  • Connects to maximum likelihood estimation
  • Never fully saturates (always encourages improvement)

Comparison with Hinge Loss:

  • Softmax never stops optimizing (always wants higher probability for correct class)
  • SVM stops once margin is achieved
  • Softmax outputs have probabilistic interpretation
  • Both work well in practice

Decision Boundaries

Linear classifiers create hyperplane decision boundaries in the input space:

wTx+b=0w^T x + b = 0

Limitations:

  • Can only separate linearly separable data
  • Cannot solve XOR problem with single linear classifier
  • Limited expressiveness for complex patterns

Why This Matters: Neural networks overcome these limitations by stacking multiple linear transformations with nonlinearities, creating arbitrarily complex decision boundaries.

Mathematical Formulation

For a linear classifier with weights WRD×CW \in \mathbb{R}^{D \times C} and bias bRCb \in \mathbb{R}^C:

f(x;W,b)=Wx+bf(x; W, b) = Wx + b

where:

  • xRDx \in \mathbb{R}^D is the input
  • CC is the number of classes
  • Output is a vector of class scores

The classifier predicts: y^=argmaxkf(x)k\hat{y} = \arg\max_k f(x)_k

Learning Resources

Videos

Reading

Next Steps

After understanding linear classifiers:

  1. Learn about the Perceptron algorithm
  2. Understand why multiple layers are needed
  3. Study activation functions that enable non-linearity