Linear Classifiers

Linear classifiers form the foundation of neural networks. Before understanding deep learning, we must understand these simple models that create linear decision boundaries in feature space.

Support Vector Machines (SVM)

The SVM loss function (hinge loss) penalizes predictions that fall on the wrong side of the decision boundary.

For a datapoint $(x_i, y_i)$ , the loss is:

L_i = \sum_{j \neq y_i} \max(0, s_j - s_{y_i} + \Delta)

where $s_j = f(x_i; W)_j$ are the class scores and $\Delta$ is a margin (typically 1).

Intuition: The loss is zero when the correct class score is at least $\Delta$ higher than all incorrect class scores. Otherwise, we accumulate a penalty proportional to how far we are from the desired margin.

Key Properties:

Creates a margin between classes
Only cares about points near the decision boundary
Robust to outliers on the correct side
Loss saturates once margin is satisfied

Softmax Classifier

The softmax function converts raw scores into probabilities:

P(Y = k | X = x_i) = \frac{e^{s_k}}{\sum_j e^{s_j}}

The loss is the negative log-likelihood (cross-entropy):

L_i = -\log P(Y = y_i | X = x_i)

Why Cross-Entropy?

Cross-entropy has several desirable properties:

Penalizes confident wrong predictions heavily
Has smooth gradients everywhere for optimization
Connects to maximum likelihood estimation
Never fully saturates (always encourages improvement)

Comparison with Hinge Loss:

Softmax never stops optimizing (always wants higher probability for correct class)
SVM stops once margin is achieved
Softmax outputs have probabilistic interpretation
Both work well in practice

Decision Boundaries

Linear classifiers create hyperplane decision boundaries in the input space:

w^T x + b = 0

Limitations:

Can only separate linearly separable data
Cannot solve XOR problem with single linear classifier
Limited expressiveness for complex patterns

Why This Matters: Neural networks overcome these limitations by stacking multiple linear transformations with nonlinearities, creating arbitrarily complex decision boundaries.

Mathematical Formulation

For a linear classifier with weights $W \in \mathbb{R}^{D \times C}$ and bias $b \in \mathbb{R}^C$ :

f(x; W, b) = Wx + b

where:

$x \in \mathbb{R}^D$ is the input
$C$ is the number of classes
Output is a vector of class scores

The classifier predicts: $\hat{y} = \arg\max_k f(x)_k$

Learning Resources

Videos

CS231n Lecture 2: Image Classification & Linear Classifiers (75 min)

Reading

Perceptron - Single neuron with activation function
Multi-Layer Perceptrons - Stacking linear layers
Backpropagation - Training neural networks

Next Steps

After understanding linear classifiers:

Learn about the Perceptron algorithm
Understand why multiple layers are needed
Study activation functions that enable non-linearity