Perceptron

The perceptron is the simplest neural network: a single neuron that applies a linear transformation followed by an activation function. Understanding the perceptron is essential for understanding deeper networks.

Architecture

Components:

Input features: $x \in \mathbb{R}^D$
Weights: $w \in \mathbb{R}^D$
Bias: $b \in \mathbb{R}$
Activation function: $f$
Output: $y = f(w^T x + b)$

The perceptron computes:

Linear transformation: $z = w^T x + b$
Nonlinear activation: $y = f(z)$

Activation Functions

The choice of activation function determines the perceptron’s behavior.

Step Function

f(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{otherwise} \end{cases}

Properties: Simple binary output
Problem: Non-differentiable (can’t use gradient descent)
Historical: Used in original perceptron (1958)

Sigmoid

f(x) = \frac{1}{1 + e^{-x}}

Properties: Smooth, outputs in $(0, 1)$ , interpretable as probability
Advantages: Differentiable everywhere, smooth gradients
Problem: Saturates (gradients near zero for large $|x|$ ), causing vanishing gradients

ReLU (Rectified Linear Unit)

f(x) = \max(0, x)

Properties: Most common in modern networks
Advantages: Fast to compute, addresses vanishing gradients, sparse activation
Problem: “Dead ReLU” when neurons always output 0 (gradient is always 0)

Modern Practice: ReLU and its variants (Leaky ReLU, ELU) are preferred for hidden layers, while sigmoid/softmax are used for output layers.

Decision Boundaries

A perceptron creates a linear decision boundary in the input space:

w^T x + b = 0

This is a hyperplane that divides the space into two regions.

Linear Separability

A perceptron can only classify data that is linearly separable - data that can be separated by a hyperplane.

Classic Example: XOR Problem

The XOR function cannot be solved by a single perceptron:

$x_1$	$x_2$	XOR
0	0	0
0	1	1
1	0	1
1	1	0

No single line can separate the positive (1) and negative (0) examples. This limitation led to the development of multi-layer perceptrons.

Perceptron Learning Rule

The original perceptron algorithm (for step activation):

Initialize weights randomly: $w \leftarrow$ random values
For each training example $(x_i, y_i)$ $(x_{i}, y_{i})$ :
- Compute prediction: $\hat{y} = \text{sign}(w^T x_i + b)$
- If incorrect, update:
  - $w \leftarrow w + \eta \cdot y_i \cdot x_i$
  - $b \leftarrow b + \eta \cdot y_i$
Repeat until convergence

where $\eta$ is the learning rate.

Convergence Guarantee: If the data is linearly separable, the perceptron algorithm will converge in finite time.

Implementation Example


import numpy as np
 
class Perceptron:
    def __init__(self, input_dim, activation='relu'):
        """
        Initialize perceptron.
 
        Args:
            input_dim: Number of input features
            activation: 'relu', 'sigmoid', or 'step'
        """
        self.weights = np.random.randn(input_dim) * 0.01
        self.bias = 0
        self.activation = activation
 
    def forward(self, x):
        """
        Forward pass: compute output.
 
        Args:
            x: Input data (N, D) or (D,)
 
        Returns:
            Activation output
        """
        # Linear transformation
        z = x @ self.weights + self.bias
 
        # Apply activation
        if self.activation == 'relu':
            return np.maximum(0, z)
        elif self.activation == 'sigmoid':
            return 1 / (1 + np.exp(-z))
        elif self.activation == 'step':
            return (z >= 0).astype(float)
 
    def predict(self, x):
        """Make binary predictions."""
        output = self.forward(x)
        return (output >= 0.5).astype(int)

Historical Significance

1958: Frank Rosenblatt invents the perceptron
1969: Minsky and Papert show limitations (XOR problem)
1986: Backpropagation revives neural networks
2012+: Deep learning revolution with multi-layer networks

The perceptron’s limitations drove the development of deeper architectures.

Learning Resources

Videos

Welch Labs: Neural Networks Demystified Part 1 (15 min)
3Blue1Brown: But what is a neural network? (19 min)

Reading

Linear Classifiers - Foundation for perceptron
Multi-Layer Perceptrons - Overcoming perceptron limitations
Activation Functions - Detailed comparison
Backpropagation - Training multi-layer networks

Next Steps

Understand why multiple layers are necessary
Learn backpropagation for training
Explore modern activation functions