Perceptron
The perceptron is the simplest neural network: a single neuron that applies a linear transformation followed by an activation function. Understanding the perceptron is essential for understanding deeper networks.
Architecture
Components:
- Input features:
- Weights:
- Bias:
- Activation function:
- Output:
The perceptron computes:
- Linear transformation:
- Nonlinear activation:
Activation Functions
The choice of activation function determines the perceptron’s behavior.
Step Function
- Properties: Simple binary output
- Problem: Non-differentiable (can’t use gradient descent)
- Historical: Used in original perceptron (1958)
Sigmoid
- Properties: Smooth, outputs in , interpretable as probability
- Advantages: Differentiable everywhere, smooth gradients
- Problem: Saturates (gradients near zero for large ), causing vanishing gradients
ReLU (Rectified Linear Unit)
- Properties: Most common in modern networks
- Advantages: Fast to compute, addresses vanishing gradients, sparse activation
- Problem: “Dead ReLU” when neurons always output 0 (gradient is always 0)
Modern Practice: ReLU and its variants (Leaky ReLU, ELU) are preferred for hidden layers, while sigmoid/softmax are used for output layers.
Decision Boundaries
A perceptron creates a linear decision boundary in the input space:
This is a hyperplane that divides the space into two regions.
Linear Separability
A perceptron can only classify data that is linearly separable - data that can be separated by a hyperplane.
Classic Example: XOR Problem
The XOR function cannot be solved by a single perceptron:
| XOR | ||
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
No single line can separate the positive (1) and negative (0) examples. This limitation led to the development of multi-layer perceptrons.
Perceptron Learning Rule
The original perceptron algorithm (for step activation):
- Initialize weights randomly: random values
- For each training example :
- Compute prediction:
- If incorrect, update:
- Repeat until convergence
where is the learning rate.
Convergence Guarantee: If the data is linearly separable, the perceptron algorithm will converge in finite time.
Implementation Example
import numpy as np
class Perceptron:
def __init__(self, input_dim, activation='relu'):
"""
Initialize perceptron.
Args:
input_dim: Number of input features
activation: 'relu', 'sigmoid', or 'step'
"""
self.weights = np.random.randn(input_dim) * 0.01
self.bias = 0
self.activation = activation
def forward(self, x):
"""
Forward pass: compute output.
Args:
x: Input data (N, D) or (D,)
Returns:
Activation output
"""
# Linear transformation
z = x @ self.weights + self.bias
# Apply activation
if self.activation == 'relu':
return np.maximum(0, z)
elif self.activation == 'sigmoid':
return 1 / (1 + np.exp(-z))
elif self.activation == 'step':
return (z >= 0).astype(float)
def predict(self, x):
"""Make binary predictions."""
output = self.forward(x)
return (output >= 0.5).astype(int)Historical Significance
- 1958: Frank Rosenblatt invents the perceptron
- 1969: Minsky and Papert show limitations (XOR problem)
- 1986: Backpropagation revives neural networks
- 2012+: Deep learning revolution with multi-layer networks
The perceptron’s limitations drove the development of deeper architectures.
Learning Resources
Videos
- Welch Labs: Neural Networks Demystified Part 1 (15 min)
- 3Blue1Brown: But what is a neural network? (19 min)
Reading
Related Concepts
- Linear Classifiers - Foundation for perceptron
- Multi-Layer Perceptrons - Overcoming perceptron limitations
- Activation Functions - Detailed comparison
- Backpropagation - Training multi-layer networks
Next Steps
- Understand why multiple layers are necessary
- Learn backpropagation for training
- Explore modern activation functions