Skip to Content
LibraryConceptsPerceptron

Perceptron

The perceptron is the simplest neural network: a single neuron that applies a linear transformation followed by an activation function. Understanding the perceptron is essential for understanding deeper networks.

Architecture

Components:

  • Input features: xRDx \in \mathbb{R}^D
  • Weights: wRDw \in \mathbb{R}^D
  • Bias: bRb \in \mathbb{R}
  • Activation function: ff
  • Output: y=f(wTx+b)y = f(w^T x + b)

The perceptron computes:

  1. Linear transformation: z=wTx+bz = w^T x + b
  2. Nonlinear activation: y=f(z)y = f(z)

Activation Functions

The choice of activation function determines the perceptron’s behavior.

Step Function

f(x)={1if x00otherwisef(x) = \begin{cases} 1 & \text{if } x \geq 0 \\ 0 & \text{otherwise} \end{cases}
  • Properties: Simple binary output
  • Problem: Non-differentiable (can’t use gradient descent)
  • Historical: Used in original perceptron (1958)

Sigmoid

f(x)=11+exf(x) = \frac{1}{1 + e^{-x}}
  • Properties: Smooth, outputs in (0,1)(0, 1), interpretable as probability
  • Advantages: Differentiable everywhere, smooth gradients
  • Problem: Saturates (gradients near zero for large x|x|), causing vanishing gradients

ReLU (Rectified Linear Unit)

f(x)=max(0,x)f(x) = \max(0, x)
  • Properties: Most common in modern networks
  • Advantages: Fast to compute, addresses vanishing gradients, sparse activation
  • Problem: “Dead ReLU” when neurons always output 0 (gradient is always 0)

Modern Practice: ReLU and its variants (Leaky ReLU, ELU) are preferred for hidden layers, while sigmoid/softmax are used for output layers.

Decision Boundaries

A perceptron creates a linear decision boundary in the input space:

wTx+b=0w^T x + b = 0

This is a hyperplane that divides the space into two regions.

Linear Separability

A perceptron can only classify data that is linearly separable - data that can be separated by a hyperplane.

Classic Example: XOR Problem

The XOR function cannot be solved by a single perceptron:

x1x_1x2x_2XOR
000
011
101
110

No single line can separate the positive (1) and negative (0) examples. This limitation led to the development of multi-layer perceptrons.

Perceptron Learning Rule

The original perceptron algorithm (for step activation):

  1. Initialize weights randomly: ww \leftarrow random values
  2. For each training example (xi,yi)(x_i, y_i):
    • Compute prediction: y^=sign(wTxi+b)\hat{y} = \text{sign}(w^T x_i + b)
    • If incorrect, update:
      • ww+ηyixiw \leftarrow w + \eta \cdot y_i \cdot x_i
      • bb+ηyib \leftarrow b + \eta \cdot y_i
  3. Repeat until convergence

where η\eta is the learning rate.

Convergence Guarantee: If the data is linearly separable, the perceptron algorithm will converge in finite time.

Implementation Example

import numpy as np class Perceptron: def __init__(self, input_dim, activation='relu'): """ Initialize perceptron. Args: input_dim: Number of input features activation: 'relu', 'sigmoid', or 'step' """ self.weights = np.random.randn(input_dim) * 0.01 self.bias = 0 self.activation = activation def forward(self, x): """ Forward pass: compute output. Args: x: Input data (N, D) or (D,) Returns: Activation output """ # Linear transformation z = x @ self.weights + self.bias # Apply activation if self.activation == 'relu': return np.maximum(0, z) elif self.activation == 'sigmoid': return 1 / (1 + np.exp(-z)) elif self.activation == 'step': return (z >= 0).astype(float) def predict(self, x): """Make binary predictions.""" output = self.forward(x) return (output >= 0.5).astype(int)

Historical Significance

  • 1958: Frank Rosenblatt invents the perceptron
  • 1969: Minsky and Papert show limitations (XOR problem)
  • 1986: Backpropagation revives neural networks
  • 2012+: Deep learning revolution with multi-layer networks

The perceptron’s limitations drove the development of deeper architectures.

Learning Resources

Videos

Reading

Next Steps

  1. Understand why multiple layers are necessary
  2. Learn backpropagation for training
  3. Explore modern activation functions