Token Bucket

Architecture

A rate limiting algorithm where tokens are added to a bucket at a fixed rate. Each request consumes a token. Allows controlled bursts up to the bucket capacity while enforcing a sustained rate limit.

Updated Mar 17, 2026

Full Explanation

Token bucket is the most widely used rate limiting algorithm in networking and CDN systems. The concept is simple: imagine a bucket that holds tokens. Tokens are added at a steady rate (say, 10 per second). Each request costs one token. If the bucket has tokens, the request proceeds and a token is removed. If the bucket is empty, the request is rejected (or queued). The bucket has a maximum capacity, which determines the maximum burst size.

Two parameters define the behavior: the fill rate (tokens per second, your sustained rate limit) and the bucket size (maximum burst). If you set a fill rate of 100 requests/second and a bucket size of 200, a client can burst up to 200 requests instantly (draining the bucket), then sustain 100 requests/second as tokens refill. This burst allowance is what makes token bucket practical: real traffic is bursty, and rigid per-second limits create unnecessary rejections.

CDN edge servers use token bucket extensively. Cloudflare's rate limiting, AWS WAF rate rules, and Fastly's rate limiting all implement variants of token bucket. At the edge, each client (identified by IP, API key, or other signal) gets its own virtual bucket. The CDN tracks token state in memory or a distributed store and makes allow/deny decisions in microseconds.

Compared to other rate limiting algorithms: a fixed window counter (100 requests per minute) has the boundary problem where a client can send 100 requests at 0:59 and 100 more at 1:01, getting 200 in 2 seconds. Sliding window fixes this but requires tracking individual request timestamps. Leaky bucket processes requests at a constant rate (no bursts allowed), which is simpler but less flexible. Token bucket gives you burst tolerance with a simple, stateless check.

Distributed token bucket is the hard part. When rate limiting runs across hundreds of CDN edge PoPs, keeping token counts perfectly synchronized is impractical. Most CDN implementations use approximate approaches: each PoP tracks its own bucket and periodically synchronizes, or they use eventual consistency with a central store. This means your rate limit of 1000 requests/minute might actually allow 1200 across all PoPs. For most use cases, this approximation is acceptable.

Nginx's limit_req module uses a variant called leaky bucket with burst, which behaves similarly to token bucket. The zone tracks tokens per key (usually client IP), the rate sets the refill speed, and the burst parameter sets the bucket capacity. Requests beyond the burst are rejected with 503.

Examples

# Python: simple token bucket implementation
import time

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # max burst
        self.tokens = capacity
        self.last_refill = time.monotonic()

    def allow(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.rate
        )
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Usage: 10 req/sec, burst of 20
bucket = TokenBucket(rate=10, capacity=20)
for i in range(25):
    print(f"Request {i}: {'allowed' if bucket.allow() else 'denied'}")

# Nginx: rate limiting (leaky bucket with burst)
http {
    # 10 requests/sec per IP, bucket size 20
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

    server {
        location /api/ {
            limit_req zone=api burst=20 nodelay;
            # nodelay: process burst immediately
            # without nodelay: queue burst requests
        }
    }
}

# Redis: distributed token bucket
# Lua script for atomic token bucket check
local key = KEYS[1]
local rate = tonumber(ARGV[1])      -- tokens/sec
local capacity = tonumber(ARGV[2])  -- max tokens
local now = tonumber(ARGV[3])
local tokens = tonumber(redis.call('hget', key, 'tokens') or capacity)
local last = tonumber(redis.call('hget', key, 'last') or now)
local elapsed = now - last
tokens = math.min(capacity, tokens + elapsed * rate)
if tokens >= 1 then
    redis.call('hset', key, 'tokens', tokens - 1, 'last', now)
    return 1  -- allowed
end
return 0  -- denied

Frequently Asked Questions

A rate limiting algorithm where tokens are added to a bucket at a fixed rate. Each request consumes a token. Allows controlled bursts up to the bucket capacity while enforcing a sustained rate limit.

# Python: simple token bucket implementation
import time

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # max burst
        self.tokens = capacity
        self.last_refill = time.monotonic()

    def allow(self):
        now = time.monotonic()
        elapsed = now - self.last_refill
        self.tokens = min(
            self.capacity,
            self.tokens + elapsed * self.rate
        )
        self.last_refill = now

        if self.tokens >= 1:
            self.tokens -= 1
            return True
        return False

# Usage: 10 req/sec, burst of 20
bucket = TokenBucket(rate=10, capacity=20)
for i in range(25):
    print(f"Request {i}: {'allowed' if bucket.allow() else 'denied'}")

# Nginx: rate limiting (leaky bucket with burst)
http {
    # 10 requests/sec per IP, bucket size 20
    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;

    server {
        location /api/ {
            limit_req zone=api burst=20 nodelay;
            # nodelay: process burst immediately
            # without nodelay: queue burst requests
        }
    }
}

# Redis: distributed token bucket
# Lua script for atomic token bucket check
local key = KEYS[1]
local rate = tonumber(ARGV[1])      -- tokens/sec
local capacity = tonumber(ARGV[2])  -- max tokens
local now = tonumber(ARGV[3])
local tokens = tonumber(redis.call('hget', key, 'tokens') or capacity)
local last = tonumber(redis.call('hget', key, 'last') or now)
local elapsed = now - last
tokens = math.min(capacity, tokens + elapsed * rate)
if tokens >= 1 then
    redis.call('hset', key, 'tokens', tokens - 1, 'last', now)
    return 1  -- allowed
end
return 0  -- denied

Related CDN concepts include:

  • Rate Limiting — Restricting the number of requests a client can make within a time window. Protects origins …
  • WAF (WAF) — Web Application Firewall. Inspects HTTP requests at the CDN edge and blocks malicious traffic: SQL …
  • API Gateway — A server that acts as the single entry point for API requests, handling authentication, rate …