March 7, 2025

Rate limiting at scale: algorithms, tradeoffs, and implementation

How I think about rate limiting in production: which algorithm to choose, how to implement it, and what changes when you have more than one server.

This post is for anyone who has to add or understand rate limiting: juniors who’ve never built one, mid-level engineers who’ve used a library and want to know what’s under the hood, and seniors who want a clear framework for choosing and implementing it. I’ll explain the ideas in plain language first, then show code and diagrams. By the end you’ll know what rate limiting is, why it matters, which algorithm to pick, and how to implement and run it.

What is rate limiting?

Rate limiting means: “Only allow so much activity per person (or per API key, or per IP) in a given time.” For example: “Each user can call this API at most 100 times per minute.” If they go over that, you don’t process the request; you return an error (like HTTP 429 Too Many Requests) or ask them to wait.

Why do it? Three main reasons:

Protect your server. If one client sends a million requests per second, your service can crash or become unusable for everyone. A limit keeps any single client from doing that.
Enforce quotas. You might offer 100 free requests per day and more for paying customers. Rate limiting is how you enforce that.
Smooth traffic. Sometimes the system downstream of you (a database, another API) can’t handle big spikes. Limiting how fast requests come in keeps things steady.

So the goal is: count how much each “user” (identifier) has done recently, and allow or block the next request based on that.

What do you need to decide before picking an algorithm?

I always write this down first so the rest of the design is clear.

What are you limiting?
“Per identifier” usually means per user ID, per API key, or per IP address. You pick one (or combine them). Each identifier gets its own counter or “bucket.”

What happens when the limit is hit?
Most of the time you reject the request and return something like HTTP 429. Sometimes you queue the request and process it later. You need a consistent policy so clients know what to expect.

One server or many?
If you have a single server, you can keep all state in memory. If you have many servers (e.g. 10 copies of your API behind a load balancer), they don’t share memory. So “this user already made 50 requests” has to live somewhere everyone can see (e.g. Redis). I’ll come back to that in the distributed section.

How strict do you need to be?
Some systems are okay with “about 100 per minute” (a bit of wiggle room at the boundary between one minute and the next). Others need “never more than 100 in any rolling 60-second window.” That choice affects which algorithm you use.

How fast and how much memory?
The rate limiter runs on every request, so it should be cheap (constant time, or close to it). And you don’t want to store unbounded data per user, or memory will grow without limit.

Once I have those answers, I choose an algorithm.

The three main algorithms (in plain language)

There are three common ways to implement “only N requests per time window.” The diagram below shows the idea behind each.

flowchart LR
  subgraph A[Fixed window]
    A1[Window 1] --> A2[Window 2]
  end
  subgraph B[Sliding window]
    B1[Now - 1s ... Now]
  end
  subgraph C[Token bucket]
    C1[Tokens refill] --> C2[Consume per request]
  end

1. Fixed window

You split time into fixed chunks (e.g. “minute 1,” “minute 2”). For each chunk, you count how many requests that user made. If they’re under the limit (e.g. 100), you allow; if not, you reject. When the next chunk starts, the counter resets to zero.

Upside: Very simple. You only need one number per user (the count) and when the window started.

Downside: boundary burst. Imagine the limit is 100 per minute. A user can send 100 requests in the last second of minute 1, then 100 more in the first second of minute 2. So in 2 seconds they sent 200 requests, but your rule was “100 per minute.” For strict “never more than 100 in any 60 seconds,” fixed window is the wrong tool.

2. Sliding window

Here the “window” is always “the last 60 seconds from right now.” So at 10:00:30, you count requests between 09:59:30 and 10:00:30. At 10:00:31, you count between 09:59:31 and 10:00:31. The window slides with time. There’s no hard reset, so you don’t get that double burst at the boundary.

Upside: Matches the intuition “at most N in any window of length W.” No boundary burst.

Downside: You need to store something per request (e.g. timestamps) or an approximation of them, and you have to clean up old data. So a bit more state and a bit more work per request than fixed window.

3. Token bucket

Think of a bucket that can hold at most C tokens. Tokens are added to the bucket at a steady rate (e.g. R tokens per second) up to the cap. Each request costs one token (or more). If there are enough tokens, you take them and allow the request; if not, you reject. So over the long run you can’t exceed R requests per second, but you can “save up” tokens and burst up to C in a short time.

Upside: Simple to implement with two numbers per user (current token count, last time you refilled). Constant time per request. Lets you allow small bursts without violating the long-term rate.

Downside: You have two knobs (capacity C and refill rate R). If you only care about “strictly N per second with no burst,” sliding window is a closer fit; token bucket is for when you’re okay with some burst.

Which one do I use?

Fixed window: Only when a rough “per minute” or “per day” limit is enough and boundary burst is acceptable (e.g. “1000 requests per day”).
Sliding window: When you need “at most N in any window of W” and no burst at the seam (e.g. “100 per minute” in the strict sense).
Token bucket: When you want a steady rate but also allow short bursts, and you want minimal state and O(1) per request. This is what I use most for APIs.

Token bucket: how it works and the code

I’ll walk through token bucket in a bit more detail, then show code.

Each user (identifier) has a bucket. The bucket has:

A capacity (max tokens it can hold).
A refill rate (how many tokens per second are added).
A current token count (how many tokens are in the bucket right now).
A last refill time (when we last updated the count).

When a request comes in:

Figure out how much time passed since the last refill. Add tokens for that time (rate × time), but don’t go above the capacity.
If the current token count is at least 1 (or the “cost” of the request), subtract the cost and allow the request. Save the new count and time.
If not, don’t subtract anything, return 429 (or your policy), and still save the updated count (so the next request sees the refilled bucket).

We don’t need a background timer. We refill on demand whenever we see a request: “How much time passed? Add that many tokens. Then try to consume one.”

Here’s a minimal version in TypeScript. Each key (user id, API key, etc.) has its own bucket. The state is just tokens and lastRefillMs.

interface TokenBucketConfig {
  capacity: number;   // max tokens in the bucket
  refillRate: number;  // tokens added per second
}

interface BucketState {
  tokens: number;        // how many tokens right now
  lastRefillMs: number;  // last time we updated (milliseconds)
}

// Refill: add tokens for elapsed time, cap at capacity.
function refill(config: TokenBucketConfig, state: BucketState, nowMs: number): BucketState {
  const elapsedSec = (nowMs - state.lastRefillMs) / 1000;
  const added = elapsedSec * config.refillRate;
  const tokens = Math.min(config.capacity, state.tokens + added);
  return { tokens, lastRefillMs: nowMs };
}

// Try to consume one token (or more). Returns allowed + new state.
function tryConsume(
  config: TokenBucketConfig,
  state: BucketState,
  nowMs: number,
  cost: number = 1
): { allowed: boolean; newState: BucketState } {
  const afterRefill = refill(config, state, nowMs);
  if (afterRefill.tokens < cost) {
    return { allowed: false, newState: afterRefill };
  }
  return {
    allowed: true,
    newState: {
      tokens: afterRefill.tokens - cost,
      lastRefillMs: afterRefill.lastRefillMs,
    },
  };
}

How to use it: On each request, load the bucket state for that user (or create it if first time). Call tryConsume(config, state, Date.now()). If allowed is true, process the request and save newState. If false, return 429 and still save newState (so the bucket keeps refilling for the next request).

One important detail: Use the same time source everywhere (e.g. Date.now() or your server’s clock). Don’t mix clocks or you’ll get wrong refill amounts.

What happens on each request (the flow)

This sequence diagram shows the path of a single request through the rate limiter.

sequenceDiagram
  participant C as Client
  participant L as Rate limiter
  participant S as Service

  C->>L: Request(key)
  L->>L: Load or init bucket state
  L->>L: Refill tokens(now)
  alt tokens >= cost
    L->>L: Subtract cost, save state
    L->>S: Forward request
    S->>C: Response
  else tokens < cost
    L->>L: Save state (no consume)
    L->>C: 429 Too Many Requests
  end

In words: the client sends a request with some key (user id, API key, etc.). The rate limiter loads that key’s bucket (or creates an empty one). It refills tokens based on how much time has passed. If there are enough tokens, it subtracts the cost, saves the new state, and forwards the request to your service. If not, it still saves the state (so refill is correct next time) and returns 429. We never block or wait on a timer; we always decide immediately based on the current state.

When you have more than one server (distributed rate limiting)

So far everything assumed a single server with one in-memory store. In production you often have many servers behind a load balancer. Request 1 might hit server A, request 2 server B. If each server has its own bucket for the same user, then that user effectively has “N buckets” and can send N times more traffic than you intended. So you need a single place that decides “this user is over limit” for all servers.

Two common approaches:

Option 1: Central store (e.g. Redis)
Every server asks the store on every request: “For key X, try to consume one token.” The store holds the bucket state and runs the same refill-and-consume logic. It returns “allowed” or “denied.” All servers see the same state, so the limit is global. The cost is an extra network round-trip per request. If the store is down, you need a policy (e.g. allow everyone or deny everyone) and possibly a circuit breaker. I use this when I need a strict global limit and can afford the latency.

Option 2: Local buckets with periodic sync
Each server keeps a local bucket per key. Periodically (e.g. every few seconds) servers report usage to a central place and get back an updated “quota” or “share” of the global limit. Requests are checked against the local bucket only, so no extra round-trip per request. The limit is only approximate: right after a sync, multiple servers might still allow a bit more than the global cap until the next sync. I use this when I care more about latency and can tolerate a small overshoot.

The diagram below sketches both.

flowchart TB
  subgraph Centralized
    R1[Req] --> Redis[(Redis)]
    Redis --> R2[Allow/Deny]
  end
  subgraph Distributed
    D1[Node 1] --> D2[Local bucket]
    D3[Node 2] --> D4[Local bucket]
    D2 --> D5[Periodic sync]
    D4 --> D5
  end

Operations: metrics, tuning, testing

Metrics: I track how many requests are allowed vs denied per key (or per key prefix if there are too many keys). I also track the latency of the rate limiter itself and, if I use Redis, its latency and errors. If the store is down, I need to know so I can trigger the fallback (allow or deny by default).

Tuning: For token bucket I often start with capacity = 2 × refillRate so a user can burst to about 2× the rate for a short time. Then I adjust based on real traffic and how much the downstream can handle.

Testing: I unit-test the refill math (e.g. after 1 second with rate 10, the bucket gains 10 tokens, capped at capacity). I integration-test that a client that keeps sending eventually gets 429, and that after waiting a bit they can send again.

Summary table and how to choose

Aspect	Fixed window	Sliding window	Token bucket
Boundary burst	Yes (2× at window seam)	No	No (bounded by capacity)
State size	Small (counter + window start)	Larger (timestamps or approximation)	Small (tokens + last refill time)
Per-request cost	O(1)	O(log n) or O(k)	O(1)
Typical use	Coarse quotas (e.g. per day)	Strict “N in any window W”	APIs, backpressure, allow small bursts

In one sentence: Use fixed window only when boundary burst is okay. Use sliding window when you need “at most N in any rolling window W” with no burst. Use token bucket when you want a steady rate with controlled burst and minimal state. Implement refill-on-read, keep state updates atomic when you go distributed, and add metrics so you can tune and debug in production.