Custom LLM Rate Limits

Introduction

Rate limits are an important feature that allows you to control the number of requests made with your API key within a specific time window. For example, you can limit users to 1000 requests per day or 60 requests per minute. By implementing rate limits, you can prevent abuse while protecting your resources from being overwhelmed by excessive traffic.

Why Rate Limit

Prevent abuse of the API: Limit the total requests a user can make in a given period to control cost.
Protect resources from excessive traffic: Maintain availability for all users.
Control operational cost: Limit the total number of requests sent and total cost.
Comply with third-party API usage policies: Each model provider has their own rate limit for your key. Helicone’s rate limit is bounded by your provider’s policy.

How Rate Limits Works

To set up rate limiting, simply add the Helicone-RateLimit-Policy header in your request. This will rate limit all requests made with the specified API key.

The header follows this format:

"Helicone-RateLimit-Policy": "[quota];w=[time_window];u=[unit];s=[segment]"

Rate Limit Parameters

Parameter	Description
`quota` (required)	The maximum number of requests allowed within the specified time window.
`time_window` (required)	The unit is seconds. For example, you would use `w=86400` (60 _ 60 _ 24 = 86400) to set the time window for a single day. The minimum is `60` seconds.
`unit` (optional)	Must be `request` or `cents`. If left blank, unit is set to `request` by default.
`segment` (optional)	Must be `user` or a custom property. If left blank, segment is set to global by default. We’ll explain the difference in the Filtering By Segments section.

Example: A policy that allows 10 cents of requests per 1000 seconds per user.

curl https://oai.helicone.ai/v1/completions \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer <YOUR_API_KEY>' \
  -H 'Helicone-Property-IP: 111.1.1.1' \
  -H 'Helicone-User-Id: user-123' \
  -H 'Helicone-RateLimit-Policy: 10;w=1000;u=cents;s=user' \
  -d '{
    "model": "text-davinci-003",
    "prompt": "How do I enable custom rate limit policies?",
}'

Fun fact: this policy format is an IETF standard for specifying rate limits! Except for the segment field, that’s a Helicone special twist 🍬

Filtering By Segments

The s=[segment] parameter is used to specify the scope in which you want to apply rate limits to all requests made with an API key. You can apply rate limits globally, by users, or by a custom property.

Global rate limiting: leave the segment field empty (i.e. 1000;w=60).
Rate limiting by user: set the segment field to “user” (i.e. 1000;w=60;s=user).
- In addition, the user ID must be included as a parameter in the request or in the helicone-user-id header (i.e. "Helicone-User-Id": "username").
- See User Metrics for more details.
Rate limiting by a custom property: set the segment field to the desired property name (i.e. 1000;w=60;s=[property_name]).
- In addition, include a corresponding header in the request (i.e. "Helicone-Property-{property_name}": "some label").
- See Custom Properties for more details.

Examples: Here is a list of example policies to use for the header Helicone-RateLimit-Policy:

Rate Limiting Globally

Rate Limiting By User

Rate Limiting By Custom Property

Extracting Rate Limit Response Headers

Extracting the headers allows you to test your rate limit policy in a local environment before deploying to production.

If your rate limit policy is active, the following headers will be returned:

Helicone-RateLimit-Limit: "number" // the request/cost quota allowed in the time window.
Helicone-RateLimit-Policy: "[quota];w=[time_window];u=[unit];s=[segment]" // the active rate limit policy.
Helicone-RateLimit-Remaining: "number" // the remaining quota in the time window.

Helicone-RateLimit-Limit: The quota for the number of requests allowed in the time window.
Helicone-RateLimit-Policy: The active rate limit policy.
Helicone-RateLimit-Remaining: The remaining quota in the current window.

If a request is rate-limited, a 429 rate limit error will be returned.

Example: Extracting headers from python with OpenAI

client = OpenAI(
    api_key="<OPENAI_API_KEY>",
    base_url="https://oai.helicone.ai/v1",
    default_headers={
        "Helicone-Auth": f"Bearer <HELICONE_API_KEY>",
    }
)

# 1. add `.with_raw_response` here
chat_completion_raw = client.chat.completions.with_raw_response.create(
    model="gpt-4-vision-preview",
    messages=[
        {"role": "user", "content": "Hello world!"}
    ],
    extra_headers={
        "Helicone-RateLimit-Policy": "10000;w=3600" # add rate limit policy here
    },
)

# This is the original parsed response as expected...
chat_completion = chat_completion_raw.parse()

# 2. get header response
rate_limit = chat_completion_raw.http_response.headers.get(
    'Helicone-RateLimit')

print(rate_limit) # will print the Rate Limit header responses

Latency Considerations

Using rate limits adds a small amount of latency to your requests. This feature is deployed with Cloudflare’s key-value data store, which is a low-latency service that stores data in a small number of centralized data centers and caches that data in Cloudflare’s data centers after access. The latency add-on is minimal compared to multi-second OpenAI requests.

Coming Soon

Soon, we will support rate limiting by tokens and cost. Additionally, you will be able to see how close your requests, users, and properties are to hitting their rate limits in the web UI.

Need more help?

Getting Started

Integrations

Tracing

Prompts & Evals

AI Gateway

References

Introduction

Why Rate Limit

How Rate Limits Works

Rate Limit Parameters

Filtering By Segments

Extracting Rate Limit Response Headers

Latency Considerations

Coming Soon

Getting Started

Integrations

Tracing

Prompts & Evals

AI Gateway

References

​Introduction

​Why Rate Limit

​How Rate Limits Works

​Rate Limit Parameters

​Filtering By Segments

​Extracting Rate Limit Response Headers

​Latency Considerations

​Coming Soon

Introduction

Why Rate Limit

How Rate Limits Works

Rate Limit Parameters

Filtering By Segments

Extracting Rate Limit Response Headers

Latency Considerations

Coming Soon