Skip to main content

Chat Completions API

Create message completions using Claude models through Korad.AI's optimized gateway.

Endpoint

POST https://api.korad.ai/v1/messages

Request Body

{
model: string;
max_tokens: number;
messages: Message[];
system?: string;
temperature?: number;
top_p?: number;
top_k?: number;
stream?: boolean;
stop_sequences?: string[];
tools?: Tool[];
}

Parameters

Required Parameters

model (string)

The model to use for completion.

Available Models:

  • claude-sonnet-4-20250514 — Best for most tasks
  • claude-sonnet-4-20250514-optim — Cost-optimized version
  • claude-haiku-4-20250514 — Fast, for simple tasks
  • claude-opus-4-20250514 — Most capable, for complex tasks

max_tokens (integer)

Maximum tokens to generate. Must be ≥ 1.

{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Hello"}]
}

messages (array)

Array of message objects representing the conversation.

type Message = {
role: "user" | "assistant";
content: string | ContentBlock[];
};

Optional Parameters

system (string)

System prompt to guide behavior.

{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"system": "You are a helpful coding assistant.",
"messages": [{"role": "user", "content": "Write a function"}]
}

temperature (number, default: 0.7)

Sampling temperature. Lower = more focused, higher = more creative.

top_p (number, default: 0.9)

Nucleus sampling threshold.

top_k (number)

Top-k sampling parameter.

stream (boolean, default: false)

Enable streaming responses.

stop_sequences (array[string])

Sequences that will stop generation.

tools (array)

Function calling tools.

Example with Tools

{
"model": "claude-sonnet-4-20250514",
"max_tokens": 1024,
"tools": [
{
"name": "get_weather",
"description": "Get current weather",
"input_schema": {
"type": "object",
"properties": {
"location": {"type": "string"}
},
"required": ["location"]
}
}
],
"messages": [
{"role": "user", "content": "What's the weather in Tokyo?"}
]
}

Response

Non-Streaming Response

{
"id": "msg_1a2b3c4d",
"type": "message",
"role": "assistant",
"content": [
{
"type": "text",
"text": "Hello! How can I help you today?"
}
],
"model": "claude-sonnet-4-20250514",
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {
"input_tokens": 10,
"output_tokens": 25,
"total_tokens": 35
},
"korad_metrics": {
"savings_percent": 45,
"original_cost": 0.0003,
"your_cost": 0.000165
}
}

Streaming Response

With "stream": true, responses are sent as Server-Sent Events.

Usage Tracking

Every response includes detailed usage metrics:

response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[...]
)

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Output tokens: {response.usage.output_tokens}")
print(f"Savings: {response.korad_metrics.savings_percent}%")
print(f"Your cost: ${response.korad_metrics.your_cost}")

Examples

Basic Chat

import anthropic

client = anthropic.Anthropic(
base_url="https://api.korad.ai/v1",
api_key="sk-korad-YOUR-KEY"
)

message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain recursion"}
]
)

print(message.content[0].text)

Streaming

with client.messages.stream(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{"role": "user", "content": "Count to 10"}]
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)

With System Prompt

message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are an expert Python developer. Provide concise, working code.",
messages=[
{"role": "user", "content": "Write a binary search function"}
]
)

Multi-turn Conversation

conversation = [
{"role": "user", "content": "What's 2+2?"},
{"role": "assistant", "content": "2+2 equals 4."},
{"role": "user", "content": "What about 5+5?"}
]

message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=conversation
)

Error Handling

from anthropic import AnthropicAPIError, APITimeoutError

try:
response = client.messages.create(...)
except AnthropicAPIError as e:
print(f"API Error: {e.status_code} - {e.message}")
except APITimeoutError:
print("Request timed out")

Best Practices

  1. Set appropriate max_tokens — prevents unexpected costs
  2. Use temperature carefully — 0.7 for creative, 0.3 for focused
  3. Include system prompts — guide behavior without using message tokens
  4. Monitor korad_metrics — track your savings in real-time
  5. Use streaming for UX — users see responses faster

Error Codes →