Skip to main content

Hybrid Summarization

Maintain conversation coherence across 128K+ token contexts with intelligent summarization.

Overview

Hybrid Summarization automatically compresses long conversation histories while preserving:

  • Key facts and entities
  • Code blocks and technical details
  • Conversation thread structure
  • Context for ongoing tasks

How It Works

graph LR
A[Long Conversation] --> B{Token Threshold}
B -->|< 50K tokens| C[No Compression]
B -->|> 50K tokens| D[Summarize Oldest]
D --> E[Preserve Code/Entities]
E --> F[Maintain Thread Structure]
F --> G[Inject Summary]
G --> H[Reduce to 128K]

Automatic Triggers

Summarization triggers at these thresholds:

Context SizeAction
< 50K tokensNo compression
50K - 80K tokensLight compression
80K - 100K tokensModerate compression
> 100K tokensAggressive compression

What Gets Preserved

1. Code Blocks

All code is preserved exactly:

# This code is never summarized
def binary_search(arr, target):
left, right = 0, len(arr) - 1
while left <= right:
mid = (left + right) // 2
if arr[mid] == target:
return mid
# ... rest of function

2. Key Entities

Names, dates, technical terms are extracted:

User: John Smith
Project: Quantum Computing
Date: 2026-01-31
Framework: React + TypeScript

3. Thread Structure

Conversation flow is maintained:

Thread 1: API authentication setup (messages 1-15)
→ Summary: User implemented JWT auth, middleware created
Thread 2: Database schema design (messages 16-30)
→ Summary: PostgreSQL schema with users, transactions tables
Thread 3: Stripe integration (messages 31-45) ← Active
→ Full context preserved

Example Compression

Before (15K tokens)

[100+ messages about implementing a feature]
User: Can you help me implement OAuth?
Assistant: Sure, let's start with the providers...
[50 messages of implementation details]
User: How do I handle refresh tokens?
Assistant: Here's the refresh token logic...
[50 more messages]
User: What about error handling?

After (3K tokens)

Thread: OAuth Implementation (messages 1-75)
Summary: Implemented OAuth with Google and GitHub providers.
Created AuthController with login/logout/refresh endpoints.
JWT middleware for protected routes. Refresh token rotation enabled.
Code preserved: auth_controller.py (450 lines), middleware.py (120 lines)

[Recent messages fully preserved]
User: What about error handling?
Assistant: For error handling, you should...

Quality Metrics

MetricTargetActual
Factual retention> 95%98%
Code preservation100%100%
Entity retention> 90%94%
Conversation coherence> 90%93%

Configuration

Via Dashboard

  1. Go to korad.ai/dashboard
  2. Settings → Conversation Management
  3. Adjust summarization threshold

Via API (Coming Soon)

client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[...],
korad_settings={
"summarization": {
"threshold": 80000, # tokens before compression
"preserve_code": true,
"preserve_entities": true
}
}
)

Best Practices

1. For Coding Projects

Code is automatically preserved, so:

# Safe: Long coding conversations
# All code blocks stay intact
# Only prose explanations are summarized

2. For Multi-Turn Tasks

Related messages are grouped:

# Thread 1: Implement feature A
# Thread 2: Debug feature B
# Thread 3: Add tests ← Active
# Only Threads 1-2 get summarized

3. For Document Analysis

Reference documents are preserved:

# Upload: 100-page technical spec
# Summarized: Key requirements extracted
# Preserved: Full spec available for reference

Monitoring

Check summarization activity:

response = client.messages.create(...)

# Check if context was compressed
if hasattr(response, 'korad_context'):
print(f"Original tokens: {response.korad_context.original_tokens}")
print(f"Compressed tokens: {response.korad_context.compressed_tokens}")
print(f"Compression ratio: {response.korad_context.compression_ratio}")

Technical Details

Algorithm

  1. Thread Detection — Group related messages
  2. Code Extraction — Preserve all code blocks
  3. Entity Recognition — Extract names, dates, terms
  4. Abstractive Summary — AI-powered summarization
  5. Structure Preservation — Maintain thread hierarchy

Performance

  • Latency: < 100ms for compression
  • Throughput: 100K tokens/second
  • Memory: O(n) where n = context size

Savings Slider → Context Optimization →