Organisation: CloudCIX
Version: March 2026 — v1.0
Related documents:
This document explains the why behind the design patterns used throughout the CloudCIX ML Services. Each concept is explained independently — you do not need to read the System Design first. Code references point to the relevant implementation sections.
Category A — Identity & Authentication
Category B — Rate Limiting
Category C — Resilience & Reliability
Category D — API & Pipeline Design
An authentication model where a single secret string (the API key) is the only identity primitive. There are no user accounts, no sessions, no tokens with claims, and no tenant hierarchy. Valid key = full access.
Many production SaaS systems start here and stay here for years. It has two genuine advantages:
Simplicity. There is no token lifecycle, no refresh flow, no session management, and no JWT parsing. Every service validates via a single POST /auth call to the Membership service. The implementation is a dozen lines.
Instant revocation. Revoking an API key in Membership is honoured by every service within 60 seconds (the cache TTL). There is no "token expiry" window during which a revoked credential still works. This matters for a SaaS where a customer churns and you need to lock them out reliably.
It cannot distinguish who is making a call. Two users with different API keys are both "valid callers" — there is no way to say "user A can only access corpus X" or "user B is on a free tier." These constraints require Membership to return identity claims (a caller_id, a tier, a set of scopes) alongside the validity check.
The CallerContext abstraction (see A2) is designed so that when Membership starts returning identity claims, every downstream consumer — rate limiter, audit logger, endpoint guards — gains that information automatically. No call sites change.
See also: System Design §3.1
A typed object that encapsulates everything known about the caller of a request. Today it holds only the API key hash. In future it will hold tier, scopes, a stable user ID from Membership, and anything else the identity model grows to include.
@dataclass
class CallerContext:
caller_id: str # sha256(api_key)[:16] today
caller_type: str = "unknown" # "user" | "service" | "unknown"
tier: Optional[str] = None
scopes: list[str] = field(default_factory=list)
raw_claims: dict = field(default_factory=dict)
A raw dict is opaque — callers cannot know what keys might exist without reading the code that populates it. It also provides no IDE support and no type safety.
More importantly: a raw dict is either full (when Membership returns claims) or empty (today). Code that uses it has to handle both cases with dict.get() calls scattered everywhere. When Membership evolves, all those scattered call sites need to change.
CallerContext solves this by providing a stable interface regardless of what Membership returns. The single _build_context() function maps from whatever Membership gives into the typed object. Every downstream consumer only knows the object — it does not know or care what Membership returns. Adding a new field to CallerContext is a one-line change in _build_context() and all consumers benefit automatically.
The raw API key must never appear in logs, database columns, Redis keys, or Celery task arguments. The hash is:
def _key_hash(api_key: str) -> str:
return hashlib.sha256(api_key.encode()).hexdigest()[:16]
It is stable (same key always produces the same hash), opaque (cannot reverse it to the original key), and safe to log. When Membership returns a real caller_id, the hash is replaced seamlessly in _build_context().
See also: Implementation Guide §3.3
Rather than calling Membership on every request, each service caches the validation result in Redis for 60 seconds, keyed on the hashed API key.
Without caching, Membership is a synchronous hard dependency on every request. If Membership is slow (even 200ms latency), every endpoint across every service is 200ms slower. If Membership is down, every endpoint across every service returns an error.
A 60-second TTL means a revoked key keeps working for up to 60 seconds after revocation. This is the deliberate tradeoff — 60 seconds of revocation lag in exchange for making Membership a soft dependency rather than a hard SPOF.
For most SaaS use cases this is acceptable. If your security requirements demand faster revocation (e.g. an active fraud scenario), lower the TTL. At 10 seconds the load on Membership increases roughly 6×, which is still manageable.
API keys are sometimes rotated or revoked for security reasons. An infinite cache means a revoked key continues to work until the service restarts — which could be days or weeks. The short TTL is the mechanism that makes revocation eventually consistent.
See also: System Design §3.3
A stability pattern that monitors calls to an external dependency (Membership) and, after a threshold of consecutive failures, stops making calls to it for a recovery period, returning a fast error instead of waiting for a slow timeout.
CLOSED (normal)
Every request passes through to Membership.
Failures are counted silently.
↓ after 3 consecutive failures
OPEN (tripped)
No calls to Membership are made for 30 seconds.
All requests immediately get 503 Service Unavailable.
↓ after 30 seconds
HALF-OPEN (probing)
One request is allowed through as a probe.
Success → circuit closes.
Failure → circuit stays open for another 30 seconds.
This is a subtle but important distinction. 401 Unauthorized means the API key is invalid — the caller did something wrong. 503 Service Unavailable means the server has a problem. When Membership is unreachable, the caller's API key might be perfectly valid — we simply cannot confirm it. Returning 401 would be wrong and harmful (the caller might invalidate their key thinking it was revoked).
Retrying a failing dependency without a circuit breaker creates a retry storm: every service hammers Membership with retries during an outage, making the outage worse and delaying recovery. The circuit breaker absorbs this by stopping calls completely, giving Membership time to recover without being overwhelmed.
Combined with the 60-second auth cache, the system degrades gracefully: callers with warm cache entries keep working normally, new callers (cold cache, open circuit) get 503.
See also: System Design §3.3
A mechanism to ensure that sensitive internal endpoints — specifically the callback POST /jobs/{id}/complete — are only reachable by other services within the system, not by external callers.
The callback endpoint marks a job as complete and advances the pipeline state machine. If an external caller could reach it, they could mark any job as done, corrupting the Orchestrator's state.
Since Membership currently returns no caller_type, we cannot distinguish a user API key from the Orchestrator's service key at validation time. The workaround is a configured set of known internal key hashes:
INTERNAL_CALLER_IDS: set[str] = {settings.ORCHESTRATOR_KEY_HASH}
def _is_internal(ctx: CallerContext) -> bool:
return ctx.caller_type == "service" or ctx.caller_id in INTERNAL_CALLER_IDS
The allowlist is injected at deploy time via the ORCHESTRATOR_KEY_HASH environment variable. It never appears in code.
Once Membership marks service keys with "caller_type": "service" in its response, the ctx.caller_type == "service" check takes over. The allowlist becomes belt-and-suspenders rather than the primary guard.
Mutual TLS between services would eliminate the need for this at the application layer entirely — the network layer would enforce service identity. This is the right long-term answer at scale (a service mesh like Istio handles it). For a VM + Docker Compose deployment, it is significant operational overhead relative to the problem it solves.
Rate limiting applied at two independent layers: the gateway (nginx) and the application (per-service SlidingWindowEnforcer). Each layer stops a different class of problem.
| Layer | Stops | Identity available | Cost of rejection |
|---|---|---|---|
| Gateway (nginx) | IP floods, unauthenticated request storms | IP address + raw header | Near-zero — no app code runs |
| Application | Per-key fair share exhaustion, capacity limits | Validated CallerContext |
Auth cache hit + Redis pipeline |
The gateway has no business context — it cannot tell if a key belongs to a paying customer or an abuser, and it has no knowledge of per-service capacity. It can only count raw request volume.
The application has full context — it knows the caller's type, the service's capacity, and the queue depth. But it is more expensive to reach: a rejected request at the application layer has already gone through nginx, auth validation, and a Redis lookup.
Together they form a defence in depth: the gateway cheaply absorbs floods before they saturate the application tier, while the application enforces business rules that the gateway cannot express.
A request that passes the gateway may still be rejected at the application layer (quota exhausted). A request rejected at the gateway never reaches the application. Both return 429 to the caller but they are fundamentally different events and should never be aggregated in metrics.
See also: System Design §4
Two approaches to counting requests within a time period:
Fixed window divides time into discrete buckets. "100 requests per hour" means 100 per clock hour — resetting at :00 every hour. Simple to implement, but has a well-known vulnerability.
Sliding window counts requests within a rolling period ending at the current moment. "100 requests per hour" means 100 requests in the last 60 minutes, evaluated at every request.
A caller can exploit a fixed window by exhausting their quota in the last seconds of one window, then immediately consuming a full new quota at the boundary:
Fixed window — 100 req/hour:
11:59:50 ████████████████████ 100 requests (quota exhausted)
12:00:00 ████████████████████ 100 requests (new window, full quota)
└──────── 10s ───────┘
200 requests in 10 seconds
against a service sized for 100/hour
This is a 2× burst at the boundary — effectively doubling the effective rate limit for anyone who times their requests carefully.
There is no boundary. At any moment the question is "how many requests in the last N seconds?" — and those 100 requests at 11:59 are still counted until they age out at 12:59.
On every request for caller_id:
1. ZREMRANGEBYSCORE — remove entries older than (now - window_seconds)
2. ZADD — add this request's timestamp
3. ZCARD — count entries in window
4. EXPIRE — TTL cleanup if caller goes quiet
All in one pipeline — atomic from the application's perspective.
Each active caller maintains a sorted set of timestamps. Memory cost: up to limit.requests × 8 bytes per caller. At 100 req/hour × 1,000 callers ≈ 800 KB. Negligible.
See also: System Design §3.6
A rate limiting strategy that divides a service's total declared capacity equally among all active callers, rather than assigning a fixed limit per caller.
A fixed limit (e.g. "50 requests/hour per caller") requires knowing in advance how many callers you will have. If you have 5 callers and a service that handles 2,000 requests/hour, a fixed 50/hour limit wastes 1,750 requests/hour of capacity. If you have 200 callers, 50/hour is fine. But you don't know which scenario you're in.
service_capacity = 2000 requests/hour
active_callers = 10 (HyperLogLog count of distinct caller_ids)
per_caller_limit = max(service_floor, service_capacity / active_callers)
= max(50, 2000 / 10)
= 200 requests/hour
As more callers become active, the per-caller limit shrinks (subject to a floor). This ensures no single caller can starve others, and unused capacity is available to whoever needs it.
The active caller count uses Redis HyperLogLog (PFADD / PFCOUNT). This provides an approximate distinct count with ~0.8% error using only ~12 KB of memory, regardless of how many callers are tracked. An exact count would require a full set of all caller IDs, which grows unboundedly.
service_floor is the minimum guaranteed per-caller limit regardless of how many callers are active. It prevents legitimate callers from being starved by a large number of concurrent callers.
See also: System Design §3.6
A design pattern where the rate limiting algorithm is separated from the rate limiting enforcement. The enforcer applies whatever limit the strategy provides — it does not know or care how that limit was calculated.
class RateLimitStrategy(Protocol):
def get_limit(self, ctx: CallerContext) -> RateLimit: ...
class SlidingWindowEnforcer:
def __init__(self, strategy: RateLimitStrategy): ...
async def check(self, ctx: CallerContext) -> None:
limit = self.strategy.get_limit(ctx) # enforcer asks strategy for the limit
# ... enforce it using Redis sliding window ...
Today Membership returns no tier data — FairShareStrategy is the correct behaviour. When Membership starts returning tier: "pro", MembershipTierStrategy becomes the correct behaviour.
Without the strategy pattern, changing the rate limiting algorithm requires modifying every service's endpoint code. With it, you swap the strategy object at startup — endpoint code is untouched.
CompositeStrategy wraps both strategies and selects between them based on what CallerContext contains:
def get_limit(self, ctx: CallerContext) -> RateLimit:
if ctx.tier or ctx.raw_claims.get("rate_limits"):
return self.membership_tier.get_limit(ctx) # Membership has data → use it
return self.fair_share.get_limit(ctx) # no data → fall back to fair share
Deployed today, it behaves as FairShareStrategy. The moment Membership adds a tier field, it silently upgrades — zero service code changes, zero redeployment.
See also: System Design §3.6
A second, independent rate limiting mechanism that rejects new work when a service's internal queue is too deep — regardless of the caller's identity, tier, or remaining quota.
Per-caller rate limits are a business policy: "this caller has used their fair share." Queue depth backpressure is infrastructure self-protection: "this service cannot accept more work right now, from anyone."
A caller with unused quota should still be rejected if the queue is saturated. Adding their work to a full queue degrades performance for everyone already waiting. The two concerns belong at different levels and should be implemented independently.
@router.post("/enqueue")
async def scrape_async(payload, ctx=Depends(validate_api_key)):
await enforcer.check(ctx) # per-caller fair share check
depth = await redis.llen("scrape_jobs")
if depth > settings.MAX_QUEUE_DEPTH:
raise HTTPException(status_code=429, # same HTTP status, different meaning
detail=f"Queue saturated ({depth} pending). Try again shortly.",
headers={"Retry-After": "30"})
# proceed to enqueue
Both return 429 but for fundamentally different reasons. Monitoring should distinguish them — application 429s indicate quota exhaustion (possibly need to increase capacity or limits), while queue depth 429s indicate the pipeline is saturated (need to scale workers or reduce submission rate).
A Celery / message broker pattern where a task is not removed from the queue when a worker picks it up. It is only removed (acknowledged) after the worker successfully completes the task. Until the ACK, the task remains "in-flight" — invisible to other workers but not yet gone.
By default, Celery acknowledges tasks at the moment they are received — before execution. If the worker crashes during execution, the task is gone. This is fine for workflows where losing a task is acceptable. It is not fine for a pipeline processing jobs that users are waiting on.
task_acks_late = TrueSetting task_acks_late = True delays acknowledgement until the task function returns (or raises an unhandled exception). Combined with the visibility timeout, this gives you crash recovery:
Worker picks up task T
↓
T moves to "unacknowledged" state in Redis
↓
Worker crashes mid-execution
↓
Visibility timeout expires (e.g. 3600 seconds)
↓
Broker returns T to the queue
↓
Another worker (or the same one, restarted) picks it up
No data loss, no manual intervention.
Because a task can be executed more than once (on retry after crash), any side effects must be safe to repeat. See C2.
See also: System Design §5.2
The property that an operation can be performed multiple times and produce the same result as performing it once. An idempotent operation is safe to retry.
Workers use task_acks_late = True and retry on failure. The Orchestrator's callback endpoint handles duplicate callbacks. Both mechanisms require that running the same operation twice is safe.
External writes — upsert, never insert:
# WRONG — second execution creates a duplicate
vector_db.insert(chunk_id=chunk_id, vector=embedding)
# CORRECT — second execution overwrites the first with identical data
vector_db.upsert(chunk_id=chunk_id, vector=embedding)
Callback endpoint — guard against duplicate advancement:
if job["status"] != body.stage:
return {"status": "already_advanced"} # 200, not 409 — this is correct
Returning 200 on a duplicate is important. Returning 409 would cause Celery to treat the callback as a failure and retry it, creating an infinite retry loop. The callback succeeding with a no-op is the correct semantic.
Any operation that could be retried must answer the question: "If this runs twice, is the second run harmless?" If yes, it is idempotent. If no, it needs to be made idempotent — typically via upsert semantics or a deduplicated record keyed on a stable ID.
The duration a broker waits for a worker to acknowledge a task before assuming the worker is dead and returning the task to the queue. Configured per service.
The visibility timeout must be longer than your longest expected task execution time. If a task legitimately takes 45 minutes and the visibility timeout is 30 minutes, the broker will re-enqueue the task while it is still running — creating a duplicate execution.
Per-service guidance:
| Service | Visibility timeout | Rationale |
|---|---|---|
| Scraper | 3600s (1 hour) | Slow sites, retry delays, large pages |
| Embedding | 7200s (2 hours) | Large document batches, OpenAI rate limit delays |
| Chunking | N/A | No Celery queue |
task_soft_time_limittask_soft_time_limit (e.g. 3500s) causes the task to raise SoftTimeLimitExceeded before the hard kill at task_time_limit. This gives the task a chance to clean up and report failure to the Orchestrator before it is forcibly killed. The visibility timeout should be slightly larger than task_time_limit so the task has a chance to complete (or fail cleanly) before the broker requeues it.
A retry strategy where the delay between retries grows exponentially (doubling each time) and includes a random component (jitter). Used both in Celery task autoretry and in the callback notification loop.
A fixed retry interval (e.g. retry every 5 seconds) can cause thundering herd problems: if 100 tasks all fail at the same moment (e.g. OpenAI returns 429), they will all retry at the same time, creating a spike of load. Exponential backoff spreads retries out over time, reducing contention.
Attempt 1 fails → wait 1s
Attempt 2 fails → wait 2s
Attempt 3 fails → wait 4s
Attempt 4 fails → wait 8s
...up to max_backoff
Without jitter, all callers that started retrying at the same time will still retry in synchrony (1s, 2s, 4s — all together). Jitter adds a random offset so retries spread across the window:
Without jitter: all retry at t=1, t=2, t=4 (synchronized)
With jitter: each retries at t=0.7, t=1.3, t=1.9, t=3.2... (spread)
Celery's retry_jitter=True handles this automatically.
A scheduled background query that finds jobs that have been in a non-terminal state for longer than expected, indicating the pipeline has stalled for them.
Even with task_acks_late = True and visibility timeout re-queuing, a job can get permanently stuck if:
The Orchestrator's state machine cannot detect this on its own — it only moves forward when callbacks arrive. The stuck job query is the out-of-band recovery mechanism.
SELECT id, caller_id, status, url, updated_at,
now() - updated_at AS stuck_for
FROM jobs
WHERE status NOT IN ('done', 'failed')
AND updated_at < now() - interval '30 minutes'
ORDER BY updated_at;
The Orchestrator's scheduled handler re-enqueues or fails stuck jobs based on their current stage. Because all worker writes are idempotent (C2), re-enqueuing a job that partially completed is safe — the second execution overwrites without creating duplicates.
Storing data in the appropriate store based on its durability requirements. Durable data (must survive restarts and failures) lives in Postgres. Ephemeral data (needed only while a task is in-flight) lives in Redis.
Postgres:
Redis:
If Redis is lost completely, no job history is lost. The Orchestrator reconnects to Postgres, finds jobs stuck in non-terminal states via the stuck job query, and re-enqueues them. The rate limit and auth cache states are reconstructed naturally as requests arrive.
This separation is why redis-orch does not need AOF persistence for job data — Postgres already has it. Redis should still be configured with maxmemory-policy noeviction to prevent silent task loss, but a Redis restart does not mean job data loss.
A centralised coordination service that owns the workflow — it knows the pipeline topology, drives each stage, and tracks overall state. Worker services are pure executors: they receive work, do it, and report back. They have no knowledge of the broader pipeline.
In a choreographed pipeline, each service calls the next:
Client → Orchestrator → Scraper → Chunking → Embedding → done
Each service is responsible for knowing what comes after it. The Orchestrator only hears back at the very end.
Tight coupling. Adding a stage between scraping and chunking requires modifying the Scraper. The Scraper must know that Chunking exists. If Chunking's API changes, Scraper must change too. Two services that should be independent become coupled.
Distributed state. When something goes wrong, the failure is wherever the chain broke. Understanding the full state of a pipeline job requires querying every service.
No single source of truth. Where is the canonical record of what happened to job X? Spread across every service's logs.
Every stage reports back to the Orchestrator. Adding a stage means adding a handler in the Orchestrator — Scraper and Embedding are unchanged. The Orchestrator's Postgres job table is the single source of truth. Failures are visible in one place.
The cost is a single point of failure at the control plane. This is mitigated by making the Orchestrator stateless at the process level (all state in Postgres) so it can restart quickly without data loss.
See also: System Design §1, §2
Two ways for a worker to inform the Orchestrator it has finished:
Polling: The Orchestrator repeatedly asks each worker "are you done yet?" on a timer.
Push callbacks: The worker calls a specific endpoint on the Orchestrator when it finishes: POST /jobs/{id}/complete.
Polling requires the Orchestrator to maintain a list of in-flight jobs and query each worker service on a schedule. The polling frequency determines your reaction time — polling every 5 seconds means up to 5 seconds of unnecessary delay per stage transition.
More seriously: workers do not expose a "are you done?" endpoint. The Celery task completes internally — the worker itself is the thing that knows when it is done. There is nothing useful to poll.
The worker is the single authority on its own completion. It fires the callback at the exact moment it finishes — zero delay. The Orchestrator is purely reactive: it waits for callbacks and acts on them.
Push callbacks also carry the completion payload — the content reference after scraping, the chunk count after chunking. This information flows naturally in the callback body. With polling, the Orchestrator would need to request this data separately.
Because workers retry on failure, callbacks can arrive more than once. The idempotency guard in the Orchestrator handles this: a duplicate callback that finds the job already advanced returns 200 and is ignored. See C2.
A design pattern where the same business logic function is called by both a synchronous HTTP endpoint and an asynchronous Celery task. The logic lives in neither the route nor the task — it lives in a separate core/logic.py file.
POST /scrape (sync) ──┐
├──→ execute_scrape(url, job_id) ──→ result
POST /enqueue (async) ─┘
Celery task ──────────┘
Without this separation, the same logic appears in two places — the sync endpoint and the Celery task. When the business logic changes (e.g. the scraping library is updated, error handling improves), both places must change. They will inevitably drift.
With execute_scrape() as the canonical implementation, there is one place to change and both modes benefit automatically.
Both the sync endpoint and the Celery task are thin wrappers:
# Sync endpoint — just calls core + handles timeout
return await asyncio.wait_for(execute_scrape(...), timeout=30)
# Celery task — just calls core + sends callback
result = await execute_scrape(...)
await notify_orchestrator(job_id, stage="scraping", result=result)
Neither contains business logic. Business logic does not contain transport concerns (HTTP, Celery, timeout).
See also: Implementation Guide §3
A pattern for managing distributed transactions across multiple services without a single global transaction. Each step of the pipeline is a local operation, and failures at any step trigger a compensating action rather than a rollback.
The ML pipeline is a saga with three steps: scraping, chunking, embedding. Each step is a local operation — the Scraper fetches a URL, the Embedding service writes vectors. There is no global database transaction spanning all of them.
When a step fails:
failed with failed_stage identifying where it stoppedA full saga would define explicit compensation actions — "if embedding fails, delete the vectors written so far." This system does not implement compensations because partial data in the vector DB is harmless: a re-run will overwrite it. The "compensation" is simply idempotent re-execution.
The saga concept is relevant here primarily as a mental model: each stage is a local commit, and the overall pipeline completes when all stages have committed.
Organising a service so that its read path and write path are explicitly separated — separate routers, separate metric labels, separate scaling considerations.
The Embedding DB API handles both:
If these are mixed in the same code with no separation, a slow batch embedding job will inflate the read latency P99. A deployment to change Celery worker concurrency (a write path concern) redeploys the read API too.
Separate FastAPI routers (different Python files, different metric labels):
# embedding/routers/read.py
with embedding_request_duration.labels(operation="read").time():
results = vector_db.search(...)
# embedding/worker/tasks.py
with embedding_request_duration.labels(operation="write").time():
vector_db.upsert(...)
The operation label is the critical piece. Without it, a slow write corrupts the read SLO alert. With it, the two paths are independently observable and alertable.
If write and read workloads diverge significantly in their machine profiles (write = GPU for embedding, read = high network throughput for result streaming), the natural next step is to split them into separate services. The separate router structure makes this migration straightforward — split the router files into separate FastAPI apps.
See also: System Design §6
Returning a failure response that contains not just "it failed" but specifically where it failed and why, in a structured format that UI consumers can act on.
MLWorkbench shows a job list. When a job fails, the user needs to know:
A bare "status": "failed" satisfies neither. A structured error object does:
{
"status": "failed",
"failed_stage": "scraping",
"error": {
"stage": "scraping",
"code": "http_403",
"message": "Domain returned 403 — likely blocked or robots.txt restricted",
"at": "2026-03-14T10:22:05Z"
}
}
MLWorkbench maps failed_stage + error.code to UI copy:
failed_stage |
error.code |
User-facing message |
|---|---|---|
scraping |
http_403 |
"The URL is blocked. Check that it is publicly accessible." |
scraping |
connection_timeout |
"Could not reach the URL. Check your network and try again." |
embedding |
openai_rate_limit |
"Indexing failed temporarily. Try again in a few minutes." |
embedding |
token_limit_exceeded |
"Document is too large to index. Try a smaller section." |
failed_stage is a column, not a status variantSeparate status values like failed_scraping, failed_chunking would bloat the CHECK constraint and complicate every transition check in the state machine. failed_stage as a separate column keeps the state machine simple (one failed terminal state) while providing the context MLWorkbench needs for filtering and display.
See also: System Design §2.3