Organisation: CloudCIX
Version: March 2026 — v1.0
Related documents:
This document explains why each technology was chosen, what alternatives were considered, and what would cause us to revisit the decision.
Chosen: Redis
The broker holds Celery tasks between the moment the Orchestrator enqueues them and the moment a worker picks them up. It is also the Celery result backend, holding task completion state.
Redis is primarily a key-value store that happens to support list and sorted set operations well enough to serve as a Celery broker. For this use case it is the right choice:
Pros:
redis transport is mature and well-testedCons:
maxmemoryThe critical constraint: maxmemory-policy noeviction is mandatory. Any policy that evicts keys (allkeys-lru, volatile-lru) can silently discard queued tasks. There is no "message eviction alert" — the task simply disappears.
RabbitMQ is a purpose-built message broker with first-class support for the patterns we use: acknowledgements, dead letter exchanges, message TTL, consumer groups.
Pros:
Cons:
When RabbitMQ would be the right choice: when DLQ complexity grows to the point where manual dead letter handling in application code becomes a maintenance burden. RabbitMQ's native DLQ routing and per-message TTL simplify this significantly.
Kafka is a distributed log — fundamentally different from a traditional message queue. It stores messages durably on disk and allows consumers to replay them.
Pros:
Cons:
When Kafka would be the right choice: if the system needed to replay historical jobs for debugging or reprocessing (e.g. the embedding model changes and all historical content needs re-embedding). The append-only log with configurable retention would make this trivial. With Redis or RabbitMQ it requires separate tooling.
SQS is a managed queue service — no servers to operate.
Pros:
Cons:
When SQS would be the right choice: when operating on AWS and wanting to eliminate one more stateful service to manage. The operational savings are real if you are already managing other AWS infrastructure.
| Redis | RabbitMQ | Kafka | SQS | |
|---|---|---|---|---|
| Ops overhead | Low (already present) | Medium | High | None (managed) |
| DLQ | Manual (application code) | Native | Via separate topic | Native |
| Message replay | No | No | Yes | No |
| Throughput ceiling | ~100K msg/s per node | ~50K msg/s | Millions/s | ~3K msg/s (standard) |
| Celery support | Excellent | Excellent | Limited | Good |
| Best for | Small-medium queue workloads already using Redis | Complex routing, native DLQ | Stream processing, replay | AWS-native deployments |
Revisit trigger: DLQ management complexity grows — manual dead letter handling becomes a maintenance burden, or replay capability is needed for model retraining.
Chosen: PostgreSQL
The jobs table is updated frequently (one status transition per pipeline stage) and queried in structured ways: filter by caller_id, filter by status, find jobs stuck for > 30 minutes, count failures by failed_stage. These are relational queries.
Pros:
FOR UPDATE lock in the callback handler prevents duplicate state advancement. This correctness guarantee is difficult to replicate in non-transactional stores.error field is stored as JSONB, allowing both structured access (error->>'code') and unstructured extensionCREATE INDEX ON jobs (failed_stage) WHERE status = 'failed' indexes only the rows that matterasyncpg — fully async Python driver with excellent performance and connection poolingCons:
jobs table is UPDATE-heavy — each stage transition creates a dead tuple. Autovacuum must be monitored.MySQL is functionally similar to Postgres for this use case.
Where MySQL falls short:
error would need to be TEXT with manual JSON parsing, or a separate tableasyncio driver ecosystem is smaller (aiomysql is less mature than asyncpg)FOR UPDATE SKIP LOCKED (useful for job queuing patterns) is supported but less commonly used in the MySQL ecosystemWhen MySQL would be acceptable: if the organisation already operates MySQL and has expertise there. The functional difference for this use case is real but not disqualifying.
MongoDB is a document store — schemaless, horizontally scalable, JSON-native.
Pros:
Cons:
FOR UPDATE equivalent requires explicit session-based transactions — more complex to implement correctlyjobs data is inherently relational (one job, many stage transitions, one caller) — a document model does not improve on a relational model hereWhen MongoDB would be the right choice: if job data were deeply nested, schemaless, or document-oriented in nature. For a flat table with predictable columns and relational queries, it adds complexity without benefit.
Redis is already used for caching and queuing. Why not also use it for job state?
The core problem is queryability. Redis does not support SQL-style queries. "Find all jobs with status='failed' in the last hour for caller_id='abc'" requires either:
Every query the Orchestrator needs (stuck job detection, per-caller job list, failure analysis) is trivial in SQL and a significant engineering effort in Redis. The operational cost of running Postgres alongside Redis is minimal — a single additional container with a volume mount.
Redis remains the right choice for: Celery result backend (in-flight task state), rate limit sorted sets, auth claim cache. These do not require queries — only key lookups.
| PostgreSQL | MySQL | MongoDB | Redis | |
|---|---|---|---|---|
| ACID transactions | Yes | Yes | Partial | No |
| Complex queries | Excellent | Good | Limited | None |
| JSONB column | Native | No | Native | No |
| Async Python driver | asyncpg (excellent) | aiomysql (good) | motor (good) | aioredis (excellent) |
| UPDATE-heavy workload | Requires autovacuum monitoring | Similar | Better (no MVCC) | No writes to disk |
| Best for | Structured, queryable, transactional job data | Same, if org prefers MySQL | Schemaless, document-oriented data | Ephemeral fast-access data |
Revisit trigger: job data volume grows to the point where a single Postgres node is a bottleneck. At that scale, read replicas are the first step; sharding is rarely needed for a jobs table.
Chosen: Celery
Celery is the dominant Python task queue with a large ecosystem, extensive documentation, and broad production deployment history.
Pros:
task_acks_late, visibility_timeout, worker_prefetch_multiplier — all the reliability primitives needed are present and configurableautoretry_for with retry_backoff and retry_jitter — retry semantics built inCeleryInstrumentor for OTel — trace context propagation across queue boundaries is one linebeat scheduler for periodic tasks (stuck job detection)Cons:
task_acks_late = False by default)celery[asyncio]) is newer and less battle-tested than synchronous CeleryDramatiq is a modern Python task queue designed to fix Celery's rough edges.
Pros:
django-dramatiq and flask-dramatiq integrations availableCons:
autoretry_for equivalent (requires middleware)CeleryInstrumentorWhen Dramatiq would be the right choice: greenfield project where you want better defaults without Celery's legacy configuration surface.
ARQ is a Redis-backed async task queue built specifically for asyncio. Lightweight, no worker subprocess — the worker runs in the same asyncio event loop as the FastAPI app.
Pros:
Cons:
When ARQ would be the right choice: a lightweight async-first Python service with simple queue requirements and no need for broker flexibility.
Temporal is a workflow orchestration platform — a durable execution engine that persists workflow state and provides first-class saga, retry, and compensation primitives.
Pros:
Cons:
When Temporal would be the right choice: when pipeline complexity grows significantly — more stages, complex branching logic (e.g. different chunking strategies based on document type), human-in-the-loop approval steps, or long-running workflows measured in hours or days. Temporal's durable execution would eliminate the need for stuck job detection, visibility timeout configuration, and most of the resilience machinery described in this codebase.
| Celery | Dramatiq | ARQ | Temporal | |
|---|---|---|---|---|
| Maturity | Excellent | Good | Early | Excellent |
| Python async support | Partial | Partial | Native | Native |
| OTel integration | CeleryInstrumentor | Manual | Manual | SDK built-in |
| Broker flexibility | Redis, RabbitMQ, SQS | Redis, RabbitMQ | Redis only | Postgres/Cassandra backend |
| Saga / compensation | Manual | Manual | Manual | First-class |
| Ops overhead | Low | Low | Low | High |
| Best for | Most Python async workloads | Celery replacement with better defaults | Simple async-first services | Complex multi-step workflows |
Revisit trigger: pipeline branching logic grows beyond simple linear stages; long-running workflows with human-in-the-loop steps are needed; stuck job handling becomes complex enough to justify Temporal's durable execution model.
Chosen: FastAPI
Pros:
async def by default, which is correct for a service that makes external HTTP calls (Membership, the H100 inference server, downstream services) on every requestDepends() injection — auth, rate limiting, and audit logging compose cleanly as dependencies rather than decoratorsCons:
Flask is a minimal WSGI framework.
Pros:
Cons:
asyncio.run() or flask[async], which has subtle gotchasWhen Flask would be the right choice: simple synchronous services, or teams with deep Flask expertise where the retraining cost of FastAPI is not justified.
Django is a batteries-included full-stack framework.
Pros:
Cons:
async_to_sync, ASGI) is improving but still secondaryasyncpg directly — requires django-channels or similarWhen Django would be the right choice: a service with complex database models, admin interfaces, or ORM-heavy operations.
| FastAPI | Flask | Django | |
|---|---|---|---|
| Async support | Native | Optional | Partial |
| Request validation | Pydantic | Manual / Marshmallow | DRF Serializers |
| Dependency injection | Depends() |
Manual | Manual |
| Auto-docs | Yes | No (flask-apispec) | DRF browsable API |
| Best for | API services with external calls | Simple sync services | Full-stack with ORM |
Chosen: nginx (with Traefik as the preferred alternative for Docker Compose)
See Design Concepts §B1 for the rationale behind having a gateway layer at all.
nginx is a mature, high-performance HTTP server and reverse proxy.
Pros:
limit_req_zone — per-IP and per-header rate limiting built in, no plugins neededCons:
nginx.conf and reloadingnginx-prometheus-exporter sidecarTraefik is a modern reverse proxy designed for dynamic containerised environments.
Pros:
docker-compose.yml, no nginx config change neededCons:
limit_req_zone — Traefik's rate limiting is per-router, nginx's is per-zone with multiple zones per locationRecommendation: Traefik is preferred for the Docker Compose deployment because auto-discovery eliminates the most common operations task (updating nginx.conf when services change or are added). For any infrastructure not running Docker, nginx is the better choice.
Kong is a full API gateway with a plugin ecosystem.
Pros:
Cons:
When Kong would be the right choice: when CloudCIX ML Services becomes a public API product with multiple developer teams, tier-based rate limits displayed in a developer portal, and API key management self-service.
| Option | Best for | Trade-offs |
|---|---|---|
| Cloudflare | DDoS at DNS level | Vendor lock-in; API key visible to Cloudflare |
| AWS ALB + WAF | AWS-native deployments | AWS ecosystem; Cognito for auth offload |
| GCP Cloud Armor | GCP-native deployments | GCP ecosystem |
These are relevant when operating on managed infrastructure (ECS, GKE) where a VM-based gateway is not the natural choice.
| nginx | Traefik | Kong | Cloud | |
|---|---|---|---|---|
| Service discovery | Manual | Docker labels | Admin API | Cloud metadata |
| Rate limiting | Built-in (excellent) | Middleware (good) | Plugin (excellent) | WAF rules |
| Native metrics | No (exporter) | Yes | Yes | CloudWatch |
| Native OTel | No | Yes | Plugin | CloudWatch Traces |
| Ops overhead | Low | Low | High | None (managed) |
| Best for | Static infra, fine control | Docker Compose dynamic | API product scale | Managed cloud infra |
Chosen: LGTM+P (Mimir · Loki · Tempo · Pyroscope) + Grafana
Loki for logs, Tempo for traces, Mimir for metrics, Pyroscope for profiles, all unified in Grafana with correlated navigation.
Pros:
trace_id automaticallyCons:
Pros:
Cons:
Datadog is a fully managed observability SaaS.
Pros:
Cons:
When Datadog would be the right choice: when the team wants zero observability infrastructure overhead and cost is not the primary constraint. Common in early-stage startups or teams without dedicated infrastructure engineers.
Pros:
Cons:
When CloudWatch would be the right choice: AWS-native deployment where you want to minimise non-AWS infrastructure. Often used in combination with a third-party tool (Datadog, Grafana Cloud) for the observability surfaces CloudWatch lacks.
| LGTM+P | ELK | Datadog | CloudWatch | |
|---|---|---|---|---|
| Metrics | Mimir (PromQL) | Elasticsearch | Full | CloudWatch Metrics |
| Logs | Loki (LogQL) | Elasticsearch | Full | CloudWatch Logs |
| Traces | Tempo (TraceQL) | APM (separate) | Full | X-Ray |
| Profiles | Pyroscope | No | Full (APM) | No |
| Trace↔Log correlation | Yes (native) | No (manual) | Yes | No |
| Trace↔Profile correlation | Yes (native) | No | Yes | No |
| Local dev support | Yes (single binary) | Partial | No | No |
| Ops overhead | Medium | High | None (managed) | None (managed) |
| Cost | Open source + hosting | Open source + hosting | High at scale | Moderate at scale |
Revisit trigger: operational overhead of running LGTM+P becomes significant relative to team size. At that point, Grafana Cloud (managed LGTM+P) or Datadog are natural transitions with minimal instrumentation changes (both support OTLP).
A vector database stores and queries high-dimensional embedding vectors. The specific choice depends on your corpus size, query latency requirements, and infrastructure constraints. We do not prescribe a single choice here — the system is designed so that the Embedding DB API abstracts the vector store, and the underlying implementation can change.
| Dimension | Considerations |
|---|---|
| Scale | How many vectors? 100K is trivially handled by many options; 100M requires careful selection |
| Query latency | P99 latency target for MLWorkbench corpus search (system design: < 200ms) |
| Hosting model | Managed cloud service (Pinecone, Weaviate Cloud) vs self-hosted (Qdrant, Milvus, pgvector) |
| Metadata filtering | Can you filter by corpus_name before ANN search? All serious options support this. |
| Upsert semantics | Does it support upsert by ID? Required for our idempotency model. |
| pgvector | Qdrant | Pinecone | Weaviate | Milvus | |
|---|---|---|---|---|---|
| Hosting | Self (Postgres extension) | Self or managed | Managed only | Self or managed | Self or managed |
| Scale | Good to ~10M vectors | Excellent | Excellent | Excellent | Excellent |
| Query P99 | ~10-50ms at moderate scale | ~1-10ms | ~10-50ms | ~5-20ms | ~1-10ms |
| Ops overhead | Low (uses existing Postgres) | Low | None | Medium | High |
| Upsert by ID | Yes | Yes | Yes | Yes | Yes |
pgvector deserves a mention because it runs inside the existing Postgres instance — no new service to operate. For corpus sizes up to ~5-10M vectors with moderate query rates, it is a practical choice that eliminates infrastructure complexity.
Chosen: Self-hosted inference server (OpenAI-compatible API) on NVIDIA H100
CloudCIX operates its own H100 GPU server running an OpenAI-compatible API endpoint. The Embedding DB API calls it identically to the OpenAI API — same openai Python client, different base_url.
# embedding/core/client.py
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url=settings.EMBEDDING_API_BASE_URL, # points to internal H100 server
api_key=settings.EMBEDDING_API_KEY, # internal auth key, not OpenAI
)
No code changes are needed to switch models or swap the inference server — only environment variables (EMBEDDING_MODEL, EMBEDDING_API_BASE_URL) change.
The NVIDIA H100 80GB SXM5 is the current generation flagship for inference workloads. For embedding models its throughput is substantial:
| Model | Dimensions | Approx throughput (H100) | Chunks/sec (512 tokens) |
|---|---|---|---|
bge-large-en-v1.5 |
1024 | ~80,000 tokens/sec batched | ~156/sec |
bge-m3 |
1024 | ~60,000 tokens/sec batched | ~117/sec |
e5-mistral-7b |
4096 | ~20,000 tokens/sec batched | ~39/sec |
all-MiniLM-L6-v2 |
384 | ~200,000 tokens/sec batched | ~390/sec |
At these rates, the embedding inference step is not a bottleneck in the pipeline. Scraping network I/O dominates total pipeline time. The inference server would need to be saturated with hundreds of concurrent Embedding workers before it becomes a constraint.
| Concern | Cloud OpenAI | Self-hosted H100 |
|---|---|---|
| Cost per token | ~$0.02 per 1M tokens | Near-zero (amortised hardware) |
| Rate limits | Hard tier limits (tokens/min) | None — bounded by GPU throughput |
| Data egress | Content leaves CloudCIX | Data stays within infrastructure |
| Latency | ~50-200ms per batch (network) | ~5-20ms per batch (LAN) |
| Availability | OpenAI SLA | Depends on H100 server uptime |
| Budget alerts | Required (openai_tokens_used_total) |
Not needed for cost — still useful for volume tracking |
| Circuit breaker | Required | Still required — H100 server can fail |
The circuit breaker on the embedding client remains necessary — the H100 server can become unavailable (hardware failure, maintenance, OOM on the GPU). The 503 response behaviour and the retry-with-backoff pattern are identical regardless of whether the backend is OpenAI or self-hosted.
The openai_tokens_used_total metric is still worth keeping for volume tracking — knowing how many tokens are being embedded is useful for capacity planning even without cost implications.
Choosing an embedding model on your own hardware is primarily a quality vs throughput tradeoff. The Embedding DB API is designed so that swapping models is a change to settings.EMBEDDING_MODEL — the API contract, Celery task, and vector DB writes are unchanged. Changing the model on an existing corpus requires re-embedding all previously indexed content.