Organisation: CloudCIX
Version: March 2026 — v1.0
Related documents:
┌─────────────────────────────────────────────────────────────────────────┐
│ Current production topology │
├──────────────┬──────────────┬──────────────┬──────────────┬────────────┤
│ Gateway VM │ Orchestrator │ Scraper VM │ Chunking VM │ Embedding │
│ │ VM │ │ │ VM │
│ nginx │ FastAPI │ FastAPI │ FastAPI │ FastAPI │
│ Alloy │ Celery │ Celery │ (no Celery) │ Celery │
│ │ Postgres │ redis-scrape │ │ redis-embed│
│ │ redis-orch │ Alloy │ Alloy │ Alloy │
│ │ Alloy │ │ │ │
├──────────────┴──────────────┴──────────────┴──────────────┴────────────┤
│ External │
│ H100 Server — self-hosted inference (OpenAI-compatible API) │
│ K8s Cluster (4×4c8g nodes) — LGTM+P monitoring stack │
└─────────────────────────────────────────────────────────────────────────┘
Total VM count: 5 microservice VMs + 1 gateway VM
GPU infrastructure: existing H100 server (not sized here — already deployed)
Monitoring: existing 4-node K8s cluster (not sized here — already deployed)
All throughput calculations in this document are derived from the following baseline assumptions. These are stated explicitly so they can be adjusted — different assumptions produce different capacity numbers.
| Parameter | Assumed value | Basis |
|---|---|---|
| Average document word count | 5,000 words | Mid-length web article / documentation page |
| Characters per word (English) | ~6 chars | Including spaces |
| Average document size | ~30,000 characters | 5,000 × 6 |
| Tokens per character | ~0.25 tokens | GPT tokeniser, English prose |
| Average document token count | ~7,500 tokens | 30,000 × 0.25 |
| Parameter | Assumed value | Basis |
|---|---|---|
| Chunk size | 512 tokens | Common default for dense retrieval |
| Chunk overlap | 64 tokens | ~12.5% overlap for context continuity |
| Effective chunk stride | 448 tokens | 512 - 64 |
| Chunks per document | ~17 | ceil(7,500 / 448) |
| Characters per chunk | ~2,048 characters | 512 tokens × 4 chars/token |
| Stage | Best case | Typical | Worst case | Bottleneck |
|---|---|---|---|---|
| Job submission (Orchestrator DB write) | 5ms | 15ms | 100ms | Postgres latency |
| Scraping (full HTTP fetch) | 500ms | 2s | 10s | Target site speed, robots.txt delays |
| Chunking (CPU tokenisation + split) | 20ms | 80ms | 500ms | Document size, tokeniser CPU |
| Embedding (17 chunks, H100 LAN batch) | 5ms | 30ms | 200ms | LAN latency, H100 load |
| Vector DB write (17 upserts) | 10ms | 50ms | 300ms | Vector DB write throughput |
| Total pipeline | ~550ms | ~2.2s | ~11s | Scraping dominates |
Scraping is the pipeline bottleneck in virtually all scenarios. The H100 self-hosted inference is a near-zero contributor to pipeline time at this scale — 17 chunks at 512 tokens = 8,704 tokens per document, and the H100 processes >60,000 tokens/second even for large models. The embedding step is effectively free compared to a network fetch.
With one Celery worker per stage and assuming typical case times:
| Stage | Time per document | Max docs/hour (1 worker) |
|---|---|---|
| Scraping | 2s | 1,800 |
| Chunking | — (sync, blocking Scraper) | Not queued |
| Embedding | 30ms | 120,000 |
The effective throughput with 1 scraper worker: ~1,800 documents/hour assuming no rate limiting on target sites.
In practice, most deployments hit 500–1,200 docs/hour per scraper worker because of:
Crawl-delay directives)Conservative realistic estimate: 600 docs/hour per scraper worker.
The scraper is the only stage that meaningfully benefits from additional workers, because it is I/O-bound — workers spend most of their time waiting on network responses, not consuming CPU.
| Scraper workers | Realistic docs/hour | Notes |
|---|---|---|
| 2 | ~1,200 | Fits comfortably on 4c8g |
| 4 | ~2,400 | Still fits 4c8g (mostly I/O wait) |
| 8 | ~4,800 | May need 8c RAM for Redis + worker processes |
| 16 | ~9,600 | Saturates most target sites — domain rate limits become dominant |
| Documents/hour | Chunks/hour | Chunk characters/hour | Embedding tokens/hour |
|---|---|---|---|
| 600 | 10,200 | 20.9M chars | 5.2M tokens |
| 1,200 | 20,400 | 41.8M chars | 10.5M tokens |
| 2,400 | 40,800 | 83.5M chars | 20.9M tokens |
| 9,600 | 163,200 | 334M chars | 83.5M tokens |
H100 perspective: 83.5M tokens/hour = ~23,000 tokens/second average. The H100 handles >60,000 tokens/second for a standard embedding model. Even at maximum scraper throughput (16 workers, ~9,600 docs/hour), the H100 is running at less than 40% utilisation for embedding alone.
Corpus search is independent of ingestion and bounded by vector DB query performance.
| Operation | Typical latency | Notes |
|---|---|---|
| Query embedding (H100) | ~5ms | Single 50-token query, trivial for H100 |
| Vector DB ANN search | 10–50ms | Depends on index size and DB choice |
| Response serialisation | 2–5ms | Typically 10 results × ~500 chars each |
| Total search P50 | ~20–60ms | End-to-end including nginx + FastAPI |
| Total search P99 | ~80–200ms | Under concurrent load |
At these latencies, a single Embedding VM with 4 FastAPI workers can handle 200–500 concurrent search requests before P99 degrades past 500ms.
Each specification includes the rationale for every component. The current 4c8g baseline is assessed per service.
Recommended: 2 vCPU · 4 GB RAM
nginx is almost entirely network I/O — it proxies bytes between clients and upstream services. CPU and memory requirements are minimal.
| Resource | Requirement | Rationale |
|---|---|---|
| vCPU | 2 | nginx is single-threaded per worker process; 2 processes covers the load comfortably |
| RAM | 4 GB | nginx itself uses ~10–50 MB; 4 GB leaves room for OS, Alloy sidecar (~100 MB), and buffer |
| Network | 1 Gbps | All traffic passes through here; bandwidth is the real constraint at high load |
| Storage | 20 GB | OS + nginx access logs (rotated); Alloy log buffer |
nginx connection capacity on 2c4g:
worker_processes 2 (one per vCPU)worker_connections 4096 per processAt typical HTTP/1.1 keep-alive with short requests, this handles 3,000–5,000 requests/second comfortably. For MLWorkbench at any reasonable user count this is not the bottleneck.
Scale trigger: nginx_connections_active / (worker_processes × worker_connections) > 0.7. In practice this VM will not need scaling before every other component.
Recommended: 4 vCPU · 16 GB RAM (upgrade from 4c8g)
The Orchestrator runs three components that have different resource profiles:
Low compute — mostly DB reads/writes and HTTP calls to downstream services. 2 vCPUs is sufficient for the application tier.
Postgres is the main reason to upgrade from 4c8g to 4c16g. Postgres performance is strongly correlated with how much of the working set fits in shared_buffers (typically set to 25% of RAM).
| RAM | shared_buffers | jobs table rows before hot data evicts from buffer |
|---|---|---|
| 8 GB | 2 GB | ~10M rows (rough estimate at ~200 bytes/row) |
| 16 GB | 4 GB | ~20M rows |
| 32 GB | 8 GB | ~40M rows |
At 600 docs/hour, the jobs table grows by 14,400 rows/day. At 16 GB, the entire jobs table fits in shared_buffers for over 3 years of operation before performance degrades. At 8 GB, you hit buffer pressure much sooner.
The result backend only holds in-flight task state — not job history. Peak memory is bounded by the number of concurrently executing tasks, not total job volume. Even at maximum throughput (32 concurrent tasks), this is under 50 MB.
| Resource | Requirement | Rationale |
|---|---|---|
| vCPU | 4 | 2 for FastAPI/Celery, 2 for Postgres background work |
| RAM | 16 GB | Postgres shared_buffers: 4 GB; OS + Alloy: 2 GB; headroom: 10 GB |
| Storage | 100 GB SSD | Postgres data volume; assume 200 bytes/job row × 5M jobs + WAL + indexes |
| Network | 1 Gbps | Handles callbacks from all worker VMs |
Postgres storage estimate:
Recommended postgresql.conf settings (add to Postgres container env or config mount):
shared_buffers = 4GB # 25% of 16 GB VM RAM
effective_cache_size = 12GB # guide query planner
work_mem = 16MB # conservative — multiply by max_connections for worst-case heap
maintenance_work_mem = 512MB # for VACUUM and index builds
max_connections = 50 # matched to asyncpg connection pool size
OOM note:
work_memis per sort operation, not per connection. Withmax_connections=50and a concurrentORDER BYworkload, peak memory iswork_mem × 50 = 800 MB. Keepwork_memat 16 MB or lower on the 16 GB VM. Increasing it without bounds risks Postgres consuming all available RAM under concurrent query load.
Recommended: 4 vCPU · 8 GB RAM (4c8g is correct)
Scraping is almost entirely network I/O. Workers spend >90% of their time waiting for HTTP responses — they are not consuming CPU or burning through RAM.
| Resource | Requirement | Rationale |
|---|---|---|
| vCPU | 4 | Celery workers are async but Python processes still benefit from multiple CPUs for Celery management |
| RAM | 8 GB | 4 scraper workers × ~200 MB RSS each = 800 MB; redis-scrape = 256 MB max; Alloy = 100 MB; headroom |
| Storage | 40 GB | OS, logs, temporary content storage if content_ref is a local file before handoff |
| Network | 1 Gbps | HTTP fetching — this is the real bottleneck, not the VM itself |
Worker count recommendation: Start with 4 workers (--concurrency=4). This is conservative but avoids hammering target sites and triggering rate limits. Increase to 8 if throughput needs to improve and target sites permit.
Memory per worker estimate:
httpx.AsyncClient connection pool: ~20 MB4 workers × 100 MB = 400 MB. 8 GB has substantial headroom.
Scale trigger: queue depth on redis-scrape consistently > 200 tasks while workers are healthy → add workers or add a second scraper VM.
Recommended: 4 vCPU · 8 GB RAM (4c8g is correct, but CPU matters)
Chunking is the only synchronous-call stage in the pipeline. The Orchestrator calls it and waits for the response. It is CPU-bound — text tokenisation using HuggingFace tokenisers or similar libraries is compute-intensive for large documents.
| Resource | Requirement | Rationale |
|---|---|---|
| vCPU | 4 | Each concurrent chunking request occupies a CPU core for its duration. 4 concurrent requests = 4 cores. |
| RAM | 8 GB | Tokeniser model loaded once (~500 MB for a full tokeniser vocab); FastAPI workers: ~200 MB each × 4; Alloy: 100 MB |
| Storage | 20 GB | OS, logs |
| Network | 1 Gbps | Receives document content from Orchestrator, returns chunk list |
Chunking time sensitivity:
Because chunking is synchronous, a single very large document can block an Orchestrator stage handler for 1.5 seconds. With 4 Uvicorn workers on the Chunking VM, you can handle 4 concurrent chunk requests simultaneously.
Scale trigger: chunking_duration_seconds P99 > 2s consistently, or Orchestrator _start_chunking calls are queuing. Add Uvicorn workers first (--workers 8); add RAM if tokeniser model reloading is observed.
CPU note: if you need to process many large documents concurrently, consider 8 vCPU for this VM specifically. The cost is minimal and it directly increases chunking concurrency without code changes.
Recommended: 4 vCPU · 8 GB RAM (4c8g is correct)
The Embedding VM does not perform inference — that runs on the H100 server. This VM is a coordinator: receive chunks from the Orchestrator, call the H100 inference API, write vectors to the vector DB. The workload is almost entirely network I/O.
| Resource | Requirement | Rationale |
|---|---|---|
| vCPU | 4 | 2 Embedding Celery workers + FastAPI API + redis-embed management |
| RAM | 8 GB | 2 workers × ~300 MB each; redis-embed: up to 1 GB (maxmemory 1gb); FastAPI: ~200 MB; Alloy: 100 MB |
| Storage | 40 GB | OS, logs |
| Network | 1 Gbps | LAN calls to H100 server (low latency, moderate bandwidth); calls to vector DB |
Worker concurrency recommendation: Start with 2 workers (--concurrency=2). Each worker handles one document batch concurrently — sending 17 chunks to H100 and writing 17 vectors to the DB. Increasing to 4 workers is safe if H100 has spare capacity.
redis-embed memory: The embed queue holds task payloads, which include the full chunk text. At 2,048 chars per chunk × 17 chunks per document × 100 queued documents = ~3.5 MB. Even with 1,000 queued documents the queue is only 35 MB. The 1 GB maxmemory on redis-embed is very conservative — it is there as a hard ceiling, not an expected usage.
Scale trigger: embed_batch_size or embedding throughput shows H100 is consistently at >80% utilisation → add more Embedding workers to increase H100 utilisation. The VM RAM and CPU are not the constraint here — the H100's throughput ceiling is.
| VM | vCPU | RAM | Storage | Key constraint | Scale by |
|---|---|---|---|---|---|
| Gateway | 2 | 4 GB | 20 GB | Network bandwidth | Rarely needed |
| Orchestrator | 4 | 16 GB | 100 GB SSD | Postgres buffer pool | More RAM first, then read replicas |
| Scraper | 4 | 8 GB | 40 GB | Network I/O, target site rate limits | More Celery workers first |
| Chunking | 4 | 8 GB | 20 GB | CPU (tokenisation) | More vCPU or more Uvicorn workers |
| Embedding | 4 | 8 GB | 40 GB | H100 throughput | More Celery workers (H100 has capacity) |
These tiers represent different usage profiles for MLWorkbench. Each tier is characterised by concurrent active users (users with jobs currently processing), plus read traffic (users searching their corpus).
Profile: Small team, 10–20 users. MLWorkbench used for ad-hoc corpus building and occasional search.
| Metric | Value |
|---|---|
| Concurrent ingestion jobs | 4–8 |
| Documents ingested / hour | 600–1,200 |
| Chunks produced / hour | 10,200–20,400 |
| Characters ingested / hour | 20.9M–41.8M |
| Tokens embedded / hour | 5.2M–10.5M |
| Concurrent search users | 5–20 |
| Search requests / hour | 100–500 |
| Scraper workers | 4 |
| Embedding workers | 2 |
| VM spec (all services) | 4c8g except Orchestrator (4c16g) |
H100 utilisation at this tier: ~2–5% of throughput capacity. Effectively idle.
Expected API response times (P95):
POST /jobs (submit): <50msGET /jobs/{id} (poll): <20msPOST /search: <100msProfile: Organisation-wide rollout, 50–100 users. Regular batch imports, active search usage.
| Metric | Value |
|---|---|
| Concurrent ingestion jobs | 20–40 |
| Documents ingested / hour | 2,400–4,800 |
| Chunks produced / hour | 40,800–81,600 |
| Characters ingested / hour | 83.5M–167M |
| Tokens embedded / hour | 20.9M–41.8M |
| Concurrent search users | 20–80 |
| Search requests / hour | 1,000–5,000 |
| Scraper workers | 8 |
| Embedding workers | 4 |
| Changes from Tier 1 | Scraper: increase workers to 8; Embedding: increase workers to 4 |
H100 utilisation at this tier: ~10–20%. The inference server has substantial spare capacity for other workloads.
Postgres load: ~4,800 jobs/hour = 80 jobs/minute. At ~4 UPDATE operations per job lifecycle (pending→scraping→chunking→embedding→done) = 320 writes/minute to the jobs table. Trivially handled by Postgres on a 4c16g VM.
Profile: Enterprise scale, 200+ active users. Bulk imports, heavy search traffic.
| Metric | Value |
|---|---|
| Concurrent ingestion jobs | 80–160 |
| Documents ingested / hour | 9,600–19,200 |
| Chunks produced / hour | 163,200–326,400 |
| Characters ingested / hour | 334M–669M |
| Tokens embedded / hour | 83.5M–167M |
| Concurrent search users | 100–300 |
| Search requests / hour | 10,000–50,000 |
| Scraper workers | 16 |
| Embedding workers | 8 |
| Changes from Tier 2 | Consider second Scraper VM; Orchestrator upgrade to 8c32g |
H100 utilisation at this tier:
bge-large-en-v1.5: ~80,000 tokens/secondPostgres load at Tier 3:
GET /jobs/ list queries become slow under concurrent search loadThe real constraint at Tier 3: target site scraping rate limits. At 16 workers, you are aggressively crawling target sites. Many sites will start returning 429s or 403s. Per-domain rate limiting in the Scraper becomes increasingly important.
The calculations above assume a 5,000-word average document. Actual usage may differ:
| Document type | Avg words | Avg chunks | Impact on throughput |
|---|---|---|---|
| Short blog post | 800 | 3 | Higher docs/hour; lower chunks/doc |
| Standard article | 5,000 | 17 | Baseline (this document's assumption) |
| Long report / whitepaper | 20,000 | 64 | Lower docs/hour (scraping takes longer for large pages); more chunks = more H100 work |
| Full book / large PDF | 80,000 | 244 | Chunking may take 1–2s; scraping may timeout; consider page-splitting pre-scrape |
For large documents (>20,000 words), the chunking step becomes a meaningful contributor to pipeline latency. The chunking_duration_seconds histogram is your signal — if P99 exceeds 1s regularly, increase Chunking VM vCPU.
The current K8s cluster runs the LGTM+P stack. This is a sensible starting point. Below is an assessment of fit.
| Component | Typical RAM | Typical CPU (idle) | Storage |
|---|---|---|---|
| Mimir (metrics) | 2–4 GB | 0.2–0.5 cores | Depends on retention + series count |
| Loki (logs) | 1–2 GB | 0.1–0.3 cores | High — log volume scales with requests |
| Tempo (traces) | 1–2 GB | 0.1–0.3 cores | Moderate — 15% sampling helps significantly |
| Pyroscope (profiles) | 512 MB–1 GB | 0.1 cores | Low |
| Grafana | 512 MB–1 GB | 0.1–0.2 cores | Minimal (dashboards only) |
| K8s overhead | ~1–2 GB per node | ~0.2 cores per node | — |
Total estimated usage at Tier 1: ~8–12 GB RAM, 1–2 cores. Fits comfortably in 32 GB / 16 cores.
The constraint at higher tiers is storage, not compute. Loki log retention and Mimir metric series count grow with request volume. The cluster nodes should have at least 100 GB SSD per node, with persistent volume claims for Loki and Mimir. At Tier 3, consider:
At Tier 1–2, the existing 4-node cluster is sufficient.
At each tier, one component limits overall throughput. This map shows what limits you first and how to address it without over-provisioning everything upfront.
Documents per hour →
100 500 1,000 5,000 10,000 50,000+
│ │ │ │ │ │
└── Any VM ─────┘ │ │ │
│ │ │ │
└── Scraper ───┘ │ │
workers │ │
(add workers) │ │
│ │
└─ Scraper VM RAM ─────┘
(scale VM or add │
second scraper VM) │
│
└─ H100 utilisation
+ Vector DB writes
(heavier model →
lighter model or
second GPU)
--concurrency)Step-by-step actions ordered from cheapest/fastest to most expensive, per bottleneck signal.
redis-scrape, workers healthyAction 1 (free, 5 minutes): Increase Celery worker concurrency on the Scraper VM.
# Edit docker-compose.yml on scraper VM
command: celery -A app.worker.celery_app worker --concurrency=8 -Q scrape_jobs
docker compose up -d scraper-worker
Action 2 (free, 10 minutes): If RAM is becoming tight with more workers, increase Scraper VM redis-scrape maxmemory limit and watch process_resident_memory_bytes.
Action 3 (VM cost): If 8+ workers still insufficient and you've verified target sites aren't rate-limiting you, add a second Scraper VM with its own redis-scrape. Update nginx upstream block to round-robin between them.
_start_chunkingAction 1 (free, 5 minutes): Increase Uvicorn workers on Chunking VM.
command: uvicorn app.main:app --workers 8 --host 0.0.0.0 --port 8000
Action 2 (VM resize): If CPU is saturated (>80% sustained), resize Chunking VM from 4c to 8c. Chunking is the most CPU-bound service and directly benefits from more cores.
asyncpg_pool_acquire_duration_seconds risingAction 1 (config, 10 minutes): Increase max_connections and connection pool size.
Action 2 (free, varies): Run VACUUM ANALYZE jobs manually if dead tuple percentage is high.
Action 3 (VM resize): Upgrade Orchestrator from 4c16g to 8c32g. Postgres scales well with both more RAM (larger buffer pool) and more CPUs (parallel query, background workers).
Action 4 (architecture): Add a Postgres read replica. Route GET /jobs/ list queries and stuck-job scans to the replica, preserving primary capacity for writes.
Action 1 (free): Verify batching is working correctly — all 17 chunks per document should be sent in a single API call, not 17 individual calls.
Action 2 (model swap): Switch to a smaller, faster embedding model (e.g. all-MiniLM-L6-v2 at ~390 chunks/second vs bge-large-en-v1.5 at ~156 chunks/second). Re-embedding the corpus is required.
Action 3 (hardware): Add a second GPU to the inference server, or deploy a second inference server and load-balance between them from the Embedding VM.
corpus_read_duration_seconds{operation="read"} P99 > 500msAction 1: Check vector DB query plan — ANN search on large corpora without proper indexing degrades to exact search.
Action 2 (free): Add result caching at the Orchestrator or Embedding level for identical queries.
Action 3: Scale Embedding VM horizontally (add a second VM for the read API only) — put a load balancer in front of both.
RedisBrokerHighMemory or RedisBrokerCriticalMemory alert firesRedis is memory-bound. The noeviction policy means Redis refuses new writes when full rather than silently discarding tasks — but new job submissions will fail with HTTP 500 until pressure is relieved.
Action 1 (immediate, no restart): Check current memory state and queue depth.
redis-cli INFO memory | grep -E 'used_memory_human|maxmemory_human'
redis-cli LLEN celery # main queue
redis-cli LLEN celery.unacked # in-flight tasks
Action 2 (temporary relief, no restart): Increase maxmemory at runtime.
redis-cli CONFIG SET maxmemory 6gb
Action 3 (permanent fix): Update redis.conf and apply via docker-compose.
# In the redis service command in docker-compose.yml, change:
# redis-server --maxmemory 512mb → redis-server --maxmemory 2gb
docker compose up -d redis-orch # or redis-scrape / redis-embed
Action 4 (sustained high load): Resize the VM to a RAM tier that gives the Redis instance at least 2× the expected peak queue depth in bytes.
WorkerOOMKilled alert fires (container_oom_events_total rising)A Celery worker was killed by the Linux OOM killer — likely caused by an unusually large document.
Action 1 (diagnose): Confirm OOM kill and identify the document size.
dmesg | grep -i "oom\|killed" # confirm kernel OOM kill
# Then in Grafana: check scrape_response_size_bytes P99 for the time window
Action 2 (free, immediate): Add a document size gate in the Scraper — reject or truncate pages above a threshold before they enter the pipeline.
# scraper/core/logic.py
MAX_CONTENT_BYTES = 5 * 1024 * 1024 # 5 MB
if len(content) > MAX_CONTENT_BYTES:
raise ValueError(f"Document too large: {len(content)} bytes")
Action 3 (config): Increase mem_limit in docker-compose for the affected worker. Current defaults: scraper-worker: 2g, embedding-worker: 3g. Increase by 1–2 GB and monitor.
Action 4 (worker hygiene): Ensure --max-tasks-per-child=200 is set on both workers. This recycles worker processes periodically, preventing heap fragmentation from accumulating over time.
embedding_circuit_breaker_state == 1)The embedding inference server is returning errors. All embedding jobs are failing immediately.
Action 1 (diagnose): Check if the H100 server is up and whether GPU OOM is the cause.
# On the H100 host
nvidia-smi # check VRAM usage and GPU processes
journalctl -u inference-server -n 100 # or equivalent log for your inference server
Action 2 (recover GPU OOM): If VRAM is exhausted, restart the inference process.
# Kill and restart the inference server process
systemctl restart inference-server # or docker compose restart on H100 host
Action 3 (prevent recurrence): Reduce batch size sent to the inference server. A 244-chunk batch from an 80,000-word document is the worst case — add a batch size cap in the Embedding worker.
# embedding/core/logic.py
MAX_EMBED_BATCH = 64 # split large batches into multiple requests
for i in range(0, len(chunks), MAX_EMBED_BATCH):
batch = chunks[i:i + MAX_EMBED_BATCH]
await client.embeddings.create(input=batch, model=settings.EMBEDDING_MODEL)