Organisation: CloudCIX
Version: March 2026 — v1.0
Related documents:
| Pillar | Tool | Answers | Critical signals |
|---|---|---|---|
| Metrics | Mimir (prod) · Prometheus (dev) | Is the system healthy right now? | Queue depth, token burn, P99 by operation |
| Logs | Loki | What happened for a specific job? | Every line must carry trace_id + job_id |
| Traces | Tempo | Where did latency go? | Full waterfall: Orchestrator → Scraper → Chunking → Embedding |
| Profiles | Pyroscope | Why is CPU or memory elevated? | Chunking CPU on large docs; Embedding worker memory during batches |
Do you need standalone Prometheus if you use Alloy?
In production — no. Alloy scrapes/metricsendpoints and remote_writes directly to Mimir. No Prometheus process needed.
In the local dev stack — yes, as the simplest local storage backend for Alloy to write to. There is no benefit to running Mimir locally at dev scale. The local Prometheus is correct and intentional.
All services share a common attribute vocabulary. This file is the source of truth — any attribute not listed here is not an approved attribute.
# docs/otel-attributes.yaml
resource_attributes:
service.name: orchestrator | scraper | chunking | embedding-api
service.version: $GIT_SHA
deployment.environment: production | staging | dev
span_and_log_attributes:
job.id: string required on all pipeline spans
job.url: string required on all pipeline spans
job.stage: string values: scrape | chunk | embed | read
queue.name: string set by Celery tasks
operation: string values: read | write (required on all embedding metrics)
request.mode: string values: sync | async
corpus.name: string corpus_name — not an ID, no tenant scope
embedding.model: string
embedding.tokens: int
search.top_k: int
caller.type: string values: user | service | unknown
http.request_id: string from gateway X-Request-Id header
failed.stage: string on failure spans — values: scraping | chunking | embedding
Naming convention note:
failed.stageuses OTel dot notation (spans and log records). The corresponding Postgres column and JSON API field use underscore:failed_stage. These are the same concept in different namespaces — OTel attributes conventionally use dots, Python/SQL/JSON conventionally use underscores. Both refer to the pipeline stage where a job failed.
Cross-service propagation is automatic:
HTTPXClientInstrumentor injects traceparent on all outgoing HTTP callsFastAPIInstrumentor extracts it from incoming requestsCeleryInstrumentor carries context across the queue boundary internal to each serviceStage callbacks carry trace context automatically — the callback POST /jobs/{id}/complete is an instrumented HTTP call, so the Orchestrator's state transition appears as a child span of the worker span. The entire pipeline shows as a single trace in Tempo.
# In every Celery task — add business context to the auto-created span
@app.task(bind=True)
def scrape_task(self, job_id: str, url: str):
span = trace.get_current_span()
span.set_attribute("job.id", job_id)
span.set_attribute("job.url", url)
span.set_attribute("job.stage", "scrape")
span.set_attribute("request.mode", "async")
span.set_attribute("queue.name", self.request.delivery_info["routing_key"])
The application always exports to localhost:4317 (traces) and localhost:4040 (profiles). The routing to the remote LGTM+P cluster is entirely Alloy's concern — the application never has remote LGTM credentials.
| Metric | Type | Labels | Why |
|---|---|---|---|
jobs_submitted_total |
counter | — | Entry rate |
jobs_completed_total |
counter | status (done/failed) |
Gap between submitted and completed = data loss |
jobs_failed_total |
counter | failed_stage |
Primary signal for failure analysis — split by stage for MLWorkbench and Grafana |
job_stage_duration_seconds |
histogram | stage |
Per-stage latency — reveals bottleneck |
job_end_to_end_duration_seconds |
histogram | — | P99 pipeline latency — SLO anchor |
queue_depth |
gauge | queue, stage |
Most critical operational metric |
stage_callbacks_total |
counter | stage, status |
Callback volume and failure rate |
dlq_depth |
gauge | queue |
Non-zero = always alert |
rate_limit_hits_total |
counter | endpoint |
Spikes = capacity issue or abuse |
| Metric | Type | Labels | Why |
|---|---|---|---|
scrape_requests_total |
counter | status, domain, mode |
Domain-level success rate; mode separates sync from async |
scrape_duration_seconds |
histogram | mode |
P95/P99 by mode |
scrape_response_size_bytes |
histogram | — | Catch unexpectedly large pages |
scrape_retry_total |
counter | domain |
Distinguish transient vs structural failures |
| Metric | Type | Labels | Why |
|---|---|---|---|
chunks_produced_total |
counter | strategy |
Throughput — determines embed queue load |
chunk_size_tokens |
histogram | strategy |
Alert on P95 > max_tokens |
chunking_duration_seconds |
histogram | mode |
P99 trend > 2s = add a queue |
| Metric | Type | Labels | Why |
|---|---|---|---|
embedding_inference_requests_total |
counter | model, status |
Error breakdown — includes H100 server errors |
embedding_tokens_processed_total |
counter | model |
Volume tracking for capacity planning (no cost implication with self-hosted H100) |
embedding_inference_duration_seconds |
histogram | model |
H100 inference latency in isolation |
embedding_inference_errors_total |
counter | model, error_type |
Server unavailable, connection error, token limit exceeded |
embeddings_stored_total |
counter | — | Confirms vectors actually persisted |
embed_batch_size |
histogram | — | Undersized batches waste round-trips |
embedding_request_duration_seconds |
histogram | operation="write" |
Must carry operation label |
| Metric | Type | Labels | Why |
|---|---|---|---|
corpus_read_requests_total |
counter | status |
Read error rate separate from write |
embedding_request_duration_seconds |
histogram | operation="read" |
User-facing latency — P99 SLO |
vector_db_search_duration_seconds |
histogram | — | Isolates DB query time |
search_results_returned |
histogram | — | Large result sets slow serialisation |
operationlabel is mandatory on any metric that touches both read and write paths. Without it, a slow embed batch corrupts the read P99 and the wrong SLO alert fires.
| Signal | Type | Alert |
|---|---|---|
http_requests_total{method,path,status_code} |
counter | 5xx rate above baseline |
http_request_duration_seconds{path} |
histogram | P99 > SLO |
http_requests_in_flight |
gauge | Spike + rising latency = event loop saturation |
Path cardinality warning: always use route templates as labels (
/jobs/{job_id}, not/jobs/abc-123). High-cardinality raw URLs will bloat Mimir. BothFastAPIInstrumentorandstarlette-prometheushandle this correctly by default — verify it in local Prometheus before deploying.
| Signal | What it tells you | Alert |
|---|---|---|
process_resident_memory_bytes |
RSS memory | Unbounded growth over time |
process_open_fds |
Open file descriptors | Approaching OS limit |
python_gc_collections_total{generation="2"} |
GC pressure | Frequent gen2 collections = excessive object allocation |
FastAPI and Celery must run as separate processes. Co-location causes unpredictable memory growth and prevents independent scaling. Each has its own
CMDin the Dockerfile and its own process in Docker Compose.
| Signal | Type | Alert |
|---|---|---|
celery_tasks_total{task,state} |
counter | failed rising; retried without failed = swallowed errors |
celery_task_duration_seconds{task} |
histogram | P99 > visibility timeout = dangerous requeue loop |
celery_workers_online |
gauge | Drop to 0 |
celery_queue_length{queue} |
gauge | Unstarted tasks — primary backpressure signal |
Celery task lifecycle events to log via signals:
from celery.signals import task_prerun, task_postrun, task_failure, task_retry
@task_prerun.connect
def on_start(task_id, task, kwargs, **_):
logger.info("task_started", task=task.name, task_id=task_id,
job_id=kwargs.get("job_id"))
@task_failure.connect
def on_failure(task_id, exception, **_):
logger.error("task_failed", task_id=task_id,
error=str(exception), exc_info=True)
@task_retry.connect
def on_retry(request, reason, **_):
logger.warning("task_retrying", task_id=request.id,
retries=request.retries, reason=str(reason))
| Signal | Alert threshold |
|---|---|
redis_evicted_keys_total |
> 0 on any broker Redis — evictions silently corrupt the queue |
redis_memory_used_bytes |
> 80% of maxmemory |
redis_connected_clients |
Approaching maxclients (default 10,000) |
Queue depth: redis_list_length{key="scrape_jobs"} |
Service-specific threshold |
Unacked: redis_list_length{key="*.unacked"} |
Rising without task_started = worker freeze |
Auth cache hit rate:
auth_cache_hits = Counter("auth_cache_hits_total", "Auth cache hits", ["service"])
auth_cache_misses = Counter("auth_cache_misses_total", "Auth cache misses", ["service"])
| Signal | Alert |
|---|---|
pg_stat_activity_count{state="idle in transaction"} |
> 0 for > 30s — transaction leak |
pg_stat_activity_count{state="waiting"} |
> 5 — lock contention |
Dead tuple % on jobs table |
> 20% — autovacuum not keeping up with UPDATE-heavy workload |
asyncpg_pool_acquire_duration_seconds P99 |
> 100ms — connection pool exhausted |
The jobs table is update-heavy (status transitions on every stage). Each UPDATE leaves a dead tuple. Monitor autovacuum activity.
# Application-side query duration
db_query_duration = Histogram(
"db_query_duration_seconds", "DB query duration",
["query_name"] # "job_insert", "status_update", "job_lookup", "stuck_jobs_scan"
)
Context: CloudCIX uses a self-hosted H100 GPU running an OpenAI-compatible inference API. There are no external rate limits or token costs. Monitoring focuses on server availability, circuit breaker state, and volume tracking for capacity planning.
| Signal | Alert |
|---|---|
embedding_inference_errors_total |
Any server-side errors (connection refused, 503) |
embedding_tokens_processed_total |
increase([1h]) > capacity_threshold — plan H100 upgrade |
embedding_inference_retry_delay_seconds P99 |
Rising = workers idle waiting on backoff — H100 may be under load |
embed_batch_size |
Undersized batches waste LAN round-trips |
embedding_circuit_breaker_state |
== 1 (OPEN) → critical alert — all embedding jobs failing |
Structured volume log per call (no cost field — self-hosted):
{
"event": "inference_embed_complete",
"model": "bge-large-en-v1.5",
"prompt_tokens": 1842,
"batch_size": 12,
"inference_latency_ms": 18,
"job_id": "job-abc-123",
"trace_id": "..."
}
| Signal | Source | Alert |
|---|---|---|
nginx_connections_active |
nginx-prometheus-exporter (stub_status) | > 80% of worker_connections × worker_processes |
nginx_rate_limit_rejections_total |
Alloy log parse | Rate > 10/s |
nginx_upstream_response_time_seconds |
Alloy log parse | P99 > 2s |
nginx_http_requests_by_status_total{status="502"} |
Alloy log parse | Any — upstream unreachable |
Distinguishing nginx 429 from application 429:
{job="nginx"} | json | limit_req_status = "REJECTED" # gateway flood
{job="nginx"} | json | status = "429" | limit_req_status = "PASSED" # app quota
| Layer | Tool | Key signals | Alert on |
|---|---|---|---|
| Gateway | exporter + Alloy log parse | Rejection rate, upstream latency, 502 | Rejection spike, 502, P99 > SLO |
| HTTP | FastAPIInstrumentor | Rate, P99, in-flight | 5xx rate, P99 > SLO |
| FastAPI runtime | prometheus_client | RSS, FDs, GC | Unbounded growth, FD limit |
| Celery | celery-exporter | Task states, duration, queue depth, workers | Workers = 0, failure rate, P99 > timeout |
| Redis | redis_exporter | Memory, evictions, queue depth, unacked | Any eviction on broker |
| Postgres | postgres_exporter | Connections, locks, dead tuples | Idle-in-transaction, lock waits |
| H100 inference | Manual | Token volume, inference errors, circuit breaker state | Circuit breaker open, error rate spike |
Every service emits JSON-structured logs. trace_id and span_id are injected automatically by LoggingInstrumentor.
{
"timestamp": "2026-03-14T10:22:01.341Z",
"level": "INFO",
"service": "scraper",
"trace_id": "3f2a1b9cde...",
"span_id": "8e4d2a...",
"job_id": "job-abc-123",
"operation": "write",
"request_mode": "async",
"event": "scrape_completed",
"url": "https://example.com/article",
"duration_ms": 412,
"status_code": 200
}
Failure log entry:
{
"level": "ERROR",
"service": "scraper",
"trace_id": "3f2a1b9cde...",
"job_id": "job-abc-123",
"event": "scrape_failed",
"failed_stage": "scraping",
"error_code": "http_403",
"url": "https://example.com/article",
"retries": 3
}
Never log full scraped HTML content. Cap all string field lengths to 512 characters. One verbose log stream saturates Loki ingestion and displaces logs from other services.
# config.alloy — production (env var values differ per VM)
otelcol.receiver.otlp "app" {
grpc { endpoint = "0.0.0.0:4317" }
}
otelcol.processor.tail_sampling "default" {
decision_wait = "10s"
policy {
name = "errors"
type = "status_code"
status_code { status_codes = ["ERROR"] }
}
policy {
name = "sample"
type = "probabilistic"
probabilistic { sampling_percentage = 15 }
}
}
otelcol.exporter.otlphttp "tempo" {
client { endpoint = env("TEMPO_REMOTE_ENDPOINT") }
}
prometheus.scrape "app_metrics" {
targets = [{ "__address__" = "app:8000" }]
forward_to = [prometheus.remote_write.mimir.receiver]
}
prometheus.remote_write "mimir" {
endpoint { url = env("MIMIR_REMOTE_ENDPOINT") }
}
loki.source.docker "containers" {
host = "unix:///var/run/docker.sock"
forward_to = [loki.process.parse.receiver]
}
loki.process "parse" {
stage.json {
expressions = {
level = "level", service = "service",
trace_id = "trace_id", job_id = "job_id",
}
}
stage.labels {
values = { level = null, service = null }
}
forward_to = [loki.write.remote.receiver]
}
loki.write "remote" {
endpoint { url = env("LOKI_REMOTE_ENDPOINT") }
}
pyroscope.receive_http "profiles" {
http { listen_address = "0.0.0.0:4040" }
forward_to = [pyroscope.write.remote.receiver]
}
pyroscope.write "remote" {
endpoint { url = env("PYROSCOPE_REMOTE_ENDPOINT") }
}
# Appended to gateway VM's config.alloy
loki.source.file "nginx_logs" {
path_targets = [{ "__path__" = "/var/log/nginx/access.log" }]
forward_to = [loki.process.nginx_parse.receiver]
}
loki.process "nginx_parse" {
stage.json {
expressions = {
status = "status",
request_id = "request_id",
limit_req_status = "limit_req_status",
upstream_response_time = "upstream_response_time",
upstream_addr = "upstream_addr",
}
}
stage.labels {
values = { status = null, limit_req_status = null, upstream_addr = null }
}
stage.metrics {
metric.counter {
name = "nginx_rate_limit_rejections_total"
description = "Requests rejected by nginx rate limiter"
source = "limit_req_status"
config { value = "REJECTED" action = "inc" }
}
metric.histogram {
name = "nginx_upstream_response_time_seconds"
description = "Time waiting for upstream FastAPI response"
source = "upstream_response_time"
config { buckets = [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0] }
}
metric.counter {
name = "nginx_http_requests_by_status_total"
description = "nginx requests by HTTP status code"
source = "status"
config { match_all = true action = "inc" }
}
}
forward_to = [loki.write.remote.receiver]
}
# alloy-local.alloy — 100% sampling, local endpoints, no auth
otelcol.receiver.otlp "default" {
grpc { endpoint = "0.0.0.0:4317" }
}
otelcol.exporter.otlphttp "tempo" {
client { endpoint = "http://tempo:3200" }
}
prometheus.scrape "app" {
targets = [{ "__address__" = "app:8000" }]
forward_to = [prometheus.remote_write.local.receiver]
}
prometheus.remote_write "local" {
endpoint { url = "http://prometheus:9090/api/v1/write" }
}
loki.source.docker "all" {
host = "unix:///var/run/docker.sock"
forward_to = [loki.write.local.receiver]
}
loki.write "local" {
endpoint { url = "http://loki:3100/loki/api/v1/push" }
}
pyroscope.receive_http "profiles" {
http { listen_address = "0.0.0.0:4040" }
forward_to = [pyroscope.write.local.receiver]
}
pyroscope.write "local" {
endpoint { url = "http://pyroscope:4100" }
}
| Alert | Condition | Severity | Runbook |
|---|---|---|---|
| Queue stall | queue_depth rising, jobs_completed_total flat 10 min |
Critical | Check Celery worker logs, Redis connectivity |
| Service down | up == 0 for 2 min |
Critical | Check VM, Docker Compose, OOM in dmesg |
| Gateway upstream unreachable | nginx_http_requests_by_status_total{status="502"} > 0 |
Critical | Check upstream VM and Docker Compose |
| Stuck job | Non-terminal status, updated_at < now() - 30min |
Warning | Run stuck-job query; check Celery logs for that job_id |
| Gateway rejection spike | rate(nginx_rate_limit_rejections_total[5m]) > 10/s |
Warning | Investigate source IP in nginx Loki logs |
| Stage failure rate | rate(jobs_failed_total{failed_stage="scraping"}[5m]) > 1/s |
Warning | Check scraper domain health dashboard |
| H100 circuit breaker open | embedding_circuit_breaker_state == 1 for 30s |
Critical | H100 inference server returning errors. Run nvidia-smi on H100 host. Check inference server logs for CUDA OOM. Restart inference process if confirmed. |
| Inference error rate | rate(embedding_inference_errors_total[5m]) > 0.05/s |
Warning | Check H100 server connectivity. Reduce embed worker concurrency if H100 is under load. |
| Token volume spike | increase(embedding_tokens_processed_total[1h]) > capacity_threshold |
Warning | Audit recent jobs for runaway re-embed or unusually large documents |
| DLQ non-empty | dlq_depth > 0 for 5 min |
Warning | Inspect failed task payloads, check for schema changes |
| Read latency breach | histogram_quantile(0.99, embedding_request_duration_seconds{operation="read"}) > 500ms |
Warning | Profile read path in Pyroscope, check vector DB query plan |
| Redis broker memory high | redis_memory_used_bytes / redis_memory_max_bytes > 0.80 for 2 min |
Warning | Queue depth rising — drain backlog or increase maxmemory in redis.conf. Check redis-cli INFO memory. |
| Redis broker memory critical | redis_memory_used_bytes / redis_memory_max_bytes > 0.95 for 30s |
Critical | New task enqueues failing with ENOMEM. Immediate action: scale VM RAM or reduce worker concurrency to drain queue. |
| Worker OOM killed | increase(container_oom_events_total{name=~"scraper-worker\|embedding-worker"}[5m]) > 0 |
Warning | Worker killed by OOM killer. Check scrape_response_size_bytes for large documents. Consider increasing mem_limit in docker-compose or adding document size pre-filter. |
| Postgres memory high | process_resident_memory_bytes{job="postgres"} / node_memory_MemTotal_bytes > 0.75 |
Warning | Run VACUUM ANALYZE jobs. Check for long-running queries via pg_stat_activity. Reduce work_mem if sort queries are the cause. |
| Worker restart loop | increase(container_restarts_total{name=~".*-worker"}[15m]) > 3 |
Warning | Container restarting repeatedly — check dmesg for OOM kill events. Check worker logs for unhandled exceptions. |
Key panels: queue depth per stage (time series), job throughput rate, P50/P95/P99 end-to-end latency, stage-level latency breakdown, jobs_failed by stage (bar chart — sourced from jobs_failed_total{failed_stage} counter, not Postgres).
Key panels: request rate by status code, nginx vs application 429 split (side by side — this immediately shows whether a surge is flood/abuse or quota exhaustion), upstream response time P99 by upstream, active connections vs capacity.
Key panels: failure rate by stage over time, failure breakdown by error_code (from logs), top failing domains (scraper stage), recent failed jobs table linked to Loki.
Key panels: tokens processed per hour by model (bar), cumulative daily volume (stat with capacity threshold line), inference error rate (time series), inference latency P99, batch size distribution. There is no cost panel — the H100 is self-hosted. The volume panel is for capacity planning: when sustained tokens/hour approaches H100 throughput ceiling, it is time to evaluate a lighter model or second GPU.
Key panels: operation="read" P99 vs operation="write" P99 (side by side — never aggregated), read throughput, search result size distribution.
Key panels: DLQ depth per queue, recent failed job_ids with Loki deep-link, time-since-last-success per queue.
Key panels: Pyroscope flame graph per service, embedded alongside the corresponding Tempo span panel. One click from a slow span to its flame graph.
Key panels: error rate filtered to service.version = $GIT_SHA (template variable). Used post-deploy to confirm error pattern is gone before closing an incident.
| Step | Tool | Action | Time |
|---|---|---|---|
| 1 | Grafana overview | Which metric deviated first? Rising queue before errors = upstream cause. Check stage failure breakdown. | < 2 min |
| 2 | Postgres | SELECT * FROM jobs WHERE status='failed' AND updated_at > now()-'1h'::interval ORDER BY updated_at DESC |
< 1 min |
| 3 | Tempo | Filter suspect service + status=error. Read span attributes: job.id, failed.stage, error_code. |
< 5 min |
| 4 | Loki | Click Logs in the Tempo span — all services for that trace_id in chronological order. |
< 5 min |
| 5 | Pyroscope | Click profile link in the Tempo span — flame graph for that exact execution window. For latency, not errors. | < 5 min |
| 6 | Tempo | Filter to new service.version SHA — confirm error pattern gone after deploy. |
< 2 min |
# All log lines for a specific job across all services
{service_name=~"orchestrator|scraper|chunking|embedding-api"}
| json
| job_id = "job-abc-123"
| line_format "{{.timestamp}} [{{.level}}] {{.service}} — {{.event}}"
# All log lines for a trace_id (includes gateway)
{job=~"nginx|orchestrator|scraper|chunking|embedding-api"}
| json
| trace_id = "3f2a1b9cde..."
# All scraping failures in the last hour with domain
{service_name="scraper"}
| json
| event = "scrape_failed"
| line_format "{{.timestamp}} {{.url}} — {{.error_code}} (retries={{.retries}})"
-- Find stuck jobs and their last known stage
SELECT id, caller_id, status, url, corpus_name,
updated_at, now() - updated_at AS stuck_for
FROM jobs
WHERE status NOT IN ('done', 'failed')
AND updated_at < now() - interval '30 minutes'
ORDER BY updated_at;
-- Cross-reference with Loki: find last log line for a stuck job
-- {service_name=~"orchestrator|scraper|chunking|embedding-api"} | json | job_id = "<id>"
| Service | Look for in flame graph |
|---|---|
| Chunking | CPU time in text splitting library on large documents — justifies async if dominant |
| Embedding worker | Memory allocation in OpenAI HTTP client during large batches — GC pressure |
| Embedding read API | Vector DB client serialisation under read load — cache candidate |
| Scraper | HTTP connection pool management under high concurrency |
When a WorkerOOMKilled or RedisBrokerCriticalMemory alert fires:
| Step | Action |
|---|---|
| 1 | Check dmesg on the affected VM: dmesg | grep -i "oom\|killed" — confirms kernel OOM kill |
| 2 | For worker OOM: check scrape_response_size_bytes histogram P99 — unusually large pages are the most common cause |
| 3 | For Redis OOM: run redis-cli INFO memory — compare used_memory_human vs maxmemory_human. Run redis-cli LLEN celery for queue depth. |
| 4 | For H100 GPU OOM: SSH to H100 host, run nvidia-smi. Look for VRAM usage near 80 GB. Restart the inference process if confirmed. |
| 5 | For Postgres OOM: run SELECT pid, query, state, query_start FROM pg_stat_activity WHERE state != 'idle' — identify runaway queries. |
| 6 | Check Pyroscope: navigate to the affected service, filter by the time window of the OOM event. Flame graph shows which function was allocating the most memory. |
| 7 | After remediation: deploy with mem_limit set in docker-compose (see Implementation Guide §6) and confirm the alert clears in Grafana. |