Organisation: CloudCIX
Version: March 2026 — v1.0
This guide tells you exactly which documents to read, in what order, and what to focus on — based on your role. You do not need to read everything.
| Document |
What it covers |
| System Design |
Architecture, data flows, state machine, auth, API contracts |
| Design Concepts |
The why behind every design decision — idempotency, circuit breakers, rate limiting patterns |
| Technology Choices |
Why Redis over RabbitMQ, why Postgres over MongoDB, why FastAPI — with full alternatives |
| Infrastructure |
VM specs, load model, throughput analysis, scaling playbook, OOM runbooks |
| Observability |
Metrics, alerts, dashboards, structured logging, debugging workflows |
| Implementation Guide |
Code patterns, Docker Compose, nginx config, environment variables, PR checklist |
| QA & Test Strategy |
Test categories, acceptance criteria, contract tests, chaos scenarios |
| Project Charter |
Goals, milestones, risks, success metrics, stakeholder summary |
| Vibe Coding & Setup Guide |
Two-developer AI-assisted build guide — work split, build order, AI prompt templates, go-live checklist |
Goal: Understand the architecture well enough to implement features, fix bugs, and extend the system correctly.
- Vibe Coding & Setup Guide §1 & §2 — Read the "three things your AI will always get wrong" and complete the local environment setup. Get your stack running before reading anything else.
- System Design §1 — Service responsibilities and data flow diagrams. Understand the Orchestrator-callback pattern before anything else.
- Design Concepts — Category A (A1–A5) — Auth model, CallerContext, circuit breaker. Every endpoint uses these.
- Design Concepts — Category C (C1–C4) — Idempotency, visibility timeout, exponential backoff. Every worker task uses these.
- Implementation Guide §3 — FastAPI service structure. This is the canonical file layout to follow.
- Implementation Guide §5 — Celery configuration. The default config is wrong for production — this explains what to set and why.
- Design Concepts — D3 — Dual transport mode (shared core pattern). New logic must go in
core/logic.py, not in route handlers.
- Implementation Guide §10 — PR checklist. Run through this before every PR.
- Observability §2 — Semantic attribute convention. New spans must use these attribute names.
- Design Concepts — Category B (B1–B5) — Rate limiting patterns and strategy composition.
- System Design §3 — Current auth model and CallerContext.
- Vibe Coding & Setup Guide §5 — Copy-paste AI prompt templates for auth middleware, Celery tasks, the state machine callback handler, and tests. Use these whenever you're asking an AI to generate code for this codebase.
- Vibe Coding & Setup Guide §8 — The custom instructions block to paste into your AI assistant's system prompt so it knows the architectural rules upfront.
- Technology Choices — when you need to understand why a technology was chosen or want to evaluate an alternative.
- Implementation Guide §4 — OpenTelemetry instrumentation patterns.
- Observability §7 — Structured logging contract. All log fields must conform to this schema.
- Infrastructure §5 (capacity planning tiers) — not relevant to day-to-day feature work.
- Project Charter — business context, not technical.
- Vibe Coding & Setup Guide §10 (go-live checklist) — that's for the SRE handoff, not feature development.
Goal: Deploy, operate, monitor, and scale the system. Respond to alerts and incidents.
- System Design §1.2 — Queue topology. Understand Option C (three Redis) and why it's the production choice.
- Infrastructure §1 — Deployment overview. Understand the VM topology and what runs where.
- Infrastructure §4 — VM specifications. RAM and CPU rationale for each VM — use this to size and justify.
- Implementation Guide §6 — Docker Compose for all services. The canonical deployment config.
- Implementation Guide §7 — nginx gateway config. Rate limit zones and upstream routing.
- Observability §4 — Alloy architecture. How metrics/logs/traces flow from each VM to the LGTM+P cluster.
- Observability §8 — Alloy configuration. The production
config.alloy and the gateway nginx log pipeline.
- Observability §9 — Alerting rules. Set these up in Grafana before deploying to production. Pay special attention to the OOM alerts (
RedisBrokerHighMemory, WorkerOOMKilled, EmbeddingCircuitBreakerOpen) — these are newly added and cover failure modes not in the original stack.
- Observability §10 — Grafana dashboards. Build these in order — Pipeline Overview first, then Gateway, then the rest.
- Observability §11 — Debugging workflow and OOM triage. The step-by-step decision tree from alert to root cause.
- Infrastructure §8 — Scaling playbook. Ordered actions per bottleneck signal — including the new OOM runbooks at the end.
- Infrastructure §6 — Monitoring cluster sizing and storage considerations at each tier.
- Infrastructure §7 — Bottleneck map. Tells you what limits you at each throughput tier.
- Observability §6 — Redis section —
redis_evicted_keys_total > 0 on a broker Redis is always a critical finding. The noeviction policy should prevent this, but monitor it.
- System Design §8 B1–B3 — Resilience Q&A. Covers worker crashes, Orchestrator restarts, embedding failures.
- Design Concepts (conceptual patterns — not operational).
- Technology Choices (decision rationale — not operational).
- QA & Test Strategy (developer/QA concern).
Goal: Understand system behaviour well enough to write meaningful tests, define acceptance criteria, identify edge cases, and validate that the system behaves correctly under failure conditions.
- System Design §2 — Job state machine. This is the core contract to test. Every status transition, every terminal state, the idempotency guard.
- System Design §5 — Service contracts (request/response schemas). These are the API contracts your integration tests must validate.
- System Design §7 — MLWorkbench API contract. The end-to-end user-facing flow.
- Design Concepts — C2 — Idempotency. Duplicate callbacks must be safe no-ops. This is a critical test case.
- Design Concepts — D6 — Structured error with stage context. The
failed_stage + error.code contract drives UI behaviour. Test all error code paths.
¶ For integration and contract testing
- Design Concepts — C1 — Acknowledgement-based delivery. Test the scenario where a worker crashes mid-task — the job must be retried, not lost.
- Design Concepts — C3 — Visibility timeout. Test that a task that exceeds the timeout is correctly requeued.
- Design Concepts — C5 — Stuck job detection. Test that a job stuck in a non-terminal state for 30+ minutes is detected and actioned.
- Design Concepts — A4 — Circuit breaker. Test the three states: CLOSED → OPEN (3 failures) → HALF-OPEN → CLOSED (recovery).
- Observability §2 — Semantic attribute convention. Span attribute tests (
test_instrumentation.py) must assert every attribute in this list is present on relevant spans.
- Implementation Guide §4 — Instrumentation test patterns. Use
InMemorySpanExporter for span assertions.
- QA & Test Strategy — Full test plan, acceptance criteria checklist, chaos test scenarios, and performance baselines. This is your primary working document.
- Technology Choices (decision rationale — not test-relevant).
- Infrastructure §4–5 (VM sizing — not test-relevant).
- Observability §8 (Alloy config — infrastructure concern).
Goal: Understand scope, dependencies, milestones, risks, and how to track delivery. Communicate status to stakeholders.
- Project Charter — Goals, milestones, risks, success metrics, and team responsibilities. This is your primary working document.
- System Design — System at a Glance table (top of document) — One-page summary of every architectural decision and when it will be revisited. Useful for status updates.
- System Design §9 — Architecture Decision Log. A record of every significant design decision — useful for understanding what is locked vs. what is still open.
¶ Understanding scope and dependencies
- System Design §1.1 — Service responsibilities table. Four services, clearly scoped. Use this to assign work and track completion.
- Technology Choices — Revisit triggers — Each technology section ends with a "revisit trigger." These are the conditions under which architectural rework would be needed — important for risk management.
- Design Concepts — A1 (API Key as Flat Identity) and A2 (CallerContext) — The current auth model is intentionally limited. The migration path to tenant/tier identity is designed but not built. This is a known scope item for a future phase.
¶ Understanding delivery risks
- Infrastructure §7 — Bottleneck map. Shows what limits throughput at each scale tier — useful for capacity conversations with stakeholders.
- Infrastructure §8 — Scaling playbook. Shows how each bottleneck is resolved — helps estimate effort for capacity-related work.
- QA & Test Strategy §5 — Risks and mitigations. Cross-reference with the Project Charter risk register.
- Implementation Guide (technical implementation detail).
- Observability §4, §6, §8 (operational detail).
- Design Concepts categories B–D (implementation patterns).
Goal: Understand what the system does, what it promises, what it costs to operate, and what the plan is for growing it.
- Project Charter — Executive Summary — What we are building, why, and what success looks like. Start here.
- Project Charter — Success Metrics — The measurable outcomes we are targeting.
¶ Understanding capabilities
- System Design — System at a Glance (first table in the document) — Eight decisions, each with a plain-English description. Non-technical readers can understand the shape of the system from this table alone.
- System Design §7.1–7.3 — The MLWorkbench API. What MLWorkbench can do: submit a URL, get a job ID back, check status, search the corpus.
- Technology Choices §8 — Self-Hosted H100 — Why CloudCIX operates its own GPU server rather than paying per token to OpenAI. The cost and data residency rationale is here.
¶ Understanding scale and cost
- Infrastructure §2 — Load model assumptions. What a "document" means in terms of size and processing time.
- Infrastructure §5 — Capacity planning tiers. Tier 1 (early users), Tier 2 (growth), Tier 3 (scale). Each tier shows what infrastructure is needed and at what cost.
- Infrastructure §3 — Throughput analysis. How many documents per hour the system can process at each configuration.
¶ Understanding risk
- Project Charter — Risks — The top risks and their mitigations.
- Technology Choices — Revisit triggers — The conditions under which significant architectural investment would be needed in future.
Everything else. The implementation details, observability config, and code patterns are not relevant to stakeholder conversations.
| Document |
Developers |
SRE/DevOps |
QA/Testers |
Project Manager |
Stakeholders |
| System Design |
✅ Core |
✅ §1, §8 |
✅ §2, §5, §7 |
✅ §1 table, §9 |
✅ §1 table, §7 |
| Design Concepts |
✅ All |
— |
✅ C1–C5, D6 |
✅ A1–A2 |
— |
| Technology Choices |
✅ Reference |
— |
— |
✅ Revisit triggers |
✅ §8 |
| Infrastructure |
— |
✅ All |
— |
✅ §5, §7, §8 |
✅ §2, §3, §5 |
| Observability |
✅ §2, §7 |
✅ All |
✅ §2 |
— |
— |
| Implementation Guide |
✅ All |
✅ §6, §7 |
✅ §4 |
— |
— |
| QA & Test Strategy |
✅ §4 patterns |
— |
✅ All |
✅ §5 risks |
— |
| Project Charter |
— |
— |
— |
✅ All |
✅ All |
| Vibe Coding & Setup Guide |
✅ All |
✅ §10 go-live |
— |
— |
— |