Organisation: CloudCIX
Version: March 2026 — v1.0
Status: Active
Owner: CloudCIX Engineering
Related documents:
CloudCIX ML Services is the backend infrastructure that powers MLWorkbench — CloudCIX's document indexing and corpus management product. It gives MLWorkbench the ability to ingest web content, index it semantically, and retrieve relevant passages on demand.
A user submits a URL. The system fetches the page, breaks it into searchable segments, converts those segments into vector embeddings using the self-hosted H100 GPU, and stores the result in a vector database. The user can then search across their indexed corpus and retrieve the most relevant content. Chat and conversational Q&A are handled by a separate service — this system is purely the indexing and retrieval layer.
The pipeline runs on CloudCIX-owned infrastructure. No content leaves CloudCIX infrastructure and there are no per-token costs to external AI providers.
MLWorkbench users need to build and search a persistent corpus of web content — internal documentation, industry reports, competitor pages, product documentation. Today, MLWorkbench has no way to ingest and index external content. Users must manually paste text, which is fragile, time-consuming, and doesn't scale.
The problem: MLWorkbench cannot connect to external content sources and has no persistent indexed corpus that can be searched across sessions.
The impact: Users cannot maintain a living library of URLs that stays current and searchable. Use cases like "index these 50 URLs and let me search across them" are impossible today.
The solution: Build a content ingestion and retrieval pipeline that accepts URLs, fetches and processes the content, and makes it semantically searchable — so MLWorkbench can surface relevant passages from a user-defined, persistent corpus.
User submits URL via MLWorkbench
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CloudCIX ML Pipeline │
│ │
│ 1. Scraper — fetches the page from the internet │
│ 2. Chunking — splits the page into ~17 searchable │
│ segments (for a typical article) │
│ 3. Embedding — converts each segment into a vector │
│ using the H100 GPU │
│ 4. Vector DB — stores the vectors for search │
└─────────────────────────────────────────────────────────────┘
│
▼
User searches "refund policy" in MLWorkbench corpus browser
│
▼
System returns the most relevant indexed passages from their corpus
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
Chat / Q&A is handled by a separate service.
It may call POST /search to retrieve relevant passages,
but that integration is out of scope for this document.
─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─
| Service | Plain English | Technology |
|---|---|---|
| Orchestrator | Traffic controller — coordinates all pipeline stages and tracks job status | Python / FastAPI / PostgreSQL |
| Scraper | Web fetcher — downloads the page and respects site rules | Python / Celery |
| Chunking | Text splitter — divides content into overlapping 512-token segments | Python / FastAPI |
| Embedding DB API | AI encoder + store — converts text to vectors and handles search queries | Python / Celery / Vector DB |
| Gateway | Front door — rate limits requests and routes traffic | nginx |
| Metric | Target |
|---|---|
| All five service VMs deployed and healthy | 100% |
| Full pipeline successfully processes a test URL end-to-end | Pass |
| MLWorkbench search returns results from indexed content | Pass |
| All Grafana dashboards operational with live data | Pass |
| All alerting rules active in production | Pass |
| SLO | Target | How measured |
|---|---|---|
| Pipeline availability | 99.5% | jobs_completed_total / jobs_submitted_total over 7 days |
| End-to-end pipeline P99 latency | < 15s | job_end_to_end_duration_seconds P99 histogram |
| Search P99 latency | < 500ms | embedding_request_duration_seconds{operation="read"} P99 |
| Failed job rate | < 5% | jobs_failed_total / jobs_submitted_total over 24h |
| Zero silent job loss | 0 jobs silently lost | jobs_submitted_total - jobs_completed_total - jobs_failed_total == 0 |
| Metric | Target |
|---|---|
| Documents indexed per hour | ≥ 600 |
| Concurrent search users supported | ≥ 50 |
| Job history retained | ≥ 90 days |
ml-shared) created with auth, rate limiting, instrumentationmake dev-full)| Role | Responsibility |
|---|---|
| Engineering Lead | Architecture ownership, technical decisions, code review |
| Backend Developers (×2) | Service implementation, shared library, integration tests |
| SRE / DevOps (×1) | VM provisioning, Docker Compose deployment, Grafana dashboards, alerting |
| QA Engineer (×1) | Test strategy, integration tests, chaos tests, acceptance criteria sign-off |
| Product Manager | Requirements, milestone tracking, stakeholder communication |
| Membership Service Team | Auth API dependency — POST /auth endpoint SLA, future identity claims |
| H100 Infrastructure Team | GPU server availability, inference server uptime, OOM mitigation |
| Risk | Likelihood | Impact | Mitigation | Owner |
|---|---|---|---|---|
| H100 server hardware failure | Low | High | Circuit breaker fails gracefully; jobs queue until recovery. Monitor with EmbeddingCircuitBreakerOpen alert. Add second GPU server at Tier 3. |
H100 Infra Team |
| Target sites rate-limiting the Scraper | High | Medium | Per-domain rate limiting in Scraper. scrape_retry_total metric per domain. 429 errors surface in jobs_failed_total{failed_stage="scraping"}. |
Backend |
| Redis OOM causing silent task drop | Medium | High | noeviction policy prevents silent drops. RedisBrokerCriticalMemory alert. Runbook in Infrastructure §8. |
SRE |
| Postgres dead tuple accumulation | Medium | Low | Autovacuum monitoring. VACUUM ANALYZE jobs in operational runbook. Alert on dead tuple %. |
SRE |
| Auth model insufficient for multi-tenant needs | High (future) | Medium | CallerContext abstraction is forward-compatible. No tenant_id added until Membership provides it. Scope is locked for v1. |
Engineering Lead |
| Worker OOM on large documents | Medium | Medium | mem_limit in docker-compose. Document size pre-filter. WorkerOOMKilled alert. Chaos test Scenario 7. |
Backend / SRE |
| Membership service unavailability | Low | High | Circuit breaker (open after 3 failures, 30s recovery). Auth cache (60s TTL reduces dependency). | Backend |
| DLQ implementation gap | Medium | Medium | DLQ is monitored but not yet implemented (v1 gap). Add DLQ implementation in Phase 3 before production. | Backend |
| MLWorkbench polling overhead at scale | Low | Low | Exponential backoff reduces overhead. SSE upgrade path designed and documented. Implement in Phase 4. | Backend |
| Dependency | Owner | Status | Risk if unavailable |
|---|---|---|---|
Membership Service POST /auth endpoint |
Membership Team | Live | Auth falls back to circuit breaker (503 for all requests) |
| H100 GPU server | H100 Infra Team | Deployed | Embedding stage fails; jobs queue until recovery |
| LGTM+P K8s cluster | SRE | Deployed | No observability — production launch without observability is not acceptable |
| Dependency | Degraded behaviour |
|---|---|
| Vector DB unavailable | Embedding jobs fail with failed_stage: embedding. Scraping and chunking unaffected. |
| Postgres unavailable | Orchestrator cannot accept new jobs or serve status queries. Workers continue processing existing tasks. |
maxmemory-policy noeviction is mandatory on all broker Redis instances. This is a correctness constraint, not a performance preference.Flat identity model. All API keys grant equal access. There is no way to say "key A can only access corpus X" or "key B is on a restricted tier." This is intentional — the Membership service does not currently provide identity claims. The CallerContext abstraction is ready to accept richer claims when Membership evolves.
No corpus-level access control. Corpus names are plain strings, not namespaced by user or organisation. Any caller with a valid API key can read any corpus by name. This is acceptable for the initial internal product but will need to change as the user base grows.
No scheduled re-indexing. Content is indexed once, on submission. Stale content is not automatically refreshed. Users must re-submit URLs to update indexed content.
Polling only for job progress. MLWorkbench polls GET /jobs/{id} to track pipeline progress. Server-Sent Events (SSE) are the right long-term solution and the architecture is designed to support them, but the implementation is deferred to Phase 4.
HTTP/HTTPS URLs only. PDFs, Word documents, internal SharePoint pages, and email content are not supported in v1.
No dead letter queue implementation. The observability layer monitors DLQ depth and has alerts for it, but the application-layer DLQ routing code has not yet been built. This must be completed before production launch (Phase 3).
| Role | Name | Date | Status |
|---|---|---|---|
| Engineering Lead | Pending | ||
| Product Manager | Pending | ||
| SRE Lead | Pending | ||
| QA Lead | Pending | ||
| H100 Infrastructure Lead | Pending |
This document should be reviewed and updated at the start of each phase. Significant architectural changes require a new revision with Engineering Lead approval.