1. 🎯 Sprint Summary
| Sprint | 6.3 (Federation + 2nd Hi-End Server) |
| Duration | 15 Nov - 5 Dec 2027 (3 weeks · 15 working days) |
| Goal | Cyberjaya 2nd Hi-End cluster · cross-region replication · automatic failover · per-tenant region pinning · DR drill (full restore < 4h) · 99.9% multi-region uptime · 35 tenants live |
| Capacity | 5 FTE (2 BE + 1 FE + 2 DevOps + 0.5 Compliance) + 0.5 Founder + 0.3 Doc Zam |
| Velocity target | 90 SP |
| Critical risk | No data loss during cutover · cross-region replication lag < 1s p95 |
| Demo date | 5 Dec 2027 |
2. 🏗️ Federation Architecture
┌─────────────────────────┐ ┌─────────────────────────┐
│ KL DC · Region A │ ◄──────────► │ Cyberjaya · Region B │
│ │ 10G dark │ │
│ Hi-End Cluster #1 │ fibre + VPN │ Hi-End Cluster #2 │
│ 4× L40S GPU │ │ 4× L40S GPU │
│ Inference: vLLM │ │ Inference: vLLM │
│ Storage: NVMe primary │ │ Storage: NVMe replica │
│ │ │ │
│ MariaDB Galera node 1 │ ◄─async────► │ MariaDB Galera node 2 │
│ pgvector primary │ cross-region │ pgvector replica │
│ MinIO bucket replica │ < 1s p95 │ MinIO bucket replica │
└──────────┬──────────────┘ └──────────┬──────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ HAProxy Region Router (DNS-aware) │
│ Tenant region tag → primary region · automatic failover on health │
│ Health check every 10s · failover RTO target < 30s │
└─────────────────────────────────────────────────────────────────────┘
│
▼
Tenants · Patients · Doctors
PER-TENANT REGION PINNING
· Tenant created with primary_region tag (kl_dc_1 OR cyberjaya_1)
· Read/write goes to primary first
· On primary unhealthy → automatic failover to secondary
· After recovery → automated catch-up + re-pin to primary
REPLICATION GUARANTEES
· Galera cluster · synchronous within region · async cross-region
· pgvector · async streaming replica · < 1s lag
· MinIO · cross-region bucket replication · 5min interval
· Audit log · WORM with cross-region hash chain
3. 🚦 Pre-Sprint Gate Checklist
- Sprint 6.2 closeout · 30 tenants stable across 4 states
- Cyberjaya colocation contract signed · power audit passed
- 2nd Hi-End hardware delivered · 4× L40S GPUs · same spec as KL DC
- 10G dark fibre + VPN to KL DC provisioned · < 5ms RTT verified
- Galera cluster strategy agreed (synchronous in-region · async cross-region)
- HAProxy region router deployed in test environment
- Per-tenant region tag schema migration drafted
- DR runbook v2 with cross-region procedures
- Pen-test scope updated (multi-region surface)
- Compliance Lead reviewed cross-region data residency (still on Malaysian soil)
4. 🧩 Sprint Scope
- Cyberjaya Cluster Bring-Up: Rack · power · cooling · 4× L40S install · OS + driver · vLLM · Triton · Litellm proxy
- Cross-Region Networking: 10G dark fibre · VPN tunnel · IPsec · DNS region-aware
- Galera Cross-Region Cluster: Synchronous in-region · async streaming cross-region · conflict resolution · last-writer-wins with audit
- pgvector Async Replica: Streaming replication · < 1s lag p95 · monitoring
- MinIO Bucket Replication: Cross-region S3 replication · 5min interval · DICOM + audio shared
- Per-Tenant Region Pinning: Tenant migration with region tag · routing rules · admin override
- HAProxy Region Router: DNS-aware · health check 10s · failover RTO < 30s · automatic re-pin
- Audit Log Cross-Region Hash Chain: WORM extended cross-region · M9 audit verifies both sides
- DR Drill v2: Full restore from cold backup · cross-region · target RTO < 4h · documented
- Multi-Region Monitoring: Filament admin extension · per-region health · replication lag · latency
- Per-Tenant Migration: 5 East Malaysia tenants migrated to Cyberjaya as primary (closer to East ML)
5. 📅 Day-by-Day Plan (15 days)
D1Mon 15 Nov · Cyberjaya Hardware + OS
Rack power up · OS install · NVIDIA driver · network bring-up.
Rack power up · OS install · NVIDIA driver · network bring-up.
D2Tue 16 Nov · Cross-Region Network
10G fibre + VPN · IPsec · < 5ms RTT verified · DNS region-aware setup.
10G fibre + VPN · IPsec · < 5ms RTT verified · DNS region-aware setup.
D3Wed 17 Nov · Inference Stack Cyberjaya
vLLM · Triton · Litellm · same model versions as KL DC · smoke test.
vLLM · Triton · Litellm · same model versions as KL DC · smoke test.
D4Thu 18 Nov · Galera Cross-Region Cluster
Galera node 2 join · synchronous in-region · async streaming cross-region.
Galera node 2 join · synchronous in-region · async streaming cross-region.
D5Fri 19 Nov · Mid-Demo + Replication Verification
Live mid-demo · replication lag < 1s p95 verified · MinIO sync confirmed.
Live mid-demo · replication lag < 1s p95 verified · MinIO sync confirmed.
D6Mon 22 Nov · pgvector Replica
Streaming replica · CPG corpus mirrored · vector search cross-region tested.
Streaming replica · CPG corpus mirrored · vector search cross-region tested.
D7Tue 23 Nov · Per-Tenant Region Tag Migration
tenant_region column · backfill all 30 tenants with kl_dc_1 default · audit.
tenant_region column · backfill all 30 tenants with kl_dc_1 default · audit.
D8Wed 24 Nov · HAProxy Region Router
DNS-aware routing · health check 10s · failover < 30s tested.
DNS-aware routing · health check 10s · failover < 30s tested.
D9Thu 25 Nov · Tenant Failover Drills
5 failover scenarios per tenant · RTO measured · zero data loss verified.
5 failover scenarios per tenant · RTO measured · zero data loss verified.
D10Fri 26 Nov · Mid-Demo Round 2 + DR Drill
Full DR drill · cold backup restore cross-region · RTO measured.
Full DR drill · cold backup restore cross-region · RTO measured.
D11Mon 29 Nov · Audit Log Cross-Region Chain
WORM extended · cross-region hash chain · M9 audit verifies both sides.
WORM extended · cross-region hash chain · M9 audit verifies both sides.
D12Tue 30 Nov · East Malaysia Tenant Migration
5 East Malaysia tenants migrated to Cyberjaya primary · latency improvement measured.
5 East Malaysia tenants migrated to Cyberjaya primary · latency improvement measured.
D13Wed 1 Dec · Multi-Region Monitoring Dashboard
Filament dashboard · per-region health · replication lag · latency per tenant.
Filament dashboard · per-region health · replication lag · latency per tenant.
D14Thu 2 Dec · Pen-Test Light + Hardening
Multi-region surface tested · cross-region data leak attempt blocked.
Multi-region surface tested · cross-region data leak attempt blocked.
D15Fri 3 Dec · Demo Prep + Polish
Demo deck · 35-tenant milestone · federation success metrics.
Demo deck · 35-tenant milestone · federation success metrics.
+Mon 5 Dec · Sprint Demo + Retro
9am demo · 11am retro · 2pm 6.4 (MOH Partnership) prep.
9am demo · 11am retro · 2pm 6.4 (MOH Partnership) prep.
6. 📦 Deliverables
| FR | Item | SP |
|---|---|---|
| FR-6.3.1 | Cyberjaya Hi-End cluster bring-up | 8 |
| FR-6.3.2 | 10G dark fibre + VPN cross-region | 5 |
| FR-6.3.3 | vLLM + Triton + Litellm Cyberjaya | 5 |
| FR-6.3.4 | Galera cross-region cluster | 8 |
| FR-6.3.5 | pgvector streaming replica | 5 |
| FR-6.3.6 | MinIO cross-region replication | 5 |
| FR-6.3.7 | Per-tenant region tag schema + migration | 5 |
| FR-6.3.8 | HAProxy region router + failover | 8 |
| FR-6.3.9 | 5 failover drill scenarios + measurements | 5 |
| FR-6.3.10 | DR drill v2 cross-region full restore | 8 |
| FR-6.3.11 | WORM audit cross-region hash chain | 5 |
| FR-6.3.12 | 5 East Malaysia tenants migrated to Cyberjaya | 5 |
| FR-6.3.13 | Multi-region monitoring dashboard | 5 |
| FR-6.3.14 | Pen-test light · multi-region surface | 5 |
| FR-6.3.15 | Compliance docs update · cross-region | 3 |
| TOTAL | 85 SP |
7. 👥 Team Capacity
| Role | Allocation | Focus |
|---|---|---|
| Eng Lead / BE | 1.0 FTE | Region tag · Galera · ConsentedFetch cross-region |
| BE Dev 2 | 1.0 FTE | Replication · audit hash chain · failover handler |
| FE Dev | 1.0 FTE | Multi-region monitoring dashboard · admin region switcher |
| DevOps 1 | 1.0 FTE | Cyberjaya cluster · vLLM · network · DR |
| DevOps 2 | 1.0 FTE | Galera cluster · pgvector · MinIO · monitoring |
| Compliance Lead | 0.5 FTE | Cross-region PDPA + data residency · doc pack update |
| Founder | 0.5 FTE | Cyberjaya colo + vendor relations · pen-test coord |
| Doc Zam | 0.3 FTE | Clinical impact assessment · failover oversight |
8. 🔔 Sprint Ceremonies
- Mon 15 Nov 9am — Sprint Planning (90 min)
- Daily 9am — Standup (15 min · DevOps deep-dive Tue/Thu)
- Fri 19 + Fri 26 Nov 4pm — Mid-sprint demos (60 min each)
- Wed 25 Nov 4pm — Failover drill rehearsal (90 min)
- Tue 30 Nov 4pm — DR drill review (60 min)
- Thu 2 Dec 2pm — Pen-test debrief (60 min)
- Mon 5 Dec 9am — Sprint Demo (90 min)
- Mon 5 Dec 11am — Sprint Retro (60 min)
9. 🩺 Sign-off Items
- Cross-region replication lag < 1s p95 verified
- Failover RTO < 30s on 5 different drill scenarios
- DR drill: full restore < 4h · cross-region · audit intact
- Per-tenant region pinning works · admin override functional
- WORM audit cross-region hash chain unbroken
- 5 East Malaysia tenants migrated · latency improvement measured · zero data loss
- Pen-test light: 0 critical · 0 high · ≤ 2 medium with mitigations
- Compliance pack updated · external consultant attestation
- Final demo (5 Dec) — written sign-off · Compliance Lead + Doc Zam
10. 🎬 Demo Agenda — 5 Dec 9am (90 min)
| Time | Segment |
|---|---|
| 0-5 | Federation narrative · 35-tenant milestone · multi-region rationale |
| 5-15 | Architecture walk · KL DC + Cyberjaya · replication topology |
| 15-30 | Live failover demo · kill KL DC simulator · East Malaysia tenant continues |
| 30-45 | DR drill replay · full restore from cold backup · audit chain verified |
| 45-55 | Per-tenant region pinning · admin region switcher · East Malaysia migration |
| 55-65 | Multi-region monitoring dashboard · per-region health · replication lag |
| 65-80 | Pen-test results · cross-region surface verified · compliance attestation |
| 80-90 | Compliance Lead + Doc Zam sign-off · 6.4 (MOH Partnership) prep |
11. 🛡️ Contingency
| Risk | Trigger | Response |
|---|---|---|
| Hardware delivery slip | 4× L40S vendor late | Cloud burst · use existing KL DC plus rented Cyberjaya cloud GPU temporarily |
| Replication lag > 1s | p95 fail target | Tighten Galera config · add NVMe write cache · escalate to vendor |
| Failover RTO > 30s | Health check too slow | Tune health interval · DNS TTL reduce · TCP fast-failover |
| Cross-region data leak (pen-test) | High-sev finding | Hot-fix · re-test · slip 6.4 by 1 week if needed |
| Compliance attestation withheld | Cross-region PDPA concern | Engage backup consultant · address findings · iterate |
| Cyberjaya colo issue | Power/cooling/network fail | Failover to KL DC · post-mortem · escalate to colo provider |