Building a Unified Multi-Cloud AI Fabric: Cloud-Native Patterns for Portable and Composable ML Services
Authors: Santosh Pashikanti
DOI: https://doi.org/10.37082/IJIRMPS.v13.i4.232861
Short DOI: https://doi.org/hbf8zk
Country: United States
Full-text Research PDF File:
View |
Download
Abstract:
Enterprises that build AI/ML platforms at scale are increasingly forced into multi-cloud strategies—sometimes by design (best-of-breed services, regulatory constraints, M&A), sometimes by accident. While Kubernetes and containers promise workload portability, the reality for AI/ML workloads is far more complex: data gravity, GPU scarcity, heterogeneous managed AI services, and fragmented MLOps tooling make “build once, run anywhere” difficult to realize in practice.
In this paper, I propose a unified multi-cloud AI fabric: an opinionated but vendor-neutral architecture that standardizes how AI/ML workloads are built, deployed, and operated across AWS, Google Cloud, and Microsoft Azure using containers, Kubernetes, and cloud-native abstraction layers. Building on recent work in cloud-native AI, Kubernetes-based ML platforms (e.g., Kubeflow), and distributed serving frameworks, the fabric defines a layered architecture with consistent patterns for portable training pipelines, composable inference graphs, cross-cloud traffic steering, and policy-driven governance. CNCF+1
I describe system requirements and design principles for such a fabric, including portability, composability, resilience, data locality, GPU efficiency, and security. I then present a reference architecture spanning EKS, AKS, and GKE, and walk through an implementation and case study of a global recommendation and fraud-detection platform. An evaluation compares this fabric against a single-cloud baseline along dimensions of migration effort, time-to-deploy, failover RTO, and cost utilization. Finally, I discuss trade-offs and outline future directions, including AI-native control planes, WASM-based runtimes, and cross-cloud vector databases. My goal is to provide a practical blueprint that other architects and ML platform teams can adapt, rather than yet another theoretical multi-cloud diagram that never survives contact with production.
Keywords: Multi-cloud, Kubernetes, AI/ML, cloud-native, MLOps, portability, composable services, EKS, AKS, GKE, Kubeflow, Ray, KServe, SageMaker, Vertex AI, Azure Machine Learning.
Paper Id: 232861
Published On: 2025-07-16
Published In: Volume 13, Issue 4, July-August 2025
All research papers published in this journal/on this website are openly accessible and licensed under