Managed ML Services Cheatsheet#
Side-by-side reference for managed AI/ML platform services across the major clouds. Use this to compare what each provider offers for model catalogs, MLOps tooling, compute options, and unique differentiators when choosing where to run a workload.
Azure#
Azure AI Platform Overview#
Unified experience: Azure AI Studio vs. Azure Machine Learning studio.
Key pillars: foundational models, machine learning, responsible AI.
Core AI Development & Model Services#
Azure AI Studio#
Hub for generative AI and copilot development.
Prompt Flow for orchestrating LLM workflows.
Evaluation and monitoring tools.
Model Infrastructure#
Azure OpenAI Service: GPT-4o, GPT-4 Turbo, embeddings (
text-embedding-3-large), DALL-E 3, moderation, fine-tuning.Azure AI Model Catalog (via AI Studio): curated catalog of open-source and frontier models (Llama, Mistral, Phi, etc.) with integrated endpoints.
Custom model training: bring your own model (BYOM) and framework to Azure ML.
Applied AI Services (Azure AI Services)#
Cognitive Services: Vision, Speech, Language, Decision — prebuilt APIs.
Azure AI Search: hybrid (keyword + vector) search with semantic ranking as the backbone for production RAG systems.
Azure AI Agents: service for building multi-step, reasoning agents with built-in tools and evaluation.
Azure AI Speech: text-to-speech avatars, real-time speech-to-speech translation.
Machine Learning Lifecycle (Azure Machine Learning)#
Data & compute infrastructure: notebooks, compute clusters/instances, serverless compute for ML.
Model training & experimentation: automated ML (AutoML), MLflow integration.
MLOps: model registry, pipelines, endpoint deployment, monitoring.
Core Cloud Infrastructure for AI/ML#
Compute Options#
Virtual Machines: NCasT4_v3, NC A100 v4, ND H100 v5/H200 series for cutting-edge GPUs. Spot instances for cost-effective training.
Serverless inference: Managed Online Endpoints (with auto-scaling) and Serverless Endpoints (pay-per-execution) in Azure ML.
Kubernetes: Azure Kubernetes Service (AKS) with GPU nodes for scalable inference.
Data & Storage#
Blob Storage & Data Lake Storage Gen2: primary data store for unstructured data.
Vector databases: Azure Cache for Redis (with vector search) or integrated vector search in Azure AI Search.
Databases: Cosmos DB (for operational data), Azure SQL Database.
Responsible AI & Governance#
Toolbox for fairness, interpretability, and transparency.
Content safety filters.
Audit trails and compliance (GDPR, etc.).
Architecture Patterns & Best Practices#
Retrieval-Augmented Generation (RAG) with Azure AI Search.
Fine-tuning vs. prompt engineering.
Cost optimization: spot VMs, managed endpoints, utilization monitoring.
Integration & Ecosystem#
GitHub Copilot & Azure: developer integration.
Power Platform: build AI-powered apps with Copilot Studio.
Microsoft Fabric: unified analytics platform with AI synergy (OneLake, Synapse Data Engineering, Direct Lake).
Azure References#
GCP#
Google Cloud AI/ML Overview#
Google Cloud Platform (GCP) offers a powerful suite of AI/ML services and is particularly strong in AI research and development.
Core Compute Services#
Compute Engine#
Virtual machines with GPU/TPU support
A2, N1 instances for ML workloads
Preemptible VMs for cost savings
Cloud Functions#
Serverless compute
Event-driven architecture
Auto-scaling
Cloud Run#
Containerized applications
Serverless container platform
Pay-per-request pricing
AI/ML Services#
Vertex AI#
Unified ML platform (replaces the legacy AI Platform)
Training with custom code and AutoML
Model deployment and monitoring
Feature Store
Pipelines (Kubeflow)
Generative AI Studio#
Access to Gemini and PaLM model families
Prompt engineering tools
Fine-tuning capabilities
Cloud AI APIs#
Vision AI, Natural Language AI
Speech-to-Text, Text-to-Speech
Translation API
Pre-trained models ready to use
Storage Services#
Cloud Storage#
Object storage (equivalent to AWS S3)
Integration with ML services
Multiple storage classes
Persistent Disk#
Block storage for VMs
SSD and standard options
Database Services#
Cloud SQL#
Managed PostgreSQL, MySQL
High availability
Firestore, Bigtable#
NoSQL options
Real-time and scalable
Unique GCP Advantages#
TPUs (Tensor Processing Units)#
Google’s custom ML accelerators
Optimized for TensorFlow
Fast training for large models
BigQuery ML#
ML directly on data warehouse data
SQL-based ML models
Integrated analytics