Production AI Architecture at Scale

Reference architecture and operating model for production AI across a multi-cloud enterprise — shared entry points, cost attribution, onboarding standards, and EU AI Act evidence built into the delivery path.

AWS Summit Madrid 2025: Scalable AI Adoption at Iberdrola (speaker, ES)

Executive Outcome

Teams onboarded through a standard path rather than building bespoke platform components, reducing duplication and time to first deployment.

Model usage and cost attribution made visible across business units, enabling more informed capacity and investment decisions.

Governed entry points became the default path for new GenAI and agentic workloads, reducing shadow AI and making security enforcement consistent without blocking delivery.

Engagement focus

Reference architecture and operating model for production AI in a federated multi-cloud enterprise — shared controls with distributed ownership.

What this covers

Shared platform architecture with explicit plane boundaries and entry points
Cost attribution, observability, and onboarding standards across business units
Governance and compliance evidence embedded into delivery gates with progressive adoption

Context

A European energy group with multiple teams running independent AI experiments across multiple cloud accounts and providers. Each team was solving the same platform problems — gateways, identity, logging, cost tracking — independently. The result was duplicated infrastructure, no central visibility into model consumption, and no consistent way to enforce security or governance at scale. The challenge was not to centralize everything, but to define a shared architecture that teams could adopt progressively, with clear ownership boundaries and enough flexibility for local adaptation.

The Challenge

01Teams reinventing gateways, identity, and observability independently across business units, duplicating effort and fragmenting standards.
02No cost visibility or consumption attribution — model usage invisible to finance and platform teams.
03No consistent delivery standards, making security baselines and governance difficult to enforce without blocking teams.
04Shadow AI risk increasing as teams accessed models through unmanaged paths with no central telemetry.

Approach

→Reference architecture with explicit plane boundaries separating platform concerns from application delivery — teams own their workloads, the platform owns shared controls.
→Standard entry points for model and tool interactions with shared routing, telemetry, and cost attribution.
→Onboarding standards and ownership model so new teams adopt the platform through a repeatable path with clear expectations on both sides.
→Governance and compliance evidence embedded into release gate design — governance as part of the delivery architecture, not a separate review layer.
→Progressive adoption model: teams onboard at their own pace, with defaults and exception paths rather than mandates.

Key Considerations

Standardized access paths reduce autonomy in exchange for consistent governance and lower overhead.
Centralizing model access introduces a platform dependency that requires clear service expectations and incident response.
Mandatory onboarding adds early friction, but prevents teams from recreating platform controls independently.
Progressive adoption means some teams operate on legacy patterns during transition — requires explicit migration timelines.

Alternatives Considered

✕Library-only approach: rejected — fails to provide central enforcement or visibility into model consumption.
✕Single-vendor strategy: rejected — creates lock-in and limits control over identity and access patterns.
✕Central mandate without adoption support: rejected — generates resistance and shadow workarounds in a federated organization.

Representative Artifacts

01Reference Architecture (plane boundaries, entry points, shared services)

02Platform Map (identity, routing, telemetry, cost attribution)

03Onboarding Pack (templates, checklists, ownership expectations)

04Ownership Model (platform vs. application team responsibilities)

05Release Gate Definitions (governance and compliance evidence embedded)

06Migration and Adoption Roadmap (progressive onboarding, exception paths)

Acceptance Criteria

New teams onboard through the standard path without bespoke platform intervention.

Telemetry captures interaction traces consistently for security and cost attribution across business units.

Ownership boundaries reflected in delivery standards and review checkpoints.

EU AI Act evidence generated as part of the standard release gate process.

Exception paths documented and governed for teams with legitimate local adaptation needs.