Budgets as code with Flyte, OpenLineage, and Marquez

Authors:

Ashok Prakash, Staff Engineer at Apple
Isan Sahoo, Principal Engineer at Oracle

In the world of multi-tenant GPU clusters, managing spiraling cloud costs is a persistent challenge. Traditional retrospective chargeback often fails to prevent budget violations, leaving teams struggling to control spend after the fact. To solve this, we’re introducing Budgets as Code. This approach transforms spend and capacity limits into versioned policy artifacts, which are evaluated at the critical moment of workload admission. This ensures real-time, admission-time enforcement, giving platform teams precise control over scope, time windows, and execution metadata for every submitted job

This article focuses on LF AI & Data projects used as control and evidence components. Flyte provides a workflow execution control plane for ML and data pipelines. OpenLineage provides an event model for emitting run-level metadata and lineage context. Marquez stores and serves OpenLineage events for query and visualization. Budget enforcement is integrated by attaching policy evaluation to workflow submission and by emitting policy decisions as OpenLineage facets persisted in Marquez.

Policy semantics and budget state

A budget policy is stored in Git and applied through a GitOps mechanism. The specification is defined over a scope selector, a time window, a limit, and an enforcement mode. Scope is expressed using Kubernetes namespace and required label predicates such as cost-center, team, and environment. The same keys are required to join admission decisions to usage metering, therefore missing attribution is treated as a policy violation.

GPU-hours are defined as the primary unit of consumption, with gpu_hours = gpu_count × runtime_hours. The identity is evaluated on projected runtime at admission and on realized runtime during reconciliation. An arithmetic example is that 8 GPUs requested for 6 hours correspond to 48 GPU-hours. Cost estimation is expressed as showback_cost = gpu_hours × rate_per_gpu_hour, where rate_per_gpu_hour is a configuration mapping keyed by resource flavor and, when required, region and epoch.

Budget ledger

Budget state is represented as a ledger keyed by (scope_id, window_start, window_end). The ledger stores reservations applied at admission and actuals derived from metering. Reservations are computed from projected GPU-hours and applied atomically with admission to prevent concurrent submissions from consuming the same remaining balance. Reconciliation replaces reservations with actuals at completion and updates long-running workloads using periodic metering snapshots.

Enforcement using Flyte

Flyte workflow and task specifications include container image references, parameter bindings, and resource requests such as nvidia.com/gpu, cpu, and memory. A preflight stage evaluates the applicable policy version against a projected consumption derived from the resolved execution plan and the active budget window.

Preflight decision record

The preflight outcome is recorded as a structured decision that includes policy identifier, policy Git SHA, scope_id, window boundaries, projected GPU-hours, remaining GPU-hours, and decision outcome. Attribution validation is part of enforcement. Required labels are validated before admission so that scope membership is well-defined. A stable run identifier is assigned at submission and reused across retries so that a preflight decision correlates with execution attempts and with lineage records.

Evidence using OpenLineage and Marquez

OpenLineage represents runs, jobs, and datasets and supports extensibility through facets. Budget enforcement is represented as a run facet that captures evaluation inputs and outcomes for admitted, denied, and delayed attempts. Events are emitted at evaluation time and at completion time to record reconciliation outcomes.

Budget facet schema

The facet schema includes policy_id, policy_git_sha, scope_id, window_start, window_end, projected_gpu_hours, admitted_gpu_hours, and decision. For deny and delay outcomes, the facet includes deficit_gpu_hours and next_eligible_time. A compact example is shown below, with identifiers and timestamps presented as placeholders.{ "budget": { "policy_id": "gpu-budget/team-vision-prod", "policy_git_sha": "e3b0c44", "scope_id": "ns=vision-prod;cc=1042", "window_start": "2026-01-01T00:00:00Z", "window_end": "2026-02-01T00:00:00Z", "projected_gpu_hours": 48.0, "admitted_gpu_hours": 48.0, "decision": "ALLOW" } }

Marquez as query and audit store

Marquez ingests OpenLineage events and maintains an immutable history of runs and their facets. Budget analysis is expressed as queries over runs grouped by scope and window, including denial rate, reservation-to-actual error, and window exhaustion time derived from reconciled admitted_gpu_hours. Because the policy Git SHA is recorded, changes in enforcement behavior are attributable to explicit policy revisions.

Reconciliation and fault boundaries

Budget correctness depends on deterministic inputs and explicit handling of telemetry lag. Reservations provide a conservative upper bound on consumption at admission, while reconciliation incorporates realized runtime from metering. Evidence emission is retried independently of admission, with the decision record treated as authoritative for enforcement and the lineage record treated as authoritative for audit and analysis. Metering outage and ledger unavailability define fault boundaries for enforcement. Admission decisions require ledger availability and atomic reservation, and evidence emission proceeds via retry without altering admission outcomes.

Conclusion

Budgets as code on GPU clusters is defined by admission-time evaluation, Git-versioned policy, and decision evidence that supports reconstruction. Flyte provides a workflow control plane for evaluating projected consumption and attribution before GPU allocation. OpenLineage provides a standard mechanism for encoding budget decisions as run facets. Marquez stores those events for query and audit, linking policy revisions to enforcement outcomes through recorded policy identifiers and Git SHAs.

The operational context for admission-time budget enforcement is documented in publicly reported adoption and cost baselines. CNCF reported that cloud native adoption reached 89% among surveyed organizations in 202. CNCF reporting published on January 20, 2026 stated that production usage of Kubernetes reached 82% among container users and that 66% of AI adopters use Kubernetes to scale inference workloads. Flexera reported in its 2025 State of the Cloud survey results that 84% of respondents identified managing cloud spend as the top cloud challenge.

To ground the budget math with a concrete reference point, one publicly-cited example of accelerator capacity pricing includes an effective hourly rate of $31.464 for a high-end instance providing 8 NVIDIA H100 GPUs. Under this public baseline, the implied rate is $3.933 per H100 GPU-hour. Using this figure, a workload using 8 GPUs for 6 hours corresponds to $188.78 of reserved instance capacity.

Author

LF AI & Data

View all posts