Self-Serve Data Platform Architecture for Data Mesh Data Products
A reference architecture blending internal developer platform (IDP) principles with data mesh paradigms to deliver scalable, governed, and self-serve data infrastructure.
Assumptions
Because target scale, cloud provider, and budget were not specified, the recommendations below are designed to be portable across cloud and on‑prem, with “swap‑in” choices for managed vs. self‑hosted components.
- Baseline scenario: A mid-to-large enterprise data mesh: dozens of domains, hundreds of data products, mixed batch + streaming workloads, and a need for strong governance and discoverability without re‑centralizing delivery. This aligns to the “self‑serve data infrastructure as a platform” pillar described in mainstream data mesh guidance. [1]
- Security assumptions: (a) centralized identity via OIDC/OAuth2, (b) least-privilege authorization at every layer, and (c) separation of control plane (platform APIs, policy, metadata) from data plane (compute/storage/stream). OIDC is explicitly defined as an authentication layer built on top of OAuth 2.0. [2]
- Operational assumptions: Platform capabilities must be consumable in “X‑as‑a‑Service” mode (self‑serve), and the internal platform (IDP) should be treated as a product with curated experiences—consistent with platform engineering guidance. [3]
Executive Summary
A self-serve data platform for data mesh succeeds when it behaves like an internal developer platform: it offers a small number of paved paths that domains can adopt quickly (templates + golden paths), while the platform team centrally manages cross-cutting concerns (identity, policy, observability, cost, metadata/lineage). [4]
Control Plane Capabilities
- A Data Product Registry API (resource-oriented) plus contract-first specs using OpenAPI for request/response APIs and AsyncAPI for event-driven interfaces. [5]
- A Data Contract standard (ODCS) stored alongside code and validated in CI/CD, enabling automated governance checks. [6]
- A Provisioning service that turns product manifests into infrastructure via IaC (Terraform/Pulumi) and GitOps workflows (e.g., Argo CD). [7]
Data Plane Capabilities
- Orchestration: Airflow and/or Dagster. Airflow for broad scheduling; Dagster for asset-centric, testable pipelines. [8]
- Streaming: Kafka or Pulsar. Pulsar has explicit multi-tenancy; Kafka relies on ACLs. [9]
- Lakehouse storage: Iceberg or Delta Lake. Both provide schema evolution and time travel/rollback primitives. [10]
- Compute: Spark (batch + micro-batch) and/or Flink (stateful streaming). [11]
Metadata, Lineage, & Observability
A metadata platform (DataHub or Amundsen) plus OpenLineage as the interoperability layer for run-level lineage events. [12]
Key Trade-off: Standardization vs. Autonomy
Too little standardization produces a “distributed swamp,” while too much creates bottlenecks. The practical solution is “federated guardrails”: enforce a small set of machine-checkable contracts while leaving implementation choice to domains where feasible. [13]
Reference Architecture Overview
A data mesh-aligned platform should map naturally to the four widely cited principles: domain-oriented ownership, data as a product, self-serve platform, and federated computational governance. [14] The key move is to implement “self-serve” primarily as APIs + paved workflows, not a ticket queue.
This separation of planes is what makes self-service safe: the control plane exposes standardized interfaces, while the data plane scales independently with workload. [3]
Platform APIs for Data Products
This section focuses on how the platform exposes “data product as a first-class resource” using stable APIs, consistent auth, and machine-checkable contracts.
Recommended Patterns
- Contract-first specs: Use OpenAPI for synchronous APIs. Use AsyncAPI for event-driven interfaces. [18][19]
- Versioning: Adopt Semantic Versioning for data product interfaces. [20]
- AuthN/AuthZ: Use centralized identity via OpenID Connect (OIDC). Apply least privilege using RBAC at runtime layers. [2][22]
Example API Contracts
Data Product Registry (OpenAPI)
openapi: 3.1.0
info:
title: Data Product Registry API
version: 1.0.0
paths:
/data-products:
post:
summary: Register a new data product
security: [{ oidc: [ "dpr.write" ] }]
requestBody:
required: true
content:
application/json:
schema: { $ref: "#/components/schemas/DataProductCreate" }
responses:
"201":
description: Created
headers:
Location:
schema: { type: string }
/data-products/{productId}/versions:
post:
summary: Publish a new version of a data product
security: [{ oidc: [ "dpr.write" ] }]
parameters:
- name: productId
in: path
required: true
schema: { type: string }
- name: Idempotency-Key
in: header
required: true
schema: { type: string }
responses:
"202":
description: Accepted for validation + provisioning
components:
securitySchemes:
oidc:
type: openIdConnect
openIdConnectUrl: https://idp.example.com/.well-known/openid-configuration
schemas:
DataProductCreate:
type: object
required: [name, domain, owner, description]
properties:
name: { type: string }
domain: { type: string }
owner: { type: string, description: "Group or team id" }
description: { type: string }
Streaming Interface (AsyncAPI)
asyncapi: 3.0.0
info:
title: Orders Data Product Events
version: 1.2.0
servers:
prod:
host: broker.prod.example:9092
protocol: kafka
channels:
orders.created:
address: orders.created
messages:
OrdersCreated:
payload:
type: object
required: [order_id, created_at, customer_id]
properties:
order_id: { type: string }
created_at: { type: string, format: date-time }
customer_id: { type: string }
operations:
publishOrdersCreated:
action: send
channel:
$ref: "#/channels/orders.created"
Data Contracts as "Product Constitution" (ODCS)
apiVersion: v3.1.0
kind: DataContract
id: orders
name: Orders
version: 1.2.0
status: active
schema:
- name: orders
columns:
- name: order_id
type: string
required: true
- name: created_at
type: timestamp
required: true
quality:
- type: freshness
maxLagMinutes: 15
terms:
- type: pii
classification: restricted
Runbook: Introducing a breaking change to a data product interface
- Classify the change using SemVer (breaking → MAJOR). [20]
- Update OpenAPI/AsyncAPI/ODCS in the product repo; create a new version folder (e.g.,
contracts/2.0.0/). - CI performs contract checks:
- OpenAPI lint + backward compatibility check.
- AsyncAPI validation + schema registry compatibility check. [26]
- ODCS validation against the standard definition. [27]
- Provision parallel interfaces where feasible (new topic or table namespace).
- Publish deprecation notice in catalog with timelines and consumer impact analysis using lineage. [28]
- Monitor adoption; enforce retirement by policy after the deprecation window.
Runbook: API authentication incidents (token compromise)
- Revoke client credentials / rotate secrets in the central secrets system.
- Force re-issuance of access tokens; validate OIDC configuration and client registrations. [2]
- Audit platform API access logs; identify affected products and consumers.
- Post-incident: add policy requirements (shorter TTL, mandatory mTLS for service clients, tighter scopes).
Data Product Provisioning
To keep domains autonomous while preventing “snowflake infrastructure,” provisioning should be template-driven, policy-guarded, and declarative.
Example Backstage Scaffolder Template
apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
name: data-product-iceberg-batch
spec:
owner: platform-team
type: data-product
parameters:
- title: Data Product Info
required: [name, domain, owner]
properties:
name: { type: string }
domain: { type: string }
owner: { type: string }
steps:
- id: fetch-base
action: fetch:template
input:
url: ./skeleton
values:
name: ${{ parameters.name }}
domain: ${{ parameters.domain }}
owner: ${{ parameters.owner }}
- id: publish
action: publish:github
input:
repoUrl: github.com?repo=${{ parameters.name }}&owner=${{ parameters.owner }}
Runbook: Onboarding a new domain team
- Create domain identity groups in the IdP (OIDC provider) and map claims. [2]
- Allocate Kubernetes namespaces with namespace-scoped RBAC roles and quotas. [35]
- Allocate streaming tenancy (Pulsar tenant/namespace or Kafka ACLs/naming conventions). [36][37]
- Enable catalog write permissions for the domain (limited to its own products).
- Provide domain with Backstage template access and a standard “first product” walkthrough.
Runbook: Retiring a product
- Mark as Deprecated in registry; include sunset date.
- Use lineage to identify downstream dependencies. [28]
- Remove new access grants; maintain existing access until sunset.
- Archive runtime deployments (DAGs/jobs) and freeze tables/topics.
- Delete or cold-store data per retention policy; remove infra via IaC destroy/reconcile.
Infrastructure-as-Code for Pipelines
Terraform Module Interface (Pseudo-code)
module "data_product_orders" {
source = "git::ssh://git.example.com/platform/iac-modules.git//data-product"
product_id = "orders"
domain = "commerce"
environment = "prod"
# Streaming interface
kafka_topics = [
{ name = "orders.created.v1", partitions = 24, retention_hours = 168 }
]
# Lakehouse interface
lake_tables = [
{ name = "commerce.orders", format = "iceberg", partitioning = ["days(created_at)"] }
]
# Access policy hooks
owner_group = "grp-commerce-data"
pii_level = "restricted"
}
Pipeline Definitions
Airflow DAG Pattern
with DAG("orders_daily", schedule="@daily") as dag:
extract = ExtractFromKafka(topic="orders.created.v1")
transform = SparkSubmit(job="orders_transform.py")
load = WriteIcebergTable(table="commerce.orders")
extract >> transform >> load
Dagster Asset Pattern
@assetdef orders_raw(kafka_resource):
return kafka_resource.read("orders.created.v1")
@asset(deps=[orders_raw])
def orders_iceberg(orders_raw, iceberg_io):
iceberg_io.write("commerce.orders", orders_raw)
Internal Developer Platform Capabilities
A self-serve data platform becomes sustainable when it is delivered as an internal developer platform (IDP): portal UX + self-serve actions + guardrails + observability and cost controls.
- Developer portal: Backstage’s software templates scaffold components; catalog processes entities. [59]
- RBAC enforcement: Close to execution surfaces (Kubernetes RBAC, Kafka ACLs). [35][37]
- Observability: OpenTelemetry for traces, metrics, and logs. [60]
- Cost controls: Showback/chargeback via namespace/label attribution (e.g., Kubecost). [61]
Runbook: Granting consumer access to a restricted data product
- Consumer requests access in the portal.
- Policy engine evaluates identity claims, product classification, and approvals.
- If approved, provisioning reconciles access controls (Kafka ACLs, Lakehouse RBAC).
- Catalog updates to reflect new access paths.
- Audit logs captured and retained.
Integration & Reference Stack Options
| Layer | Option | Maturity Signal | Scalability Posture | Cost Posture (Ops) |
|---|---|---|---|---|
| Orchestration | Airflow | Stable REST API, broad adoption | Scales well; ops rise with DB tuning | Medium–High |
| Orchestration | Dagster | Asset-centric, lineage emphasis | Scales well with asset modeling | Medium |
| Streaming | Kafka | Strong security primitives (ACLs) | High throughput | Medium–High |
| Streaming | Pulsar | Explicit multi-tenancy model | Designed for geo patterns | Medium–High |
| Lakehouse | Iceberg | Schema evolution, time travel | Strong for very large tables | Medium |
| Lakehouse | Delta Lake | ACID transactions, scalable metadata | Strong at scale (Spark-centric) | Low–Medium |
| Metadata | DataHub | GraphQL APIs, lineage APIs | Scales as central metadata service | Medium |
| IDP Portal | Backstage | Templates scaffold + catalog | Scales as portal | Medium |
Closing Perspective
A self-serve data platform for data mesh is less about picking “the right tools” and more about achieving a consistent product operating model: machine-readable contracts, automated checks, declarative provisioning, standardized telemetry/lineage, and a portal that makes the paved path the easiest path. The cited ecosystem standards (OpenAPI, AsyncAPI, OpenLineage) and platform primitives (Terraform, Argo CD, Kubernetes RBAC, DataHub) exist precisely because scale demands automation and interoperability, and because manual governance does not scale. [84]
Selected References
- [1] Data Mesh Principles and Logical Architecture - Martin Fowler
- [3] CNCF Platforms White Paper | CNCF TAG App Delivery
- [5] OpenAPI Specification v3.1.0
- [6] Open Data Contract Standard (ODCS)
- [28] OpenLineage
- [29] Backstage Software Templates
- [30] Argo CD - Declarative GitOps CD
- [62] DataHub Components