Self-Serve Data Platform Architecture

View References

Self-Serve Data Platform Architecture for Data Mesh Data Products

A reference architecture blending internal developer platform (IDP) principles with data mesh paradigms to deliver scalable, governed, and self-serve data infrastructure.

Assumptions

Because target scale, cloud provider, and budget were not specified, the recommendations below are designed to be portable across cloud and on‑prem, with “swap‑in” choices for managed vs. self‑hosted components.

  • Baseline scenario: A mid-to-large enterprise data mesh: dozens of domains, hundreds of data products, mixed batch + streaming workloads, and a need for strong governance and discoverability without re‑centralizing delivery. This aligns to the “self‑serve data infrastructure as a platform” pillar described in mainstream data mesh guidance. [1]
  • Security assumptions: (a) centralized identity via OIDC/OAuth2, (b) least-privilege authorization at every layer, and (c) separation of control plane (platform APIs, policy, metadata) from data plane (compute/storage/stream). OIDC is explicitly defined as an authentication layer built on top of OAuth 2.0. [2]
  • Operational assumptions: Platform capabilities must be consumable in “X‑as‑a‑Service” mode (self‑serve), and the internal platform (IDP) should be treated as a product with curated experiences—consistent with platform engineering guidance. [3]

Executive Summary

A self-serve data platform for data mesh succeeds when it behaves like an internal developer platform: it offers a small number of paved paths that domains can adopt quickly (templates + golden paths), while the platform team centrally manages cross-cutting concerns (identity, policy, observability, cost, metadata/lineage). [4]

Control Plane Capabilities

  • A Data Product Registry API (resource-oriented) plus contract-first specs using OpenAPI for request/response APIs and AsyncAPI for event-driven interfaces. [5]
  • A Data Contract standard (ODCS) stored alongside code and validated in CI/CD, enabling automated governance checks. [6]
  • A Provisioning service that turns product manifests into infrastructure via IaC (Terraform/Pulumi) and GitOps workflows (e.g., Argo CD). [7]

Data Plane Capabilities

  • Orchestration: Airflow and/or Dagster. Airflow for broad scheduling; Dagster for asset-centric, testable pipelines. [8]
  • Streaming: Kafka or Pulsar. Pulsar has explicit multi-tenancy; Kafka relies on ACLs. [9]
  • Lakehouse storage: Iceberg or Delta Lake. Both provide schema evolution and time travel/rollback primitives. [10]
  • Compute: Spark (batch + micro-batch) and/or Flink (stateful streaming). [11]

Metadata, Lineage, & Observability

A metadata platform (DataHub or Amundsen) plus OpenLineage as the interoperability layer for run-level lineage events. [12]

Key Trade-off: Standardization vs. Autonomy

Too little standardization produces a “distributed swamp,” while too much creates bottlenecks. The practical solution is “federated guardrails”: enforce a small set of machine-checkable contracts while leaving implementation choice to domains where feasible. [13]

Reference Architecture Overview

A data mesh-aligned platform should map naturally to the four widely cited principles: domain-oriented ownership, data as a product, self-serve platform, and federated computational governance. [14] The key move is to implement “self-serve” primarily as APIs + paved workflows, not a ticket queue.

flowchart LR subgraph DomainTeams[Domain data product teams] Dev[Domain developers & analytics engineers] Repo[Data Product Repo\n(code + manifest + contracts)] end subgraph IDP[Internal Developer Platform] Portal[Developer Portal\n(e.g., Backstage)] Templates[Templates & Golden Paths] end subgraph ControlPlane[Platform control plane] DPR[Data Product Registry API] Prov[Provisioning Orchestrator\n(IaC + GitOps)] Policy[Policy Engine\n(RBAC/ABAC + approvals)] Secrets[Secrets & Key Mgmt] Catalog[Metadata Catalog\n(DataHub/Amundsen)] end subgraph DataPlane[Platform data plane] Orch[Orchestration\n(Airflow/Dagster)] Stream[Streaming\n(Kafka/Pulsar)] Compute[Compute\n(Spark/Flink)] Lake[Lakehouse Storage\n(Iceberg/Delta)] end subgraph ObsPlane[Observability & Lineage] OTel[Telemetry\n(OpenTelemetry)] OL[Lineage Events\n(OpenLineage)] Monitor[Monitoring/Alerting] end Dev --> Portal Portal --> Templates Templates --> Repo Repo --> Prov Prov --> Policy Policy --> Prov Prov --> Orch Prov --> Stream Prov --> Lake Orch --> Compute Compute --> Lake Orch --> OL Compute --> OL OL --> Catalog OTel --> Monitor DPR --> Catalog Secrets --> Prov DPR --> Prov Portal --> DPR

This separation of planes is what makes self-service safe: the control plane exposes standardized interfaces, while the data plane scales independently with workload. [3]

Platform APIs for Data Products

This section focuses on how the platform exposes “data product as a first-class resource” using stable APIs, consistent auth, and machine-checkable contracts.

Recommended Patterns

  • Contract-first specs: Use OpenAPI for synchronous APIs. Use AsyncAPI for event-driven interfaces. [18][19]
  • Versioning: Adopt Semantic Versioning for data product interfaces. [20]
  • AuthN/AuthZ: Use centralized identity via OpenID Connect (OIDC). Apply least privilege using RBAC at runtime layers. [2][22]

Example API Contracts

Data Product Registry (OpenAPI)

openapi: 3.1.0
info:
  title: Data Product Registry API
  version: 1.0.0
paths:
  /data-products:
    post:
      summary: Register a new data product
      security: [{ oidc: [ "dpr.write" ] }]
      requestBody:
        required: true
        content:
          application/json:
            schema: { $ref: "#/components/schemas/DataProductCreate" }
      responses:
        "201":
          description: Created
          headers:
            Location:
              schema: { type: string }
  /data-products/{productId}/versions:
    post:
      summary: Publish a new version of a data product
      security: [{ oidc: [ "dpr.write" ] }]
      parameters:
        - name: productId
          in: path
          required: true
          schema: { type: string }
        - name: Idempotency-Key
          in: header
          required: true
          schema: { type: string }
      responses:
        "202":
          description: Accepted for validation + provisioning
components:
  securitySchemes:
    oidc:
      type: openIdConnect
      openIdConnectUrl: https://idp.example.com/.well-known/openid-configuration
  schemas:
    DataProductCreate:
      type: object
      required: [name, domain, owner, description]
      properties:
        name: { type: string }
        domain: { type: string }
        owner: { type: string, description: "Group or team id" }
        description: { type: string }

Streaming Interface (AsyncAPI)

asyncapi: 3.0.0
info:
  title: Orders Data Product Events
  version: 1.2.0
servers:
  prod:
    host: broker.prod.example:9092
    protocol: kafka
channels:
  orders.created:
    address: orders.created
    messages:
      OrdersCreated:
        payload:
          type: object
          required: [order_id, created_at, customer_id]
          properties:
            order_id: { type: string }
            created_at: { type: string, format: date-time }
            customer_id: { type: string }
operations:
  publishOrdersCreated:
    action: send
    channel:
      $ref: "#/channels/orders.created"

Data Contracts as "Product Constitution" (ODCS)

apiVersion: v3.1.0
kind: DataContract
id: orders
name: Orders
version: 1.2.0
status: active
schema:
  - name: orders
    columns:
      - name: order_id
        type: string
        required: true
      - name: created_at
        type: timestamp
        required: true
quality:
  - type: freshness
    maxLagMinutes: 15
terms:
  - type: pii
    classification: restricted
Runbook: Introducing a breaking change to a data product interface
  1. Classify the change using SemVer (breaking → MAJOR). [20]
  2. Update OpenAPI/AsyncAPI/ODCS in the product repo; create a new version folder (e.g., contracts/2.0.0/).
  3. CI performs contract checks:
    • OpenAPI lint + backward compatibility check.
    • AsyncAPI validation + schema registry compatibility check. [26]
    • ODCS validation against the standard definition. [27]
  4. Provision parallel interfaces where feasible (new topic or table namespace).
  5. Publish deprecation notice in catalog with timelines and consumer impact analysis using lineage. [28]
  6. Monitor adoption; enforce retirement by policy after the deprecation window.
Runbook: API authentication incidents (token compromise)
  1. Revoke client credentials / rotate secrets in the central secrets system.
  2. Force re-issuance of access tokens; validate OIDC configuration and client registrations. [2]
  3. Audit platform API access logs; identify affected products and consumers.
  4. Post-incident: add policy requirements (shorter TTL, mandatory mTLS for service clients, tighter scopes).

Data Product Provisioning

To keep domains autonomous while preventing “snowflake infrastructure,” provisioning should be template-driven, policy-guarded, and declarative.

stateDiagram-v2 [*] --> Draft Draft --> Proposed: Contracts + manifest submitted Proposed --> Provisioning: Policy checks pass Provisioning --> Active: Infra reconciled + first successful run Active --> Active: Minor/Patch updates Active --> Deprecated: Deprecation announced Deprecated --> Retired: Sunset + access removed Proposed --> Draft: Policy or contract rejected Provisioning --> Proposed: Provisioning failed (rollback)

Example Backstage Scaffolder Template

apiVersion: scaffolder.backstage.io/v1beta3
kind: Template
metadata:
  name: data-product-iceberg-batch
spec:
  owner: platform-team
  type: data-product
  parameters:
    - title: Data Product Info
      required: [name, domain, owner]
      properties:
        name: { type: string }
        domain: { type: string }
        owner: { type: string }
  steps:
    - id: fetch-base
      action: fetch:template
      input:
        url: ./skeleton
        values:
          name: ${{ parameters.name }}
          domain: ${{ parameters.domain }}
          owner: ${{ parameters.owner }}
    - id: publish
      action: publish:github
      input:
        repoUrl: github.com?repo=${{ parameters.name }}&owner=${{ parameters.owner }}
Runbook: Onboarding a new domain team
  1. Create domain identity groups in the IdP (OIDC provider) and map claims. [2]
  2. Allocate Kubernetes namespaces with namespace-scoped RBAC roles and quotas. [35]
  3. Allocate streaming tenancy (Pulsar tenant/namespace or Kafka ACLs/naming conventions). [36][37]
  4. Enable catalog write permissions for the domain (limited to its own products).
  5. Provide domain with Backstage template access and a standard “first product” walkthrough.
Runbook: Retiring a product
  1. Mark as Deprecated in registry; include sunset date.
  2. Use lineage to identify downstream dependencies. [28]
  3. Remove new access grants; maintain existing access until sunset.
  4. Archive runtime deployments (DAGs/jobs) and freeze tables/topics.
  5. Delete or cold-store data per retention policy; remove infra via IaC destroy/reconcile.

Infrastructure-as-Code for Pipelines

Terraform Module Interface (Pseudo-code)

module "data_product_orders" {
  source = "git::ssh://git.example.com/platform/iac-modules.git//data-product"
  
  product_id   = "orders"
  domain       = "commerce"
  environment  = "prod"
  
  # Streaming interface
  kafka_topics = [
    { name = "orders.created.v1", partitions = 24, retention_hours = 168 }
  ]
  
  # Lakehouse interface
  lake_tables = [
    { name = "commerce.orders", format = "iceberg", partitioning = ["days(created_at)"] }
  ]
  
  # Access policy hooks
  owner_group = "grp-commerce-data"
  pii_level   = "restricted"
}

Pipeline Definitions

Airflow DAG Pattern

with DAG("orders_daily", schedule="@daily") as dag:
    extract = ExtractFromKafka(topic="orders.created.v1")
    transform = SparkSubmit(job="orders_transform.py")
    load = WriteIcebergTable(table="commerce.orders")
    
    extract >> transform >> load

Dagster Asset Pattern

@assetdef orders_raw(kafka_resource):
    return kafka_resource.read("orders.created.v1")

@asset(deps=[orders_raw])
def orders_iceberg(orders_raw, iceberg_io):
    iceberg_io.write("commerce.orders", orders_raw)

Internal Developer Platform Capabilities

A self-serve data platform becomes sustainable when it is delivered as an internal developer platform (IDP): portal UX + self-serve actions + guardrails + observability and cost controls.

  • Developer portal: Backstage’s software templates scaffold components; catalog processes entities. [59]
  • RBAC enforcement: Close to execution surfaces (Kubernetes RBAC, Kafka ACLs). [35][37]
  • Observability: OpenTelemetry for traces, metrics, and logs. [60]
  • Cost controls: Showback/chargeback via namespace/label attribution (e.g., Kubecost). [61]
Runbook: Granting consumer access to a restricted data product
  1. Consumer requests access in the portal.
  2. Policy engine evaluates identity claims, product classification, and approvals.
  3. If approved, provisioning reconciles access controls (Kafka ACLs, Lakehouse RBAC).
  4. Catalog updates to reflect new access paths.
  5. Audit logs captured and retained.

Integration & Reference Stack Options

erDiagram DOMAIN ||--o{ DATA_PRODUCT : owns DATA_PRODUCT ||--o{ PRODUCT_VERSION : has PRODUCT_VERSION ||--o{ INTERFACE : exposes INTERFACE ||--o{ DATASET : "table view" INTERFACE ||--o{ STREAM : "topic/queue" PRODUCT_VERSION ||--|| CONTRACT : governed_by DATA_PRODUCT ||--|| TEAM : operated_by DATASET ||--o{ LINEAGE_EDGE : participates_in STREAM ||--o{ LINEAGE_EDGE : participates_in DOMAIN { string domain_id string name } TEAM { string team_id string idp_group } DATA_PRODUCT { string product_id string name string lifecycle_state } PRODUCT_VERSION { string version_semver datetime published_at } INTERFACE { string interface_type string address } CONTRACT { string contract_type string location } DATASET { string urn string format } STREAM { string topic string protocol } LINEAGE_EDGE { string upstream string downstream }
Layer Option Maturity Signal Scalability Posture Cost Posture (Ops)
Orchestration Airflow Stable REST API, broad adoption Scales well; ops rise with DB tuning Medium–High
Orchestration Dagster Asset-centric, lineage emphasis Scales well with asset modeling Medium
Streaming Kafka Strong security primitives (ACLs) High throughput Medium–High
Streaming Pulsar Explicit multi-tenancy model Designed for geo patterns Medium–High
Lakehouse Iceberg Schema evolution, time travel Strong for very large tables Medium
Lakehouse Delta Lake ACID transactions, scalable metadata Strong at scale (Spark-centric) Low–Medium
Metadata DataHub GraphQL APIs, lineage APIs Scales as central metadata service Medium
IDP Portal Backstage Templates scaffold + catalog Scales as portal Medium

Closing Perspective

A self-serve data platform for data mesh is less about picking “the right tools” and more about achieving a consistent product operating model: machine-readable contracts, automated checks, declarative provisioning, standardized telemetry/lineage, and a portal that makes the paved path the easiest path. The cited ecosystem standards (OpenAPI, AsyncAPI, OpenLineage) and platform primitives (Terraform, Argo CD, Kubernetes RBAC, DataHub) exist precisely because scale demands automation and interoperability, and because manual governance does not scale. [84]

Selected References

  • [1] Data Mesh Principles and Logical Architecture - Martin Fowler
  • [3] CNCF Platforms White Paper | CNCF TAG App Delivery
  • [5] OpenAPI Specification v3.1.0
  • [6] Open Data Contract Standard (ODCS)
  • [28] OpenLineage
  • [29] Backstage Software Templates
  • [30] Argo CD - Declarative GitOps CD
  • [62] DataHub Components