Data Product Design and Lifecycle
Beyond Datasets to Fully Managed Data Products.
Executive Summary
Organizations that "publish datasets" often operate in an implicit-contract world: the producer changes column names or semantics, consumers discover breakage downstream, and ownership is unclear. Data Mesh reframes this by treating analytical data as a product and data consumers as customers, requiring product-level capabilities like discoverability, trustworthiness, and security even as data ownership decentralizes across domains.
A fully managed data product is best understood as an independently operable unit (an architectural "quantum") that bundles: the data itself, transformation code, infrastructure declarations, metadata, governance policies, and operational guarantees.
A practical, interoperable way to "package" these expectations is via a data product contract: Bitol's Open Data Contract Standard (ODCS) explicitly models fundamentals, schema, data quality, support/communication, pricing, roles/team, infrastructure/servers, and SLAs as first-class contract sections.
Six Capabilities Separating Datasets from Data Products
Contracts & Schemas
Explicit, machine-validatable definitions of interface, semantics, and quality rules.
Versioning & CI/CD
A release discipline that encodes compatibility promises and automated validation gates.
SLAs & SLOs
Reliability engineering applied to data: SLIs, target SLOs, consumer-facing SLAs.
Discoverability
Standardized metadata and lineage so products are findable and impact analysis is possible.
Deprecation & Lifecycle
Explicit states, timelines, communications, and compliance-grade retention controls.
Product Boundaries
Boundaries aligned to business domains (DDD) with clear accountability and cost visibility.
1. Data Product Contracts and Schemas
Recommended Practices
A robust data product contract should specify more than "columns and types." ODCS provides a useful canonical checklist. Structure contracts as layered commitments:
- Interface contract: Names, types, nullability, keys, partitions, and serialization format.
- Semantic contract: Grain, meaning, units, canonical definitions, aligned with Domain-Driven Design (DDD).
- Quality contract: Rules and thresholds (uniqueness, completeness). Tools like Deequ formalize "unit tests for data".
- Operational contract: Freshness, availability, delivery schedule, and support channels.
- Security & governance contract: Classification, access method, authZ model, and terms of use.
- Economics contract: Pricing/showback inputs and cost allocation rules.
Contract Types and Trade-offs
| Contract Approach | What it optimizes | Strengths | Trade-offs / risks |
|---|---|---|---|
| Documentation-only "data dictionary" | Understanding | Low effort; easy to start | Breakage still discovered late; drift between docs and reality |
| Machine-validatable schema contract | Structural correctness | Prevents schema drift; automatable | Doesn't guarantee semantics; may overfit to types |
| Full data product contract (ODCS-style) | Product reliability & governance | Aligns schema, quality, SLA, ownership, support, pricing | Higher up-front investment; requires operating model maturity |
Concrete Example (ODCS-style snippet)
apiVersion: v3.1.0
kind: DataContract
name: orders
status: active
domain: commerce.orders
dataProduct: Orders
schema:
- name: orders
type: table
description: "One row per customer order (grain = order_id)."
columns:
- name: order_id
dataType: string
required: true
- name: total_amount
dataType: decimal(12,2)
required: true
dataQuality:
rules:
- type: uniqueness
column: order_id
sla:
freshness:
maxLagMinutes: 15
support:
channel: "#orders-data-product"
Implementation Checklist
Case Study: PayPal's open-source data contract template evolved into ODCS (Open Data Contract Standard), showing how solving cross-team contract friction leads to community standardization and better validation tooling.
2. Versioning and CI/CD for Data Products
Data products need versioning at multiple conceptual layers: Contract/Interface (SemVer), Implementation (Git hash), and Data Content (CalVer/Time-based).
Migration Patterns
- Additive evolution: Add nullable fields; widen types.
- Compatibility views: Preserve old interfaces while stabilizing new ones.
- Dual-publish: Publish v1 and v2 concurrently for a migration window.
- Shadow pipelines: Compute v2 in parallel, compare, then flip.
Data CI/CD Gates
- Contract linting / validation.
- Schema compatibility checks (registry rules).
- Data unit tests & invariants running in staging.
- Metadata + lineage emission verification.
Implementation Checklist
3. Data Product SLAs and SLOs
Adopt SRE terminology for data reliability: An SLO is a target value measured by an SLI. An SLA is the consumer-facing agreement with consequences.
| Metric Category | Example SLI | Target SLO |
|---|---|---|
| Freshness / Timeliness | lag(now, max(event_time)) | p95 ≤ 15 min |
| Availability | % successful reads | 99.9% / 30d |
| Validity | % rows passing constraints | ≥ 99.99% |
| Completeness | % expected entities present | ≥ 99.5% |
Alerting and Enforcement
Page on sustained SLO burn (error budget consumption) rather than one-off anomalies. Implement automated remediation runbooks (rollback, replay, or degrade gracefully).
Implementation Checklist
4. Discoverability and Metadata Standards
Discoverability is foundational to "data as a product" at scale. Metadata must be both a standard model for tool interoperability (DCAT, PROV-DM, OpenLineage) and an operational system (DataHub, OpenMetadata).
Minimum Viable Metadata Record:
- Identity: name, domain, unique ID, stable address.
- Contract: format, version, compatibility mode.
- Semantics: grain, key fields, units, glossary mappings.
- Ownership: owner, steward, on-call support channel.
- Access: auth method, classification, allowed uses.
- Reliability: SLO definitions + current SLI rollups.
- Lineage: upstream sources, downstream dependents.
Implementation Checklist
5. Deprecation and Lifecycle Management
Without explicit lifecycle policies, deprecated products linger, consuming cost and creating compliance risks. Lifecycle transitions are most effective when enforced computationally.
Data Product Lifecycle Map
Implementation Checklist
6. Product Boundaries and Ownership
Defining "what is a data product" requires balancing cohesion and coupling. Bounding aligns closely with Domain-Driven Design (DDD) bounded contexts and Conway's Law.