Data MeshData ProductsGenAIGovernanceMLOpsRAG

Data Mesh for Data Product Development with AI and Generative AI

Executive summary

Data mesh is best understood as a sociotechnical paradigm (not “just an architecture diagram”) that intentionally couples organizational design with a distributed data architecture to increase value from analytical data at scale while sustaining agility as the organization grows. Zhamak Dehghani frames it as an approach optimizing both technical excellence and the experience of data providers, users, and owners, and describes it via four interacting principles: domain ownership, data as a product, self-serve data platform, and federated computational governance.

The “data product” is the atomic unit of the mesh: an autonomous, independently managed package that combines data with the code, metadata, policies, and infrastructure declarations needed to serve it reliably—what Dehghani calls a data quantum. Data-as-a-product requires explicit, easy-to-use data sharing contracts and product-like usability characteristics (discoverable, addressable, understandable, trustworthy, natively accessible, interoperable/composable, valuable on its own, secure).

AI and generative AI (GenAI) amplify both the value and the operational risk of data products. Practically, they can: - Create data products faster (generate pipeline scaffolds, transformations, documentation, tests, contracts, semantic models) but must be constrained by review and automated checks to avoid hallucinated logic or unsafe access patterns. - Enhance data products by delivering new “AI-native” output ports (features for ML, embeddings for retrieval, labels and evaluation sets, model-ready datasets), enabling RAG-style experiences and faster experimentation. - Operate data products more effectively (automated metadata enrichment, anomaly summarization, incident triage copilots, policy drift detection), but only if governance is computational and enforced through the platform.

This report provides a rigorous, implementation-oriented view of how to apply data mesh principles in practice, and how to integrate AI/GenAI into data product development and operations using established building blocks: feature stores (offline + online serving), “RAG stacks” (embedding pipelines + vector search), MLOps/LLMOps (CI/CD/CT, registries, serving), and compliance-grade governance.

Unspecified constraints materially affect recommendations. Your prompt does not specify target industry, organization size (teams, domains, data volume), regulatory regime, cloud posture, or budget. Therefore, the roadmap and cost/effort ranges below are parameterized (small/medium/large) and highlight decision points that should be resolved early.

Definitions and core principles of data mesh

Data mesh definition and scope

Dehghani describes data mesh as a sociotechnical paradigm—an approach that explicitly recognizes interactions between people, architecture, and technology in complex organizations—and positions it as part of an enterprise data strategy: target state of enterprise architecture + organizational operating model, executed iteratively.

Data mesh focuses on analytical data (historical, aggregated, OLAP-oriented) and recognizes the “great divide” between operational and analytical data planes, noting that attempting to connect these planes via complex ETL often yields fragile architectures and “labyrinths” of pipelines.

The four principles and what they mean in practice

Domain ownership (domain-oriented decentralized data ownership and architecture) Ownership and accountability for analytical data shift to business domains closest to the data (source or main consumers), aligning business, technology, and analytical data. The aim is to scale analytical data sharing along organizational growth axes (more sources, consumers, use cases) and reduce centralized bottlenecks.

Data as a product Domain-oriented analytical data is shared as a product directly with data users. It must be discoverable, addressable, understandable, trustworthy/truthful, natively accessible, interoperable/composable, valuable on its own, and secure, supported by explicit sharing contracts and managed life cycles. The “data quantum” encapsulates the data, metadata, code, policy, and infrastructure dependencies needed to serve the product autonomously.

Self-serve data platform A platform provides enabling services for domains to build, deploy, and maintain data products with reduced friction and cognitive load. Dehghani explicitly calls out mesh-level experiences such as surfacing an emergent knowledge graph and lineage across the mesh and managing end-to-end data product life cycles.

Federated computational governance Governance is a federated accountability structure (domain representatives + platform + SMEs such as legal/security/compliance), relying heavily on codifying/automating policies at fine-grained levels for each data product via platform services. This counters decentralization risks (incompatibility/disconnection) while enabling cross-cutting requirements (security, privacy, legal).

Common misconceptions and anti-patterns

A recurring failure mode is adopting the vocabulary of decentralization without the platform and governance to support it—leading to “data mess.” The HelloFresh re:Invent session summary explicitly describes an initial attempt at data mesh that resulted in “data mess” due to lacking a proper playbook and insufficient implementation strategy.

Similarly, case studies repeatedly cite centralized data teams becoming bottlenecks. SSENSE describes “cracks” as domains and consumption cases grew, with pipeline breakage, data validity questions, and difficulty finding the right data—drivers for shifting toward data mesh principles.

Data products and lifecycle

Data product concept and boundaries

Thoughtworks’ “Designing data products” defines data products as building blocks of a data mesh serving analytical data, and anchors them in the data-as-a-product characteristics (discoverable, addressable, understandable/self-describing, trustworthy via SLOs/SLIs, natively accessible via persona-oriented ports, interoperable, valuable on its own, secure).

The same article emphasizes a pragmatic design method: work backwards from concrete use cases, then overlay additional use cases to avoid overfitting; assign domain ownership; and define SLOs. It also warns organizations can spend months in design cycles without shipping, and proposes methodical workshops that produce “just enough” boundaries for execution.

A practical boundary test suggested: if you cannot describe a data product concisely in one or two sentences, it’s likely not well-defined (too large or not cohesive).

Data product lifecycle as an operating loop

A useful operational lifecycle for data products (aligned with the “autonomous lifecycle” framing in Dehghani’s data quantum) is:

Discover → Design → Build → Publish → Operate → Evolve/Deprecate, where “Operate” includes reliability & trust guarantees (SLOs, quality checks, lineage, access control), and “Evolve” includes versioning and consumer-safe change management.

Where GenAI fits in the data product lifecycle

GenAI can accelerate development, but it must be constrained within governance and security controls. The table below highlights high-leverage opportunities and the corresponding controls needed to make them safe.

Roles, operating model, and organizational changes

Core roles in a data-mesh operating model

A practical “mesh” needs both domain autonomy and a platform + governance backbone.

Domain data product team (within each business domain) Owns the domain’s analytical data products end-to-end, including quality and lifecycle. This is aligned with Dehghani’s “decentralize ownership to business domains closest to the data” principle.

Data product owner (per critical data product) Dehghani and Fowler emphasize shifting accountability for data quality upstream toward the source and treating consumers as customers. This typically requires an explicit product-owner-like role accountable for SLOs, adoption, and consumer satisfaction.

Self-serve data platform team Builds the paved roads: templates, provisioning, CI/CD for data products, catalog/lineage integration, policy enforcement. Dehghani explicitly frames this as reducing total cost of decentralization, lowering cognitive load, and enabling a larger population of generalist developers to build data products.

Federated governance group (domain reps + platform + SMEs) Sets global interoperability standards and codifies them as policies executed via the platform (computational governance).

AI-specific role extensions

Data mesh does not eliminate centralized AI capability; it changes its shape:

ML/LLM platform (MLOps/LLMOps) team Owns shared AI platform components (model registry, pipelines, serving, evaluation harnesses). Tools like MLflow provide model registry functions including lifecycle management, lineage, versioning, metadata tagging, and annotation.

AI governance & risk (cross-functional) Maps AI practices to risk controls. NIST AI RMF provides a voluntary framework to manage AI risks and promote trustworthy AI, and NIST’s GenAI Profile extends that specifically for generative AI systems.

Prompt/agent engineering + evaluation Prompting guidance from major model providers stresses establishing clear success criteria and empirical evaluation methods for prompts before iterating (Claude) and treating prompting as a disciplined practice (OpenAI).

Architecture patterns and technology stack options

Reference architecture for a mesh with AI/GenAI

The platform should be thought of as multi-plane: user-facing onboarding and automation, control functions (orchestration, identities, policies), and the data plane (storage/compute/serving). This aligns with the “self-serve data platform” framing in data mesh and echoes large platforms described in industry talks (e.g., management/control/data planes).

A generalized reference architecture is:

flowchart TB
 subgraph Domains["Business domains (cross-functional teams)"]
 D1["Domain A team\n(owns Product A1, A2)"]
 D2["Domain B team\n(owns Product B1)"]
 D3["Domain C team\n(owns Product C1)"]
 end

 subgraph SSP["Self-serve data platform (paved roads)"]
 CI["CI/CD for data products\n(build, test, publish)"]
 Prov["Provisioning & templates\n(IaC, pipelines, connectors)"]
 Catalog["Catalog / metadata graph"]
 Lineage["Lineage collection"]
 Policy["Policy enforcement\n(policy-as-code)"]
 Obs["Observability\n(SLOs/SLIs, alerts)"]
 end

 subgraph DataPlane["Data plane (storage + compute + serving)"]
 Lake["Lakehouse / warehouse storage\n(table formats, partitions, time travel)"]
 Stream["Event streaming / CDC"]
 Compute["Batch + stream compute\n(Spark/Flink/dbt/etc.)"]
 Query["Query federation\n(Trino/warehouse SQL)"]
 APIs["Serving ports\n(SQL, API, files, streams)"]
 end

 subgraph AIPlane["AI/GenAI plane"]
 FS["Feature store\n(offline + online)"]
 Emb["Embedding pipeline + vector index"]
 Train["Training pipelines + evaluation"]
 Registry["Model registry"]
 Serve["Model/LLM serving\n(KServe/Triton/etc.)"]
 end

 D1 --> CI
 D2 --> CI
 D3 --> CI

 CI --> Compute
 Prov --> DataPlane
 Compute --> Lake
 Stream --> Lake
 Lake --> Query
 Query --> APIs

 Catalog --> D1
 Catalog --> D2
 Catalog --> D3
 Lineage --> Catalog
 Policy --> APIs
 Obs --> D1
 Obs --> D2
 Obs --> D3

 Lake --> FS
 Lake --> Emb
 FS --> Train
 Emb --> Serve
 Registry --> Serve
 Train --> Registry

This architecture is consistent with Dehghani’s emphasis that platform services manage data product life cycles, provide lineage and emergent knowledge graph experiences, and automate governance policies.

Architecture choices and trade-offs

Open table format options

Open table formats matter because they strongly influence interoperability, time travel, schema evolution, and compute decoupling.

Governance, metadata, lineage, quality, security, and compliance

Governance model: make it computational or it won’t scale

Federated computational governance is explicitly about codifying and automating policies (security, privacy, legal compliance) at fine-grained levels across distributed products.

In practice, this means governance is enforced through: - Data contracts (schema + semantics + SLOs + access model + quality gates) - Policy-as-code (consistent evaluation and enforcement at build time and run time) - Automated checks in CI/CD and platform workflows

The Open Data Contract Standard (ODCS) provides an open, structured specification for data contracts (Apache 2.0 licensed) and is explicitly used in data mesh contexts.

Metadata catalog and lineage: the mesh’s “nervous system”

A mesh without metadata becomes undiscoverable and unsafe. That’s why self-serve platform responsibilities include mesh-level discovery and lineage experiences.

Catalog / metadata graph options - DataHub is positioned as an open-source data catalog for the modern data stack, with official docs referencing it as LinkedIn’s open-sourced metadata search/discovery platform; DataHub’s docs list many adopters and case studies (including “Optum: Data Mesh via DataHub” and “Saxo Bank: Enabling Data Discovery in Data Mesh”). - OpenMetadata positions itself as documentation/discovery/governance end-to-end, and has explicit documentation sections for data quality, observability, and data contracts.

Lineage standards - OpenLineage is an open standard for lineage metadata collection with a standard API for capturing lineage events; it includes a reference backend implementation (Marquez) and integrations with pipeline tools.

Data quality and trust

Data as a product requires “trustworthy and truthful” characteristics and explicitly links trust to contracts and product ownership.

Great Expectations is a widely used testing/validation framework for data quality, framing data quality checks as “expectations” over data.

A concrete real-world implementation: Saxo Bank’s data workbench uses Apache Kafka as a core authoritative source, and a central data management application powered by DataHub and Great Expectations to help domains publish/manage data as products with ownership, lineage, metadata, and quality. Saxo also emphasizes capture of metadata at the point of origin and enabling domains to attach quality rules and execution results to increase trust without heavy central-team lifting.

Security: least privilege + fine-grained controls + policy-as-code

Data product “secure” is a first-class characteristic in data mesh.

Practical patterns: - Central permissioning with domain-managed publication, using a platform capability like AWS Lake Formation (fine-grained access control for data lakes) or equivalents in other clouds/warehouses. - Policy-as-code enforcement: Open Policy Agent (OPA) is designed for policy evaluation (Rego) against inputs such as API requests, enabling consistent authorization policy across services. - Data plane controls: encryption, row/column-level controls, tokenization/masking, audit logs, and access request workflows, implemented uniformly via the platform’s “paved roads.”

Compliance: align data mesh + AI governance to evolving regulation

This section is necessarily high-level because compliance depends on industry, geography, and data types (PII/PHI/financial data).

EU AI Act The official AI Act is Regulation (EU) 2024/1689. The European Commission states it entered into force on 1 August 2024. Many obligations apply in phases; legal analyses summarize that the majority of provisions become applicable in 2026, with different dates for prohibited practices and other categories.

GDPR GDPR is Regulation (EU) 2016/679. Operationally, data products must classify and handle “personal data” appropriately as defined in Article 4.

HIPAA (US healthcare) HHS guidance explains two de-identification methods under the HIPAA Privacy Rule: Expert Determination and Safe Harbor. If your mesh supports PHI access, platform guardrails must enforce these requirements.

AI risk management NIST AI RMF is a voluntary framework intended to help organizations manage AI risks and promote trustworthy AI. NIST’s GenAI Profile is a companion resource for generative AI systems. In practice, these should translate into model/data documentation (model cards, datasheets), evaluation requirements, and monitored controls.

AI and GenAI integration patterns for data products in a mesh

Integration pattern map

A useful way to integrate AI into a data mesh is to treat AI capabilities as consumers and producers of data products:

AI systems consume domain data products (training sets, features, embeddings, labels, ground truth).

AI systems produce new data products (predictions, scores, embeddings, synthetic data, evaluation artifacts).

GenAI systems add a third layer: they are “data consumers” for context (RAG) and also “metadata producers” (documentation, classification suggestions), which must be governed.

Training datasets and MLOps as mesh-native products

Google’s MLOps guidance emphasizes that an ML system is a software system and that CI/CD practices can apply to reliably build and operate ML systems at scale.

In a mesh, treat these as first-class products: - Training dataset product: versioned snapshots, with datasheet-like metadata and lineage. - Feature product(s): consistent definitions reusable across models. - Model artifact product: registered, versioned, with evaluation reports (model cards).

Model Cards propose standardized documentation for trained models to clarify intended use, performance characteristics, and evaluation. Datasheets propose standardized documentation for datasets (motivation, composition, collection process, recommended uses).

Feature stores: connecting domain products to consistent online/offline features

Feast describes itself as an open-source feature store enabling teams to define, manage, discover, and serve features. Offline stores are used to build training datasets and materialize features into an online store for low-latency serving.

Mesh design implication: features should be produced/owned by the domain best positioned to define them, while the platform standardizes: - feature definitions and metadata standards, - point-in-time correctness for training set generation, - and consistency between training and serving.

RAG: turning data products into “knowledge products” for GenAI

Retrieval-Augmented Generation (RAG) is a well-established research direction where generation is augmented by retrieving relevant documents; the original RAG paper frames it as combining parametric generation with a non-parametric memory (retriever + knowledge source).

A pragmatic industry integration uses: 1) Embedding pipeline data products (document corpora, chunking rules, embedding model version) 2) A vector index (vector database or relational extension) 3) A retrieval + generation service for user-facing applications

Vector database and similarity search foundations matter. FAISS provides research and implementation for billion-scale similarity search and underpins many ANN retrieval stacks. For “keep vectors where your data is,” pgvector is an open-source PostgreSQL extension for vector similarity search. Milvus positions itself as an open-source vector database built for GenAI applications.

RAG data flow diagram (mesh-aligned)

flowchart LR
 subgraph Domain["Domain data products"]
 DP["Data Product: Policies/Procedures\n(Data+Metadata+Contract)"]
 KP["Data Product: Knowledge Corpus\n(doc chunks + metadata)"]
 end

 subgraph Indexing["Indexing pipeline (platform-paved road)"]
 Clean["Clean & normalize"]
 Chunk["Chunk + enrich metadata"]
 Embed["Embed (embedding model vN)"]
 Store["Vector store / index"]
 end

 subgraph Runtime["Runtime (GenAI application)"]
 Query["User query"]
 Retrieve["Retriever (top-k)"]
 Context["Context pack\n(citations, filters)"]
 LLM["LLM/GenAI model"]
 Answer["Answer + citations"]
 end

 DP --> KP --> Clean --> Chunk --> Embed --> Store
 Query --> Retrieve --> Store
 Retrieve --> Context --> LLM --> Answer

Data labeling and synthetic data as products

Data labeling Label Studio is positioned as a flexible open-source data labeling tool for multiple data types, used to prepare training data for CV/NLP/speech/video models. Snorkel’s research argues labeling is a major bottleneck and introduces “weak supervision” via labeling functions to generate training labels without hand-labeling every example.

In a mesh, treat “labels” as products: - Label definitions, taxonomies, and guidelines (business-owned) - Label datasets with versioning + lineage to raw sources - Label quality metrics (inter-annotator agreement, drift)

Synthetic data SDV (Synthetic Data Vault) is a library intended for creating tabular synthetic data, training generative models to emulate patterns from real data. Synthetic data can help with scarce-class augmentation and safer sharing, but it is not automatically privacy-safe; governance must explicitly classify and risk-assess synthetic outputs (especially for re-identification risks).

Prompt engineering, prompt security, and “LLMOps”

Prompting is treated by providers as a disciplined engineering activity: OpenAI describes prompting as both art and science and provides strategies to get more consistent output. Anthropic’s prompt engineering overview stresses defining success criteria and building empirical evaluations before iterating.

Security is non-optional. OWASP’s Top 10 for LLM Applications (v1.1) lists categories including prompt injection, insecure output handling, training data poisoning, model DoS, supply chain vulnerabilities, sensitive info disclosure, insecure plugin design, excessive agency, overreliance, and model theft. Academic work on prompt injection against LLM-integrated applications shows the practical security risks of prompt injection in real systems.

Mesh implication: prompts, tools, and retrieval policies should be treated as governed artifacts: - prompts are versioned and tested, - tool access is least-privilege and sandboxed, - retrieval sources are controlled by data product policies.

Model serving and inference at edge: mesh-aware deployment

Kubernetes-native serving KServe is positioned as a Kubernetes-native platform for serving predictive and generative AI models with standardized protocols, scaling to zero, GPU acceleration, and multi-framework support; it is also a CNCF incubating project. Seldon Core 2 positions itself as an MLOps/LLMOps framework for deploying and scaling AI systems in Kubernetes, including LLM systems.

High-performance inference NVIDIA Triton Inference Server supports inference across cloud, data center, edge and embedded devices, and supports multiple hardware backends (GPUs, CPUs, Inferentia).

Edge inference ONNX Runtime documentation explicitly covers deploying ML models to IoT and edge devices, with considerations for when on-device inference is appropriate.

Mesh implication: serve models as products with: - runtime SLOs (latency, availability), - traceable lineage (training data products + feature products), - and controlled access to downstream data products.

Case studies, implementation roadmap, KPIs, costs, risks, and templates

Real-world examples and lessons

Zalando (Data mesh in practice) A widely shared deck describes a mindset shift from centralized ownership/data-lake thinking to decentralized ownership and cross-functional domain-data teams, emphasizing “data as a product” and “ecosystem of data products,” and framing data quality as a “contract between consumer and producer.”

Saxo Bank (Data mesh discovery + governance tooling) Saxo’s evolving architecture places Apache Kafka at the core as an authoritative source and uses a domain-enabling “Data Workbench” powered by DataHub and Great Expectations to provide discovery, ownership, lineage, metadata, and data quality. They explicitly capture metadata at the point of origin and tie the definition of a data product to its published form in Kafka plus its derived representation in Snowflake.

SSENSE (central bottleneck → data mesh extension of domain architecture) SSENSE describes centralized data collection/experimentation creating a bottleneck and “pipeline havoc” from upstream changes, plus difficulty finding valid data. They cite adopting microservices/domain bounded contexts and viewing data mesh as a relatively straightforward extension, requiring decomposing and modeling analytical data within each domain.

HelloFresh (learning from ‘data mess’ to platform-led adoption) A re:Invent session summary describes an initial “data mesh → data mess” attempt due to lacking a playbook and domain/global model clarity, followed by building a platform with easier onboarding, abstraction of infra complexity, cost optimization, automated checks (including data contracts), and adoption KPIs. This source is a summary rather than official slides; treat details as directional unless validated against first-party materials.

Implementation roadmap and milestones

A practical roadmap should build the minimum viable mesh (platform + governance + first products), then scale by replication. Dehghani’s principle interplay explicitly warns that decentralization without platform increases cost and duplicated effort; the platform is intended to reduce cognitive load and total cost of ownership.

Milestone-based roadmap (parameterized)

Foundation phase (weeks 0–8) - Domain mapping & “mesh charter”: define domains, initial product candidates, and funding model. - Define minimum governance standards: naming, identity/access patterns, contract schema, required metadata fields, SLO templates. - Choose baseline stack (lakehouse vs warehouse vs hybrid) and catalog/lineage approach.

Platform MVP phase (months 2–5) - Paved road v1: repo template, CI/CD pipeline, standard contract checks, catalog registration automation, lineage instrumentation (OpenLineage), basic observability. - Security baseline: least privilege patterns, policy-as-code enforcement, audit logging. - Data product “golden path” that a domain team can use with minimal platform hand-holding.

First domain products phase (months 3–8) - Ship 3–6 data products in 1–2 domains, working backwards from 1–3 concrete use cases (per “Designing data products” method). - Add quality gates (Great Expectations or equivalent) and publish SLO dashboards.

Scale-out phase (months 6–18) - Expand to additional domains; seed domain enablement; reduce central platform toil. - Standardize interoperability via federated governance decisions, embedded computationally. - AI enablement: feature store integration, model registry, serving platform, embedding/RAG pipeline.

Mermaid timeline (illustrative)

gantt
 title Data Mesh + AI Data Products Roadmap (illustrative)
 dateFormat YYYY-MM-DD
 axisFormat %b %Y

 section Foundation
 Domain mapping + mesh charter :a1, 2026-03-01, 6w
 Governance minimum standards :a2, 2026-03-15, 6w

 section Platform MVP
 Paved road (CI/CD + contracts + catalog):b1, 2026-04-15, 10w
 Lineage + observability baseline :b2, 2026-05-01, 10w
 Security baseline (policy-as-code) :b3, 2026-04-15, 12w

 section First products
 3-6 data products in 1-2 domains :c1, 2026-05-15, 14w
 Consumer onboarding + SLO dashboards :c2, 2026-06-15, 12w

 section AI enablement
 Feature store integration :d1, 2026-07-01, 12w
 RAG/embedding pipeline (pilot) :d2, 2026-08-01, 12w
 Model serving + registry :d3, 2026-08-15, 14w

 section Scale
 Expand to more domains + governance ops :e1, 2026-09-15, 24w

KPIs for data products and for AI integration

Data mesh success is measurable when it reduces friction, increases reuse, and improves trust. Saxo explicitly cites lowering barriers to data democratization via self-service and automation and improving discovery and trust through metadata and quality rule attachment.

Recommended KPI families:

Adoption & usability - Number of active data products; active consumers; % domains producing at least one product - Time-to-discover (median time from “need” to “found usable product”) - Consumer satisfaction (lightweight surveys + usage retention)

Reliability & trust - SLO compliance per product (freshness, uptime, completeness) - Data incident rate and MTTR for data-product incidents - % products with complete contracts + lineage + owners + classification

Delivery throughput - Lead time from request → first usable product release - Release frequency and change failure rate (data pipeline failures)

Cost efficiency - Cost per product (infra + ops) and cost per query/training run - Storage bloat / compaction efficiency (lakehouse) - Feature store online serving cost per 1K predictions (if applicable)

AI-specific - Model performance metrics + drift (monitored) - Retrieval metrics for RAG (precision@k/recall@k proxies where you have labeled sets) - “Groundedness”/hallucination rate (measured via evaluations) - Security events: prompt injection attempts caught, sensitive info leakage incidents

Cost and effort estimates

Because org size/budget are unspecified, below are conservative FTE-month ranges (excluding major vendor licensing spend). These ranges assume you reuse existing cloud infrastructure and focus on building an internal platform team plus domain pilots.

Key drivers of cost: access control complexity, legacy integration, data quality maturity, and compliance requirements (GDPR/AI Act/industry-specific).

Key risks and mitigations

Decentralization without platform maturity → fragmentation Mitigate by enforcing paved roads: templates, automated contract checks, computed policies, and required metadata.

Central team becomes bottleneck for governance documentation Saxo explicitly identifies the previous generation approach (analysts reverse engineering landscape) as a bottleneck; mitigate by capturing metadata at origin and enabling domains to attach metadata and quality rules through self-service.

GenAI security failures (prompt injection, sensitive info disclosure, excessive agency) OWASP LLM Top 10 provides a practical categorization of threats; academic work demonstrates prompt injection risk in real applications. Mitigate via tool sandboxing, least privilege, output validation, and evaluation harnesses.

Overreliance on GenAI for correctness OWASP explicitly identifies “Overreliance” as a risk category; mitigate with human-in-the-loop review for high impact changes and automatically verifiable checks (tests, reconciliations).

Regulatory drift and uncertainty EU AI Act obligations are phased; governance must be adaptable. Track official sources and implement configurable policy-as-code rather than manual, one-off compliance processes.

Templates and checklists

Data product design template

Identity and ownership - Product name and unique address (URI) - Owning domain, accountable owner(s), escalation path - Consumer personas and key use cases (work backwards from at least one concrete use case)

Contract - Inputs (sources) and output ports (SQL tables/views, APIs, streams, files) - Schema (including keys) + semantic definitions + glossary mapping - Change management policy (versioning, deprecation) - Privacy/security classification and access policy

SLOs / SLIs - Freshness, availability, completeness, correctness thresholds - Measurement mechanism and publication method (catalog + dashboard)

Quality - Validation rules (e.g., Great Expectations suite) - Reconciliation rules (source-to-product checks) - Incident response runbook

Lineage and metadata - Lineage instrumentation method (OpenLineage events) - Required metadata fields for discoverability and understanding (owner, description, samples)

AI integration checklist for a data product

AI purpose - Is the product (a) model input, (b) model output, (c) feature/embedding/label, or (d) evaluation artifact?

Training readiness - Dataset datasheet exists (collection, biases, intended use) - Privacy rules satisfied (GDPR/HIPAA if relevant) - Point-in-time correctness and leakage checks (especially for features)

MLOps - CI/CD/CT pipeline defined (build → train → evaluate → register → deploy) - Model registry entry created with lineage and evaluation artifacts (MLflow) - Model card created for each production model

Serving - Serving SLOs defined and monitored - Serving platform selected (KServe/Seldon/Triton) - Edge inference requirements evaluated (ONNX Runtime, latency/offline constraints)

GenAI / RAG - Embedding pipeline is versioned; chunking rules and metadata standards defined - Vector store selection justified (pgvector vs Milvus vs other), with backup/restore strategy - Prompt evaluation harness exists; success criteria defined (provider guidance) - Security controls implemented for OWASP threat categories (prompt injection, sensitive info disclosure, excessive agency, etc.)

Mermaid chart: domain relationships as a starting point

flowchart LR
 subgraph Commerce["Commerce domain"]
 Orders["Orders product\n(order facts)"]
 Returns["Returns product\n(return facts)"]
 end

 subgraph Customer["Customer domain"]
 Profile["Customer profile product"]
 CLV["Customer lifetime value product\n(derived)"]
 end

 subgraph Marketing["Marketing domain"]
 Campaign["Campaign performance product"]
 end

 Orders --> CLV
 Returns --> CLV
 Profile --> CLV
 CLV --> Campaign

This mirrors the “work backwards from use cases” portfolio mapping approach described in “Designing data products.”