From Seventeen Data Sources to One Governed Lakehouse: An AWS Architecture Guide

From Scattered Silos to Unified Intelligence

Jun 12, 2026

A retail company I advised last year had seventeen data sources feeding into five different analytics tools. Point-of-sale transactions from 40 physical stores landed in one database. Their ecommerce platform wrote to another. Social media engagement metrics sat in a third-party dashboard nobody queried. Customer service logs lived in yet another system. When the marketing team wanted to understand which in-store promotions drove online purchases, nobody could answer the question. The data existed, but it was scattered across silos with incompatible schemas, different retention policies, and no unified access control.

They spent four months building a centralized data warehouse. It worked for structured transactional data. But when they tried to incorporate social media feeds, clickstream logs, IoT sensor data from their warehouses, and unstructured customer feedback, the warehouse hit its limits. Storage costs climbed. Ingestion pipelines broke when schemas changed. The data engineering team spent more time maintaining pipelines than delivering insights.

This is the pattern I see repeatedly in companies that operate across physical and digital channels. The volume and variety of data outgrows any single system. The solution is not another database. It is a data lakehouse architecture that combines the flexibility of a data lake with the governance and query performance of a warehouse, built on open table formats that prevent vendor lock-in.

Why the Lakehouse Architecture Replaced the Pure Data Lake

The original data lake concept was simple: dump everything into S3, figure out structure later. That worked until teams needed to actually query the data. Without schema enforcement, ACID transactions, or proper cataloging, data lakes became data swamps. Files accumulated with no clear lineage, duplicates proliferated, and anyone who needed answers had to understand the physical layout of files across thousands of S3 prefixes.

Data warehouses solved the query problem but created a different one. Moving data from the lake into Redshift or another warehouse meant maintaining ETL pipelines that duplicated data, introduced latency, and required constant maintenance as source schemas evolved. Companies ended up paying for storage twice and reconciling discrepancies between lake and warehouse copies.

The lakehouse eliminates this dichotomy. AWS has standardized much of its managed lakehouse roadmap around Apache Iceberg, with S3 Tables and SageMaker Lakehouse built natively on the format, and it brings database-like capabilities directly to files stored in S3. ACID transactions, schema evolution, time travel, partition evolution, and row-level operations happen on data in place, without copying it into a separate system. AWS formalized this architecture with SageMaker Lakehouse at re:Invent 2024, unifying S3 data lakes and Redshift warehouses under a single governance layer.

For an omnichannel retailer, this means point-of-sale transactions, ecommerce events, loyalty program activity, supply chain data, social media metrics, and customer service interactions all live in one governed storage layer. Analysts query with Athena. Data scientists access training data through SageMaker. Business intelligence dashboards run on Redshift. All engines read from the same copy of data, with consistent permissions enforced by Lake Formation.

Storage Layer: S3 Tables and the Iceberg Foundation

AWS launched S3 Tables in December 2024 as the first cloud object store with native Apache Iceberg support. Unlike self-managed Iceberg tables on standard S3 buckets, S3 Tables delivers up to 3x faster query performance and 10x higher transactions per second. Automatic maintenance, compaction, snapshot management, and orphaned file cleanup happen without operational overhead.

For a commerce operation generating millions of transactions daily, this matters. Each point-of-sale event, each page view, each cart addition creates records that need to be queryable within minutes. S3 Tables handles the compaction that keeps query performance stable as data accumulates. Without automatic compaction, small files proliferate and queries slow down. Glue ETL jobs that previously ran nightly to compact files become unnecessary.

Storage Tiering Strategy

Not all data needs the same performance tier. A practical tiering strategy for retail data looks like this:

S3 Express One Zone serves data that requires consistent single-digit millisecond access: ML training datasets being actively used, intermediate pipeline results, and hot analytics tables under active experimentation. Pricing is region-specific (as low as $0.11 per GB per month in us-east-1 after the April 2025 reduction, higher in other regions), so it makes economic sense primarily for data accessed frequently by co-located compute. A SageMaker training job reading the same dataset for hundreds of epochs benefits enormously. A quarterly compliance report does not.

S3 Tables (standard storage class) holds the operational lakehouse: current and recent transactional data, active customer profiles, inventory positions, and real-time aggregations. Storage starts around $0.0265 per GB per month, with separate monitoring, request, and maintenance charges. It provides the performance and governance features needed for production analytics.

S3 Intelligent-Tiering covers data with unpredictable access patterns. Historical transactions older than 90 days, past campaign performance data, and archived social media metrics automatically move between frequent, infrequent (40% savings after 30 days), and archive instant access (68% savings after 90 days) tiers based on actual usage. No retrieval charges apply. Companies like Stripe report approximately 30% savings from this single configuration.

S3 Glacier Deep Archive at $0.00099 per GB per month stores data retained purely for regulatory compliance. Seven-year transaction records required by tax authorities, historical customer consent records, and audit trails that might never be accessed again but must remain retrievable.

Ingestion: Getting Data from Everywhere into One Place

An omnichannel operation generates data from point-of-sale systems, ecommerce platforms, mobile apps, CRM tools, supply chain systems, social media APIs, customer service platforms, and physical sensors (foot traffic counters, RFID inventory tracking, warehouse automation). Each source has different formats, schemas, and delivery patterns.

Zero-ETL: Eliminating Custom Pipelines

AWS Zero-ETL integrations come in two flavors with different targets and latency profiles.

Zero-ETL to Amazon Redshift supports Aurora MySQL and PostgreSQL, RDS for MySQL/PostgreSQL/Oracle, DynamoDB, and self-managed databases. For relational sources like Aurora, replication uses binary logging and delivers changes within seconds. This path is ideal for BI workloads that need near-real-time data in a warehouse.

Zero-ETL to S3/SageMaker Lakehouse (via Glue) delivers data as Apache Iceberg tables in S3. Supported sources include a growing set: DynamoDB, Oracle Database@AWS, and SaaS applications such as Salesforce, SAP, ServiceNow, Zendesk, Facebook Ads, and Instagram Ads. DynamoDB integration lands data incrementally every 15-30 minutes, not in real-time. SaaS sources sync on configurable schedules.

For a retailer running their ecommerce platform on Aurora PostgreSQL (zero-ETL to Redshift for dashboards) and their inventory system on DynamoDB (zero-ETL to S3 Iceberg for ML and broad analytics), this combination means product catalog changes, order records, and stock levels flow into the lakehouse automatically. No Glue jobs to author. No schema drift to debug at 3 AM. The Data Catalog tracks schema evolution as source tables change.

A marketing team’s Salesforce CRM data and their Facebook campaign metrics arrive in the same lakehouse through Glue zero-ETL connectors, queryable together, without a data engineer writing custom pipeline code.

Streaming Ingestion for Real-Time Data

Not everything moves in batches. Clickstream data from a website, real-time inventory movements, and point-of-sale events often need sub-second latency. Amazon Data Firehose delivers streaming data directly into Iceberg tables in S3, applying transformations and format conversions in flight. Kinesis Data Streams handles the highest throughput scenarios, processing millions of events per second from sources like IoT sensors in distribution centers or high-traffic promotional events.

AWS Glue for Complex Transformations

When data requires transformation beyond what zero-ETL provides, AWS Glue handles it. The latest runtime (Glue 5.1) ships with Spark 3.5.6, Iceberg 1.10.0 with format v3 support, and Iceberg materialized views. Glue 5.0, now widely adopted, brought Java 17, Python 3.11, Iceberg 1.7.1, and native fine-grained access control through Lake Formation, meaning Glue jobs respect the same column and row-level permissions as every other engine.

For cost-sensitive batch processing (daily aggregations, weekly reporting datasets, monthly compliance extracts), Glue Flex workers run at $0.29 per DPU-hour, 34% cheaper than standard, with the trade-off of variable start times.

Governance with AWS Lake Formation: The Central Nervous System

Governance is where most data lake implementations either succeed or fail. A lakehouse without proper governance becomes a liability, especially for companies handling customer personal data across multiple jurisdictions. AWS Lake Formation provides the governance layer that enforces access control across every engine that touches the data.

How Lake Formation Access Control Works

Lake Formation replaces the traditional approach of managing S3 bucket policies, IAM roles, and service-specific permissions separately for each analytics tool. Instead, permissions are defined once in Lake Formation and enforced uniformly whether the data is accessed through Athena, Redshift, EMR, Glue, or SageMaker.

The mechanism works through credential vending. When a user or service queries data through any supported engine, Lake Formation evaluates the requester’s permissions against the Data Catalog metadata. If the requester has access, Lake Formation issues temporary credentials scoped to exactly the data they are authorized to see. The underlying S3 objects remain inaccessible through direct S3 API calls unless separately authorized through IAM policies.

This dual-authorization model means both IAM policies and Lake Formation permissions must permit access for a request to succeed. Neither can override the other. A user with full S3 access in their IAM policy still cannot see data restricted by Lake Formation.

Tag-Based Access Control at Scale

For a company with hundreds of tables and dozens of teams, granting permissions table-by-table becomes unmanageable. Lake Formation’s Tag-Based Access Control (LF-TBAC) solves this through metadata tags assigned to both data resources and principals.

A retail company might tag data with classification levels: sensitivity:public, sensitivity:internal, sensitivity:pii, sensitivity:restricted. Tables containing customer email addresses get tagged sensitivity:pii. Tables with aggregate sales metrics get tagged sensitivity:internal. Teams receive corresponding tag grants: the marketing analytics team gets access to sensitivity:internal and below. The customer data science team gets access to sensitivity:pii for specific use cases with additional row-level restrictions.

LF-Tag Expressions, introduced in late 2024, allow complex logical combinations. An expression like content_type:Sales AND region:LATAM grants access only to sales data from Latin American operations. These expressions compose with AND across tag keys and OR within tag values, supporting up to 1,000 reusable expressions per account.

When new tables arrive in the lakehouse, assigning the appropriate tags automatically grants access to the right teams. No ticket to an admin. No waiting days for permissions. The governance is built into the catalog itself.

Attribute-Based Access Control (ABAC)

Launched in April 2025, ABAC extends the model further by using IAM session tags and principal tags rather than Lake Formation-specific tags. Any IAM principal whose tags match the required attributes gains access automatically. For organizations using identity federation (Active Directory, Okta, Identity Center), user attributes from the identity provider flow through as session tags, enabling completely dynamic access without any Lake Formation administrator action per user.

A new hire in the finance department automatically gets access to financial data because their department attribute matches. When they transfer to operations, their access shifts with their identity attributes. No manual permission grants or revocations required.

Row-Level and Column-Level Filtering

Fine-grained access control operates at three levels within Lake Formation:

Column-level security restricts which columns a principal can see. A customer analytics team might access purchase history, product preferences, and engagement scores, but not credit card numbers, home addresses, or government identification numbers. Column restrictions are defined as include or exclude lists and support nested struct columns.

Row-level security uses PartiQL filter expressions to restrict which rows a principal can access. A regional operations manager sees only transactions from their region. A franchise partner sees only data from their locations. The filter expression store_region = ‘EMEA’ attached to a principal means every query they run, regardless of engine, returns only EMEA rows.

Cell-level filtering combines both. Different columns are restricted depending on which row is accessed. A compliance officer might see all columns for flagged transactions but only aggregated columns for normal transactions. This is implemented through data filters that specify both a row filter expression and column restrictions simultaneously:

-- Data filter: restrict-pii-non-consented

-- Row filter: consent_status != ‘explicit’

-- Excluded columns: email, phone, address, national_id

This means users see full detail for customers who provided explicit consent, but only anonymized aggregates for those who did not. The filter applies transparently regardless of which engine queries the data.

Cross-Account Sharing and Federation

Large organizations separate AWS accounts by business unit, environment, or regulatory boundary. Lake Formation cross-account sharing (Version 5) enables sharing hundreds of thousands of tables across accounts using wildcard patterns, without resource association limits. Permissions travel with the data, enforced identically in the consumer account.

Catalog federation extends this to external systems. Lake Formation can federate with Iceberg catalogs in Snowflake, Databricks, or any system exposing the Iceberg REST Catalog API. Row, column, and cell-level permissions apply to federated tables identically to local ones, meaning a third-party data partner’s Iceberg tables can be governed by your Lake Formation policies without moving the data.

Hybrid Access Mode for Migration

Organizations already operating with IAM-based data access do not need to migrate all at once. Hybrid access mode allows two permission models to coexist on the same catalog objects. Principals opted into Lake Formation get fine-grained column and row restrictions. Principals not opted in continue using their existing IAM policies. This enables incremental migration, starting with the most sensitive datasets and expanding governance progressively.

Personal Data Protection: GDPR, CCPA, and Chile’s Ley 21.719

Companies operating across regions face overlapping privacy regulations that directly impact data lake architecture. The three jurisdictions most relevant for companies with operations in Europe, the United States, and Latin America each impose specific technical requirements on how personal data is stored, processed, and deleted.

European Union: GDPR Requirements for Data Lakes

GDPR’s principles of data minimization, purpose limitation, and storage limitation create specific architectural constraints for data lakes. You cannot simply store everything indefinitely and figure out purpose later. The European Data Protection Board’s Coordinated Enforcement Framework on the right to erasure (report adopted February 2026, 32 DPAs participating, 764 controllers surveyed) explicitly flagged “deletion of personal data in the context of backups” as a major technical challenge. Seven recurring problems were identified, including insufficient erasure procedures and reliance on ineffective anonymization.

Right to erasure (Article 17) requires the ability to delete an individual’s personal data upon request. In a traditional append-only data lake built on Parquet files, deletion is technically complex because files are immutable. The EDPB’s April 2025 guidelines on blockchain and immutable storage establish a clear principle: if you cannot delete from a system, identifiable personal data should not reside there in identifiable form. Apache Iceberg addresses this through merge-on-read deletes (write delete markers, compact later) and copy-on-write deletes (rewrite data files excluding the deleted rows). After deletion, three steps ensure true physical erasure: rewrite_data_files to purge marked records, expire_snapshots to prevent time-travel to pre-deletion state, and remove_orphan_files to delete old physical files from S3.

Purpose limitation (Article 5) means personal data collected for ecommerce transactions cannot be repurposed for behavioral profiling without a separate lawful basis. The Criteo case (EUR 40 million fine upheld March 2026 by French courts) confirmed that large-scale behavioral profiling requires valid, informed consent. Lake Formation’s column-level and row-level controls enforce this technically: the fraud detection team accesses transaction patterns, but the marketing team cannot access the same granular personal data without explicit consent records linked to that specific purpose.

Data Protection Impact Assessments (DPIAs) are mandatory before processing personal data at large scale. Building a data lakehouse that unifies customer data from physical stores, ecommerce, mobile apps, and social media qualifies as large-scale processing. The DPIA must document the technical and organizational measures that protect the data, which Lake Formation’s centralized governance makes auditable.

Cross-border transfers after the EU-US Data Privacy Framework (2023) allow transfers to certified US organizations, but transfers to other regions require Standard Contractual Clauses or Binding Corporate Rules. AWS Regions in Europe (Frankfurt, Ireland, Paris, Stockholm, Milan, Spain, Zurich) allow data residency within the EU. S3 bucket policies and Lake Formation can enforce that PII never leaves EU Region boundaries while aggregate analytics flow to global dashboards.

United States: CCPA/CPRA and the Expanding State Privacy Patchwork

As of mid-2026, over twenty US states have enacted comprehensive privacy laws. California’s CPRA remains the strictest. New CPRA regulations covering cybersecurity audits, risk assessments, and automated decision-making technology were finalized in late 2025, with compliance timelines phasing in through 2027 depending on company revenue and the specific obligation. The California Privacy Protection Agency (now CalPrivacy) issued a $1.35 million fine against a large retailer in September 2025, and subsequently launched the Data Broker Enforcement Strike Force in November 2025 to escalate enforcement of data broker registration and deletion obligations. There is no federal comprehensive privacy law; the patchwork continues expanding.

Approximately twelve states now require honoring Universal Opt-Out signals (Global Privacy Control): California, Colorado, Connecticut, Montana, Texas, Delaware, Oregon, Minnesota, New Jersey, Nebraska, Maryland, and others. This creates an architectural requirement at the data collection layer: opt-out signals must be captured in real-time and propagated through to the data lakehouse, where they drive row-level filtering for downstream analytics and AI systems.

Right to delete personal information upon consumer request, with a 45-day response window (extendable to 90 in California). Deletion must propagate through all downstream systems, backups, and analytics stores. Iceberg’s delete operations, tracked through snapshot history, provide an auditable deletion trail.

Right to opt out of sale/sharing requires technical controls that prevent opted-out consumers’ data from flowing to third-party analytics, advertising partners, or cross-context behavioral profiling. California’s “sale” definition is the broadest: monetary OR other valuable consideration, plus a separate “sharing” definition covering behavioral advertising even without payment. Lake Formation row-level filters enforce this: a filter expression like ccpa_opt_out != true on tables shared with advertising analytics ensures opted-out records never appear in those queries.

Data minimization varies by state. Maryland’s MODPA is the strictest, requiring that only data “reasonably necessary” for the disclosed purpose may be collected, which limits what you can even ingest into a data lake. Architecturally, this means separate storage zones with different retention policies. Transactional data necessary for order fulfillment lives in one zone with a 7-year retention. Behavioral clickstream data for personalization lives in another with a 13-month retention. S3 lifecycle policies automate the expiration.

Chile: Ley 21.719 and Latin American Data Protection

Chile’s Personal Data Protection Law (Ley 21.719), published December 13, 2024, takes full effect on December 1, 2026. The 24-month transition period means organizations operating in Chile need their technical controls in place within months, not years. The law establishes the Agencia de Proteccion de Datos Personales as the enforcement authority and aligns Chile’s framework with GDPR-level requirements for the first time.

Lawful bases for processing mirror GDPR’s framework with six bases: consent, contractual necessity, legal obligation, vital interests, public interest, and legitimate interest, plus a Chile-specific basis for “economic/financial obligations.” Companies must document which basis applies to each processing activity. In the lakehouse architecture, this maps to metadata in the Data Catalog, associating each table and column with its documented lawful basis.

Data subject rights include access, rectification, deletion, portability (in “electronic, structured, generic and commonly used format,” which directly impacts data lake export capabilities), and the right to object to automated decision-making and profiling for marketing purposes. The deletion requirement creates the same architectural need as GDPR: Iceberg equality deletes triggered by verified data subject requests, with audit trails in CloudTrail proving compliance.

Cross-border transfer restrictions require one of three mechanisms: adequacy determination by the Agency, contractual clauses or Binding Corporate Rules with “adequate safeguards,” or certified compliance models. Controllers must publicly disclose whether data is transferred internationally and what safeguards exist. AWS’s Sao Paulo Region serves Latin American data residency requirements, though Chile does not yet have a local AWS Region. Organizations must evaluate transfer mechanisms and document them before the December 2026 enforcement date.

Penalties scale significantly. Minor infractions face fines up to approximately $397,000 USD (5,000 UTM). Serious infractions up to $794,000 USD (10,000 UTM). Very serious infractions up to $1.59 million USD (20,000 UTM). For recidivist companies (two or more sanctions within 30 months), fines escalate to triple the base amount, and for large companies, up to 2-4% of prior year gross revenues. This revenue-based penalty structure gives the law real teeth for multinational retailers operating in Chile.

Breach notification must occur within 72 hours of becoming aware of a breach, with enhanced notification requirements for sensitive data breaches that require direct communication to affected individuals. Failure to notify the Agencia within this window triggers serious infractions under the penalty framework.

Technical Controls That Satisfy All Three Frameworks

The architectural patterns that satisfy privacy regulations across all three jurisdictions share common technical controls:

Data discovery and classification with Amazon Macie provides ML-based PII detection across S3 buckets. Macie continuously monitors for sensitive data patterns (credit card numbers, national identification numbers, email addresses, health information) and flags datasets that require elevated protection. AWS Glue’s Sensitive Data Detection transform identifies PII during ETL processing, with remediation options to redact, hash, or encrypt sensitive fields before data reaches the curated layer. For a retailer ingesting data from dozens of sources, automated classification catches PII in unexpected places, like customer names embedded in free-text order notes or phone numbers in delivery instructions.

Encryption at rest (SSE-KMS with customer-managed keys) and in transit (TLS 1.2+). Separate KMS keys per data classification level ensure that compromising one key does not expose all personal data. Lake Formation’s service role requires explicit KMS decrypt permissions, creating an additional authorization boundary. For the highest sensitivity requirements, DSSE-KMS provides dual-layer encryption meeting FIPS 140-2 Level 3 standards.

Pseudonymization through tokenization of direct identifiers (names, emails, phone numbers) before data enters the lakehouse. AWS Glue supports multiple techniques: SHA-256 hashing, HMAC with salt, format-preserving encryption, and KMS-based encryption of specific fields. The mapping between tokens and real identifiers lives in a separate, restricted data store. Analytics operate on pseudonymized data by default. Re-identification requires explicit access to the mapping, controlled through Lake Formation’s cell-level filtering. The EDPB’s January 2025 guidelines on pseudonymization confirm this as a valid technique for reducing identifiability while maintaining analytical utility.

Crypto-shredding provides an additional approach for data stored in formats where individual record deletion is impractical. Each customer’s data is encrypted with a customer-specific KMS key. When a deletion request arrives, destroying the key renders the data permanently unreadable without physically removing it from storage. This technique can support erasure objectives when keys are irreversibly destroyed and the residual ciphertext is no longer personal data under the applicable legal analysis. The EDPB has recognized it as a valid approach for immutable or append-only systems, though organizations should document their legal reasoning for each jurisdiction.

Automated retention enforcement through S3 lifecycle policies tied to data classification. PII-tagged data expires after the documented retention period. Regulatory archives move to Glacier Deep Archive. S3 Object Lock in Governance Mode prevents accidental deletion during retention periods while still allowing privileged deletion for regulatory requests. Consent records persist for the duration required to demonstrate lawful processing.

Audit logging through CloudTrail captures every data access event: who accessed what data, when, through which service, and from which IP address. Lake Formation’s auditable credential vending (introduced July 2024) includes IAM Identity Center user context, enabling attribution to individual humans rather than shared service roles. AWS Audit Manager automates compliance evidence collection for frameworks including GDPR, SOC 2, and PCI DSS.

Deletion pipelines triggered by data subject requests. A Step Functions workflow receives the request, identifies all tables containing the subject’s data through the Data Catalog (using Macie’s data maps to ensure completeness), executes Iceberg equality deletes across affected tables, runs rewrite_data_files and expire_snapshots to ensure physical erasure, logs the action in CloudTrail, and confirms completion within the regulatory timeframe (45 days for CCPA, “without undue delay” for GDPR and Chile’s Ley 21.719). The entire process is auditable and repeatable.

Analytics and AI Integration: From Raw Data to Business Insight

A unified lakehouse serves multiple consumption patterns simultaneously, each engine optimized for different query profiles.

Interactive Analytics with Athena

Athena queries data directly in S3 Tables using standard SQL, charging $5 per terabyte scanned. For a retail operation running ad-hoc analyses (which promotions drove the most cross-channel conversion? which product categories show declining margins?), Athena requires zero infrastructure management. Columnar formats like Parquet, combined with Iceberg’s partition pruning, typically reduce scanned data by 90% or more compared to querying raw CSV or JSON files.

The economics favor Athena for exploratory queries run by analysts and data scientists. A complex query scanning 10 GB of optimized Parquet data costs $0.05. The same data in uncompressed JSON might require scanning 100 GB, costing $0.50. Proper Iceberg table design (partition by date, sort by frequently filtered columns) makes the difference between an analytics platform that costs hundreds of dollars per month and one that costs tens of thousands.

Redshift for Production Dashboards

When queries need sub-second response times for dashboards viewed by hundreds of users, Redshift Serverless or provisioned clusters provide the performance that S3 scans cannot match. The lakehouse architecture means Redshift queries the same Iceberg tables through Redshift Spectrum or through the SageMaker Lakehouse unified access layer. No data copying. No synchronization pipelines. Redshift materializes frequently accessed data in its managed storage while reading less common data directly from S3.

Machine Learning Pipelines

Data scientists access lakehouse data through SageMaker Unified Studio, combining SQL queries with Python notebooks in a single environment. Feature engineering pipelines read raw transaction data from Iceberg tables, compute features (rolling averages, customer lifetime value estimates, purchase frequency patterns), and store results in SageMaker Feature Store for both batch training and real-time inference.

For model training on large datasets, S3 Express One Zone stages training data in a single availability zone co-located with GPU instances. The consistent single-digit millisecond latency keeps expensive accelerators fed with data continuously, reducing training time and cost. After training completes, the data moves to standard storage or gets deleted.

Generative AI and Agentic Systems: The Data Lakehouse as Foundation

The most significant shift in how organizations consume data lake assets in 2026 is not faster SQL engines or better dashboards. It is generative AI and autonomous agents that interact with structured and unstructured data to answer complex business questions, automate decisions, and execute multi-step workflows without human intervention at each step.

A well-governed data lakehouse is not just an analytics platform. It is the knowledge substrate that makes enterprise AI systems accurate, trustworthy, and auditable.

Retrieval-Augmented Generation from Lakehouse Data

Amazon Bedrock Knowledge Bases connects directly to S3 for retrieval-augmented generation. Product catalogs, customer FAQ databases, internal documentation, policy documents, and historical transaction patterns stored in the lakehouse feed RAG pipelines that power customer-facing AI assistants. The pipeline processes documents through parsing, chunking, embedding, and vector storage, then retrieves relevant context at query time to ground model responses in actual company data.

For an omnichannel retailer, this means a customer service AI assistant can reference the actual return policy for the specific product category, check the customer’s purchase history, and verify current inventory levels across locations, all from governed lakehouse data. The response is grounded in facts, not model hallucination.

The governance layer here is critical. Lake Formation ensures the AI system only retrieves information appropriate for the requesting user’s authorization level. A customer-facing chatbot accesses product information and the specific customer’s own order history. An internal operations agent accesses supply chain data and vendor pricing. The same underlying lakehouse serves both, but Lake Formation’s row and column-level controls ensure each AI system sees only what it should.

Natural Language to SQL: Democratizing Data Access

Bedrock Knowledge Bases supports natural language to SQL conversion against Redshift and Glue Data Catalog sources. This capability transforms who can query the lakehouse. A regional manager asking “what were our top-selling products in Santiago stores last quarter compared to online” gets a SQL query generated, executed against Iceberg tables, and returned as a natural language answer with the underlying numbers.

The practical impact for companies with mixed technical and non-technical teams is substantial. Instead of queuing requests with the data team, business users get direct (governed) access to operational intelligence. Lake Formation permissions still apply. The generated SQL executes through a service role connected to the query engine. Per-user entitlements require application-layer identity propagation, passing the end user’s context so that Lake Formation row-level filters apply correctly. When configured, a store manager’s query automatically filters to their region’s data through row-level security.

Agentic AI Architectures on the Lakehouse

Agentic AI systems, autonomous programs that plan, execute multi-step tasks, use tools, and make decisions, represent the next evolution of how enterprises consume data assets. Amazon Bedrock Agents and the newer AgentCore framework both integrate with lakehouse data as a primary tool.

An agentic system for inventory management might operate like this: the agent monitors real-time sales velocity from streaming data (Kinesis into Iceberg tables), compares against current stock levels (zero-ETL from the inventory database), checks historical patterns for seasonal demand shifts (queried from the lakehouse through Athena), and autonomously generates purchase orders when restock thresholds are met. Each decision is traceable. Each data access is logged through CloudTrail. Each permission boundary is enforced by Lake Formation.

For customer experience automation, an agent handling a return request can: verify the purchase in transaction data, check the product-specific return policy from the knowledge base, confirm current inventory at nearby stores for exchange options, process the refund through integration with the payments system, and update the customer’s profile. The lakehouse provides the factual grounding that prevents the agent from making promises it cannot keep.

S3 Tables MCP Server: AI Agents Querying Structured Data

AWS provides an MCP (Model Context Protocol) server for S3 Tables, enabling AI agents and coding assistants to discover Iceberg tables, inspect schemas, and execute queries through a standardized interface. Rather than writing custom integration code per data source, an agent connects to the MCP server and interacts with lakehouse tables programmatically. New tables added to the lakehouse become automatically discoverable. This is still an emerging integration pattern, but it signals the direction: AI agents treating governed lakehouse data as a first-class tool.

The Governance Imperative for AI Systems

Generative AI and agentic systems amplify the consequences of poor data governance. A SQL query that returns wrong results because of insufficient access controls is a mistake an analyst might catch. An autonomous agent making purchasing decisions based on data it should not have accessed, or using customer PII that should have been filtered, creates regulatory liability at machine speed.

This is why governance must precede AI integration. The tag-based access control, cell-level filtering, and audit trails described earlier are not optional infrastructure for organizations deploying AI on their data. They are the controls that make AI deployment defensible when regulators or auditors ask how you ensure your AI systems respect data boundaries, privacy rights, and purpose limitations.

AWS Clean Rooms adds another dimension for scenarios where AI systems need to process data from multiple organizations without exposing raw records. Cryptographic Computing for Clean Rooms (C3R) enables computation on encrypted data, and differential privacy controls add statistical noise to query results. For a retailer collaborating with a logistics partner to optimize delivery routes, both parties contribute data to a clean room where an AI model trains on the combined dataset without either party seeing the other’s raw customer records.

Cost Architecture: What Actually Drives the Bill

Data lakehouse costs come from four categories, and most organizations dramatically underestimate one of them.

Storage is usually the smallest component. At $0.023 per GB per month for S3 Standard (dropping to $0.004 with Glacier Instant Retrieval for infrequent data), even 100 TB of data costs approximately $2,300 per month before tiering optimizations. Intelligent-Tiering reduces this further without operational effort.

Compute dominates for active analytics. Athena charges per query based on data scanned. Redshift Serverless charges per RPU-hour. EMR charges per instance-hour. Glue charges per DPU-hour. The optimization lever is reducing data scanned through proper table design: correct partitioning (Iceberg hidden partitioning by date, bucket by customer ID), sort orders aligned with query patterns, and regular compaction to eliminate small files. Companies that invest in table optimization consistently report 30-60% reductions in compute costs.

Data transfer catches organizations off guard. Cross-AZ transfers, cross-region replication, and data egress to the internet accumulate quietly. A data lake architecture that distributes compute across availability zones or requires frequent cross-region access generates transfer costs that can exceed storage costs. Co-locating compute and data (S3 Express One Zone in the same AZ as EMR clusters) eliminates cross-AZ charges for the hottest data.

Operational overhead does not appear on the AWS bill but represents real cost. Every custom ETL pipeline requires monitoring, debugging, and schema drift management. Zero-ETL integrations and managed services (S3 Tables automatic compaction, Glue crawlers, Lake Formation) reduce engineering time spent on undifferentiated infrastructure work.

Graviton-based instances (m8g, r8g) provide 35-40% better price-performance for analytics compute compared to x86 equivalents. Combined with Compute Savings Plans (up to 66% discount) and Spot instances for batch processing (60-90% savings), the compute layer can be optimized significantly without architectural changes.

Where to Start

Building a data lakehouse is not an all-or-nothing commitment. The most successful implementations I have worked on started with a specific business question that could not be answered with existing systems, built the minimum architecture to answer it, and expanded from there.

Start by identifying your highest-value data integration. For most omnichannel retailers, that means connecting in-store transaction data with ecommerce behavior, creating a unified view of customer activity across channels. Set up an S3 table bucket, configure zero-ETL from your transactional database, apply Lake Formation governance from day one, and query with Athena. That minimal architecture can be operational within days using Lake Formation blueprints. Everything else, streaming ingestion, ML pipelines, cross-region replication, scales incrementally from that foundation.

The governance layer matters most when implemented early. Retrofitting access controls onto an ungoverned data lake is painful and error-prone. Defining your data classification scheme, tagging strategy, and compliance controls before data flows in means every dataset arrives governed by default.

If your organization is generating data across physical locations, digital platforms, and operational systems but struggling to unify that data into actionable intelligence while maintaining regulatory compliance, I can help you evaluate your current architecture and design a lakehouse implementation that fits your specific data sources, team capabilities, and compliance requirements. Reach out to explore what a modern data architecture could look like for your business.

Discussion about this post

Ready for more?