A $4M fashion brand's S3 data architecture has been the same for two years: two buckets, two access patterns, one nightly ETL job holding them together. The raw bucket — s3://brand-orders-raw/ — receives every Shopify order event, including customer email, phone number, billing and shipping address, and the last four digits of the payment method. The ML consultant's IAM role can read it. Nobody else can. The clean bucket — s3://brand-orders-analytics/ — contains the same orders with those six columns stripped, produced by a Lambda ETL job that runs at 2am every night. The Redash analyst, the growth team, and the agency can read it.
In January, Shopify added three new columns to the Orders API response — estimated_delivery_at, checkout_token, and duties_included. The Lambda ETL job, which was written to copy a specific column list, silently started producing a schema that didn't match the growing Parquet files. The analyst ran nine days of January reports before anyone noticed the clean bucket was 9 days stale and the column count was wrong. The fix took four hours: update the ETL column list, backfill the clean bucket, update the Athena table schema.
AWS Lake Formation's unified S3 file and catalog permission model eliminates the architecture that caused this failure. One Parquet dataset in one S3 location, registered with Lake Formation, governed by a single permission layer. The analyst's role gets a column exclusion grant — email, phone, billing address, and payment fields are absent from their Athena query results without being absent from the files. The ML consultant's SageMaker role gets direct S3 file credentials scoped to the same location, column grant configured separately. The analytics agency gets temporary credentials via a single API call that expire automatically. No ETL job. No second bucket. No silent schema drift at 2am.
Running two S3 buckets for customer data and a nightly ETL you no longer fully trust? Book a 30-min audit — Dev joins every call, we review your current S3 architecture and map the Lake Formation migration path for your specific data consumers. Written brief inside a week. No SDR layer.
Why the Two-Bucket Pattern Exists and What It Costs
The two-bucket pattern emerged because it was the most intuitive way to enforce data access control before fine-grained column permissions existed at scale. Put PII in a locked bucket. Copy the rest to an open bucket. Write an ETL job to keep them in sync. Every D2C analytics team that set this up made the same reasonable decision given the tools available at the time.
The costs accumulate:
Schema maintenance burden. Every time the upstream data source adds, renames, or removes a column — Shopify API updates, Odoo schema changes, new event properties — the ETL job must be updated. If it is not, the clean bucket silently diverges from the raw bucket. The divergence is rarely discovered immediately.
Lag in analytics freshness. A nightly ETL introduces a minimum 6 to 12 hour lag between when an order is placed and when it appears in the analytics bucket. For daily reporting this is acceptable. For same-day operational decisions during BFCM — how many orders have shipped, what is the current return rate — the lag is a problem.
Duplicate storage cost. Two copies of the same data at D2C order volumes — typically 50,000 to 500,000 annual orders — costs $30 to $150 per month in S3 storage, depending on data retention policy. Not budget-breaking, but it is waste that accumulates.
IAM permission sprawl. The clean bucket needs read access granted to the analyst, the growth team, the BI tool service account, the agency, and any other consumer added over time. Each new consumer is a new IAM policy attachment. Over two years, a D2C brand's clean bucket typically has 5 to 12 principals with read access, some of which belong to employees who have left or agencies whose engagement has ended.
How Lake Formation Column Exclusions Work
Lake Formation column exclusions operate at the Glue Data Catalog table level. The setup sequence for a D2C order analytics table:
Step 1: Register the S3 location. In the Lake Formation console, register s3://brand-orders-raw/orders/ as a data lake location. This designates it as a Lake Formation-governed location. Lake Formation creates a service-linked role with read/write access to the bucket.
Step 2: Create or connect the Glue table. If an Athena table already exists pointing at this location, it becomes a Lake Formation-governed table automatically after the location is registered. If not, create the Glue Data Catalog table with the full order schema — all columns including PII fields.
Step 3: Grant column-level permissions to the analyst role. In Lake Formation, grant the analyst's IAM role SELECT permission on the table with EXCLUDED_COLUMN_NAMES: [customer_email, customer_phone, billing_address, shipping_address, payment_last4, checkout_token]. When the analyst queries the table in Athena, those columns do not appear in the result set. They cannot be referenced in a SELECT or WHERE clause. The data is not masked or nulled — it is structurally absent from that role's view of the table.
Step 4: Grant the ML role direct file access. The SageMaker or EMR role gets a Lake Formation table grant with a narrower column list specific to what the forecasting model needs — typically order_id, created_at, line_items, total_price, shipping_region, and discount_codes. No PII needed for demand forecasting, and explicitly scoping the grant means the ML role cannot access PII columns even if the model code tries to read them.
When Shopify adds new columns in the next API update, the Glue table schema is updated once. The exclusion grants remain valid — new columns default to not-granted for roles with exclusion lists, and need to be explicitly added to grants for roles that should see them. The nightly ETL job and the second bucket are no longer in the picture.
Temporary Credentials for External Agencies
The GetTemporaryDataLocationCredentials API vends short-lived S3 credentials scoped to a specific Lake Formation-registered data location. For D2C brands that share data with analytics agencies, this replaces the most common data security anti-pattern in the industry: the permanent IAM access key sent over email.
The current workflow at most D2C brands: the agency engagement starts, someone creates an IAM user, attaches an S3 read policy to the clean bucket, generates access keys, sends them to the agency via email or Slack, and writes a calendar reminder to deactivate them in 90 days. The calendar reminder gets missed. The access keys remain active for 18 months after the engagement ends. The former agency employee who saved the credentials still has read access to 3 years of order data.
The Lake Formation workflow: the agency's AWS account ID is added as an external principal in the Lake Formation grant. The brand's system calls GetTemporaryDataLocationCredentials with the appropriate role, receives temporary S3 credentials with a configurable TTL between 15 minutes and 12 hours, and sends those credentials to the agency for the specific analysis session. The credentials expire automatically. For a multi-day engagement, the API is called each session. No IAM user, no permanent access key, no deactivation step to remember.
For agencies that need table-level SQL access rather than direct file access, Lake Formation cross-account data sharing grants the external account access to the Glue table, and the agency queries via their own Athena instance with the column exclusions applying cross-account. Our Athena for e-commerce log analytics post covers the Athena query security layer that sits alongside this.
Configuring the D2C Order Table in Glue and Athena
The practical Glue table configuration for a Shopify-to-S3 order analytics setup:
S3 prefix structure. Partition by year and month: s3://brand-orders-raw/orders/year=2025/month=01/. Athena uses partition pruning to avoid scanning full table on date-filtered queries. Lake Formation permissions apply to the registered root location and inherit to all prefixes beneath it.
File format. Parquet with Snappy compression. Parquet's columnar storage means Athena only reads the columns in the SELECT clause from S3 — column exclusions at the Lake Formation layer and Parquet's columnar reads at the storage layer work together to minimize data scanned per query. A 12-month order history for a $5M D2C brand typically occupies 2 to 8 GB in Parquet depending on order complexity and line item count.
Schema columns and types. Define PII columns explicitly in the Glue schema with clear names: customer_email STRING, customer_phone STRING, billing_address STRUCT. Explicit naming makes the Lake Formation exclusion grant readable and auditable — anyone reviewing the grant can see exactly which columns are excluded without needing to understand the data model.
Athena workgroup. Create a dedicated workgroup for the analytics use case with a query result location in a separate S3 prefix. Configure the workgroup to enforce Lake Formation permissions — this prevents analysts from bypassing column exclusions by querying the underlying S3 location directly through Athena's S3 access path. See our AWS infrastructure work for D2C for the full workgroup + Lake Formation + VPC endpoint configuration we use in production setups.
What Lake Formation Does Not Replace
Row-level filtering. Column exclusions control which columns a role can see. Row-level filters control which rows. If the analytics role should see only orders from a specific region, or only orders in a specific date range, row-level filters are a separate Lake Formation configuration from column exclusions. Both can be applied to the same grant for the same role — a role can be restricted to specific rows AND specific columns simultaneously.
Real-time data access. Lake Formation governs Athena queries, EMR Spark jobs, and Glue ETL jobs — batch and interactive workloads against files at rest in S3. It does not govern Kinesis Data Streams, real-time order webhooks, or live API calls to Shopify. The real-time Shopify webhook ingest that writes orders to S3 in the first place is outside Lake Formation's governance scope — it is a standard S3 write operation by a Lambda function with an IAM role.
Application-layer data access. Lake Formation controls data access for analytics tools and ML training pipelines. It does not govern which customer-facing application fields are visible to which users in Shopify or Odoo. Application-layer PII governance in Odoo is a separate configuration, covered in the demand forecasting and AI cost attribution setup we described in our AI cost visibility post.
Ready to retire the two-bucket pattern? Book 30 minutes with Dev — we audit your current Glue tables, S3 bucket policies, and IAM access patterns, and deliver a Lake Formation migration plan specific to your data consumers. Written brief inside a week.
Frequently Asked Questions
Does Lake Formation replace S3 bucket policies entirely?
Not entirely, but it becomes the primary access control layer for registered data locations. When a location is registered with Lake Formation, the recommended pattern is to configure the S3 bucket to deny all direct access except through the Lake Formation service role — a bucket policy with a deny-all except the service principal. This creates a single governance layer: you grant access in Lake Formation, not by editing S3 bucket policies or IAM inline policies for each consumer. For D2C brands, this matters when a new analyst joins or an agency engagement starts — instead of updating multiple S3 policies and IAM role attachments, you make one Lake Formation grant and the access flows through automatically.
What is the setup effort for a D2C brand already using Athena?
If you are already using Athena with a Glue Data Catalog table pointing at your S3 order data, enabling Lake Formation adds two main configuration steps: register the S3 location as a Lake Formation data lake location, and switch Athena from IAM-based to Lake Formation-based access control in the Athena settings. The Glue tables you already have become Lake Formation-governed automatically. Column exclusion grants are then configured in the Lake Formation console per IAM principal. For a brand with one order analytics table and two or three data consumers, initial setup runs 2 to 4 hours. Ongoing management is lighter than maintaining separate S3 bucket policies — all access grants are visible and auditable from a single Lake Formation console view.
Can the demand forecasting model read S3 files directly, or does it have to query through Athena?
Direct S3 file access is exactly what GetTemporaryDataLocationCredentials enables. The SageMaker notebook or EMR Spark job calls this API with the appropriate IAM role, receives temporary S3 credentials scoped to the registered data location, and reads the Parquet files directly — spark.read.parquet() or boto3 S3 client calls work without going through Athena SQL. Column-level exclusions configured in Lake Formation apply to Athena queries; for direct file reads, the column access is governed by what the Lake Formation grant explicitly allows for that role. This means the ML role's grant should list exactly the columns the forecasting model needs, rather than relying on exclusions — an explicit allow is more auditable than an implicit allow-all-except.
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
